MIPS 74K

Atheros AR9344 (MIPS 74K), 560MHz, 128 MB (16-bit DDR2-667D x 2). TP-Link WDR3600.

L1 Data cache = 32 KB. 32 B/line. 4-way. Write allocate
DTLB size = 32 items. (2 pages per item),

L1 Data cache latency = 4 cycles.
MIPS ISA doesn't support complex address modes in LOAD instruction. The latence for LOAD from integer array (n=p[n]) is 7 cycles.
RAM Latency = 4 cycles + 155 ns (32 cycles + 100 ns ?)
DTLB miss penalty = 40 cycles + 100 ns ?

4 KB pages

  32 K     4                              TLB + L1
  64 K     4 +  80 ns           80 ns     + 150 ns RAM
 128 K     4 + 120 ns           40 ns
 256 K     4 + 140 ns           20 ns          
 512 K    24 + 200 ns      20 + 60 ns     + 40 + 100 ns (TLB miss)
   1 M    34 + 225 ns      10 + 25 ns
   2 M    39 + 237 ns       5 + 12 ns               
   4 M    42 + 246 ns       3 +  9 ns               
   8 M    44 + 260 ns       2 + 14 ns     
  16 M    44 + 290 ns           30 ns     + ??? ns (Page walk)
  32 M    44 + 340 ns           50 ns     
  64 M    44 + 370 ns           30 ns

16 KB pages

  Size        Latency        Increase     Description

  32 K     4                              TLB + L1
  64 K     4 +  80 ns           80 ns     + 155 ns RAM
 128 K     4 + 120 ns           40 ns
 256 K     4 + 140 ns           20 ns          
 512 K     4 + 150 ns           10 ns
   1 M     4 + 155 ns            5 ns
   2 M    24 + 207 ns      20 + 52 ns     + 40 + 100 ns (TLB miss)               
   4 M    34 + 230 ns      10 + 23 ns               
   8 M    39 + 243 ns       5 + 13 ns     
  16 M    43 + 248 ns       3 +  5 ns     
  32 M    44 + 259 ns       2 + 11 ns     
  64 M    44 + 294 ns           35 ns     + ??? ns (Page walk)

4-bytes range cross penalty = 320 cycles
CPU can't process several TLB misses concurrently.
L1 B/W (Parallel Random Read) = 1 cycle per one access
RAM Read B/W (Parallel Random Read) = 44 ns / cache line. (720 MB/S)
RAM Read B/W (Read, 4 Bytes step) = 200 MB/s
RAM Read B/W (Read, 32 Bytes step) = 860 MB/s
RAM Read B/W (Read, 32 Bytes step, pointer-chasing) = 260 MB/s (no hardware prefetch)
RAM Write (4 Bytes step) = 220 MB/s
RAM Write (32 Bytes step) = 120 ns per write. Write Allocate? 270 MB/s (32-byte cache line)

Branch misprediction penalty = 10 cycles.

Cache aliasing problem (32 KB data cache, 4-way, 4 KB pages): There is some penalty for data cache accesses, if there are some uninitialized data in cache (the data from another process?).

MIPS 74K

L1 Caches
- 4-way set associative
- 32-byte cache line size
- Virtually indexed, physically tagged
- Cache line locking support
- Up to 4 outstanding I-cache misses
- Virtual tag based hit prediction in data cache
- Up to 4 unique outstanding D-cache misses and 9 total load misses
- Writeback and write-through support in data cache
- Non-blocking data cache prefetches
L1 Data cache:
- Cache Protocols: uncached, write-back (with write-allocate), write-through (without write-allocate).
- Data cache misses are non-blocking and up to 4 may be outstanding.
- The tag array also has a virtual address portion, which is used to compare against the virtual address being accessed and generate a data cache hit prediction.
- 64- or 128-bit wide access to the data cache
L1 Instruction cache.
- 128-bit wide access to the instruction cache
- Instruction cache tag and data access are staggered across 2 cycles, with up to 4 instructions fetched per cycle.
Instruction Fetch Unit
- 4-instruction fetch per cycle
- 8-entry Return Prediction Stack
- Combined Majority Branch Predictor using three 256-entry Branch History Tables (BHT)
- 64-entry (4-way) jump register cache to predict target for indirect jumps
- Hardware prefetching of the next 1 or 2 sequential cache lines on a miss.
- In the MIPS16e mode, the IFU takes an additional 3 stages to recode and expand the compressed code.
Combined majority branch predictor using three 256-entry BHT; 8-entry return prediction stack
Dual Out-of-Order Instruction Issue
- 12-stage ALU fetch and execution pipe. The latency of the ALU operation is 1 or 2 cycles.
- 13-stage AGEN fetch and execution pipe. AGEN pipe executes load/store and control transfer instructions
- Common 2-stage graduation pipe
- 32 (18 ALU, 14 AGEN) completion buffers hold execution results until instructions are graduated in program order
- 12-entry Instruction Buffer to decouple the instruction fetch from execution. Up to 4 instructions can be written into this buffer, but a maximum of 2 instructions can be read from this buffer by the IDU.
- Up to 4 instructions issued per cycle in 74Kf core with dual issue FPU
Programmable Memory Management Unit
- 16/32/48/64 dual-entry, dual-ported TLB shared by Instruction and Data MMU
- 4-entry ITLB (4KB, 16KB page size)
- 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M byte page size supported in JTLB
TLB: 2 virtual pages (odd and even) per entry. dual-ported TLB shared by Instruction and Data MMU.
4-entry ITLB (4KB, 1MB page size)

Integer pipeline:

Unit	#	Stage	Name	Description
Fetch (IFU)	1	IT	Instruction Tag Read	I-cache tag arrays accessed Branch History Table, JRC accessed ITLB address translation performed Instruction watch and EJTAG break comparesdone
	2	ID	Instruction Data Read	I-cache data array accesses Tag compare, Detect I-cache hit
	3	IS	Instruction Select	Way select Target calculation start
	4	IB	Instruction Buffer	Instruction Buffer write Target calculation done
Decode & Despatch (IDU)	5	DD	Decode	Access Rename Map, get source register availability to resolve source dependency Decode instructions and assign pipe and instruction identifier Check execution resources
	6	DR	Rename	Update Rename Map at destination register to resolve output dependency Send instruction information to Graduation Unit (GRU) Send instruction to Decode and Dispatch Queue (DDQ)
	7	DS	Select for Dispatch	Check for operand and resource availability and mark valid instructions as ready for dispatch Select 1 out of 8 (6-entry DDQ + 2 staging registers) ready instructions in each ALU and AGEN pipe independently
	8	DM	Instruction Mux	Read out the selected instruction from the previous stage and update the selection information Generate controls for source-operand bypass mux ALU pipe will start premuxing operands based on the selected instruction AGEN pipe will starting reading source operands from Register File and Completion Buffers.
ALU	9	AF	ALU Register File Read
	10	AM	ALU Operand Mux
	11	AC	ALU Compute
	12	AB	ALU Results Bypass
Graduation Unit (GRU)	13	WB	Writeback
Graduation Unit (GRU)	14	WC	Graduation Complete

Links

MIPS32 74K