Atheros AR9344 (MIPS 74K), 560MHz, 128 MB (16-bit DDR2-667D x 2). TP-Link WDR3600.
- L1 Data cache = 32 KB. 32 B/line. 4-way.
- DTLB size = 32 items. (2 pages per item),
- RAM Latency = 150 ns
- DTLB miss penalty = 40 + 120 ns
4 KB pages
Size Latency Increase Description
32 K 4 TLB + L1
64 K 4 + 80 ns 80 ns + 150 ns RAM
128 K 4 + 120 ns 40 ns
256 K 4 + 140 ns 20 ns
512 K 24 + 200 ns 20 + 60 ns + 40 + 120 ns (TLB miss)
1 M 34 + 225 ns 10 + 25 ns
2 M 39 + 237 ns 5 + 12 ns
4 M 42 + 246 ns 3 + 9 ns
8 M 44 + 260 ns 2 + 14 ns
16 M 44 + 290 ns 30 ns
32 M 44 + 340 ns 50 ns
64 M 44 + 370 ns 30 ns + ??? ns
- 4-bytes range cross penalty = 318 cycles
- CPU can't process several TLB misses concurrently.
- L1 B/W (Parallel Random Read) = 1 cycle per one access
- RAM Read B/W (Parallel Random Read) = 44 ns / cache line. (720 MB/S)
- RAM Read B/W (Read, 4 Bytes step) = 200 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 830 MB/s
- RAM Read B/W (Read, 32 Bytes step, pointer-chasing) = 260 MB/s (no hardware prefetch)
- RAM Write (4 Bytes step) = 230 MB/s
- RAM Write (32 Bytes step) = 110 ns per write. Write Allocate? 280 MB/s (32-byte cache line)
Branch misprediction penalty = 10 cycles.
Cache aliasing problem (32 KB data cache, 4-way, 4 KB pages):
There is some penalty for data cache accesses, if there are some
uninitialized data in cache (the data from another process?).
- L1 Caches
- 4-way set associative
- 32-byte cache line size
- Virtually indexed, physically tagged
- Cache line locking support
- Up to 4 outstanding I-cache misses
- Virtual tag based hit prediction in data cache
- Up to 4 unique outstanding D-cache misses and 9 total load misses
- Writeback and write-through support in data cache
- Non-blocking data cache prefetches
- L1 Data cache:
- Cache Protocols: uncached, write-back (with write-allocate), write-through (without write-allocate).
- Data cache misses are non-blocking and up to 4 may be outstanding.
- The tag array also has a virtual address portion, which is used to compare against the
virtual address being accessed and generate a data cache hit prediction.
- 64- or 128-bit wide access to the data cache
- L1 Instruction cache.
- 128-bit wide access to the instruction cache
- Instruction cache tag and data access are staggered across 2 cycles,
with up to 4 instructions fetched per cycle.
- Instruction Fetch Unit
- 4-instruction fetch per cycle
- 8-entry Return Prediction Stack
- Combined Majority Branch Predictor using three 256-entry Branch History Tables (BHT)
- 64-entry (4-way) jump register cache to predict target for indirect jumps
- Hardware prefetching of the next 1 or 2 sequential cache lines on a miss.
- In the MIPS16e mode, the IFU takes an additional 3 stages to recode and expand the compressed code.
- Combined majority branch predictor using three 256-entry BHT; 8-entry return prediction stack
- Dual Out-of-Order Instruction Issue
- 12-stage ALU fetch and execution pipe. The latency of the ALU operation is 1 or 2 cycles.
- 13-stage AGEN fetch and execution pipe. AGEN pipe executes load/store and control
- Common 2-stage graduation pipe
- 32 (18 ALU, 14 AGEN) completion buffers hold execution results until instructions
are graduated in program order
- 12-entry Instruction Buffer to decouple the instruction fetch from execution.
Up to 4 instructions can be written into this buffer,
but a maximum of 2 instructions can be read from this buffer by the IDU.
- Up to 4 instructions issued per cycle in 74Kf core with dual issue FPU
- Programmable Memory Management Unit
- 16/32/48/64 dual-entry, dual-ported TLB shared by Instruction and Data MMU
- 4-entry ITLB (4KB, 16KB page size)
- 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M byte page size supported in JTLB
- TLB: 2 virtual pages (odd and even) per entry. dual-ported TLB shared by Instruction and Data MMU.
- 4-entry ITLB (4KB, 1MB page size)
|| IT || Instruction Tag Read
I-cache tag arrays accessed
Branch History Table, JRC accessed
ITLB address translation performed
Instruction watch and EJTAG break comparesdone
|| ID || Instruction Data Read ||
I-cache data array accesses |
Tag compare, Detect I-cache hit
|| IS || Instruction Select ||
Way select |
Target calculation start
|| IB || Instruction Buffer ||
Instruction Buffer write |
Target calculation done
| Decode &
|| DD || Decode
Access Rename Map, get source register availability to resolve source dependency
Decode instructions and assign pipe and instruction identifier
Check execution resources
|| DR || Rename ||
Update Rename Map at destination register to resolve output dependency |
Send instruction information to Graduation Unit (GRU)
Send instruction to Decode and Dispatch Queue (DDQ)
|| DS || Select for Dispatch ||
Check for operand and resource availability and mark valid instructions as ready for dispatch |
Select 1 out of 8 (6-entry DDQ + 2 staging registers) ready instructions in each ALU and AGEN pipe independently
|| DM || Instruction Mux ||
Read out the selected instruction from the previous stage and update the selection information |
Generate controls for source-operand bypass mux
ALU pipe will start premuxing operands based on the selected instruction
AGEN pipe will starting reading source operands from Register File and Completion Buffers.
|| 9 || AF || ALU Register File Read || |
| 10 || AM || ALU Operand Mux || |
| 11 || AC || ALU Compute || |
| 12 || AB || ALU Results Bypass || |
| 13 || WB || Writeback || |
| 14 || WC || Graduation Complete || |