Samsung Exynos 4210: Cortex-A9 dual core, 1200 MHz,
2-ports 32-bit 800Mbps LPDDR2/DDR2/DDR3 (6.4GB/s).
- L1 Data cache = 32 KB. 4-WAY, 32 B/line, Physically Indexed, Physically Tagged.
Two 32-byte linefill buffers and one 32-byte eviction buffer.
A 4-entry, 64-bit merging store buffer.
- Automatic prefetcher that monitors cache misses, it can monitor and prefetch two independent data streams.
- L1 Instruction cache = 32 KB, 4-WAY, Virtually Indexed, Physically Tagged, 64-bit accesses.
- L2 cache = 1 MB. 32 B/line, ?-WAY
- 2*ALU, LS, MUL
- BTAC: 512 entries, 2-WAY.
- GHB (Global history Buffer): 4K entries, 2-bit
- Instruction buffer (<64 bytes) for short loops to disable the instruction cache.
- PRF (Physical Register File): 56 x 32-bit.
- Return Stack: 8 items ?
- Store buffer: 4 x 64-bit slots with data merging capability.
4 KB pages mode
- Micro TLB Data (L1 TLB): 32 entries (8 in first revision ?), fully associative.
- Micro TLB Instr. (L1 TLB): 32 entries (8 in first revision ?), fully associative.
- Main TLB (L2 TLB): 128 entries, 2-WAY. + fully-associative lockable array of 4 elements.
| 32 K || 4 || TLB + L1 |
| 64 K || 23 || + 19 (L2) |
| 128 K |
| 256 K || 30 || + 7 (L1 TLB miss) |
| 512 K |
| 1 M || 37 || + 7 (L2 TLB miss) |
| ... || 37 + 110 ns || + 110 ns (RAM) |
Data prefetcher monitors only RAM misses. It doesn't prefetch data from L2 cache.
- 4-bytes range cross penalty = 1 cycle
- 8-bytes range cross penalty = 6 cycles
- CPU can handle TLB misses in parallel (it works with two parallel accesses at least).
- L1 B/W (Parallel Random Read) = 1 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 7 cycles per cache line
- L2->L1 B/W (Read, 32 bytes step) = 8.7 cycles per cache line
- L2 Write (Sequential) = 1 cycle per 4 bytes.
- L2 Write (Write, 32 bytes step) = 11.5 cycles per write (cache line), probably write allocate to L1 is enabled
- RAM Read B/W (Parallel Random Read) = 68 ns / cache line = 470 MB/s
- RAM Read B/W (Read, 4 Bytes step) = 890 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 1010 MB/s
- RAM Write B/W (Sequential, or 4 bytes step) = 1600 MB/s
- RAM Write B/W (32 bytes step) = 725 MB/s, probably write allocate is enabled
Branch misprediction penalty = 11 cycles.
|1 ||Fe1 ||Fetch |
|2 ||Fe2 |
|3 ||Fe3 |
|4 ||De1 ||Decode |
|5 ||De2 |
|6 ||Re ||Rename |
|7 ||Iss ||Issue |
|8 ||Ex ||Execute |
|9 ||WB ||WriteBack |
ARM Cortex-A9 at Wikipedia
ARM Cortex-A9 at arm.com