- L1 Data cache = 8 KB. 32 B/line. 4-Way, 2 ports and 4 banks, dual tagged
(linear tags and physical tags), writeback or write allocate,
MESI, round robin replacement, blocking.
The physical tags are only accessed during cache misses and snoops.
The physical tags are stored in the memory management unit (MMU), where the TLB is also located.
The physical-tag directories for each cache have one port.
Accesses to the data-cache physical tags add two clocks to the one-clock linear-tag access.
- L1 Instruction cache = 16 KB. 32 B/line (2 * 16 bytes halfs), dual tagged, 4-WAY, blocking.
5 bits/byte - predecoding information, 1024 branch targets. 1 branch target enrtry and
1 bit of prediction per 16 bytes half-line.
Instruction-cache accesses can be to any 16 bytes within a single 32-byte line or
they can be split into two 8-byte accesses across two contiguous lines.
- 4 KB TLB size = 128 items, 4-way. TLB is not used, if L1 Data cache hit
- 4 MB TLB size = 4 items, full assoc.
- Out-of-Order execution
- 2 * ALU, 2 * LS, Branch Unit, FPU.
- Decoding rate: 4 ROPS per cycle.
- Reservation Station in each Execution Unit. 2 (or 4?) entries per RS (1-entry RS in FPU).
One ROP can be dispatched to a single reservation station in a given clock, thus up to
four reservation stations receive an ROP each clock.
- ROPs are issued from a reservation station to its execution unit when all operands are
available from the register file, reorder buffer, or prior execution via forwarding
(including from data cache loads), and when the execution unit has completed its prior ROP.
Issue and dispatch occur in the same clock if the operands are available and the unit is
free at dispatch time.
- 16-entry ROB. An entry tag is allocated at the top of the ROB for each ROP
that is dispatched to a reservation station.
- Store buffer: speculative-state, 4-entry, 4-byte-wide, between the two
load/store execution units and the data cache. The store buffer can contain
both speculative- and real-state data. Each entry in the store buffer is in
speculative state until the associated ROP is retired, after which the data is
transferred to the data cache and/or memory, both of which represent the real
(non-speculative) state of data.
A store occurs at the retirement stage of the pipeline, when the processor writes
an entry from the store buffer to the data cache and/or memory.
- 1-entry, 32-byte writeback (copy-back) buffer in the bus interface unit
for replacements and invalidations. The buffer is used for writebacks of modified
data in the data cache (Cache-line replacement during data-cache read miss).
During cache-line replacements, the memory read cycle for the
new cache line is initiated on the bus before the contents of the
modified line to be replaced are copied into the writeback
buffer. When the cache-line fill is completed, the contents of
the writeback buffer are written to memory.
Writethroughs from the data cache do not go through a buffer.
These transfers are between 1 and 8 bytes in length and they
go directly onto the bus from the store buffer.
AMD K5 75 (75 x 1), FIC VA-503 (Via VP3, L2 1MB, PC133 SDRAM).
4 KB pages mode
| 8 K ||1 ||TLB + L1 |
|512 K ||10 ||+ 9 (L1-Cache miss, L2 hit) |
|1 M ||25 ||+15 (TLB miss) |
|... ||25 + 100 ns || + 100 ns (RAM) |
- 4-bytes range cross penalty = 1 cycles.
- L1 Random Read = 0.60 cycles per read
- L2 Read B/W (32 Bytes stride) = 10 cycles per cache line (260 MB/s)
- RAM Read B/W (4 Bytes stride) = 107 MB/s
- RAM Read B/W (32 Bytes stride) = 166 MB/s
- RAM Write B/W (4 Bytes stride) = 76 MB/s (write-allocate enabled via MSR)
AMD K5 75 MHz (50 MHz x 1.5) : Zida 5STX 1.02 (Intel 430TX, L2 512MB, RAM: 16MB - 2xSIMM 4xHY5118160 JC-70).
4 KB pages mode
| 8 K ||1 ||TLB + L1 |
|512 K ||15 ||+ 14 (L1-Cache miss, L2 hit) |
|... ||32 + 270 ns || +17 (TLB miss) + 270 ns (RAM) |
- L2 Read B/W (32 Bytes stride) = 14 (190 ns) cycles per cache line (170 MB/s)
- RAM Read B/W (4 Bytes stride) = 49 MB/s
- RAM Read B/W (32 Bytes stride) = 67 MB/s
- RAM Write B/W (4 Bytes stride) = 37 MB/s (write-allocate enabled via MSR)
Branch misprediction penalty = 3 cycles?
||Calculate Fetch PC|
||Shift instructions to 16-byte FIFO byte queue.|
||Drive ROPs to decode.|
|Access registers or ROB.|
||Dispatch to function unit.
calculate address and dcache index|
||Result on Bus.|
||Write to register.|
AMD K5 at Wikipedia
AMD's K5 Designed to Outrun Pentium, Michael Slater, Microprocessor Report.