AMD K5

Cache

L1 Data cache = 8 KB. 32 B/line. 4-Way, 2 ports and 4 banks, dual tagged (linear tags and physical tags), writeback or write allocate, MESI, round robin replacement, blocking. The physical tags are only accessed during cache misses and snoops. The physical tags are stored in the memory management unit (MMU), where the TLB is also located. The physical-tag directories for each cache have one port. Accesses to the data-cache physical tags add two clocks to the one-clock linear-tag access.
L1 Instruction cache = 16 KB. 32 B/line (2 * 16 bytes halfs), dual tagged, 4-WAY, blocking. 5 bits/byte - predecoding information, 1024 branch targets. 1 branch target enrtry and 1 bit of prediction per 16 bytes half-line. Instruction-cache accesses can be to any 16 bytes within a single 32-byte line or they can be split into two 8-byte accesses across two contiguous lines.
4 KB TLB size = 128 items, 4-way. TLB is not used, if L1 Data cache hit
4 MB TLB size = 4 items, full assoc.
Out-of-Order execution
2 * ALU, 2 * LS, Branch Unit, FPU.
Decoding rate: 4 ROPS per cycle.
Reservation Station in each Execution Unit. 2 (or 4?) entries per RS (1-entry RS in FPU). One ROP can be dispatched to a single reservation station in a given clock, thus up to four reservation stations receive an ROP each clock.
ROPs are issued from a reservation station to its execution unit when all operands are available from the register file, reorder buffer, or prior execution via forwarding (including from data cache loads), and when the execution unit has completed its prior ROP. Issue and dispatch occur in the same clock if the operands are available and the unit is free at dispatch time.
16-entry ROB. An entry tag is allocated at the top of the ROB for each ROP that is dispatched to a reservation station.
Store buffer: speculative-state, 4-entry, 4-byte-wide, between the two load/store execution units and the data cache. The store buffer can contain both speculative- and real-state data. Each entry in the store buffer is in speculative state until the associated ROP is retired, after which the data is transferred to the data cache and/or memory, both of which represent the real (non-speculative) state of data. A store occurs at the retirement stage of the pipeline, when the processor writes an entry from the store buffer to the data cache and/or memory.
1-entry, 32-byte writeback (copy-back) buffer in the bus interface unit for replacements and invalidations. The buffer is used for writebacks of modified data in the data cache (Cache-line replacement during data-cache read miss). During cache-line replacements, the memory read cycle for the new cache line is initiated on the bus before the contents of the modified line to be replaced are copied into the writeback buffer. When the cache-line fill is completed, the contents of the writeback buffer are written to memory. Writethroughs from the data cache do not go through a buffer. These transfers are between 1 and 8 bytes in length and they go directly onto the bus from the store buffer.

AMD K5 75 (75 x 1), FIC VA-503 (Via VP3, L2 1MB, PC133 SDRAM).

4 KB pages mode

Size	Latency	Description
8 K	1	TLB + L1
512 K	10	+ 9 (L1-Cache miss, L2 hit)
1 M	25	+15 (TLB miss)
...	25 + 100 ns	+ 100 ns (RAM)

4-bytes range cross penalty = 1 cycles.
L1 Random Read = 0.60 cycles per read
L2 Read B/W (32 Bytes stride) = 10 cycles per cache line (260 MB/s)
RAM Read B/W (4 Bytes stride) = 107 MB/s
RAM Read B/W (32 Bytes stride) = 166 MB/s
RAM Write B/W (4 Bytes stride) = 76 MB/s (write-allocate enabled via MSR)

AMD K5 75 MHz (50 MHz x 1.5) : Zida 5STX 1.02 (Intel 430TX, L2 512MB, RAM: 16MB - 2xSIMM 4xHY5118160 JC-70).

4 KB pages mode

Size	Latency	Description
8 K	1	TLB + L1
512 K	15	+ 14 (L1-Cache miss, L2 hit)
...	32 + 270 ns	+17 (TLB miss) + 270 ns (RAM)

L2 Read B/W (32 Bytes stride) = 14 (190 ns) cycles per cache line (170 MB/s)
RAM Read B/W (4 Bytes stride) = 49 MB/s
RAM Read B/W (32 Bytes stride) = 67 MB/s
RAM Write B/W (4 Bytes stride) = 37 MB/s (write-allocate enabled via MSR)

Pipeline

Branch misprediction penalty = 3 cycles?

#	Name	Description
1	Fetch	Calculate Fetch PC
1	Fetch	Fetch instruction. Predict branch
2	Decode1	Shift instructions to 16-byte FIFO byte queue.
2	Decode1	Generate ROPs.
3	Decode2	Drive ROPs to decode.
3	Decode2	Access registers or ROB.
4	Execute	Dispatch to function unit. calculate address and dcache index
4	Execute	Execute. Access Cache
5	Result	Result on Bus. Write ROB. Branch correction
5	Result	.
6	Retire	Write to register. ROB forwarding
6	Retire	.

Links

AMD K5 at Wikipedia

AMD's K5 Designed to Outrun Pentium, Michael Slater, Microprocessor Report.