Sun UltaSPARC III

Sun UltraSPARC IIIi 1000 MHz, Sun Fire V210, 1 GB (256 MB x 4, dual PC-2300 CL2.5)

87.5 M transistors, 178.5 mm2, Texas Instruments, 130 nm, 7-layer metal (copper) CMOS with low-k dielectric. Power: 59 W at 1 GHz.

L1 Data cache = 64 KB . 32 B/line, pseudo 4-WAY, write-through, no write-allocate. Parity checking. Virtually indexed, Physically tagged. 2 cycles of latency. 8-bit microtags to do way-selection based on virtual addresses.
L1 Instruction cache = 32 KB , 32 B/line, 4-WAY. Virtually indexed, Physically tagged. Parity checking. latency: 2 cycles.
L2 cache = 1 MB , 64 B/line. 4-WAY (pseudo-random replacement policy), Physically indexed, Physically tagged. Write-back, write-allocate. The L2-cache does not include the contents of the Instruction Cache, Prefetch Cache and Data Cache. 63 M transistors. Half of the microprocessor's clock frequency. latency: 6 cycles, throughput: 2 cycles. The load to use latency is 15 cycles. 36 ECC bits / 64-byte cache line, correction of 1-bit errors and the detection of any error within a 2 (or 4?) bits.
Data TLB L1 = 16 items , fully-associative (8K, 64K, 512 KB, 4 MB pages),.
2 * Data TLB L2 = 2 * 512 items = 1024 items, 2-WAY. One of the TLBs can be set for large pages (such as 4 MB pages) while the other can be set to the default page size (usually 8 KB pages).
Instruction TLB L1 = 16 items , fully-associative (8K, 64K, 512 KB, 4 MB pages),.
Instruction TLB L2 = 128 items , 2-WAY (8K pages).
16K-entry branch predictor.
8-entry Return Address Stack (RAS).
4 instructions are fetched from the L1 instruction cache to instruction buffer.
16-entry instruction queue after fetch. While instructions enter and exit the instruction queue in strict program order, they can complete executing out-of-order.
Up to 4 instructions in a clock cycle can be steered from instruction queue into 6 execution buffers.
Up to 6 instructions in a clock cycle can be dispatched from the 6 execution buffers into the 6 execution units.
Execution units: ALU1, ALU2, BranchUnit, LSU (also handles certain special operations, like integer multiplication and division), FALU1, FALU1.
Integer MUL latency of 6 - 9 cycles depending on the size of the operands.
Integer DIV: 40 - 70 cycles.
Integer loads of unsigned words and double words have a 2-cycle latency. All other loads have a 3-cycle latency.
8-entry store queue to buffer stores. Stores reside in the store queue from the time they are issued until they complete an update to the write cache. The store queue allows successive separate stores to the same cache line to collect. Store forwarding from store queue to quickly following load. Since 3 cycles of latency is required for a load to communicate with the store queue, the LPB bit in the instruction cache is used to force 2-cycle loads to issue as 3-cycle loads. If a 2-cycle load is not correctly predicted to have a RAW hazard, the load must be re-issued.
2 KB prefetch cache for for FPU. 4-WAY. Software and hardware data prefetch operations. write-invalidate, 64-byte line and two 32-byte sub-blocks. It is physically-indexed and physically-tagged and never contains modified data. The P-cache only needs to be flushed for error handling. The P-cache is used for software prefetch instructions as well as a autonomous hardware prefetch from the L2-cache. This cache never needs to be flushed (not even for address aliases).
2 KB write cache. 4-WAY. 64-byte lines and 32-byte sub-blocks. Used to coalesce data being stored back to memory by reducing the number of separate store operations needed. The W-cache is included in the L2-cache, and flushing the L2-cache ensures that the W-cache has also been flushed.
43-bit physical address space
160 64-bit integer registers and 32 64-bit registers for FPU and VIS.
Memory controller: 256 MB to 16 GB of 133 MHz DDR-I SDRAM. 137-bit bus: 128 bits data and 9 bits ECC. 4.2 GB/s.
supports 4-way multiprocessing.

8 KB pages mode (OpenBSD)

Size	Latency	Description
64 K	2	TLB + L1
256 K	15	+13 (L1-Cache miss, L2 hit)
1 M	15 + ?? ns	+?? ns (exclusive cache thrashing)
8 M	15 + 120 ns	+RAM (L2-Cache Miss)
64 M	100 + 120 ns	+85 (TLB miss)
...	100 + 240 ns	+120ns (?)

Notes: L1 cache and L2 cache are exclusive. So read from L2 to L1 moves some cache line from L1 to L2. But L2 uses pseudo-random replacement policy, and probably that new line can remove some non least recently used cache line from L2.

L2->L1 Read B/W (32 Bytes stride) = 18 cycles
L2 Write B/W (8 Bytes stride) = 1.8 GB/s
L2 Write B/W (64 Bytes stride) = 2.1 GB/s
RAM Read B/W (4 Bytes stride) = 300 MB/s
RAM Read B/W (64 Bytes stride) = 456 MB/s
RAM Write B/W (any stride) = 400 MB/s

Pipeline

Branch misprediction penalty = 6 cycles.

#	Name	Description
1	A	Address generation
2	P	Preliminary Fetch: I-cache access start. Branch Predictor (BP) access,
3	F	Fetch instructions from I-cache (second half of the I-cache access)
4	B	Branch target computation
5	I	Instruction group formation fetched from the I-cache are entered as a group into the instruction queue (4 grous * 4 instructions).
6	J	Grouping group of instructions are dequeued from the instruction queue and prepared for being sent to the R-stage. If the R-stage is expected to be empty at the end of the current cycle, the group is sent to the R-stage.
7	R	Register access (dispatch/dependency checking stage): The integer working register file is accessed. The register and pipeline dependencies between the instructions in the group and the instructions in the execution pipelines are calculated concurrently with the register file access. If a dependency is found, the dependent instruction and any older instruction in the group is held in the R-stage until the dependency is resolved.
8	E	Execute
9	C	Cache: The D-cache delivers results for doubleword (64-bit) and unsigned word (32-bit) integer loads in the C-stage. The D-TLB access is initiated in the C-stage and proceeds in parallel with the D-cache access.
10	M	Miss detect: D-cache misses are determined in the M-stage by a comparison of the physical address from the D-TLB to the physical address in the D-cache tags. If the load requires additional alignment or sign extension (such as signed word, all halfword, and all byte loads), it is carried out in this stage, resulting in a three-cycle latency for those load operations.
11	W	Write: The MS integer pipeline results are written into the working register file. The results of the D-cache miss are available in this stage and the requests are sent to the L2-cache if needed.
12	X	eXtend: last execution stage for most floating-point operations (except divide and square root) and for all VIS instructions. Floating-point results from this stage are available for bypass to dependent instructions that will be entering the C-stage in the next cycle.
13	T	Trap: Traps, including floating-point and integer traps, are signalled
14	D	Done: Integer results are written into the architectural register file in this stage. At this point, they are fully committed and are visible to any traps generated from younger instructions in the pipeline.

Pipeline Recirculation: When adependency is encountered in or before the dispatch R-stage, then the pipeline is stalled. Most dependencies, like register or FV dependencies are resolved in the R-stage. When adependency is encountered after the dispatch R-stage, then the pipeline is recirculated. Recirculation involves resetting the PC back to the recirculation invoking instruction. Instructions older than the dependent instruction continue to execute. The offending instructions and all younger instructions are recirculated. The offending instruction is retried and goes through the entire pipeline again. Upon recirculation, the instruction responsible for the recirculation becomes a single-group instruction that is held in the R-stage until the dependency is resolved.

Load Instruction Dependency: In the case of a load instruction miss in a primary cache, the pipeline recirculates and the load instruction waits in the R-stage. When the data is returned in the D-cache fill buffer, the load instruction is dispatched again and the data is provided to the load instruction from the fill buffer. The pipeline logic inserts two helpers behind the load instruction to move the data in the fill buffer to the D-cache. The instruction in the instruction fetch stream, after the load instruction, follows the helpers and will re-group with younger instructions, if possible.

Links

UltraSPARC III at Wikipedia