Sun UltaSPARC III

Sun UltraSPARC IIIi 1000 MHz, Sun Fire V210, 1 GB (256 MB x 4, dual PC-2300 CL2.5)

87.5 M transistors, 178.5 mm2, Texas Instruments, 130 nm, 7-layer metal (copper) CMOS with low-k dielectric. Power: 59 W at 1 GHz.

8 KB pages mode (OpenBSD)

Size Latency Description
64 K 2 TLB + L1
256 K 15 +13 (L1-Cache miss, L2 hit)
1 M 15 + ?? ns +?? ns (exclusive cache thrashing)
8 M 15 + 120 ns +RAM (L2-Cache Miss)
64 M 100 + 120 ns +85 (TLB miss)
... 100 + 240 ns +120ns (?)

Notes: L1 cache and L2 cache are exclusive. So read from L2 to L1 moves some cache line from L1 to L2. But L2 uses pseudo-random replacement policy, and probably that new line can remove some non least recently used cache line from L2.

Pipeline

Branch misprediction penalty = 6 cycles.

# Name Description
1 A Address generation
2 P Preliminary Fetch: I-cache access start. Branch Predictor (BP) access,
3 F Fetch instructions from I-cache (second half of the I-cache access)
4 B Branch target computation
5 I Instruction group formation fetched from the I-cache are entered as a group into the instruction queue (4 grous * 4 instructions).
6 J Grouping group of instructions are dequeued from the instruction queue and prepared for being sent to the R-stage. If the R-stage is expected to be empty at the end of the current cycle, the group is sent to the R-stage.
7 R Register access (dispatch/dependency checking stage): The integer working register file is accessed. The register and pipeline dependencies between the instructions in the group and the instructions in the execution pipelines are calculated concurrently with the register file access. If a dependency is found, the dependent instruction and any older instruction in the group is held in the R-stage until the dependency is resolved.
8 E Execute
9 C Cache: The D-cache delivers results for doubleword (64-bit) and unsigned word (32-bit) integer loads in the C-stage. The D-TLB access is initiated in the C-stage and proceeds in parallel with the D-cache access.
10 M Miss detect: D-cache misses are determined in the M-stage by a comparison of the physical address from the D-TLB to the physical address in the D-cache tags. If the load requires additional alignment or sign extension (such as signed word, all halfword, and all byte loads), it is carried out in this stage, resulting in a three-cycle latency for those load operations.
11 W Write: The MS integer pipeline results are written into the working register file. The results of the D-cache miss are available in this stage and the requests are sent to the L2-cache if needed.
12 X eXtend: last execution stage for most floating-point operations (except divide and square root) and for all VIS instructions. Floating-point results from this stage are available for bypass to dependent instructions that will be entering the C-stage in the next cycle.
13 T Trap: Traps, including floating-point and integer traps, are signalled
14 D Done: Integer results are written into the architectural register file in this stage. At this point, they are fully committed and are visible to any traps generated from younger instructions in the pipeline.

Pipeline Recirculation: When adependency is encountered in or before the dispatch R-stage, then the pipeline is stalled. Most dependencies, like register or FV dependencies are resolved in the R-stage. When adependency is encountered after the dispatch R-stage, then the pipeline is recirculated. Recirculation involves resetting the PC back to the recirculation invoking instruction. Instructions older than the dependent instruction continue to execute. The offending instructions and all younger instructions are recirculated. The offending instruction is retried and goes through the entire pipeline again. Upon recirculation, the instruction responsible for the recirculation becomes a single-group instruction that is held in the R-stage until the dependency is resolved.

Load Instruction Dependency: In the case of a load instruction miss in a primary cache, the pipeline recirculates and the load instruction waits in the R-stage. When the data is returned in the D-cache fill buffer, the load instruction is dispatched again and the data is provided to the load instruction from the fill buffer. The pipeline logic inserts two helpers behind the load instruction to move the data in the fill buffer to the D-cache. The instruction in the instruction fetch stream, after the load instruction, follows the helpers and will re-group with younger instructions, if possible.

Links

UltraSPARC III at Wikipedia