Sun UltaSPARC II

L1 Data cache = 16 KB. write-through, nonallocating-on-write-miss, direct mapped, two 16-byte sub-blocks per line, virtually indexed / physically tagged.
L1 Instruction cache = 16 KB, pseudo-two-way set-associative cache with 32-byte blocks.
L2 cache: 256 KB - 8 MB, 64-byte line size.
Data TLB size = 64 items.
Instruction TLB size = 64 items.

Sun UltaSPARC II 400 MHz x 6 (Sun Enterprise 4500, RAM: FPM 60ns)

L2 cache size = 2 MB.

8 KB pages mode

Size	Latency	Description
16 K	2	TLB + L1
512K	10	+ 8 (L1-Cache miss, L2 hit)
2 M	60	+ 50 (TLB miss)
...	60 + 200 ns	+ RAM (L2-Cache Miss)

RAM Read B/W (4 Bytes stride) = 180 MB/s
RAM Read B/W (64 Bytes stride) = 270 MB/s
RAM Write B/W (4 Bytes stride) = 220 MB/s

Sun UltraSPARC IIe 500MHz (Sun Netra T1 200, PC133 SDRAM)

L2 cache size = 256 KB (direct-mapped or 4-way with random replacement and 2-bit line entry).

8 KB pages mode

Size	Latency	Description
16 K	2	TLB + L1
128K	10	+ 8 (L1-Cache miss, L2 hit)
512K	10 + 140 ns	+ RAM (L2-Cache Miss)
...	60 + 140 ns	+ 50 (TLB miss)

RAM Read B/W (4 Bytes stride) = 180 MB/s
RAM Read B/W (64 Bytes stride) = 410 MB/s
RAM Write B/W (4 Bytes stride) = 190 MB/s

Pipeline

4-way superscalar
Execution units: ALU1, ALU2, LSU, FALU1, FALU1, GRU1, GRU2.
Page sizes of 8 KB, 64 KB, and 512 KB and 4 MB.

#	Name	Description
1	Fetch	Prior to their execution, instructions are fetched from the Instruction Cache (I-cache) and placed in the Instruction Buffer, where eventually they will be selected to be executed. Accessing the I-cache is done during the F Stage. Up to four instructions are fetched along with branch prediction information, the predicted target address of a branch, and the predicted set of the target. The high bandwidth provided by the I-cache (4 instructions/cycle) allows UltraSPARC-IIi to prefetch instructions ahead of time based on the current instruction flow and on branch prediction. Providing a fetch bandwidth greater than or equal to the maximum execution bandwidth assures that, for well behaved code, the processor does not starve for instructions. Exceptions to this rule occur when branches are hard to predict, when branches are very close to each other, or when the I-cache miss rate is high.
2	Decode	After being fetched, instructions are pre-decoded and then sent to the Instruction Buffer. The pre-decoded bits generated during this stage accompany the instructions during their stay in the Instruction Buffer. Upon reaching the next stage (where the grouping logic lives) these bits speed up the parallel decoding of up to 4 instructions. While it is being filled, the Instruction Buffer also presents up to 4 instructions to the next stage. A pair of pointers manage the Instruction Buffer, ensuring that as many instructions as possible are presented in order to the next stage.
3	Grouping	The G-stage logic's main task is to group and dispatch a maximum of four valid instructions in one cycle. It receives a maximum of four valid instructions from the Prefetch and Dispatch Unit (PDU), it controls the Integer Core Register File (ICRF), and it routes valid data to each integer functional unit. The G-stage sends up to two foating-point or graphics instructions out of the four candidates to the Floating-Point and Graphics Unit (FGU). The G-stage logic is responsible for comparing register addresses for integer data bypassing and for handling pipeline stalls due to interlocks.
4	Execute	Data from the integer register fille is processed by the two integer ALUs during this cycle (if the instruction group includes ALU operations). Results are computed and are available for other instructions (through bypasses) in the very next cycle. The virtual address of a memory operation is also calculated during the E Stage, in parallel with ALU computation. FLOATING-POINT AND GRAPHICS UNIT: The Register (R) Stage of the FGU. The floating-point register file is accessed during this cycle. The instructions are also further decoded and the FGU control unit selects the proper bypasses for the current instructions.
5	Cache Access	The virtual address of memory operations calculated in the E-stage is sent to the tag RAM to determine if the access (load or store type) is a hit or a miss in the D-cache. In parallel the virtual address is sent to the data MMU to be translated into a physical address. On a load when there are no other outstanding loads, the data array is accessed so that the data can be forwarded to dependent instructions in the pipeline as soon as possible. ALU operations executed in the E-stage generate condition codes in the C Stage. The condition codes are sent to the PDU, which checks whether a conditional branch in the group was correctly predicted. If the branch was mispredicted, earlier instructions in the pipe are flushed and the correct instructions are fetched. The results of ALU operations are not modified after the E Stage; the data merely propagates down the pipeline (through the annex register file), where it is available for bypassing for subsequent operations. FLOATING-POINT AND GRAPHICS UNIT: The X1 Stage of the FGU. Floating-point and graphics instructions start their execution during this stage. Instructions of latency one also finish their execution phase during the X1 Stage.
6	N1	A data cache (D-cache) miss/hit or a TLB miss/hit is determined during the N1 Stage. If a load misses the D-cache, it enters the Load Buffer. The access will arbitrate for the E-cache if there are no older unissued loads. If a TLB miss is detected, a trap will be taken and the address translation is obtained through a software routine. The physical address of a store is sent to the Store Buffer during this stage. To avoid pipeline stalls when store data is not immediately available, the store address and data parts are decoupled and sent to the Store Buffer separately. FLOATING-POINT AND GRAPHICS UNIT: The X2 stage of the FGU. Execution continues for most operations.
7	N2	Most floating-point instructions finish their execution during this stage. After N2, data can be bypassed to other stages or forwarded to the data portion of the Store Buffer. All loads that have entered the Load Buffer in N1 continue their progress through the buffer; they will reappear in the pipeline only when the data comes back. Normal dependency checking is performed on all loads, including those in the load buffer. FLOATING-POINT AND GRAPHICS UNIT: The X3 stage of the FGU.
8	N3	UltraSPARC-IIi resolves traps at this stage.
9	Write	All results are written to the register files (integer and floating-point) during this stage. All actions performed during this stage are irreversible. After this stage, instructions are considered terminated.

Links

UltraSPARC II at Wikipedia