IBM POWER7

Configuration: IBM Power 730 Express server: POWER7 cpu, 3.55 GHz, 2 sockets (8 cores per socket, 16-cores total, 64-threads total), one Memory Controller per socket, 64 GB RDIMM DDR3 (Registered DDR3-1066 CL7).

567 mm2, 45nm, 11 layers, Cu, SOI, eDRAM, 1.2 B transistors
L1 Data cache = 32 KB, 128 B/line, 8-WAY, store-through, EA index, 2 reads + 1 write every cycle
L1 Instruction cache = 32 KB, 128 B/line, 4-WAY
L2 cache = 256 KB per core, 128 B/line, 8-WAY, Store-In
L3 local cache (Fast-L3 Region cache) = up to 4 MB (eDRAM), 128 B/line, 8-WAY, Policy: Partial Victim
L3 cache = 32 MB per chip (eDRAM) consist of LOCAL-L3 from another cores, 128 B/line, (maybe sends 2 lines for read request). Policy: Adaptive victim
Dual DDR3 Memory Controllers per chip. Each DDR3 Memory Controller:
- 8 KB scheduling window
- Connects to up to 4 proprietary memory buffer chips. Differential-signaling interface. Buffer chip:
  - 6.4 GHz * 2 bytes buffer chip -> Power7 chip bandwidth
  - 6.4 GHz * 1.5 bytes buffer chip <- Power7 chip bandwidth
  - dual high-frequency DDR3 DIMM ports (DDR3: 800, 1066, 1333, 1600)
Scalability up to 32 Sockets (360GB/s SMP bandwidth/chip, 20,000 coherent operations in flight)
Fetch: 8 instructions, Instruction decode and preprocessing
- 3 BHTs (shared by 4 threads)
- IBUF: 10x4 instr (SMT2: 10x4 instr / thread, SMT4: 5x4 instr / thread)
6-wide in-order instruction dispatch: 1 group / cycle (4 non-branch instructions, 2 branches)
GCT (global completion table): 20 groups (up to 120 in-flight instructions) The POWER7 core can complete one group per thread pair (T0 and T2 form one pair, whereas threads T1 and T3 form the other pair) per cycle, for a maximum total of two group completions per cycle.
12 execution units:
- 2 Fixed point units
- 2 Load store units (2 load-store pipes can also execute simple fixed-point operations)
- 4 Double precision floating point, The 4 FPU pipelines can each execute double-precision multiply-add operations, accounting for 8 flops/cycle per core. 4 FP units combined into two 128-bit VSX (Vector/Scalar extension) units
- 1 Vector unit 128-bit VMX/AltiVec (Vector Multimedia Extension)
- 1 Branch
- 1 Condition register
- 1 Decimal floating point unit
Unified issue queue (UQ): 2 * 24-entry queues, (UQ0 and UQ1).
- In the ST and SMT2 modes, the two physical copies of the GPR have identical contents. Instructions from the thread(s) can be dispatched to either one of the issue queue halves (UQ0 or UQ1) in these modes.
- In an SMT4 mode, the two copies of the GPR have different contents. FX and load/store (LS) operations from threads T0 and T1 can only be placed in UQ0, can only access GPR0, and can only be issued to FX0 and LS0 pipelines. FX and LS operations from threads T2 and T3 can only be placed in UQ1, can only access GPR1, and can only be issued to FX1 and LS1 pipelines.
- From IBM docs: The most frequent FX instructions are executed in one cycle, and dependent operations may issue back to back to the same pipeline, if they are dispatched to the same UQ half (otherwise, one cycle bubble is introduced). But in real tests: latency of FX instructions is 2 cycles. Why?
8 Wide Issue, Out of Order Execution.
- 2 load or store ops
- 2 fixed-point ops
- 2 scalar floating-point, 2 VSX, 2 VMX/AltiVec ops (1 must be a permute op) or 1 DFP op
- 1 branch op
- 1 condition register op
Segment sizes: 256 MB and 1 TB
Page sizes: 4 KB, 64 KB, 16 MB, and 16 GB
D-ERAT : 2 * 64-entry : fully assoc, (ST and SMT2: identical data; SMT4: 64 shared by T0/T1 + 64 shared by T2/T3)
I-ERAT : 2 * 64-entry
32-entry SLB, fully assoc
TLB size = 512 items 4-WAY, shared.

64 KB pages mode (Linux)

Size	Latency	Description
32 K	2	ERAT + L1
256 K	8	+6 (L2)
4 M	24	+16 (L3-Local)
32 M	158	+14 (ERAT miss -> TLB hit) + 120 (L3-Global)
...	180 + 70 ns	+22 (TLB miss -> L2) + 70 ns (RAM)

128-bytes range cross penalty = 36 cycles
L1 B/W (Parallel Random Read) = 0.59 cycles per one access
L2 -> L1 B/W (Parallel Random Read) = 4.7 cycles per cache line (128 bytes)
L2 -> L1 B/W (Read, 128 bytes step) = 5.0 cycles per cache line (128 bytes)
L3-Local -> L1 B/W (Parallel Random Read) = 8.5 cycles per cache line (128 bytes)
L3-Local -> L1 B/W (Read, 128 bytes step) = 8.5 cycles per cache line (128 bytes)
L3-Global -> L1 B/W (Parallel Random Read) = 50 cycles per 2 cache lines (256 bytes)
L3-Global -> L1 B/W (Read, 256 bytes step) = 40 cycles per 2 cache lines (256 bytes)
RAM Read B/W (Parallel Random Read) = 35 ns / 2 cache lines = 6400 MB/s
RAM Read B/W (Read, 4 Bytes step) = 7000 MB/s
RAM Read B/W (Read, 256 Bytes step) = 14300 MB/s

Branch misprediction penalty = 15 cycles.

Execution Latency = 2 cycles for simple dependent integer instructions !!!

Links

Power7 at Wikipedia

"IBM POWER7 multicore server processor". B. SINHAROY ET AL.