Intel Sandy Bridge
Configuration
Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T.
- L1 Data cache = 32 KB. 64 B/line, 8-WAY. (Write-Allocate?), 2 * 16 Bytes read ports + 16 Bytes store port.
- L1 Instruction cache = 32 KB. 8-WAY. 64 B/line
- L2 Cache = 256 KB. 64 B/line, 8-WAY
- L3 Cache = 3 MB. 64 B/line
- mOp Cache: 1.5k instructions, 8-WAY, 6 MOP / line. 
3 lines of 6 mops each for each aligned and contiguous 32-bytes block of code (Agner).
-  instruction decode/fetch throughput - 16 bytes/clock for ICache, 
32 bytes/clock for uop cache (Agner).
-  uop cache line is assigned to a specific 32-bytes block of code.
-  Instructions that generate multiple uops cannot be split between two uop cache
lines.
-  An unconditional jump or call always ends a uop cache line.
-  The same piece of code can have multiple entries in the uop cache if it has multiple
jump entries.
-  Each entry in the uop cache has 32 bits of storage space for address and data bits.
-  L1 Data Cache Latency = 4 cycles for simple access via pointer
-  L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
-  L2 Cache Latency = 12 cycles
-  L3 Cache Latency = 27.85 cycles
-  RAM Latency = 28 cycles + 49 ns (for open RAM page).  RAM page size = 16 KB?
-  RAM Latency = 28 cycles + 56 ns (for random RAM page).
2 MB pages mode (64-bit Windows)
-  Data TLB: 32 entries. 4-WAY, Miss Penalty = 16 cycles. Parallel miss: 20 cycles per access
-  PDPTE cache: 4 entries (cover 4 GB). Miss Penalty = 18 cycles.
-  PML4 cache: ? entries.
  Size        Latency       Increase   Description
  32 K     4                           
  64 K     8               4           + 8 (L2)        
 128 K    10               2   
 256 K    11               1
 512 K    20               9           + 16 (L3)
   1 M    24               4
   2 M    26               2
   4 M    27 + 18 ns       1 + 18 ns   + 56 ns (RAM)
   8 M    28 + 38 ns       1 + 20 ns
  16 M    28 + 47 ns            9 ns   
  32 M    28 + 52 ns            5 ns
  64 M    28 + 54 ns            2 ns
 128 M    36 + 55 ns       8 +  1 ns   + 16 (TLB miss)
 256 M    40 + 56 ns       4 +  1 ns
 512 M    42 + 56 ns       2 
1024 M    43 + 56 ns       1 
2048 M    44 + 56 ns       1 
4096 M    44 + 56 ns       0 
8192 M    53 + 56 ns       9           + 18 (PDPTE cache miss)
4 KB pages mode (64-bit Windows)
- Data TLB L1 size = 64 items. 4-WAY. Miss penalty = 7 cycles. Parallel miss: 1 cycle per access  
- TLB L2 size = 512 items. 4-WAY. Miss penalty = 10 cycles. Parallel miss: 21 cycle per access
- Instruction TLB L1 size = 64 items per thread (128 per core). 4-WAY
- PDE cache = 32 items?
  Size        Latency       Increase   Description
  32 K     4                           
  64 K     8               4   	       + 8 (L2)        
 128 K    10               2   
 256 K    11               1
 512 K    24              13           + 16 (L3) +7 (L1 TLB miss)
   1 M    30               6
   2 M    32               2
   4 M    39 +  18 ns      7 + 18 ns   + 56 ns (RAM) +10 (L1 TLB miss)
   8 M    44 +  38 ns      5 + 20 ns
  16 M    49 +  47 ns      5 +  9 ns   
  32 M    51 +  52 ns      2 +  5 ns
  64 M    60 +  54 ns      9 +  2 ns
 128 M    69 +  55 ns      9 +  1 ns   + 18 (PDE cache miss) + 16 (Page walk to L3)
 256 M    76 +  57 ns      7 +  2 ns
 512 M    79 +  70 ns      3 + 13 ns
1024 M    79 +  86 ns      0 + 16 ns   + 56 ns (Page walk to RAM)
2048 M    79 +  93 ns      0 +  7 ns
4096 M    79 + 103 ns      0 + 10 ns
8192 M    88 + 107 ns      9 +  4 ns   + 18 (PDPTE cache miss)
MISC
- Branch misprediction penalty = 14 cycles (if mOp cache is used).
- Branch misprediction penalty = 17-18 cycles (if mOp cache miss, and L1 cache hit).
- 64-bytes range cross penalty = 5 cycles
- 4096-bytes range cross penalty = 24 cycles
- L1 B/W (Parallel Random Read) = 0.54 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 2.50 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 2.10 cycles per cache line
- L2 Write (Write, 64 bytes step) = 6.70 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 4.65 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step) = 4.92 cycles per cache line
- L3 Write (Write, 64 bytes step) = 9.00 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 8.3 ns / cache line = 7700 MB/s
- RAM Read B/W (Read, 8-64 Bytes step) = 16000 MB/s
- RAM Write B/W (Write, 4-64 Bytes step) = 9200 MB/s