Intel Ivy Bridge

Intel i7-3770 (Ivy Bridge), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 4 GB (Single PC3-12800 10-10-10-28).

L3 (SLC) cache

The cache latency for reading from different L3 Slices to different Cores with additional ALU OPs between LOADs:

  0   1   2   3   4   5   6   7   8   ALU OPs   

  4   5   5   5   5   5   5   5   5   L1
 12  12  12  12  13  12  12  12  12   L2
            
 30  30  30  31  30  30  30  30  30   L3 core 0,3
 29  29  29  30  29  29  29  29  29   L3 core 1,2
                                
 26  27  26  27  26  27  26  27  26   core-N slice-N
 28  31  30  29  28  29  28  29  28   core-0 slice-1 / core-1 slice-0
 32  33  32  33  32  33  32  34  32   core-0 slice-2
 34  33  34  35  34  33  34  33  34   core-0 slice-3
                                
 32  31  30  31  30  29  30  29  30   core-1 slice-2
 32  33  32  33  32  33  32  33  32   core-1 slice-3

The total L3 iteration latency is always EVEN number, when ALU OPS are included:

L3 Latency penalty for reading from different L3 Slices:

Core-0 =##= Slice-0
        || 2c
Core-1 =##= Slice-1
        || 4c
Core-2 =##= Slice-2
        || 2c
Core-3 =##= Slice-3  

Note: the large latency between Slice-1 and Slice-2 can be some effect of slices polarity, where some structures work with 2 cycles periods.

To read data from required slice we use the following hash (xor) functions for L3 slice number, from physical address bits [1]:

Note: L3 cache in Sandy Bridge uses Pseudo-LRU policy for LLC. But LLC replacement policy in Ivy Bridge looks like random replacement policy.

2 MB pages mode (64-bit Linux)

  Size        Latency       Increase   Description

  32 K     4                           
  64 K     8                       4   + 8 (L2)        
 128 K    10                       2   
 256 K    11                       1
 512 K    21                      10   + 18 (L3)
   1 M    26                       5
   2 M    28                       2
   4 M    29                       1
   8 M    30                       1
  16 M    30 + 27 ns           27 ns   + 53 ns (RAM)
  32 M    30 + 40 ns           13 ns
  64 M    30 + 47 ns            7 ns
 128 M    38 + 50 ns       8 +  3 ns   + 16 (TLB miss)
 256 M    42 + 52 ns       4 +  2 ns
 512 M    44 + 53 ns       2 +  1 ns
1024 M    45 + 53 ns       1 
2048 M    46 + 53 ns       1 

4 KB pages mode (64-bit Linux)

  Size        Latency       Increase   Description

  32 K     4                           
  64 K     8                       4   + 8 (L2)        
 128 K    10                       2   
 256 K    14                       4
 512 K    25                      11   + 18 (L3) +7 (L1 TLB miss)
   1 M    31                       6
   2 M    34                       3
   4 M    41                       7   + 9 (L2 TLB miss)
   8 M    44                       3
  16 M    45 + 27 ns       1 + 27 ns   + 53 ns (RAM)
  32 M    46 + 40 ns       1 + 13 ns
  64 M    49 + 47 ns       3 +  7 ns
 128 M    64 + 50 ns      15 +  3 ns   +  9 (PDE cache miss) + 19 (Page walk to L3)
 256 M    69 + 52 ns       5 +  2 ns   + 
 512 M    76 + 53 ns       7 +  1 ns
1024 M    84 + 53 ns      12 
2048 M    94 + 53 ns      10 

MISC

Branch misprediction penalty = 14 cycles.

Links

[1]: Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters. Maurice, 2015

Ivy Bridge at Wikipedia