By Brian Beach.
NOTE (from Chris Pirazzi): These statistics were collected in early 1997. By the time you read this, clock speed and other system enhancements will have changed the numbers significantly. We recommend you read the text and run the provided test program on the modern platforms you are interested in. In particular, you will find that the R10000 O2 numbers you get from a recent system are significantly better than what is shown here.
This page contains benchmark results for several different SGI computers. For applications that access memory intensively, these numbers can help in tuning the program to attain the highest possible performance on each machine. These tests are designed to mimic the memory activity of programs that do simple processing of large blocks of data. For example, running a simple filter on a video-sized image buffer, putting the result into another buffer.
The numbers were gathered with memspeed.c++, a simple test program. The program allocates 2MB memory blocks, and scans through them from beginning to end. The 2MB size was chosen because it is bigger than the cache on all of the machines tested. Reading, writing, and copying are all measured, using a simple for loop that iterates through the memory. The inside of the loop for each of these operations is:
read: | tmp = *src++; |
write: | *dst++ = 0; |
copy: | *dst++ = *src++; |
The program includes three different versions of each test for different data types: unint8_t, uint32_t, and uint64_t. It also runs the tests for both cached and uncached memory.
The program was compiled "-n32 -mips4", which produces code that takes advantage of the 64-bit load and store operations, and unrolls the loops pretty well. Here is the loop for reading 64-bit integers:
.BB57.testReadWrite__GPvPCc: # 0x1d0 .loc 1 130 3 addiu $2,$2,64 # [0] ld $0,-72($2) # [1] id:21 ld $0,-64($2) # [2] id:21 ld $0,-56($2) # [3] id:21 ld $0,-48($2) # [4] id:21 ld $0,-40($2) # [5] id:21 ld $0,-32($2) # [6] id:21 ld $0,-24($2) # [7] id:21 bne $8,$2,.BB57.testReadWrite__GPvPCc # [8] ld $0,-16($2) # [8] id:21
The test program runs at the maximum non-degrading priority, and locks itself into memory, to avoid random effects from other applications. All of the tests were run on quiescent systems so that I/O interrupts would not affect the results.
A basic knowledge of the operation of the cache is important if you want to understand the numbers below. All of the machine profiled use a write-back cache, not a write-through cache. This means that before loading a cache line, the previous contents must first be written back to memory if it's dirty.
This is what happens when loading from cached memory:
if the given address is not in the cache then write back the cache line (if it's dirty) load the cache line that contains the address endif read from the cache
This is what happens when writing to cached memory:
if the given address is not in the cache then write back the cache line (if it's dirty) load the cache line that contains the address endif write to the cache
For this test program, reading is faster than writing because the blocks are large, and all the program is doing is reading. This means that after the first pass through the cache, all of the cache lines will be clean when it comes time to load them. Writing is slower, because after the first pass through the cache all of the cache lines will be dirty when the previous contents are loaded.
On some machines (such as O2), uncached writes are pipelined in the memory controller if they're the same width as the memory. The CPU does not have to wait after issuing such a write. This is why the 64-bit uncached writes are so much faster than anything else on the O2.
Here are the numbers (in megabytes per second) for 64-bit read and write access:
read cached | write cached | read uncached | write uncached | |
---|---|---|---|---|
O2 180MHz R5K PC | ??? | ??? | ??? | ??? |
O2 180MHz R5K SC | 74.281 |
58.704 |
21.774 |
180.378 |
O2 150MHz R10K SC | 55.968 | 48.809 | 10.704 | 127.330 |
Octane 195MHz R10K SC | 270.135 | 201.890 | 272.189 | 200.687 |
Onyx2 190MHz R10KSC | 259.539 | 142.712 | 253.769 | 144.453 |
Indigo2 200MHz R4K SC | 71.537 | 57.981 | 10.758 | 19.460 |
I still haven't found one of these to test on.
read cached bytes: 52.875 MB/s 32-bit ints: 71.479 MB/s 64-bit ints: 74.281 MB/s write cached bytes: 49.021 MB/s 32-bit ints: 58.678 MB/s 64-bit ints: 58.704 MB/s read uncached bytes: 2.714 MB/s 32-bit ints: 10.893 MB/s 64-bit ints: 21.774 MB/s write uncached bytes: 5.525 MB/s 32-bit ints: 22.099 MB/s 64-bit ints: 180.378 MB/s copy cached to cached bytes: 18.097 MB/s 32-bit ints: 28.380 MB/s 64-bit ints: 31.388 MB/s bcopy: 32.665 MB/s copy cached to uncached bytes: 5.177 MB/s 32-bit ints: 17.477 MB/s 64-bit ints: 51.948 MB/s bcopy: 52.333 MB/s copy uncached to cached bytes: 2.487 MB/s 32-bit ints: 8.816 MB/s 64-bit ints: 15.213 MB/s bcopy: 15.833 MB/s copy uncached to uncached bytes: 1.786 MB/s 32-bit ints: 7.142 MB/s 64-bit ints: 16.517 MB/s bcopy: 19.641 MB/s
read cached bytes: 41.966 MB/s 32-bit ints: 51.072 MB/s 64-bit ints: 55.968 MB/s write cached bytes: 41.017 MB/s 32-bit ints: 46.793 MB/s 64-bit ints: 48.809 MB/s read uncached bytes: 1.338 MB/s 32-bit ints: 5.353 MB/s 64-bit ints: 10.704 MB/s write uncached bytes: 5.625 MB/s 32-bit ints: 22.492 MB/s 64-bit ints: 127.330 MB/s copy cached to cached bytes: 15.066 MB/s 32-bit ints: 20.802 MB/s 64-bit ints: 22.518 MB/s bcopy: 23.856 MB/s copy cached to uncached bytes: 5.124 MB/s 32-bit ints: 16.499 MB/s 64-bit ints: 40.900 MB/s bcopy: 43.502 MB/s copy uncached to cached bytes: 1.300 MB/s 32-bit ints: 4.848 MB/s 64-bit ints: 8.739 MB/s bcopy: 8.714 MB/s copy uncached to uncached bytes: 1.078 MB/s 32-bit ints: 4.310 MB/s 64-bit ints: 9.176 MB/s bcopy: 9.975 MB/s
read cached bytes: 124.430 MB/s 32-bit ints: 233.586 MB/s 64-bit ints: 270.135 MB/s write cached bytes: 118.115 MB/s 32-bit ints: 174.691 MB/s 64-bit ints: 201.890 MB/s read uncached bytes: 123.799 MB/s 32-bit ints: 223.918 MB/s 64-bit ints: 272.189 MB/s write uncached bytes: 117.887 MB/s 32-bit ints: 172.916 MB/s 64-bit ints: 200.687 MB/s copy cached to cached bytes: 29.699 MB/s 32-bit ints: 59.917 MB/s 64-bit ints: 76.208 MB/s bcopy: 139.126 MB/s copy cached to uncached bytes: 46.065 MB/s 32-bit ints: 103.399 MB/s 64-bit ints: 121.507 MB/s bcopy: 156.203 MB/s copy uncached to cached bytes: 48.635 MB/s 32-bit ints: 99.496 MB/s 64-bit ints: 109.698 MB/s bcopy: 159.957 MB/s copy uncached to uncached bytes: 25.418 MB/s 32-bit ints: 62.988 MB/s 64-bit ints: 88.254 MB/s bcopy: 158.487 MB/s
NOTE: Multiple processors don't speed up this benchmark.
NOTE: Buffer size changed to 8MB for this run.
read cached bytes: 112.749 MB/s 32-bit ints: 222.683 MB/s 64-bit ints: 259.539 MB/s write cached bytes: 85.087 MB/s 32-bit ints: 130.116 MB/s 64-bit ints: 142.712 MB/s read uncached bytes: 114.255 MB/s 32-bit ints: 224.054 MB/s 64-bit ints: 253.769 MB/s write uncached bytes: 84.775 MB/s 32-bit ints: 126.155 MB/s 64-bit ints: 144.453 MB/s copy cached to cached bytes: 26.531 MB/s 32-bit ints: 49.233 MB/s 64-bit ints: 61.999 MB/s bcopy: 124.973 MB/s copy cached to uncached bytes: 38.786 MB/s 32-bit ints: 85.670 MB/s 64-bit ints: 93.844 MB/s bcopy: 139.424 MB/s copy uncached to cached bytes: 38.785 MB/s 32-bit ints: 83.630 MB/s 64-bit ints: 91.274 MB/s bcopy: 145.113 MB/s copy uncached to uncached bytes: 23.243 MB/s 32-bit ints: 54.961 MB/s 64-bit ints: 70.418 MB/s bcopy: 141.284 MB/s
NOTE: This machine does not have 64-bit loads and stores.
NOTE: I think that the cached-to-cached copy rate is so low because the Indigo2 does not have a 2-way associative cache, like the other machines tested do. This means that if the block being read and the block being written are lined up in the cache, then every read and write must re-load the cache line.
read cached bytes: 31.829 MB/s 32-bit ints: 62.903 MB/s 64-bit ints: 71.537 MB/s write cached bytes: 29.881 MB/s 32-bit ints: 55.780 MB/s 64-bit ints: 57.981 MB/s read uncached bytes: 2.670 MB/s 32-bit ints: 10.675 MB/s 64-bit ints: 10.758 MB/s write uncached bytes: 4.864 MB/s 32-bit ints: 19.460 MB/s 64-bit ints: 19.460 MB/s copy cached to cached bytes: 0.410 MB/s 32-bit ints: 1.597 MB/s 64-bit ints: 3.060 MB/s bcopy: 15.852 MB/s copy cached to uncached bytes: 4.685 MB/s 32-bit ints: 17.040 MB/s 64-bit ints: 17.046 MB/s bcopy: 17.029 MB/s copy uncached to cached bytes: 2.425 MB/s 32-bit ints: 8.640 MB/s 64-bit ints: 8.876 MB/s bcopy: 9.038 MB/s copy uncached to uncached bytes: 1.837 MB/s 32-bit ints: 7.346 MB/s 64-bit ints: 7.182 MB/s bcopy: 6.983 MB/s