(src % 2) == (dest % 2) - probably won't help (src % 4) == (dest % 4) - definitely will help (src % 8) == (dest % 8) - often helps even more even more well aligned - usually same as 8Referring to the 4-byte alignment case on some Indigo2-class machines, Micheal Minakami tells us that "when copying between buffers with the same alignment, throughput is approximately 400% greater than when copying between buffers with mismatched alignments." He also notes that "the memory allocation routines such as malloc return double-word (64-bit aligned) buffers." Entries in VLBuffers and DMbufferpools are also always at least 8 byte aligned.
SGI primary caches are virtually indexed, meaning that for a cache_size byte cache whose lines are each cache_line_size bytes long, a reference to a virtual address V will map to cache line number (V%cache_size)/cache_line_size. Say you're copying from a buffer A to a buffer B, and both buffers are 32k-aligned. Say your computer has a cache_size==32k primary data cache. Say your cache is direct-mapped, meaning each cache line can store data for only one address. If the inner loop of your copy looks like "B[i]=A[i]", then the math above tells us that &A[i] and &B[i] will always map to the same cache line. Thus each reference to A[i] and B[i] in the inner loop above will miss in the primary cache, severly hurting performance. To solve this, you need to reduce the number of cache misses by reducing the number of times the same cache line gets reused for a different address. If you can control the address of buffer A and B, then allocating A and B such that:
(A%cache_size)/cache_line_size != (B%cache_size)/cache_line_size.will minimize cache thrashing. If you can't control the address of A and B (for example, if you're copying between VLBuffers or DMbuffers), then you should try to read a whole cache line at a time from buffer A into registers, and then write a whole cache line at a time from registers into buffer B. For cache_line_size=32bytes, this might look something like:
{ __uint64_t *A, *B; int i; ... assert((max%4) == 0); for(i=0; i < max; i+=4) { __uint64_t tmp0, tmp1, tmp2, tmp3; tmp0 = A[i+0]; tmp1 = A[i+1]; tmp2 = A[i+2]; tmp3 = A[i+3]; B[i+0] = tmp0; B[i+1] = tmp1; B[i+2] = tmp2; B[i+3] = tmp3; } }You should disassemble any C code written like this to make sure the compiler did the right thing, or write assembly yourself.
If your primary cache is 2-way set-associative instead of direct-mapped, then a copy operation ("B[i]=A[i]") will not thrash in the cache, but an operation involving three pointers ("C[i]=A[i]+B[i]") will, and the solution is the same.
SGI secondary caches are indexed by physical address, and it is possible for them to thrash in the same way. Because secondary caches are so much larger, and because a given range of virtual memory (say, our buffers A and B) sometimes maps to discontiguous physical pages, secondary cache thrashing is not as common. On O2 (mvp), it happens that video buffers are made up of 64k-aligned physical memory chunks of 64k in size, and so O2 customers notice this problem more often. Since a normal user-mode app cannot control the physical address of any memory it allocates, the only workaround is the second one above: try to read a whole secondary cache line into registers, and then write from registers into a whole secondary cache line.