The last time I did any x86 Assembly was in 1997 when I wrote the lilo graphical boot loader for the linux hotels project. GCC has changed a but since then. As has modern CPU architecture.
I've been investigating the use of prefetch instructions under the Athlon, P3 and P4 processors as part of my research for The MetaWrap Project. When performing a prefetch, depending on the processor, you have the MMX, 3DNow!, SSE and SSE2 extensions that can help you.
| CPU |
L1 Cache |
L2 Cache |
write back |
cache lines L1/L2 |
| Pentium Pro |
8 KB $I (4 way ) and $D (2 way) |
256/512/1MB KB @ CPU speed |
2 write buffers |
32B/32B |
| Pentium II |
16 KB $I and $D; 4 way |
512 KB@ 1/2 CPU speed |
2 write buffers |
32B/32B |
| Celeron 266,300 |
16 KB $I and $D; 4 way |
not present |
|
32B/32B |
| Celeron 300A & up |
16 KB $I and $D; 4 way |
128 KB @ CPU speed |
|
32B/32B |
| Mobile PII |
16 KB $I and $D; 4 way |
256 KB @ CPU Speed |
|
32B/32B |
| PII Xeon |
16 KB $I and $D; 4 way |
512KB/1MB @ CPU Speed |
|
32B/32B |
| PIII |
16 KB $I and $D; 4 way |
512 KB at half CPU speed |
uprated |
32B/32B |
| P4 |
8 KB |
256KB 8-way, unified |
? |
64B/128B |
| PIII Xeon |
16 KB $I and $D; 4 way |
1MB up @ CPU Speed |
uprated |
32B/32B |
| K6 |
32KB $I and $D 2-way |
on system board; 128/256KB? |
|
64B/64B |
|
Athlon Thunderbird |
128KB |
256KB |
|
64B/64B |
|
Duron |
64KB |
128KB |
|
64B/64B |
|
Athlon Classic |
512KB |
256KB |
|
64B/64B |
* src of this data below
Knowing where to use prefetch is one thing, Knowing which to use (there is usually one per level of cache and one for all levels) is another - measuring the impact - well thats just plain nasty. The main issue is that on a multi tasking OS, your cache is always being impinged on by other processes.
Why does prefetch work?
The CPU is much faster than the memory bus. If all the memory you want to access is in the lowest level cache, then your latency, the time you wait between instructions is low because you are not waiting for memory to be transfered.
| Location of data |
Read time |
| L1 |
<3 nS |
| L2 |
<10 nS |
| RAM |
< 100 nanoseconds assuming no page table misses, plus possible delays to write back a dirty cache line |
| Disk |
10+ milliseconds |
| Network Disk |
100mS to tens of seconds |
* src of this data below
Aims Of Experiment
To find out how effective explicit precache is under the two defined patterns of data access. The main purpose is to see how effective this is under multi-process/multi-processor conditions. I am curious to see how things run under a single process DOS> prompt. But I would need to drag out my old borland compiler and port generic_crt and the testing framework code to DOS.
Methods
Explicit prefetch is only really useful under two conditions (and it depends heavily on your pattern of data access).
1) When you are going to spend a lot of time reading & writing a small chunk of memory, and then go off and do something else it pays to prefetch the entire block. Matrix multiplication i a perfect example of of this pattern.
2) When you know where you _might_ need to access memory next, but need to perform a computation to find out which memory - it pays to prefetch those locations so that the memory is transfered into the cache while you work out which location you want to access.. eg. binary trees, tries etc..
Measuring
Its not really an option for me to test with no other processes running. The trick I have found useful is to split my tests up into short sharp bursts and to make sure the cache is flushed out before running them. TO do this I allocate the block of memory that I want to test, then I access a block of memory as big the the largest level of the cache - then I read and write to every byte in the block. I then start a timer, run my test, and stop my timer and do the whole thing again multiple times. Sometimes it takes 5 minutes of execution to obtain 0.5 seconds of real quality cache testing time.
If you are performing streaming digital signal processing, where you access a lot of memory, not very often, then prefetch won't do much for you - and in fact there are extended instructions to ensure that you don't clog up the cache with this data - but that is another story.
Good article on prefetch (* and origin of tables above) http://www.iseran.com/Win32/CodeForSpeed/memory.html
Good Tutorial On Using asm within GCC http://www-106.ibm.com/developerworks/linux/library/l-ia.html
Nifty website on good ideas in computer science http://cgi.cse.unsw.edu.au/~ideas/
Good descripton on how to hand roll assembly instructions http://www.stereopsis.com/sse.html
[UPDATE]
http://www.tuleriit.ee/progs/index.php?rexample=1