Monday, February 16, 2004

The last time I did any x86 Assembly was in 1997 when I wrote the lilo graphical boot loader for the linux hotels project. GCC has changed a but since then. As has modern CPU architecture.

I've been investigating the use of prefetch instructions under the Athlon, P3 and P4 processors as part of my research for The MetaWrap Project. When performing a prefetch, depending on the processor, you have the MMX, 3DNow!, SSE and SSE2 extensions that can help you.

CPU L1 Cache L2 Cache write back

cache lines L1/L2

Pentium Pro 8 KB $I (4 way ) and $D (2 way) 256/512/1MB KB @ CPU speed 2 write buffers 32B/32B
Pentium II 16 KB $I and $D; 4 way 512 KB@ 1/2 CPU speed 2 write buffers 32B/32B
Celeron 266,300 16 KB $I and $D; 4 way not present   32B/32B
Celeron 300A & up 16 KB $I and $D; 4 way 128 KB @ CPU speed   32B/32B
Mobile PII 16 KB $I and $D; 4 way 256 KB @ CPU Speed   32B/32B
PII Xeon 16 KB $I and $D; 4 way 512KB/1MB @ CPU Speed   32B/32B
PIII 16 KB $I and $D; 4 way 512 KB at half CPU speed uprated 32B/32B
P4 8 KB 256KB 8-way, unified ? 64B/128B
PIII Xeon 16 KB $I and $D; 4 way 1MB up @ CPU Speed uprated 32B/32B
K6 32KB $I and $D 2-way on system board; 128/256KB?   64B/64B

Athlon Thunderbird

128KB 256KB   64B/64B

Duron

64KB 128KB   64B/64B

Athlon Classic

512KB 256KB   64B/64B

* src of this data below

Knowing where to use prefetch is one thing, Knowing which to use (there is usually one per level of cache and one for all levels)  is another - measuring the impact - well thats just plain nasty. The main issue is that on a multi tasking OS, your cache is always being impinged on by other processes.

Why does prefetch work?

The CPU is much faster than the memory bus. If all the memory you want to access is in the lowest level cache, then your latency, the time you wait between instructions is low because you are not waiting for memory to be transfered.

Location of data Read time
L1  <3 nS
L2  <10 nS
RAM  < 100 nanoseconds assuming no page table misses, plus possible delays to write back a dirty cache line
Disk  10+ milliseconds
Network Disk 100mS to tens of seconds

* src of this data below

Aims Of Experiment

To find out how effective explicit precache is under the two defined patterns of data access. The main purpose is to see how effective this is under multi-process/multi-processor conditions. I am curious to see how things run under a single process DOS> prompt. But I would need to drag out my old borland compiler and port generic_crt and the testing framework code to DOS.

Methods

Explicit prefetch is only really useful under two conditions (and it depends heavily on your pattern of data access).

1) When you are going to spend a lot of time reading & writing a small chunk of memory, and then go off and do something else it pays to prefetch the entire block. Matrix multiplication i a perfect example of of this pattern.

2) When you know where you _might_ need to access memory next, but need to perform a computation to find out which memory - it pays to prefetch those locations so that the memory is transfered into the cache while you work out which location you want to access.. eg. binary trees, tries etc..

Measuring

Its not really an option for me to test with no other processes running. The trick I have found useful is to split my tests up into short sharp bursts and to make sure the cache is flushed out before running them. TO do this I allocate the block of memory that I want to test, then I access a block of memory as big the the largest level of the cache - then I read and write to every byte in the block. I then start a timer, run my test, and stop my timer and do the whole thing again multiple times. Sometimes it takes 5 minutes of execution to obtain 0.5 seconds of real quality cache testing time.

If you are performing streaming digital signal processing, where you access a lot of memory, not very often, then prefetch won't do much for you - and in fact there are extended instructions to ensure that you don't clog up the cache with this data - but that is another story.

Good article on prefetch (* and origin of tables above) http://www.iseran.com/Win32/CodeForSpeed/memory.html

Good Tutorial On Using asm within GCC http://www-106.ibm.com/developerworks/linux/library/l-ia.html

Nifty website on good ideas in computer science http://cgi.cse.unsw.edu.au/~ideas/

Good descripton on how to hand roll assembly instructions http://www.stereopsis.com/sse.html

[UPDATE]

http://www.tuleriit.ee/progs/index.php?rexample=1

 

Monday, February 16, 2004 8:31:28 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [1]