|Thierry Atheist wrote:|
Since the entire cache is flushed when multitasking one program to the next ...
Actually the cache does NOT need to be flushed.
There are two ways of implementing a cache.
A CPU cache always needs to remember where from main memory the cached values came from.
This is simple if the CPU has no MMU as the address the CPU works with always matches the real memory address. On system having a CPU an address translation has to be done. This gives a problem as this increases latceny. There are two options on implementing this.
A) Storing the cached value with the "real" memory address.
This has the disadvantage of needing a translation before being able to access the cache. This will slow the cache down by 1-2 clocks.
B) Storing the value with the "virtual" address. This has the advantage that the CPU does not need to translate when accessing the cache. The disadvantage is that in task-switch you need to flush the whole cache as otherwise the cache would cause false hits.
The AMIGA OS does not use virtual addresses therefore this MMU address translation is not needed for us and all these issues could be removed by simply not implementing it.
One thing you need to mind.
A Cache will normally never be able to cache your whole program.
This means you will have cache misses all the time.
The purpose of the L1 cache is to allow a important workloop to run at maximum speed. This means the L1 needs to be big enough to cache your important workloops.
During program execution the cache gets constantly reloaded with new code and data. Each new subroutine/workloop that you execute will load into the cache and replace older content. This happens all the time during program execution and will happen also during task switch.
The best you can do to improve cache performance are 4 things:
A) Try to maximise cache size.
The 68030 did have 2 times 256 Byte.
This is only big enough for very tiny workloops.
The 68040 did have 2 times 4096 Byte.
This is much better and allows all normal workloops to be cached.
The 68060 did have 2 times 8KB .
This is even better but bigger would again be better.
Another good speed up could be get by increasing this to 2 times 16KB or 2 times 32kB ...
B) Keep the latency low.
All the 68K caches had 1 clock latency (0 waitstates) we need to keep it this way. An increase in latency would be bad for performance.
C) Implement it multi-way.
A 1-way cache is the simplest but also has a high risk of antialiasing effect and can get problems when subroutines are on the same concurrency class. E.g. If a program calls 2 subroutines which have the same address boundary then they push each other out of the cache even if the cache would in theory be big enough to keep them both.
We did experiment with 1-way, 2-way, 3-way and 4-way cache.
More ways increasing the needed routing resources in the FPGA which becomes costly and at a certain point will decrease performance.
For the Cyclone chips we saw the performance peak with 2-way become of this. 3-way was also very good.
D) Do prefetching.
A cache that has intelligence and is able to guess which memory location will be used in the future could be able to prefetch this value from memory and get it ready for the CPU to use. Some CPU can do this - many CPU can not do this. A smart cache will greatly increase performance. Even a 32 KB cache which is able to prefetch can easily outrun a 1MB cache which does not prefetch in real live.
Lets look what is possible with the NATAMI:
68050 CPU clock rate: 100 MHz
L1 cache 2 times 32 KB is possible.
Latency 1 clock is also possible.
L2 not needed and not useful.
Main memory latency 12 clocks.
If combined with prefetching we this is a good solution.