Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have questions about the Natami?
Post it here and we will answer it!

Open Brainstorming About 68060 CPU Card Design.page  1 2 
Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Jun 2011 10:23


As you know the NATAMI offers a 68060 CPU card.
The 68060 CPU card comes with local 68060 CPU, its own FPGA, and local SRAM.
This design allows to use the CPU card in several ways.
The CPU card could in theory be used a independent computing node with own local ram.

The CPU card can be used a main host CPU - owning the memory of the mainboard and using the local SRAM as 2nd level cache to accelerate the system.

I would like to invite people to discuss the usage as host CPU together with us now - and which options we have to configure the Card most effectively.

Let me give you some more information.

Features of the CPU card:
  One 68060 CPU.
  The 68060 CPU has 8KB ICache and 8KB DCache.
  4MB Local SRAM.
  The SRAM has a extrem low latency.
  The SRAM has a theoretical memory throughput of around 400MB/sec.
  Of course the 68060 has a busprotocol and because of
  protocol overhead and depending of the CPU clockrate
  the 68060 can not use every possible MB/sec of SRAM bandwidth.

  With normal C-code the 68060 can currently read about 160 MB/sec from the SRAM and can copy about 155 MB/sec.

  The CPU card is connected to the NATAMI MX over the SyncZorro port.
  The SyncZorro is a direct connection to the mainboard FPGA which acts as main memory controller/Northbridge.
 
  This design CPU card - connector - northbridge  is similar to how Early PC/Athlon systems and Pegasos or AMIGAOne were designed - with the difference that the CPU card on the 68060 features a quite large 2nd level cache.

The mainboard DDR2 memory has a bandwidth of around 1000 MB/sec.

Its worth to note that a 68K CPU is not able to utilize 1000MB/second.
A 68060 CPU can in theory with optimal conditions copy about 250 MB/sec maximum. Under "normal" conditions with normal code a 68060 is able to copy about 160 MB/sec.

The SyncZorro interface is able to transfer this amount of data,
This means even without the local SRAM on the CPu card - the 68060 would be able to copy 150 MB/sec of main memory.

The local SRAM has some advantages.
It does offload the main memory system, this means as long as your program operates somewhat locally in 4MB the CPU will work mostly in its 2nd level cache and not use the mainboard memory. This means the whole mainboard bandwidth of around 1000MB/sec is available to the Blitter and Co.

Another important feature of the SRAM is its very low latency.
This means algorithms which jump around a lot in memory, like software texture lookups, e.g. Demo effects like Rotozoom, array lookups etc  - will run a magnitude faster from the 2nd level cache than the could from any DRAM.

Ok now that you know how the CPU card work -
I would like to ask you for your opinion about the setup.

The 68060 1st level cache supports two options:
A) Caching as Writethrough
B) Caching as CopyBack

Both options have their PROs and CONs.

I wonder if it would make sense to map all the mainboard memory as "write through" cache able. This means the the CPU will push all writes immediately out of its 1st level cache, to the 2nd level cache and this will echo all writes down to the mainboard.

This disadvantage of this mode is that the CPU card will generate a constant write stream of data down to the mainboard.
Technically this is no problem as the Synczorro port can trasnfer all the writes and the mainboard memory can easily "eat" all of them.

The advantage of this mode is that the 1st level cache will not get polluted with any memory regions which are only written too.
Like e.g. framebuffer generated in fastmemory.
The CPU will just echo each write to such memory immediately and it will get pushed down to the main memory.

The other option would be to run the memory map in "copyback" mode.
Copyback more allows keeping memory operations "inside" of the CPU.
This means with stack operations like e.g storing parameters or return address on the stack - this stays inside the CPU and creates less noise on the memory bus. Copyback is ideally for Stack memory.
For framebuffer copyback is bad. As copyback would initiate a line-load on touched memory regions - which is totally unneeded for such cases.

What do you think?
What is your opinion?
Would in your opinion a mixed mode make sense?

 

 

Marcel Verdaasdonk
Netherlands

Posts 3975
12 Jun 2011 10:40


The odd duckling in the design is the size of the L2 Cache GHz CPU's even had less.

My idea on this is to use only half of the SRAM as CPU Cache use Writethrough and use 1MB for stack and context switches, and 1MB to shadow Kickstart and drivers.
Doing this would reduce the bandwidth usage of the Szorro bus freeing up bandwidth for future adapters to operate consecutive with the 68060 CPU card.

Or isn't that one of the Goals here?


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Jun 2011 11:27


Marcel Verdaasdonk wrote:

My idea on this is to use only half of the SRAM as CPU Cache use Writethrough and use 1MB for stack and context switches, and 1MB to shadow Kickstart and drivers.

Yes, setting up the memory this way would be absolutely possible.
The memory map of the 4MB SRAM can be configured in the CPU-Card FPGA.
This means we have the option to use it as 4 MB 2nd level cache.
or
2MB 2nd level cache and 2MB local store.
Localstore e.g. Holding Kickstart and extra Stack memory.

Which option gives the most benefit is the key question. :-D

Marcel Verdaasdonk
Netherlands

Posts 3975
12 Jun 2011 11:46


Gunnar you have the physical hardware not I.
So you should try and find out, and don't ask me. ;)

SID Hervé
France

Posts 663
12 Jun 2011 11:56


Gunnar von Boehn wrote:

  The 68060 1st level cache supports two options:
  A) Caching as Writethrough
  B) Caching as CopyBack
 
  Both options have their PROs and CONs.
 
  I wonder if it would make sense to map all the mainboard memory as "write through" cache able. This means the the CPU will push all writes immediately out of its 1st level cache, to the 2nd level cache and this will echo all writes down to the mainboard.
 
  This disadvantage of this mode is that the CPU card will generate a constant write stream of data down to the mainboard.
  Technically this is no problem as the Synczorro port can trasnfer all the writes and the mainboard memory can easily "eat" all of them.
 
  The advantage of this mode is that the 1st level cache will not get polluted with any memory regions which are only written too.
  Like e.g. framebuffer generated in fastmemory.
  The CPU will just echo each write to such memory immediately and it will get pushed down to the main memory.
 
  The other option would be to run the memory map in "copyback" mode.
  Copyback more allows keeping memory operations "inside" of the CPU.
  This means with stack operations like e.g storing parameters or return address on the stack - this stays inside the CPU and creates less noise on the memory bus. Copyback is ideally for Stack memory.
  For framebuffer copyback is bad. As copyback would initiate a line-load on touched memory regions - which is totally unneeded for such cases.
 
  What do you think?
  What is your opinion?
  Would in your opinion a mixed mode make sense?

Hello

Could you introduce the mixed mode please?

thank you.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Jun 2011 12:17


SID Hervé wrote:

Could you introduce the mixed mode please?

With mixed mode I mean running some part of the memory as writethrough and some as copyback.
Or meaning using part of the SRAM as cache and others as stack localstore.



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Jun 2011 12:18


Marcel Verdaasdonk wrote:

  Gunnar you have the physical hardware not I.
  So you should try and find out, and don't ask me. ;)
 

I was asking people with knowledge about Amiga OS and experience in programming to participate in the PRO and CON discussion.
People with programming background can build a sensible opinion an about this based on the provided information.

Of course if you have no programming background but want to learn something you are invited to ask questions.

Marcel Verdaasdonk
Netherlands

Posts 3975
12 Jun 2011 13:34


Okay then i'll explain what i know. ;)
Why should we reserve a small space of local storage for Shadowing the ROM or local storing stacks.
This frees up bus cycle's better used in loading data, this can be dropped when the OS isn't used!
Local storing the Stack has the same advantage, but this isn't limited to OS or applications.

What i was thinking of was how to offload the bus and reduce latency in these cases.
This should result in a more reactive system.

SID Hervé
France

Posts 663
12 Jun 2011 13:39


The copyback option seems interesting because of its predispositions and its impact.

Matt Hey
USA

Posts 727
12 Jun 2011 17:46


I think the first priority would be to remap the Amiga zero page into this super fast memory. Address 4, the (exception/interrupt) vector base and the supervisor stack should all be remapped via the MMU. This provides good backward compatibility and should probably be done by modifying ThoR's MuFastZero. Mapping exec.library, expansion.library, and the 68060.library exception code would be the next step. MuFastZero and the 68060.library would probably need to be modified for this. The next step would be the whole ROM/Kickstart. Perhaps this could be done automatically with MAPROM support if it supports 1MB ROMs. Is MAPROM supported already or can it be added in the fpga of the 060 board? Is ThoR's 68060.library running on the board? Is the Natami team working with ThoR? A Natami aware version of MuFastZero would be awesome and greatly speed up exceptions. There may be other important data that would benefit greatly from being in super fast memory (MMU translation tables?) which ThoR could probably tell you. As far as using the rest of the memory for level 2 cache, I think some testing is in order. ThoR might not even be able to predict that (Ok, maybe ;).


SID Hervé
France

Posts 663
12 Jun 2011 18:45


Correct me if I'm wrong please, but it is expected that the 68060 and the N68050 coexist.

I think that the characteristics of the copyback mode (CF "copyback more allows keeping memory operations inside of the CPU. This means with stack operations like e.g storing parameters or return address on the stack - this stays inside the CPU and creates less noise on the memory bus.") induce this choice.

Megol .

Posts 672
12 Jun 2011 19:58


L1 as write-through
L2 as copy back

This is IMHO the best of both worlds if the L2 cache controller can be intelligent enough.

Reserving some space for a local store isn't a good use of the resources with such a large L2 cache.

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
12 Jun 2011 20:26


Gunnar von Boehn wrote:

  The 68060 CPU card comes with local 68060 CPU, its own FPGA, and local SRAM.
  This design allows to use the CPU card in several ways.
  The CPU card could in theory be used a independent computing node with own local ram.

My first question would be how the CPU interacts with the RAM. Is it mapped somewhere? Which system components can access it? Does it act as a second-level cache?

What about DMA? Can components DMA into this memory, or is it "local" to the 68060? Does the 68060 snoop only this memory, or all memory? (It might be that it cannot snoop all memory because the Natami bus is probably too fast for it).

Gunnar von Boehn wrote:

  The CPU card can be used a main host CPU - owning the memory of the mainboard and using the local SRAM as 2nd level cache to accelerate the system.

If this is a second level cache, where is its cache controller? Or do you mean it is RAM that just appears under a specific hardware address and the CPU can use it to "cache" something manually?
 
Is this a "full" 68060 or the econo^H^H^H embedded controller version? (-;

Gunnar von Boehn wrote:

  The CPU card is connected to the NATAMI MX over the SyncZorro port.

How much of the Zorro bus cycles are visible to the CPU? Say, if the blitter moves chip memory, does the CPU recognize such cycles? On a standard Amiga design, it wouldn't, thus blitter cycles cannot be snooped.

Gunnar von Boehn wrote:
 
  The local SRAM has some advantages.
  It does offload the main memory system, this means as long as your program operates somewhat locally in 4MB the CPU will work mostly in its 2nd level cache and not use the mainboard memory.

I guess I would need to understand this somewhat better. If this is really like a cache, then there should be mechanisms that synchronize this cache with the main memory. That is, I somehow need to "push" cache modifications in this RAM area to the board RAM. How is this done? Similarly, how do I allocate lines in this cache, and how do I invalidate it? CPU instructions won't do that since the CPU cache controller of course knows nothing about this memory.

Gunnar von Boehn wrote:
 
  This means the whole mainboard bandwidth of around 1000MB/sec is available to the Blitter and Co.

Why wouldn't it without this memory? Usually, the blitter *may* (if the blitter nasty bit is set) overrule the CPU cycle allocation in the classical design - isn't that possible here?

I'm asking because I'm usually "afraid" of complicated designs - the win needs to outweight the complexity. Complexity means people need to understand it correctly, and not understanding it correctly leads to bugs, and bugs lead to stability problems.

Gunnar von Boehn wrote:

  Another important feature of the SRAM is its very low latency.
  This means algorithms which jump around a lot in memory, like software texture lookups, e.g. Demo effects like Rotozoom, array lookups etc  - will run a magnitude faster from the 2nd level cache than the could from any DRAM.

I also need to understand a bit more. How much do these programs run out of the first level cache such that they profit from a second level cache? Or is it a second level cache?

 
Gunnar von Boehn wrote:

  Ok now that you know how the CPU card work -
  I would like to ask you for your opinion about the setup.

I first need to understand the question. Then the answer will be 42. No, wait, wrong...
 

Gunnar von Boehn wrote:

  The 68060 1st level cache supports two options:
  A) Caching as Writethrough
  B) Caching as CopyBack
 
  Both options have their PROs and CONs.
 
  I wonder if it would make sense to map all the mainboard memory as "write through" cache able. This means the the CPU will push all writes immediately out of its 1st level cache, to the 2nd level cache and this will echo all writes down to the mainboard.

Likely not. This keeps the push-buffer of the CPU busy, and writing out *local* data that is already in the CPU cache and thus very close to the core to memory that is far away from the core does not sound like a good idea to me. Even if the CPU can access the external SRAM at full bus speed, this still means that this access is *likely* slower than any write that remains local in the first level cache.

Gunnar von Boehn wrote:

  This disadvantage of this mode is that the CPU card will generate a constant write stream of data down to the mainboard.
  Technically this is no problem as the Synczorro port can trasnfer all the writes and the mainboard memory can easily "eat" all of them.
 
  The advantage of this mode is that the 1st level cache will not get polluted with any memory regions which are only written too.

Huh? Yes, they will. The 68060 uses write-allocation, thus even if you write through, the data will end up in the cache.

Gunnar von Boehn wrote:

  For framebuffer copyback is bad. As copyback would initiate a line-load on touched memory regions - which is totally unneeded for such cases.

I don't quite understand? Basically, what you need to do in case write allocation is a problem is that you need to configure those memory regions where it is as non-cachable, typically chip mem. If the CPU can snoop, writethrough *might* work, though I'm not yet quite clear on the implications.

Thanks,

Thomas

 
 
 
 
 




Wojtek P
Poland

Posts 1597
12 Jun 2011 21:00


Gunnar von Boehn wrote:

Marcel Verdaasdonk wrote:

  My idea on this is to use only half of the SRAM as CPU Cache use Writethrough and use 1MB for stack and context switches, and 1MB to shadow Kickstart and drivers.
 

 
  Yes, setting up the memory this way would be absolutely possible.
  The memory map of the 4MB SRAM can be configured in the
  This means we have the option to use it as 4 MB 2nd level cache.
  or
  2MB 2nd level cache and 2MB local store.
  Localstore e.g. Holding Kickstart and extra Stack memory.
 
  Which option gives the most benefit is the key question. :-D

The practice (such things were tried) shows that just having 4MB caches and allowing it to work automatically works best.

unless it's direct mapped.

Jakob Eriksson
Sweden
(Moderator)
Posts 1097
12 Jun 2011 23:59


Well, just using it as ROM-mirror and a local, "fast" fast-mem would be very simple to understand anyway. :-/

Team Chaos Leader
USA
(Moderator)
Posts 2094
13 Jun 2011 00:27


Wasting 1MB of SRAM to hold the entire Kickstart is a terrible idea.  Most of the OS is not used in time-critical loops, most of the time.  Large pieces of the OS are never used in time-critical code, ever.

When is the last time you called OpenScreen() or CloseScreen() in a loop 100,000 times in a row?

Ever opened 100,000 fonts in a row?  Didn't think so.

I bang the OS hard but I don't want kickstart permablocking 1MB of my beautiful lovely SRAM.  SRAM = Sexy Rapid Amiga Memory =)

L2 cache FTW!


Marcel Verdaasdonk
Netherlands

Posts 3975
13 Jun 2011 08:11


Team Chaos Leader wrote:

  Wasting 1MB of SRAM to hold the entire Kickstart is a terrible idea.  Most of the OS is not used in time-critical loops, most of the time.  Large pieces of the OS are never used in time-critical code, ever.
   
    When is the last time you called OpenScreen() or CloseScreen() in a loop 100,000 times in a row?
   
    Ever opened 100,000 fonts in a row?  Didn't think so.
   
    I bang the OS hard but I don't want kickstart permablocking 1MB of my beautiful lovely SRAM.  SRAM = Sexy Rapid Amiga Memory =)
   
   
   
    L2 cache FTW!
   
 

 
  TCL it's a matter of taste, I prefer to off load the static data from the BUS whenever I can, guess what a ROM is.
  My solutions saves bus cycles that can be used for something more important like your game data.
 
  I think something like what i am trying to say has been said in another thread, by Claudio.(the happy birthday Arne thread)
 
  I can tell you one thing for sure the ROM code doesn't change on the fly.
  And hence it makes a perfect candidate to shadow in RAM to increase speed, and save bandwidth.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
13 Jun 2011 08:21


Matt Hey wrote:

Is MAPROM supported already or can it be added in the fpga of the 060 board?

Yes this is supported

Matt Hey wrote:

Is ThoR's 68060.library running on the board?

Yes it is.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
13 Jun 2011 09:17


Thomas Richter wrote:

My first question would be how the CPU interacts with the RAM. Is it mapped somewhere? Which system components can access it? Does it act as a second-level cache?

Both is possible.
The 4 MB can be mapped somewhere as normasl Fastmem.
Or all of it, or parts of it can be used as 2nd level cache.
The address controller / cache controller is located inside the FPGA on the CPU Card.

Thomas Richter wrote:

What about DMA? Can components DMA into this memory, or is it "local" to the 68060? Does the 68060 snoop only this memory, or all memory? (It might be that it cannot snoop all memory because the Natami bus is probably too fast for it).

The FPGA on the Card is connected as memory controller and busmaster. This means both DMA writing to the card memory and general "echoing" of accesses to allow CPU snopping is possible.
We have not yet used this though.
 

Thomas Richter wrote:

 
Gunnar von Boehn wrote:

  The CPU card can be used a main host CPU - owning the memory of the mainboard and using the local SRAM as 2nd level cache to accelerate the system.
 

  If this is a second level cache, where is its cache controller? Or do you mean it is RAM that just appears under a specific hardware address and the CPU can use it to "cache" something manually?

The Cache controller is inside the CPU card FPGA.
The FPGA contains the tag-Ram and will decide for each CPU access if the SRAM is allowed to answer a read or if the access needs to be answered by the DRAM.
The design is very much like a higher clocked, faster version of a Socket-7 Motherboard.



Deep Sub Micron
Germany
(MX-Board Owner)
Posts 566
13 Jun 2011 10:33


Wojtek P wrote:

  unless it's direct mapped.

As far as I know direct mapped is the only possible way with current hardware. This is because all the SRAM address lines are directly connected to CPU (and FPGA as well). This direct connection makes the design so fast. But FPGA can not remap CPU access to multiple SRAM memory ranges. Only if the tristate buffers of the CPU address lines are in high impedance then the FPGA can access any SRAM address.


posts 37page  1 2