Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Cesare Di Mauro
Italy

Posts 528
04 Mar 2011 06:32


Team Chaos Leader wrote:

Cesare Di Mauro wrote:

  (2D) Games, by their nature, have different "patterns" (mostly loading custom registers), and are usually are "hardware-bound" (most of the work is done by the chipset).
 
  So they aren't good candidates to show how to improve the CPU.
 

  It depends what 2D game you profile.
 
  My 2D Amiga games spend 1% of the time banging hardware registers and 99% banging the CPU.  My games have AI (Amiga Intelligence :D which benefits from CPU improvements.  Plus I do a LOT of other things with the CPU.

So you have little chipset usage? I talk in terms of cycles used by chipset or CPU, which is essentially the primary metric used to evaluate if a game is CPU or chipset/hardware bound on Amiga (the latter is the usual scenery).

In Fightin' Spirit I used a lot of CPU, because I needed to move the characters' "blocks" from fast (usually slow) to chip memory, before drawing them, so I basically used a massive amount of MOVEM.L. But in the end most of the time was spent by the Blitter.

Anyway, except for MOVEM.Ls (for which I see no optimizations, unless a BLOCKMOVE instruction will be made available), the CPU usage was quite low, even considering AI, music playback, and other things.

Natami haven't such stupid thing such as slowmem, and a more powerful chipset, so for 2D games I see very little CPU usage, which is essentially banging the hardware.
680x0 Forever!

Sure! :)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
04 Mar 2011 06:50


Cesare Di Mauro wrote:

so for 2D games I see very little CPU usage, which is essentially banging the hardware.

I agree. A 2D game should need very little CPU.

Jakob Eriksson
Sweden
(Moderator)
Posts 1097
04 Mar 2011 07:55


In the typical case. (I.e. counter example, chess game.)

Marcel Verdaasdonk
Netherlands

Posts 3991
04 Mar 2011 09:25


@Jakob let's say a typical 2d game needs very little CPU power.
Chess isn't what i would call a typical 2d game. ;)

Wojtek P
Poland

Posts 1597
04 Mar 2011 12:12


Cesare Di Mauro wrote:

Team Chaos Leader wrote:

 
Cesare Di Mauro wrote:

    (2D) Games, by their nature, have different "patterns" (mostly loading custom registers), and are usually are "hardware-bound" (most of the work is done by the chipset).
   
    So they aren't good candidates to show how to improve the CPU.
 

  It depends what 2D game you profile.
 
  My 2D Amiga games spend 1% of the time banging hardware registers and 99% banging the CPU.  My games have AI (Amiga Intelligence :D which benefits from CPU improvements.  Plus I do a LOT of other things with the CPU.

  So you have little chipset usage? I talk in terms of cycles used by chipset or CPU, which is essentially the primary metric used to evaluate if a game is CPU or chipset/hardware bound on Amiga (the latter is the usual scenery).
 
  In Fightin' Spirit I used a lot of CPU, because I needed to move the characters' "blocks" from fast (usually slow) to chip memory, before drawing them, so I basically used a massive amount of MOVEM.L. But in the end most of the time was spent by the Blitter.
 
  Anyway, except for MOVEM.Ls (for which I see no optimizations, unless a BLOCKMOVE instruction will be made available), the CPU usage was quite low, even considering AI, music playback, and other things.
 
  Natami haven't such stupid thing such as slowmem, and a more powerful chipset, so for 2D games I see very little CPU usage, which is essentially banging the hardware.
 
680x0 Forever!

  Sure! :)

for 2 reasons you will not need to do massive block moves by CPU

1)256MB chipram is more than 2. Most probably you will not need to copy a lot from fastram to chipram
2)natami have ca 100 times higher bandwidth to chipram - even if you need it will be fast.


Wojtek P
Poland

Posts 1597
04 Mar 2011 12:14


Marcel Verdaasdonk wrote:

@Jakob let's say a typical 2d game needs very little CPU power.
  Chess isn't what i would call a typical 2d game. ;)

And you can not be sure someone will invent chess calculation algorithm that will make use of blitter or 3D core.

Contrary to PC-style hardware, amiga hardware is clearly available to programmer and can be (ab)used to do other things that it was made for.

Cesare Di Mauro
Italy

Posts 528
04 Mar 2011 13:19


Wojtek P wrote:
  for 2 reasons you will not need to do massive block moves by CPU
 
  1)256MB chipram is more than 2. Most probably you will not need to copy a lot from fastram to chipram
  2)natami have ca 100 times higher bandwidth to chipram - even if you need it will be fast.

Requirements are changed too from Amiga time.

Now FullHD (1080p), or at least 720p @24/32bits depth is a common and/or desirable target, and it requires both space and bandwidth.

Strictly speaking, we don't need to run games faster at 320x256 with 256 colors.

And sound, too, demand a better user experience, with a lot of positional (or, at least, left & right panning) channels.

Cesare Di Mauro
Italy

Posts 528
04 Mar 2011 13:28


Wojtek P wrote:

Marcel Verdaasdonk wrote:

  @Jakob let's say a typical 2d game needs very little CPU power.
  Chess isn't what i would call a typical 2d game. ;)
 

  And you can not be sure someone will invent chess calculation algorithm that will make use of blitter or 3D core.
 
  Contrary to PC-style hardware, amiga hardware is clearly available to programmer and can be (ab)used to do other things that it was made for.

Wojtek, I don't know why you fire such out-of-the-world statements.

You simply don't know how a PC works. Not only a modern one, but an old one too.

EXTERNAL LINK 
Take a look at the "Retro" demo, for example. The good old Frogger game runs entirely into the GPU, using a quite old technology: Pixel Shader 2.0 (we have 4.1 now)...

Loïc Dupuy
France

Posts 253
04 Mar 2011 13:39


Marcel Verdaasdonk wrote:

@Jakob let's say a typical 2d game needs very little CPU power.
Chess isn't what i would call a typical 2d game. ;)

Nope it is a reality 3D game with wood textures :-D

Thierry Atheist
Canada

Posts 1830
05 Mar 2011 01:51


Wojtek P wrote:

Marcel Verdaasdonk wrote:

@Jakob let's say a typical 2d game needs very little CPU power.
Chess isn't what i would call a typical 2d game. ;)

And you can not be sure someone will invent chess calculation algorithm that will make use of blitter or 3D core.

Contrary to PC-style hardware, amiga hardware is clearly available to programmer and can be (ab)used to do other things that it was made for.


Hi Wojtek,

YES!

And isn't chess like a form of ray tracing if you think about it? What speeds that up???

Wojtek P
Poland

Posts 1597
05 Mar 2011 10:36


Cesare Di Mauro wrote:

  You simply don't know how a PC works. Not only a modern one, but an old one too.

I know very well. I don't say it is not possible, but it's VERY difficult to write efficient programs on PC hardware.
 
It's easy to write ones that will run 100 times slower that PC hardware actually make possible IN THEORY.
Like the example as 100 times more slow is still enough.


Megol .

Posts 690
05 Mar 2011 14:13


Trying to post something more on-topic :
 
  A equivalent to the x86 TEST instruction would be nice, it takes two arguments and changes the flags as a logical AND but doesn't change anything else. The implementation would be the same as AND but with the data write suppressed.
 
  X86 LEA equivalent. This is one of the few really nice things in the x86 ISA however just an artifact from the shared integer/address pool. A single extension word should be enough if the word/long bit is reused as a data/address register bit. Another variation would reuse the D/A and W/L bits to indicate subtraction and forcing use of only data registers, yet another could extend the scale field with one bit.
  LEAD 8(D0, D3*4), D1  ; LEA to Data register ;)
  LEAD 8(-D0, -D3*4), D1 ; The other variation
 
  RTScc/ADDcc/INDcc/DECcc/MOVEcc. While not required conditional instructions can help to size optimize common patterns.
 
  64 bit shifts equivalent to x86 SHRD/SHLD. Usefull for fixpoint calculations.

Edit: The LEAD instruction above could support inverting input operand instead of negating them, as -A = ~A+1 for twos complement numbers an inversion+increasing the displacement could possibly be faster.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
05 Mar 2011 14:36


Megol . wrote:

  Trying to post something more on-topic :
   
    A equivalent to the x86 TEST instruction would be nice, it takes two arguments and changes the flags as a logical AND but doesn't change anything else. The implementation would be the same as AND but with the data write suppressed.
 

Where is the benefit over using the 68K-CMP instruction?
 
   
 
Megol . wrote:

    X86 LEA equivalent. This is one of the few really nice things in the x86 ISA however just an artifact from the shared integer/address pool. A single extension word should be enough if the word/long bit is reused as a data/address register bit. Another variation would reuse the D/A and W/L bits to indicate subtraction and forcing use of only data registers, yet another could extend the scale field with one bit.
    LEAD 8(D0, D3*4), D1  ; LEA to Data register ;)
    LEAD 8(-D0, -D3*4), D1 ; The other variation
 

   
On 68K we do have the LEA instruction for the purpose to store a calculated address.
 
The seperation between AN and DN bank is thankfully very clean on 68K. This clean seperation gives us the edge to do pointer increments like (AN)+ for free.

Now creating a LEA variation which updates DN will violate this  clean seperation and cause forwarding issues.
This can create more drawbacks then bringing any good.
 
Your proposed versions which use DN as sources will obviously often produce bubbles when trying to use the EA unit for this.
This is not good.
 
 
 
Megol . wrote:

    RTScc/ADDcc/INDcc/DECcc/MOVEcc. While not required conditional instructions can help to size optimize common patterns.
 

Hmm, what is menat by: INDcc/DECcc ?
Do you mean somehting like ADDQcc ?
 
You say your goal is size optimisation.
How do you want to encode your proposed instructions to reach this goal?
 
 
 
Megol . wrote:

    64 bit shifts equivalent to x86 SHRD/SHLD. Usefull for fixpoint calculations.
 

The funny thing is if you look at the implementation of those instructions then you see that, Intel does not regard these instructions as important at all.
These instructions are terrible slow on some recent x86 cores.
 
 
 
Megol . wrote:

  Edit: The LEAD instruction above could support inverting input operand instead of negating them, as -A = ~A+1 for twos complement numbers an inversion+increasing the displacement could possibly be faster.
 

The point of the LEA is to use the result of the ADDRESS decoding unit. So it should be clear that this instruction can only ever do calculations which the EA-unit provides.

Cheers

Wojtek P
Poland

Posts 1597
05 Mar 2011 15:14


Megol . wrote:

Trying to post something more on-topic :
 
  A equivalent to the x86 TEST instruction would be nice, it takes two arguments and changes the flags as a logical AND but doesn't change anything else. The implementation would be the same as AND but with the data write suppressed.

Doesn't seem on topic, but rather badly done x86 architecture promotion.
I don't see any of it being able to replace 68k architecture instruction giving better performance.

as being great fan of intel, you should know that shifts are high latency on x86 CPUs.

Anything that may seem even great on them execute slower than sequence of simpler instructions.

Look at any longer C compiler produced code and try to manually optimize it, then talk.

PS. seems like you don't really know 68k architecture if you say TEST and LEA are great feature of x86 and not 68k ;)


Wojtek P
Poland

Posts 1597
05 Mar 2011 15:26


Gunnar von Boehn wrote:

  Where is the benefit over using the 68K-CMP instruction?
 
  On 68K we do have the LEA instruction for the purpose to store a calculated address.
 
  The seperation between AN and DN bank is thankfully very clean on 68K. This clean seperation gives us the edge to do pointer increments like (AN)+ for free.

And to do some calculations without writing to flags and some with writing - without extra bits of encoding it. Saves often storing flags or repeating compares.

  Now creating a LE variation which updates DN will violate this  clean seperation and cause forwarding issues.
  This can create more drawbacks then bringing any good.

On processor that can access L1 cache in 1 cycle and registers and the same time 8+8 registers are enough. Even if someone ever produce very fast 68k CPU in which L1 cache will be 2 cycles but will be able to do out of order execution it will be OK, but no more.

But separating address and data registers give adventage in encoding and implies unsigned and no messing with flags. Being able to do pre/postincrement in every load is great too.

Actually the only advantage to other CPUs done in similar silicon technology and similar size - of intel products - is market adventage of being able to run windoze. As they are produced in hundreds of millions they are cheaper. But ONLY because of it.

The discussion about adventages of many RISC architectures or other CISC architectures over 68k make sense. All have adventages, and disadventages. But i don't see any "adventages" of x86 architecture at all maybe except SIMD units for number crunching.
But it can be added easily to 68k too if needed.

  The funny thing is if you look at the implementation of those instructions then you see that, Intel does not regard these instructions as important at all.

All shifts - excluding implied short ones in LEA instruction - are slower than additions on any intel processor.
Not 64 but even 32 bit.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
05 Mar 2011 15:49


Wojtek P wrote:

Even if someone ever produce very fast 68k CPU in which L1 cache will be 2 cycles but will be able to do out of order execution it will be OK, but no more.

Whether a cache access takes 1 or 2 or 3 clocks is of relative little importance on a normal 68k CPU, that is designed lik eth e 040/050 or 060.

The reason is that the 68k can "hide" the cache access time in its pipeline.

Out of Order is NOT needed for this on the 68K.

This is different to an RISC CPU.
A RISC CPU pipeline is different.
Therefore the RISC can not do free cache reads like the 68K.

The RISC CPU needs an extra instruction for this - this is a drawback on the RISC - and the RISC chips need to recompile there code for each CPU - or they need Out-of-Order to work around this latency.

For an 68K OoO is much less needed. :-D

Megol .

Posts 690
05 Mar 2011 15:50



The TEST instruction is an AND not a compare. Example use:
MOVE #$FF000000, D1
TEST D1, D0 ; D0 not touched
BEQ.S MSB_ZERO
--
X86 LEA is used for many cases that have nothing to do with addresses and yes it would be good to have the equivalent in 68k. Replicating the functionality of the address generation stage is how x86 does it.
If you don't like the LEA name SADD (Scaled ADD) could do, however a tree operand subset is very useful and reusing the EA extension word format would be natural.
--
It should be INCcc/DECcc increment decrement, while ADDQcc would work I though it would be harder to fit that encoding. Haven't checked for encoding at all.
--
Yonah (2006):
  SHLD/SHRD r,r,cl/i 2µops latency 2 throughput 1/2
Merom (2006):
  SHLD/SHRD r,r,cl/i 2µops latency 2 throughput 1
Penryn (2008):
  SHLD/SHRD r,r,cl/i 2µops latency 2 throughput 1
Nehalem (2009):
  SHLD r,r,cl/i 2µops latency 3 throughput 1
  SHRD r,r cl/i 2µops latency 4 throughput 1
Sandy Bridge (2011):
  SHLD/SHRD r, r, i    1µop latency 1 throughput 2
  SHRD/SHRD r, r, cl  4µops latency 2 throughput 1/2

Does that equals terribly slow for you? Such a short latency will be transparent in most cases and the throughput is very good. Remember that Intels P6 and successors are internally relatively clean RISC style so every instruction generating more than one result requires at least 2µops.

Megol .

Posts 690
05 Mar 2011 15:59


Wojtek P wrote:

Megol . wrote:

  Trying to post something more on-topic :
   
    A equivalent to the x86 TEST instruction would be nice, it takes two arguments and changes the flags as a logical AND but doesn't change anything else. The implementation would be the same as AND but with the data write suppressed.
 

  Doesn't seem on topic, but rather badly done x86 architecture promotion.
  I don't see any of it being able to replace 68k architecture instruction giving better performance.
 
  as being great fan of intel, you should know that shifts are high latency on x86 CPUs.
 
  Anything that may seem even great on them execute slower than sequence of simpler instructions.
 
  Look at any longer C compiler produced code and try to manually optimize it, then talk.
 
  PS. seems like you don't really know 68k architecture if you say TEST and LEA are great feature of x86 and not 68k ;)

How would you translate this into 68k code:
LEA ESI, 20[EAX+EAX*4]

Yes shifts are very slow: latency 1 throughput 2
Very slow indeed.

If you meant the SHRD/SHLD try to do the equivalent with the same latency/throughput.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
05 Mar 2011 16:18


Megol . wrote:

  The TEST instruction is an AND not a compare. Example use:
  MOVE #$FF000000, D1
  TEST D1, D0 ; D0 not touched
  BEQ.S MSB_ZERO
  --

OK, thanks for the example.

Instead doing this:


  MOVE #$FF000000, D1
  TEST D1, D0 ; D0 not touched
  BEQ.S MSB_ZER

How about doing this?
  MOVE #$FF000000, D1
  AND D0, D1 ; D0 not touched
  BEQ.S MSB_ZER

Megol . wrote:

X86 LEA is used for many cases that have nothing to do with addresses and yes it would be good to have the equivalent in 68k. Replicating the functionality of the address generation stage is how x86 does it.

If you don't like the LEA name SADD (Scaled ADD) could do, however a tree operand subset is very useful and reusing the EA extension word format would be natural.

It would be good to see some clear usage/purpose examples showing WHY the SADD/x86LEA is usefull.

Being able to add (2*Dn) or (4*Dn) to Dm, looks like a little bit limited use case to justify adding a new instructions.
Maybe a more general MultiplyADD or MultiplySUB would be more useful?

Megol . wrote:

It should be INCcc/DECcc increment decrement, while ADDQcc would work I though it would be harder to fit that encoding. Haven't checked for encoding at all.

Is there a need for this if the CPU can do any instruction conditional ?

Megol . wrote:

  Nehalem (2009):
  SHRD r,r cl/i 2µops latency 4 throughput 1
  Sandy Bridge (2011):
  SHRD/SHRD r, r, cl  4µops latency 2 throughput 1/2
 
  Does that equals terribly slow for you? Such a short latency will be transparent in most cases and the throughput is very good.

Well 4 is slow, and on P4, ATOM, and NANO the number are even longer.

Latency 4 and throughput 0.5 means you could execute 4 normal instruction instead of this.

Honestly I fail to see the use case for this instruction.
Its needed very rarely.
- With a good FPU integer Fixpoint start to make little sense.
- Some usecases coulkd be replaced with an 68K BITFIELD instruction

And where you realy need the 65bti shift you can emulate
it with 5 normal instructions for the same speed.

But maybe you can show us a usecase where the 64bit shift is really needed.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
05 Mar 2011 16:22


Megol . wrote:

  How would you translate this into 68k code:
  LEA ESI, 20[EAX+EAX*4]

How about:
LEA (20,A0,A0*4),A1

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22