Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

SIMD Unit for the NatAmipage  1 2 3 
Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Sep 2010 10:36


Cesare Di Mauro wrote:

  You don't need to implement everything at the first time. Just define the complete ISA, but implement a restricted one for Natami 1.0.
 
  For example, you can break 512 bits instructions in 4 micro-ops that works on 128 bits each time,
 

 
Your proposal to implement the key instructions first makes perfect sense!

 
Regarding the SIMD WIDTH I would have thought that 128 bit will be  perfect. A design with 512 bit registers is way to costly IMHO - even when limiting to a 128 ALU that needs 4 clocks.
 
Example: Cost of the Register File:
A FPU register file of 16 registers each 80bit implemented as real register with needed muxes will cost about: 2500 LE
A register file of 16 registers each 128 Bit implemented as real register with muxes will cost about: 4000 LE.
If you combine the FPU and SIMD Unit you will get the SIMD register file for 1500 LE.
 
A register file of 16 registers 512bit width does cost about: 16000 LE. This is outrages expensive.
 
If you implement the register file with FPGA internal SRAM then you need 32 Memory blocks for a fullclocked 512 bit registerfile with 2 read ports and 1 write port. This means this register file costs as much as 32 KB CPU Cache!
 
 
Our goal is not to beat Larrabee or any upcoming PS4.
I would be more than happy with a 128bit SIMD Unit like PowerPC have them.
 
 
Also I think that a SIMD Unit which does one 128bit operation every clock is easier to program and a lot more powerful at the end of the day, than a SIMD Unit which does one 512bit operation every 4 clocks.


Deep Sub Micron
Germany
(MX-Board Owner)
Posts 567
09 Sep 2010 13:48


Gunnar von Boehn wrote:

Cesare Di Mauro wrote:

  For example, you can break 512 bits instructions in 4 micro-ops that works on 128 bits each time,
 

  Regarding the SIMD WIDTH I would have thought that 128 bit will be  perfect. A design with 512 bit registers is way to costly IMHO - even when limiting to a 128 ALU that needs 4 clocks.

No, I think the idea is to structure the register file more like
  (16 x 4) x 128 = 64 register each 128bit
and not
  16 x (4 x 128) = 16 register each 512bit

So a 128 bit wide register file would be only 8KB.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Sep 2010 15:12


deep sub micron wrote:

No, I think the idea is to structure the register file more like
(16 x 4) x 128 = 64 register each 128bit
and not
16 x (4 x 128) = 16 register each 512bit
So a 128 bit wide register file would be only 8KB.

Yes, a 128 bit file is of course cheaper.

It needs to be discussed whether a real register file with flip-flops might be of advantage over a SRAM based file.

The flip-flops offer the possibility to add more READ or WRITE ports.
If we want to support a parallel LOAD instruction an extra write port will be good.
If we want to include a FMAD operation we need 3 Read ports.

This means a FPU/SIMD Unit which supports parallel Load and FMAD will need 3 Read ports and 2 Write ports. I think we can only achieve this with a real register based register file.



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Sep 2010 15:19


A 128bit SIMD FPU which can do 4 SINGLE FMAD in parallel and can do a parallel VMOVE kicks as already.

This should give the single 68070 CORE a floating point performance comparable to days G3 equipped AMIGAs.

I would suppose that this is more than enough.


Megol .

Posts 672
09 Sep 2010 16:00


Gunnar von Boehn wrote:

deep sub micron wrote:

  No, I think the idea is to structure the register file more like
  (16 x 4) x 128 = 64 register each 128bit
  and not
  16 x (4 x 128) = 16 register each 512bit
  So a 128 bit wide register file would be only 8KB.
 

 
  Yes, a 128 bit file is of course cheaper.
 
  It needs to be discussed whether a real register file with flip-flops might be of advantage over a SRAM based file.
 
  The flip-flops offer the possibility to add more READ or WRITE ports.
  If we want to support a parallel LOAD instruction an extra write port will be good.
  If we want to include a FMAD operation we need 3 Read ports.
 
  This means a FPU/SIMD Unit which supports parallel Load and FMAD will need 3 Read ports and 2 Write ports. I think we can only achieve this with a real register based register file.

With Xilinx distributed memory it would be trivial. AFAIK Alteras higher-end FPGAs have similar distributed memory.
With dedicated memory blocks it's possible to replicate the memory and have a separate 1-bit "register file" that indicates which memory block contains the latest data, lookups would be in parallel so the only extra delay would be a mux (which is required anyway even if the register file is implemented in logic).

Ceti 331
United Kingdom

Posts 282
09 Sep 2010 16:04


for a cell/larabee inspired 68k ,Would there be any merit in:-
  -a SIMD ISA uniform across all cores for ease of programming;
  - changing the number of execution units between the cores.
 
  e.g.
  Master:-
  ========
  large cache
  OOOE ?
  slow SIMD/fpu
 
  Slaves:-
  ========
  small caches
  less 'branch prediction'
  in-order
  fast SIMD/fpu
 
  Could this bring some of the benefits of dedicated units (like the Audio DSP benig suggested) together with the easy accessibility of SMP.
  Maybe even specialize the slaves further, e.g. half of them have faster implementations of the fiddlier instructions for video/sound decoding, the other half more tailored to 3d transformation... all still have the same instructions, just microcoded on the simplified cases

Amiga Believer
Canada

Posts 282
09 Sep 2010 18:24


> Also, you can implement only 32 bits integers and FP32 instructions
For floating point, I do agree that 32 bit is probably sufficient, 64 bit may be overkill. I would howver implement half precision floating point support (16 bit). In many cases, the precision would be good enough and working on 8 values in parallel would double the performance (versus 4 values in parallel). It may make sense to limit the design to half-precision and single precision to simplify the design.

For integer, I would suggest to have at least partial support for 64 bit. It can be useful to be able to multiply two 32 bit integers and have a 64 bit result, or to calculate the square root of a 64 bit integer and generate a 32 bit result. Thus, I would recommand having three word size for integer operations, 16 bit, 32 bit and (at least partial support for) 64 bit, all of these, both signed and unsigned. 8 bit is too small, I would not implement 8 bit support.

As for the register size, I agree that 128 bit is appropriate, there is not much point in supporting more than 128 bit. If there are spare LEs, I would rather use them to give the SIMD unit it's own registers than to increase the register size. If it is about performance, implementing half-precision floating point seems like a great approach to me, which does not need bigger registers.

Finally, I would like to bring back a point about which no body commented, what about adding an absolute value instruction?



Loïc Dupuy
France

Posts 253
09 Sep 2010 18:43


Amiga Believer wrote:

For integer, I would suggest to have at least partial support for 64 bit. It can be useful to be able to multiply two 32 bit integers and have a 64 bit result, or to calculate the square root of a 64 bit integer and generate a 32 bit result.

It is the case since the 68020
http://en.wikipedia.org/wiki/Motorola_68020#Instruction_set

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Sep 2010 19:00


Amiga Believer wrote:

> Also, you can implement only 32 bits integers and FP32 instructions
  For floating point, I do agree that 32 bit is probably sufficient, 64 bit may be overkill.
I would howver implement half precision floating point support (16 bit).

16bit float is very very imprecise.
Therefore I would not implement it.

As more sizes the Units supports as more expensive does it get.
Keep it Simple!
 

Amiga Believer wrote:

For integer, I would suggest to have at least partial support for 64 bit. It can be useful to be able to multiply two 32 bit integers and have a 64 bit result

Its supported in 68K ISA since 1984!

Amiga Believer wrote:

or to calculate the square root of a 64 bit integer and

1) Convert InT to Float,
2) Do SQRT
3) Convert Float to Int

Amiga Believer wrote:

Finally, I would like to bring back a point about which no body commented, what about adding an absolute value instruction?

Do you talk about FLOAT or Integer?



Loïc Dupuy
France

Posts 253
09 Sep 2010 19:11


Ok, SIMD can do 4 operation in //
But what are the real case that benefit from this ?

i can see scalar operation (*,/,+,-,,~) on a vector, example [a,b,c,d]*k = [a*k,b*k,c*k,d*k]
But a 3D rotation is a 4x4 diagonal matrix multiplied by a 4x4 matrix, so the scalar operation is not directly applicable.
So in fact, the only interesting option is [a,b,c,d](horizontal) x [x,y,z,t](vertical) = a*x + b*y + c*z + d*t (one register)

I agree with Ceti331 that the dot product is the most interesting aspect of a simd.

Outside this, all other instruction will be a waste, and it will be very difficult to justify the gates/power ration gives by the SIMD.

I must be wrong somewhere, please enlight me !

For FP precision, 32bits is enough for games (dunno for 16bits, but i doubt). 64bits is not enough to simulate world economy, and you never had enough bits in scientific simulation !

Fixed point 32bits integer with a scalar register seems more interesting.
We keep integer engine, only when the value is used that it is scaled (can be done by the 3D chip or the blitter on the fly ?)

Another very useful feature is saturation mode register, very useful in signal processing, example for a byte 128+128=255 and not 0 (=256 mod 256). Perhaps with an extra status register.

According to the SSE4/Altivec/... ISA, what are the operation that are really used in day to day operations, what are the operations that could be emulate in software, because they are seldom used in the software, and the 100 cycles lost once a while are peanuts to the 10000 gates used to implement this operation ?

INAHE i'm not an hardware enginer

Thierry Atheist
Canada

Posts 1828
09 Sep 2010 19:29


FPGAs are getting bigger all the time. Using 10,000 gates is WAY worth it if we could get no frame loss playback of MPEG2 at 640*480 and MP3 audio, with sound being in sync with the video stream.

A NatAmi should even outperform the least expensive DVD players that are made with the absolute cheapest parts (lowest MHz) and lowest amount of RAM to make video playback possible. I've found that even NAME BRAND DVD players can get kludged up before a movie is finished with NEW DVDs being used!

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
09 Sep 2010 19:34


Loïc Dupuy wrote:

I must be wrong somewhere, please enlight me !

When you do 3D Matrix Mul with SIMD you still need the same amount of instructions. But instead precessing 1 Vector you are processing 4 Vectors in parallel.

In 3D games you always need to rotate Vectors with a Matrix.
The Matrix for each Object is the same.
A 3D Object typically has many hundred Vectors.

This means if you can do 4 Vectors in parallel your are 4 times faster.



Amiga Believer
Canada

Posts 282
09 Sep 2010 21:30


@Loïc Dupuy
> It is the case since the 68020
I know, but this was an answer to the suggestion to drop 64 bit support. I was saying that dropping 64 bit support for float is fine (in the SIMD only, the FPU should support it), but that we should at least keep partial 64 bit support for integers.

> 16bit float is very very imprecise.
> Therefore I would not implement it.
I disagree. Half precision floating point (IEEE754 binary16) is obviously not precise enough for scientific calculations. It is however quite enough for many everyday tasks, such as MPEG compression and decompression, the difference in the quality final image would not even be visible (one must keep in mind that our eyes' sensitivity is not linear, it is more or less logarithmic). Moreover, supporting half precision floating point format would double the performance of MPEG decoding. If working on macroblocks of 8 pixel per 8 pixel, one has to do the same sequence of operations on each of the 8 rows to do the DCT or IDCT, then, the same operation on each of the columns. If our 128 bit registers have 8 "positions" holding each a half precision float, then, each row can use one of the "positions" in the register and all the rows of a macroblock can be done in a single pass, then the same can be done for the columns in a second pass. If working on 16 pixel per 16 pixel macroblocks, the rows can be processed in two passes and the columns in two more passes. This is the double of what can be done when working with four 32 bit floats in parallel. This can make the difference between h264 support on the NatAmi being usable or unusable.

> 1) Convert InT to Float,
> 2) Do SQRT
> 3) Convert Float to Int
It may be useful to have, like I said, a square root instruction for integers.

> Do you talk about FLOAT or Integer?
Well... both...
I was thinking of adding it for signed integers and floating point.

> In 3D games you always need to rotate Vectors with a Matrix.
I was thinking of creating a SIMD unit with the goal of accelerating compression and decompression of video, audio and images, not to accelerate games. For games, I have another solution which I will bring in another forum topic.

@Loïc Dupuy

> Ok, SIMD can do 4 operation in //
> But what are the real case that benefit from this ?
> Outside this, all other instruction will be a waste, and it will be very difficult to justify the gates/power ration gives by the SIMD.
See the example which I gave about MPEG in this message. When working with macroblocks, a SIMD unit can give a boost for the DCT or the IDCT calculation.

@Thierry Atheist
> FPGAs are getting bigger all the time. Using 10,000 gates is WAY worth it if we could get no frame loss playback of MPEG2 at 640*480 and MP3 audio, with sound being in sync with the video stream.
I think you got the point, whith this unit, we may have usable video compression and decompression.


Loïc Dupuy
France

Posts 253
09 Sep 2010 21:36


Gunnar von Boehn wrote:

When you do 3D Matrix Mul with SIMD you still need the same amount of instructions. But instead precessing 1 Vector you are processing 4 Vectors in parallel.
 
In 3D games you always need to rotate Vectors with a Matrix.
The Matrix for each Object is the same.
A 3D Object typically has many hundred Vectors.
 
This means if you can do 4 Vectors in parallel your are 4 times faster.

It has been quite a while i did not program spatial rotation (since the A500 in fact, and to discover that the 68k was not speedy enough for general cases, and A1200 with a 4x acceleration was still not enough, do not count demos, all the rotation are fixed along axis to avoid to multiply).

So i just retrieve from a box my antique reference book "Introduction to computer graphics" by Jame D. Foley (Addison-Wesley 1995)

I was afraid of that, we have to adapt the flow of the algorithm to load the registers with a batch of 4 vectors instead of one.

Seems to me that a dot product using only 8 registers will be a lot more effective and more general wise.

Of course doing 4 dot products in // can not be too bad either :-D

If we have to break the flow of the algorithm, 4 maidcores with an effective dot products (4 multiplication in // and adding the 4 results in a second pass) will be a lot more effective than a single simd.

Program wise, you thread each objet, you will have less memory starvation, because dot products will not be compute in one cycle, letting the other cores loading from memory.

You can have an independant algorithm on each coordinate or vector (in the simd case, you have to link 4 value, every object has now a factor of 4 coordinates ?), so you can apply different rotation to different objects at the same time.

I did not touch 3D computation for a long time, but it's not because SIMD is the new trend, that we have to follow it, as you seldom told us, FPGA allow a lot of trick not possible to do with classic architecture.

Thierry Atheist
Canada

Posts 1828
09 Sep 2010 21:48


A SIMD is often 4 vectors only... Is an 8 vector SIMD not very effective, or too demanding on the bus for data? Or otherwise "law of diminishing returns" applies?

Loïc Dupuy
France

Posts 253
09 Sep 2010 21:51


Thierry Atheist wrote:

Using 10,000 gates is WAY worth it if we could get no frame loss playback of MPEG2 at 640*480 and MP3 audio, with sound being in sync with the video stream.

I'm for a specialized DCT, iDCT unit, as a lot of loss compression algorithm use them, and it has general use in signal processing. No fear to loss 10.000 gates here :-)

But if an instruction in the SIMD will be seldom used (not the case of DCT/iDCT) do not implement it.
With more instructions comes more complexity and difficulty to change to a new generation.

Let use 100.000 gates if it gives a cutting-edge on common algorithms.

Loïc Dupuy
France

Posts 253
09 Sep 2010 22:03


I like integers and i hate floats, floating point is far from be foolproof, and if you want to avoid derivation of the results, you need specialised algorithms that focus on the limit of the result (nightmare in my enginering mathematics class, the algorithms used are not straitgh forward).

At the end, you have to convert to screen resolution, and it's integer only, and conversion is NOT losseless.

So here is my crazy idea

Why not have an integer unit with bresenham correction ?
The first application is to draw lines, but it can be used for ALL interpolations.
At university, a friend of mine has done a raytracer using bresenham style for the ray's on a P133, and it was order of magnitude faster than using floats (the same could be applied to DCT/iDCT).

The error of computation is maintened in the extra accumulator. It's pretty stable to several transformation in a row (contrary to the float's).

Of course, it's far to be mainstream, but it's really effective computation wise, it will give NATAMI an edge even with a "slow" clock.

Perhaps a FFP library could be adapted to this unit, granted us compatibility with old programs.

It will be not as flexible as FFP32bits with 32bits integer, but with 64bits integers ? (hello Amiga Believer)

Deep Sub Micron
Germany
(MX-Board Owner)
Posts 567
09 Sep 2010 23:26


Amiga Believer wrote:

  > 1) Convert InT to Float,
  > 2) Do SQRT
  > 3) Convert Float to Int
  It may be useful to have, like I said, a square root instruction for integers.

It's just that some fast algorithms expect normalized numbers. And float are normalized by definition. So it would be just the simplest and for long integers also a fast solution.


Amiga Believer
Canada

Posts 282
10 Sep 2010 00:08


> It's just that some fast algorithms expect normalized numbers. And float are normalized by definition. So it would be just the simplest and for long integers also a fast solution.
Can you elaborate a bit about this?
I was thinking of the fact that integer calculation is usually faster than floating point calculation. Moreover, doing the conversion from integer to floating point and from floating point to integer will need some extra clock cycles. We want the code to run **fast**.



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
10 Sep 2010 00:38


Amiga Believer wrote:

> 1) Convert InT to Float,
> 2) Do SQRT
> 3) Convert Float to Int
It may be useful to have, like I said, a square root instruction for integers.

SQRT is in general much more often needed for float.
A SQRT instruction is by design a slow and expensive operation
saving chip space and reusing the FPU for this makes good sense.
Especially as on the 68K the convesion between INT and FLOAT is swift. By design the 68K can do this conversion much quicker than a PowerPC for example.
 

Amiga Believer wrote:

> Do you talk about FLOAT or Integer?

Well... both...
I was thinking of adding it for signed integers and floating point.


Well the 68K already has a FABS for float.
ABS for INT is also easy to do with normal instructions.

 
Amiga Believer wrote:
 
  > In 3D games you always need to rotate Vectors with a Matrix.
  I was thinking of creating a SIMD unit with the goal of accelerating compression and decompression of video, audio and images, not to accelerate games.

SIMD is perfect to accelerate 3D games.



posts 55page  1 2 3