Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Welcome to the Natami lounge.
Meet new AMIGA friends here and enjoy having a friendly chit chat.

Amiga's Bottleneckspage  1 2 3 4 
Geoffrey Kramer

Posts 21
16 Oct 2010 17:02


I open this topic after reading a Thomas RICHTER post (thanks to him).

Thomas Richter wrote:

Rather, let's identify existing bottlenecks where important applications could be made a lot faster by smart instructions. For example rendering JPEG images is one such application. Block-transfer between chip and fast (brr, I hate this obsolete concept) due to limitations of the DMA engine to chip memory would be another addition needed.

First Question : What are the bottlenecks of the Amiga, 68k and OS.

Second Question : What would be the ways to remedy to those bottlenecks.

Lets go people, but please, try to stay clear and point the bottleneck you're toalking about.

No need to have an answer, if you just identify a bottleneck it's ok.

Samuel D Crow
USA
(Natami Team)
Posts 1295
16 Oct 2010 18:40


Many have already been addressed by the NatAmi hardware:

Chunky graphics modes have more sequential memory access and therefore are faster on modern memory.

The chunky 256 color mode should be able to generate masks dynamically thus eliminating unnecessary accesses to the mask plane that a bit-planar screen mode would need.

The Bopper would allow blit nodes to be queued into the blitter independently of the CPU and would not need to tie up the copper so it could be doing independent functions.

Having a vertical modulo operation on the blitter allows interleaved bitplane screenmodes to blit transparently without CPU intervention for the mask plane, while allowing the mask plane itself to be only one bit deep per row of pixels.

Cesare Di Mauro
Italy

Posts 528
17 Oct 2010 07:08


1) Lack of a modern SIMD unit (CPU)

2) Lack of SMP (OS)

3) Stack frames make interrupts and traps expensive. Shadow registers ala-ARM or FIDO are a better solution in terms of speed and latency. (CPU & OS)

4) Lack of chunky graphic modes (GPU)

5) Lack of modern GPU architecture, and shaders particularly (GPU)

6) Lack of resource tracking, virtual memory, and memory protection (OS)

7) Lack of GUI enhancements, such as vectorial graphic presentation (OS)

8) Lack of modern transactional/journaled filesystem, with metadata support and large storage (OS)

9) chip / fast memory subdivision which makes no sense if the available bandwidth is almost the same. It makes sense only if we have chip mem = embedded or graphic memory with enormous bandwidth compared to the main memory. (Hardware)

10) Lack of multiuser support (OS)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Oct 2010 11:59


Ok, i'll bite on this one.

Cesare Di Mauro wrote:

  1) Lack of a modern SIMD unit (CPU)

A SIMD UNit can be added to the 68k in a clean compatible manner
But SIMD is useful for a fraction of software only.
While a good INTEGER unit is always needed.

 
Cesare Di Mauro wrote:

  2) Lack of SMP (OS)

Well we can certainly live without it.

 
 
Cesare Di Mauro wrote:
 
  3) Stack frames make interrupts and traps expensive. Shadow registers ala-ARM or FIDO are a better solution in terms of speed and latency. (CPU & OS)
 

But interrupt handling never was limitation on AMIGA.
 
 
Cesare Di Mauro wrote:
 
  4) Lack of chunky graphic modes (GPU)
 

This is fixed.
 
 
Cesare Di Mauro wrote:

  5) Lack of modern GPU architecture, and shaders particularly (GPU)
 

Well this could be added.
 
 
Cesare Di Mauro wrote:

  6) Lack of resource tracking, virtual memory, and memory protection (OS)
 

I agree that resource tracking is something very good.
But Virtual memory is fluff.
Memory protection is mainly a kludge - and in a DMA driven system like the AMIGA it can by desing never work to 100%.
 
Cesare Di Mauro wrote:

  7) Lack of GUI enhancements, such as vectorial graphic presentation (OS)

Taste is a personal matter but I would consider this fluff.
 
 
Cesare Di Mauro wrote:
 
  8) Lack of modern transactional/journaled filesystem, with metadata support and large storage (OS)
 

PFS?
 
 
Cesare Di Mauro wrote:

  9) chip / fast memory subdivision which makes no sense if the available bandwidth is almost the same. It makes sense only if we have chip mem = embedded or graphic memory with enormous bandwidth compared to the main memory. (Hardware)
 

I think the opposite is true, my friend.
The logical Chip and fast subdivsion was very important for the AMIGA. This siperation is a clever and simple way to create a working syncronise beetwen chipset and CPU. Without this seperation no AMIGA with cache and without a MMU would have ever worked.
 
 
Cesare Di Mauro wrote:
 
  10) Lack of multiuser support (OS)
 

I would say this is only important for Server but not for homecomputers.

Ceti 331
United Kingdom

Posts 282
17 Oct 2010 12:35


Cesare Di Mauro wrote:

1) Lack of ...

the main bottleneck was commodore going bust :)



Richard Maudsley
United Kingdom

Posts 821
17 Oct 2010 12:53


I absolutely agree that memory protection/multiuser are required today for most people, to help stop the russian mafia from stealing your credit card numbers (and etc).

But I don't really think it matters for the Amiga. I mean, all the users are technical enough that they would notice spyware/viruses/phishing... and the marketshare is too low for anyone to bother with attacking anyway.

Ajc ;)
United Kingdom

Posts 688
17 Oct 2010 12:56


Gunnar von Boehn wrote:

Ok, i'll bite on this one.
   
 
Cesare Di Mauro wrote:

    5) Lack of modern GPU architecture, and shaders particularly (GPU)
 

  Well this could be added.

Indeed it'd be needed for OpenGL ES 2.0, that'll mean extending the Tami v1.0 design one day (in the future).

Gunnar von Boehn wrote:
 
 
Cesare Di Mauro wrote:

    6) Lack of resource tracking, virtual memory, and memory protection (OS)
 

  I agree that resource tracking is something very good.
  But Virtual memory is fluff.
  Memory protection is mainly a kludge - and in a DMA driven system like the AMIGA it can by design never work to 100%.

I disagree a _little_ bit Gunnar. Virtual Memory is I think very useful, almost an essential ingredient of a modern OS.
Also the DMA is no needed to relegate memory protection to the scrapheap but I'll elaborate more in the next question...
Gunnar von Boehn wrote:
   
 
Cesare Di Mauro wrote:

    9) chip / fast memory subdivision which makes no sense if the available bandwidth is almost the same. It makes sense only if we have chip mem = embedded or graphic memory with enormous bandwidth compared to the main memory. (Hardware)
 

  I think the opposite is true, my friend.
  The logical Chip and fast subdivsion was very important for the AMIGA. This siperation is a clever and simple way to create a working syncronise beetwen chipset and CPU. Without this seperation no AMIGA with cache and without a MMU would have ever worked.

...as long as the DMA affects only the "chip" memory then this makes memory protection perfectly possible.
View it not as the Amiga is traditionally thought of but think of it like the PC (sorry it's gotta be said) if you redraw the Amiga OCS/ECS/AGA chipsets as being the "GPU" with Chip ram and the CPU with Fast ram then it all makes a nasty kind of sense :)
The Chip ram has always been accessible by the chipset and cpu, but fast ram is purely for the CPU which is why adding just 2MB of fast ram to the A1200 would have made a huge (2x) performance difference. It's why 4MB trapdoor expansion cards were popular, it was the cheapest effective upgrade you could get!

André Jernung
Sweden
(MX-Board Owner)
Posts 988
17 Oct 2010 12:57


Yup. The Amiga is today is a hackers and computer enthusiasts box, and some of the "weaknesses" you mention are actually its strengths.


Ajc ;)
United Kingdom

Posts 688
17 Oct 2010 13:02


Geoffrey Kramer wrote:

First Question : What are the bottlenecks of the Amiga, 68k and OS.

Amiga:
The AGA chipset clockspeed and bandwidth to memory.
Lack of true Chunky support.
Lack of Fast Ram installed by default.
Small size of Chip Ram.
No (real) access to cheap PC monitors (VGA etc).
No default inter-amiga communication other than Serial port.

68k:
Never understood why Commodore didn't commission a cut down, higher clockspeed 68030 chip, the A1200 020@14Mhz was always too slow.

Geoffrey Kramer wrote:
 
Second Question : What would be the ways to remedy to those bottlenecks.

I think ThomasH has already solved 90% of my gripes, Gunnar and DSM have solved the remaining 10% with N68050 :D

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
17 Oct 2010 14:40


Cesare Di Mauro wrote:

1) Lack of a modern SIMD unit (CPU)

This is not a bottleneck, but only one solution for bottlenecks.

Cesare Di Mauro wrote:

  2) Lack of SMP (OS)

This is a Os problem, not a hardware problem. Not fixable by hardware.

Cesare Di Mauro wrote:

  3) Stack frames make interrupts and traps expensive. Shadow registers ala-ARM or FIDO are a better solution in terms of speed and latency. (CPU & OS)

Are they? I'm not sure. Interrupt handling was rarely a problem for me.

Cesare Di Mauro wrote:

  4) Lack of chunky graphic modes (GPU)

This is a hardware problem, but not really CPU related.

However, there is one point here: Amiga gfx is planar-operated and uses many planar concepts like planemasks. Which means that converting from and to chunky will be a bottleneck. It was a bottleneck for P96. This means that some provision in the system should help here. This could be the blitter, but it requires extensions for that.

Actually, I would probably design the system in a different way today. Instead of a blitter, I would add blitter-type instructions to the CPU to let it perform most of the operations the blitter would have done. That would include bit-packing and unpacking instructions - this stuff could be useful, either in the CPU or in the blitter.

Actually, as the blitter is limited to chip mem, the CPU is probably a better place. Off-image bitplanes could then be blitted by the CPU.

2D operations, similar topic: This could also be done by the CPU, and would then be possible in fastmem as well. Rectangle-fill, rectangular copy, cookie-cut.

Cesare Di Mauro wrote:

  5) Lack of modern GPU architecture, and shaders particularly (GPU)

True enough. What could be done in the CPU to help here? For 3D, I would need multiply-add a lot. For shaders, you likely need a table-based interpolation, similar to what you find in the CPU32. Further, vector operations to handle red-green-blue simultaneously in a single instruction.

Cesare Di Mauro wrote:

  6) Lack of resource tracking, virtual memory, and memory protection (OS)

Unfortunately, an Os design issue, nothing that requires a fix in hardware. Of course, a MMU would be helpful for some of them. For example, a MMU would help to implement a virtual sandbox for applications, but this is a major work to design and write.

 
Cesare Di Mauro wrote:

  7) Lack of GUI enhancements, such as vectorial graphic presentation (OS)

Not exactly a hardware issue.
 
Cesare Di Mauro wrote:

  8) Lack of modern transactional/journaled filesystem, with metadata support and large storage (OS)

Neither exactly a hardware issue.

Cesare Di Mauro wrote:
 
  9) chip / fast memory subdivision which makes no sense if the available bandwidth is almost the same. It makes sense only if we have chip mem = embedded or graphic memory with enormous bandwidth compared to the main memory. (Hardware)

This is a hardware design issue. If Gunnar wants to stick to this, at least mechanism need to be provided to transfer memory to and from chip mem fast. As DMA devices can only access chip memory currently, the CPU must shuffle data between chip and fast mem for any type of IO that has the fast memory as target. Thus, a possible resolution would be to have a CPU block-move instruction that does not fill the cache, but invalidates cache-entries only, and that works faster than a series of MOVE16. That *would* be useful.
 

Cesare Di Mauro wrote:
 
  10) Lack of multiuser support (OS)

Not really an issue fixable by hardware.

Let me add a couple from my list:

MP3-Playing: Again, multimedia-related. Would require fast bit-manipulations for decoding, fast vector operations, fast DCT transformations. Here again multiply-add comes into play, and support for fixpoint arithmetics if FPU extensions are too hard. That means multiply-round-shift instructions.

JPEG: Mostly DCT and bit-shuffling. Something like a bit-buffer instruction might be helpful: Append bits from one register into another register, increment bit-position, set carry if full. bit-field instructions might be helpful here.

Saturating arithmetics: Often used as last stage in such processing chains, both in JPEG and MP3.

Video coding: Fast "sum of absolute differences", the main speed-brake is the motion prediction. Better even, "sum of squares" of two rectangular memory regions, aka "scalar product".

Cryptography: Is there a need for this, for example?

Alpha-channel, transparency: Again, multiply-add instructions required here.

Bobs: Can be currently done by the blitter. But should they?

Emulation: Endian-swap, byte-reversal access of memory?
Would it probably make sense to have parts of the CPU microcode re-writable while the CPU is working, i.e. "define your own instruction at run time". No need to go deep into the CPU, but very short instruction sequences could be pre-compiled by the CPU, buffered somewhere within the CPU and executed faster by a single opcode than a subroutine jump.

Emulation: "Virtual hardware": Would require the CPU to run through an exception cycle if a predefined hardware address is read from or written to. Could be done by a full MMU plus complete exception processing, but might be worth implementing separately.

String-handling: strlen,strcpy and strchr could be handled by fast and simple copy/test and loop instructions.

Greetings,
Thomas



Geoffrey Kramer

Posts 21
17 Oct 2010 14:59


Cesare Di Mauro wrote:

  2) Lack of SMP (OS)

  10) Lack of multiuser support (OS)

As the Amiga don't have multicore CPU, i don't consider the lack of SMP as a bottleneck (it is for the X1000, not for the NatAmi).

I failed to see where the multiusers support will increase the performances of the Amiga architecture.

A tiny example of bottleneck was the lack of Chunky mode who forced to have Chunky 2 Planar routines and was a huge weakness of the AGA. Another will be the kickstart rom, who can't be updated without tons of patches and all that become a real mess. That's why Cosmos is writting a new "graphic.library".

I know that Thomas.H and the Team had already solved most of the weaknesses, they can be proud.  But it's still good to do a sort of "cheklist" and maybe find some tiny things hided in the corners and who would improve the OS and everydays tasks


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Oct 2010 15:09


Thomas Richter wrote:

  Let me add a couple from my list:
 
  MP3-Playing: Again, multimedia-related. Would require fast bit-manipulations for decoding, fast vector operations, fast DCT transformations. Here again multiply-add comes into play, and support for fixpoint arithmetics if FPU extensions are too hard. That means multiply-round-shift instructions.
 

  What type of muliply-round-shift is needed?
  What bitwidth makes most sense? 16bit?
 
 
Thomas Richter wrote:

  JPEG: Mostly DCT and bit-shuffling. Something like a bit-buffer instruction might be helpful: Append bits from one register into another register, increment bit-position, set carry if full. bit-field instructions might be helpful here.
 

  Can you explain this again in more detail?
  Is there something which you can not do with BFEXT / BFINS?
 
 
 
Thomas Richter wrote:
 
  Saturating arithmetics: Often used as last stage in such processing chains, both in JPEG and MP3.
 

  I was thinking about what makes most sense here.
  I wonder if the Coldfire SATS instruction is really that useful.
  CF SATS does saturate to 32bit.
 
  I wonder if the following would make sense.
  I would assume this could be nice:
  Lets say your values are 8 or 16 bit.
  You extend them to 32bit registers.
  You do the calculations/processing.
  Add the end you execute a SATS.W or SATU.W (new instructions!)
  These would check for 16bit boundary and saturate to it in 1 clock.
  Does this sound reasonable?
 
 
 
 
Thomas Richter wrote:

  Video coding: Fast "sum of absolute differences", the main speed-brake is the motion prediction. Better even, "sum of squares" of two rectangular memory regions, aka "scalar product".
 

  Which instructions makes most sene for this?
 
 
Thomas Richter wrote:
 
  Cryptography: Is there a need for this, for example?
 

  For SSL maybe..
 
 
Thomas Richter wrote:
 
  Emulation: Endian-swap, byte-reversal access of memory?
 

  This we have BYTEREV and BITREV.
  Instructions stolen from Coldfire :-D
 
 
Thomas Richter wrote:

  Would it probably make sense to have parts of the CPU microcode re-writable while the CPU is working, i.e. "define your own instruction at run time".
 

  This sounds difficult.
 
 
Thomas Richter wrote:

  No need to go deep into the CPU, but very short instruction sequences could be pre-compiled by the CPU, buffered somewhere within the CPU and executed faster by a single opcode than a subroutine jump.
 

  Our goal is that subroutine calls BSR and RTS will be free instructions at some point. This will make subroutine nice to use.

 
 
Thomas Richter wrote:

  Emulation: "Virtual hardware": Would require the CPU to run through an exception cycle if a predefined hardware address is read from or written to. Could be done by a full MMU plus complete exception processing, but might be worth implementing separately.
 

 
  What HW would you like to emulate?
 
Thomas Richter wrote:

  String-handling: strlen,strcpy and strchr could be handled by fast and simple copy/test and loop instructions.
 

  Agreed for this the 68k instruction set is excellent already.
 
  This is a perfect memcopy :)
  .loop
  MOVE.L (A0)+,(A1)+
  DBF    D0,.loop
 
 

Claudio Wieland
Germany
(Natami Team)
Posts 706
17 Oct 2010 15:37



Would it probably make sense to have parts of the CPU microcode re-writable while the CPU is working, i.e. "define your own instruction at run time".

> This sounds difficult.

There's the NIOS from Altera. NIOSII allows for 256 custom instructions. I think Mr. Richter's idea could be quite revoloutionary.

Imagine this: If we allow each task to change the microcode of a defined number of instructions, then special computing tasks could be executed much faster. And we would not clog opcodes for fixed instructions. On task switch, an internal SRAM block with microcode could be updated. I think this idea could be quite revolutionary and increase the performance a lot.

Just think about current developments of FPGAs, which allow for realtime partial reconfiguration to better use the available space!

Claudio Wieland
Germany
(Natami Team)
Posts 706
17 Oct 2010 16:58


We could even consider a "FPGA within FPGA" concept to implement realtime HW updates during task switches :) .

Ajc ;)
United Kingdom

Posts 688
17 Oct 2010 17:00


RE: microcode.
Microcode usually has a lot of performance _penalties_ on anything I've ever used. They're not super-fast-instructions but meta-instructions that use several of the real instruction to achieve their operation.

On all the projects I've worked on we use compiler flags to avoid outputting them wherever possible because they cause pipeline stalls and prevent dual issuing of instructions (PowerPC).

They can of course make your compiled code smaller, even a lot smaller where they're used heavily, but they do not make code execute faster.

Andy

André Jernung
Sweden
(MX-Board Owner)
Posts 988
17 Oct 2010 17:09


Claudio, imagine what that could mean for f.e. MAME :)

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
17 Oct 2010 17:33


Gunnar von Boehn wrote:

Thomas Richter wrote:

  Let me add a couple from my list:
   
  MP3-Playing: Again, multimedia-related. Would require fast bit-manipulations for decoding, fast vector operations, fast DCT transformations. Here again multiply-add comes into play, and support for fixpoint arithmetics if FPU extensions are too hard. That means multiply-round-shift instructions.
 

  What type of muliply-round-shift is needed?

(a * b + ((1 > 1)) >> bits

where a,b are 16 bit signed, the product is 32 bit signed, and
where a,b are 32 bit signed, the product is 32 bit signed, and
bits is a constant that is encoded as an immediate operand in the
instruction.

I mostly need bits == 16, but other choices are reasonable.

Gunnar von Boehn wrote:

   
Thomas Richter wrote:

    JPEG: Mostly DCT and bit-shuffling. Something like a bit-buffer instruction might be helpful: Append bits from one register into another register, increment bit-position, set carry if full. bit-field instructions might be helpful here.
 

  Can you explain this again in more detail?
  Is there something which you can not do with BFEXT / BFINS?

One of the often used schemes in encoding is "bit-packing" of bits into a byte-array, i.e. you hold a bit-pointer, have an array of ULONG or UBYTE and want to insert bits, then incrementing the pointer. BFINS does the insertion quite fine, but it does not increment a bit-pointer, i.e. doesn't adjust its registers.

BFINS Dn, {offset:width}

would require a form where the offset is incremented by the number of bits inserted, and when this wraps around,  is incremented:

BFINS D0,(a0)+{d1:3}

insert three bits into the bit-buffer pointed to by a0, increment d1 by 3, if that wraps around 32, increment a0 by four, subtract 32 by d1.

Similar for BFEXT.

For JPEG coding/decoding it would be better if the instruction would be byte-oriented, i.e.

insert three bits into the bit-buffer pointed to by a0, increment d1 by 3, if that wraps around 8, increment a0 by one, subtract 8 from d1.

Bits should be filled left to right (i.e. MSB first).

Similar with dynamic width in a third register.

This is the core instruction for huffman coding, the basis for many codecs.

Gunnar von Boehn wrote:

 
Thomas Richter wrote:
 
    Saturating arithmetics: Often used as last stage in such processing chains, both in JPEG and MP3.
 

  I was thinking about what makes most sense here.
  I wonder if the Coldfire SATS instruction is really that useful.
  CF SATS does saturate to 32bit.
 
  I wonder if the following would make sense.
  I would assume this could be nice:
  Lets say your values are 8 or 16 bit.
  You extend them to 32bit registers.
  You do the calculations/processing.
  Add the end you execute a SATS.W or SATU.W (new instructions!)
  These would check for 16bit boundary and saturate to it in 1 clock.
  Does this sound reasonable?

It would need dynamic sizes, i.e. I would have signed or unsigned arithmetic, and a maximum (or minimum) value. If SATS.W takes a parameter (say, number of bits), then this would be sufficient.

SATS.W #12,d0

-> clip d0 to -2048 to 2047, depending on the V flag (integer overflow) and N flag.

Similarly:

SATU.W #10,d1

Clip d1 to 0..1023, depending on the C flag.
 
SATU.W d2,d1:

Dynamically sized.

Gunnar von Boehn wrote:
 
 
Thomas Richter wrote:

  Video coding: Fast "sum of absolute differences", the main speed-brake is the motion prediction. Better even, "sum of squares" of two rectangular memory regions, aka "scalar product".
 

  Which instructions makes most sene for this?

SAD.W (a0),(a1),d0,d1

compute the differences of the 16-bit words pointed to by a0 and a1, of a block of d0 entries long, add up the absolute values of the differences to d1. Probably a "step" instruction (increment, take difference, add up) would be sufficient.

The native 68K instruction sequence is longer and requires an additional register.

move.w (a0)+,d1
sub.w (a1)+,d1
bcc.s .nocarry
neg.w d1
.nocarry:
ext.l d1
add.l d1,d2

Similarly, the same with multiplication:

move.w (a0)+,d1
sub.w (a1)+,d1
muls.w d1,d1
add.l d1,d2

Scalar product:

move.w (a0)+,d1
muls.w (a1)+,d1
add.l d1,d2

Gunnar von Boehn wrote:
 
 
Thomas Richter wrote:
 
    Cryptography: Is there a need for this, for example?
 

  For SSL maybe..

Secure web browsing. Make sense. Unfortunately I do not know enough about it to tell you what it would require.

Gunnar von Boehn wrote:
 
 
Thomas Richter wrote:
 
    Emulation: Endian-swap, byte-reversal access of memory?
 

  This we have BYTEREV and BITREV.
  Instructions stolen from Coldfire :-D

Excellent.

Gunnar von Boehn wrote:
   
 
Thomas Richter wrote:

  Would it probably make sense to have parts of the CPU microcode re-writable while the CPU is working, i.e. "define your own instruction at run time".
 

  This sounds difficult.

I know...

Gunnar von Boehn wrote:
   
  What HW would you like to emulate?
 
Thomas Richter wrote:

    String-handling: strlen,strcpy and strchr could be handled by fast and simple copy/test and loop instructions.
 

  Agreed for this the 68k instruction set is excellent already.
 
  This is a perfect memcopy :)
  .loop
  MOVE.L (A0)+,(A1)+
  DBF    D0,.loop 

Not exactly. a) it copies only multiples of 4 bytes, b) it clogs up the cache, c) it requires additional instruction decoding along its way. c) could be resolved by "zero-level cache" that keeps the instructions looped over in the pipeline (intel's next generation seem to do that, IIRC the P4 also had an instruction cache for its micro-ops).

Greetings,
Thomas



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Oct 2010 17:34


Mr Copland ;) wrote:

RE: microcode.
Microcode usually has a lot of performance _penalties_ on anything I've ever used. They're not super-fast-instructions but meta-instructions that use several of the real instruction to achieve their operation.

I agree with Andy,
Microcode will in a good case execute with 1 instruction per clock.
This is the same speed as normal code.

What makes sense performancewise is "extend" the CPU with special instruction for special purposes.  For example one could develop a special Multiply-ADD instruction. But adding such special instructions needs to be done in VHDL and requires changes to the CPU internas.

This is actually the same as the NIOS works.
The NIOS is a basic RISC CPU with some free encoding space.
NIOS customer can develop their own instructions (in VHDL) and include them into the CPU as FPGA compile time.

We can of course to the same - actualy we are doignt he same and are discussing which CPU instruction set enhancements do make good sense.

Claudio Wieland
Germany
(Natami Team)
Posts 706
17 Oct 2010 18:14


We should elaborate on the possibility to make a "FPGA within FPGA". The instructions could be done in HW then.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
17 Oct 2010 18:34


Thomas Richter wrote:

 
Gunnar von Boehn wrote:
   
    This is a perfect memcopy :)
    .loop
    MOVE.L (A0)+,(A1)+
    DBF    D0,.loop 

 
Not exactly. a) it copies only multiples of 4 bytes,


Which should be no problem.
One just needs to copy the rest 0-3 bytes in a second loop.

Thomas Richter wrote:

  b) it clogs up the cache,

This could be good thing too.
Going through the cache allows the CPU to use the cache for stream prefetching - and this is really needed for best performance.

Thomas Richter wrote:

c) it requires additional instruction decoding along its way.

This is no problem. Instruction decoding is sort of free.

Thomas Richter wrote:

  c) could be resolved by "zero-level cache" that keeps the instructions looped over in the pipeline (intel's next generation seem to do that, IIRC the P4 also had an instruction cache for its micro-ops).

But you only need to solve this if your decoder is not fast enough to decode in real time. :-D

For a super scalar design is always a good idea to cheat a little bit like this. :-)

posts 70page  1 2 3 4