Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Welcome to the Natami lounge.
Meet new AMIGA friends here and enjoy having a friendly chit chat.

68K Ideas for the Futurepage  1 2 3 4 
Thierry Atheist
Canada

Posts 1830
19 Sep 2010 18:23


That ABS sounds very useful!

The 68050/070 will be incredibly powerful with any 2 cycle to 1 cycle gain, I mean, that's a 100% performance boost!

Also, we will have the FASTEST Amiga, and Atari ST and Falcon TT, as well as the fastest Macintosh that was ever made, too! And, we all know that some SW on the Mac was better than Amiga's only because they had a bigger market share than Amiga and their customers bought that expensive SW.

The NatAmi is going to be simply amazing!

Matt Hey
USA

Posts 737
19 Sep 2010 19:24


Denis Markovic wrote:

   
Gunnar von Boehn wrote:

      FF1    Find First 1
   

    I guess something like count leading zeros (i.e. count number of most significant 0 bits in a variable)? Very important
    for block floating point implementation, i.e. normalization of a block for higher precision/higher dynamic range;
   
    it would also be very useful, to have the instruction for
    signed numbers, i.e. if the number is positive return number
    of most significant 0 bits (minus 1), if it is negative return number
    of most significant 1 bits (minus 1)
    If you want to save space for op code, one instruction could
    do both and put the unsigned result (# leading 0) in the low
    part and the signed result in the high part of the result
 

 
  Can't you do an abs or neg (if you know it's negative) before the ff1?
 
 

    Important in this context:
    array minimum/maximum search (i.e. example pseudocode for
    findmax.w var0,cntvar,var2
    if(var0[31:16] > var2[31:16]) {
      var2[31:16] = var0[31:16];
      var2[15:0] = cntvar[15:0];
      cntvar++;
    }
    if(var0[15:0] > var2[31:16]) {
      var2[31:16] = var0[15:0];
      var2[15:0] = cntvar[15:0];
      cntvar++;
    });
 

 
  The 68k addressing modes already handle this well if the data in the array is byte, word or long.
 
 
 
    We could have 3 modes (similar to ARM):
    68k basic version, 68k basic plus DSP instructions (with memory mapped or special registers or registers could be readable/writable with an extra move), 68k basic plus DSP plus SIMD?
 

 
  I don't like it. It gets too complicated and there is more registers to save on context switches. The ColdFire instruction additions (MAC, SATS, FF1, etc) already add enhanced and fast DSP like instructions. The 68k addressing modes already make table use reasonably fast. Fast bitfield instructions and fused instructions should add more speed and flexibility. I think the N68050 with CF instructions is already more powerful than the 96000 DSP for integer manipulation in versatility and functionality. I would rather see the FPU expanded with some SIMD functionality rather than a full blown SIMD.
 
 
Phil G. wrote:

    Area used by things such as ff1, bitrev, byterev have a few places left. So abs could fit in here. I'd also recommend adding a new BitCnt (bit counter), PopCnt if you prefer to name it like that.
 

 
  If the ColdFire instruction additions are going to be added then they should all be added. Full CF support is marketable. Partial CF support would be confusing and require more modification of existing CF software. FF1 is common and still useful because of the smaller size even with the more flexible BFFFO. BITREV is not commonly used but it can make other instructions more powerful and is not easy to duplicate in software. BYTEREV will be used a lot by some code (drivers and datatypes) and not at all by others.
 
  BitCnt is slow to implement in software all right. I have seen it used before (vasm) although it's not very common in my experience. Is this more common in your experience? A 2 word encoding would be adequate if added. In that case, it might be nice to do a new bitfield instruction for flexibility allowing it to be used in more cases. I don't know if Gunnar likes the bitfield instructions although he said they are fast but he hasn't said how fast.
 
    BFCO (Bit Field Count Ones)
 
    BFCO {offset,width},Dn
 
  Description: Counts the number of ones in the source operand and places the result in the destination data register. The instruction sets the condition codes according to the bit field value. The field offset and field width select the field. The field offset specifies the starting bit of the field. The field width determines the number of bits in the field.
 
  This could be useful for bit mapped flags too. The condition codes being set could be useful as well. It all depends on the bit field instructions being fast enough though. Gunnar?
 
 
Gunnar von Boehn wrote:

    The new A-line instructions could be enabled or disabled with a special Bit in the SR register. This way each task would could decide to be in AMIGA or MAC mode. The AMIGA OS could default to enabled, and the MAC-Emulator would disable them. Should be simple and fully compatible.
 

 
  I don't like it. The instruction space to recoup is the CALLM and RTM. I bet these instructions have never been used in a commercial piece of software on the Amiga. There is probably room here for a word ABS and 2 word BFCO. I'm a little more hesitant about reusing the BCD instruction space even though they were removed on the CF.
 
 

    Regarding the Shift.
    I wonder if adding an "extend" opcode would make sense.
    Lets say we identify a 16bit opcode which is so far unused and has some free bits.
   
    This opcode could be used to context sensitive extend the existing instructions:
    Adding this opcode to the existing SHIFT #1,(ea) instruction would make it to an SHIFT #n,(ea)
 
    And other instructions could be extended like this also ... Just a thought.
 

 
  That's a good idea for not using up instruction space and could be used for rotates also but it makes the instruction 2 words for shift GT 8. I don't think it's worth it. The superscaler N68070 should be able to handle the moveq + shift/rotate as well as the superscaler 68060. The N68050 can already fuse 2x shifts/rotates for immediate 8-16 (also swap for 16). I can live with shifts and rotates GT 16 being a little slower until the N68070 ;). The instruction space is just too valuable to make 1 word encodings for shifts/rotates and 2 words is not enough of an advantage IMHO.


Richard Maudsley
United Kingdom

Posts 821
19 Sep 2010 19:52


Thierry Atheist wrote:
And, we all know that some SW on the Mac was better than Amiga's only because they had a bigger market share than Amiga and their customers bought that expensive SW.

No, the mac was nowhere near as popular as the amiga. The difference is that the Amiga was complete failure in the country where good productivity software comes from, and the mac was only slight failure.

Thierry Atheist
Canada

Posts 1830
19 Sep 2010 20:58


Richard Maudsley wrote:

Thierry Atheist wrote:
And, we all know that some SW on the Mac was better than Amiga's only because they had a bigger market share than Amiga and their customers bought that expensive SW.

 
  No, the mac was nowhere near as popular as the amiga. The difference is that the Amiga was complete failure in the country where good productivity software comes from, and the mac was only slight failure.

So, Apple really was afraid of Commodore and AMIGA.

Denis Markovic
Germany
(Natami Team)
Posts 41
19 Sep 2010 20:58


Matt Hey wrote:

 
Denis Markovic wrote:

     
Gunnar von Boehn wrote:

        FF1    Find First 1
     

      I guess something like count leading zeros (i.e. count number of most significant 0 bits in a variable)? Very important
      for block floating point implementation, i.e. normalization of a block for higher precision/higher dynamic range;
     
      it would also be very useful, to have the instruction for
      signed numbers, i.e. if the number is positive return number
      of most significant 0 bits (minus 1), if it is negative return number
      of most significant 1 bits (minus 1)
      If you want to save space for op code, one instruction could
      do both and put the unsigned result (# leading 0) in the low
      part and the signed result in the high part of the result
   

   
    Can't you do an abs or neg (if you know it's negative) before the ff1?
 
   

 
  You could do that, but it might be difficult to do it in one cycle?
But maybe you are right, if I would have to chose between this and the min/max, I would chose min/max as instruction to implement as it is much more important (used in loops)
 
 
Matt Hey wrote:

   
   

      Important in this context:
      array minimum/maximum search (i.e. example pseudocode for
      findmax.w var0,cntvar,var2
      if(var0[31:16] > var2[31:16]) {
        var2[31:16] = var0[31:16];
        var2[15:0] = cntvar[15:0];
        cntvar++;
      }
      if(var0[15:0] > var2[31:16]) {
        var2[31:16] = var0[15:0];
        var2[15:0] = cntvar[15:0];
        cntvar++;
      });
   

   
    The 68k addressing modes already handle this well if the data in the array is byte, word or long.
   

 
  Yes, but it is even more important to be fast here than with the norm/exp instruction (FF1);
 
  could this operation be done in one cycle with normal 68k
  instructions on e.g. 68050 (or later)?
 
If not it would really be useful to have the described instruction for all sorts of block floating (executed on bigger blocks in loops) point etc. unless you implement all in floating point...

But it is true that extra registers might be a problem as they would have to be saved in a context change and old software is not aware of it ... so might be that we have to use known registers for that.

We will run into similar problems for selected SIMD instructions as we will need vector registers for this, and that will be a lot of data so save on context switches; any ideas?


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
19 Sep 2010 22:06


Denis Markovic wrote:

could this operation be done in one cycle with normal 68k  instructions on e.g. 68050 (or later)?

You mean a MAX instruction?
Like: MAX.L D0,D1  (D1 will hold the bigger of D0 and D1 ?) 
How important is this in your opinion?

Denis Markovic wrote:
 
  We will run into similar problems for selected SIMD instructions as we will need vector registers for this, and that will be a lot of data so save on context switches; any ideas?

How about overlaying FPU and SIMD registers?



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
19 Sep 2010 22:11


Crazy idea for a possible extention:

Lets say we use the "A"-Line as extention code.
This gives us 12bit extention which we coul add to any existing instruction.
With this extension it would be possible to:
1) make every instruction conditional
2) add a 3rd parameter in the same time
3) Extent the existing parameters by 1 bit.

This means we could either have every instruction be able to use all Registers equally. This means all instructions could update Address registers and all Addressmodes could use Data-Registers as pointers.
Another option for using this extra bit would be to double the number of registers. 16 Data + 16 Address.

I know these ideas where proposed here before.
Using the A range would allow to do all of them at the same time.

Crazy?

Matt Hey
USA

Posts 737
19 Sep 2010 22:13


Denis Markovic wrote:

 
Matt Hey wrote:

      Can't you do an abs or neg (if you know it's negative) before the ff1?
 

 
  You could do that, but could you do it in one cycle?
 

 
  No, but the N68050 will probably already be able to do this in less cycles than the 96000 DSP. The superscaler 68060 would need 10-17 cycles to do this. The N68050 should take a fraction of that and the N68070 maybe even in 1-2 cycles. Is it worth it to add specialized instructions when general purpose instructions can perform this well? It would be better to fuse abs + ff1 if it really is used that much. Creating a new instruction would not be any faster then.
 
  I am not completely against all instruction additions. An ABS instruction would at least be general purpose. It needs to be very commonly used though as it offers only a very small speed up and saves little if any code on the N68050.
 
 
Matt Hey wrote:


     

        Important in this context:
        array minimum/maximum search (i.e. example pseudocode for
        findmax.w var0,cntvar,var2
        if(var0[31:16] > var2[31:16]) {
          var2[31:16] = var0[31:16];
          var2[15:0] = cntvar[15:0];
          cntvar++;
        }
        if(var0[15:0] > var2[31:16]) {
          var2[31:16] = var0[15:0];
          var2[15:0] = cntvar[15:0];
          cntvar++;
        });
     

     
      The 68k addressing modes already handle this well if the data in the array is byte, word or long.
     

 
  Yes, but it is even more important to be fast here than with the norm/exp instruction (FF1);
 

 
  This is where the 68k already excels. It will already outperform RISC processors at over 500MHz and probably more like 1GHz. The larger caches, faster ram speed, and lack of virtual addresses of the N68050 should make this as fast as some of the powerhouse processors. It should handle this code better than the 96000 also.
 
 

  could this operation be done in one cycle with normal 68k
  instructions on e.g. 68050 (or later)?
 

 
  I doubt it even with a specialized instruction but it is already very fast when the data is in the cache.
 

Matt Hey
USA

Posts 737
20 Sep 2010 01:05


Gunnar von Boehn wrote:
 
  Lets say we use the "A"-Line as extention code.

...

  Crazy?

Tempting for what it would add but a big fat NO for compatibility, simplicity and code bloat. You have already touted the virtues of the 68k operating in memory without registers and your code fusion eliminates the need for 3 operand instructions. Leave the 3 operand and conditional instructions as internal representations only please. Save the register expansion and semi-SIMD 3 operand instructions for the FPU.


Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:22


Gunnar von Boehn wrote:

  BScc    Conditional BSR
  JScc    Conditional JSR

I never had such need. I don't scream if a processor missed such things.
Jcc    Conditional JMP

Useless. We have long, conditional, branches.
JOIN    Src.LW -> Dst.HW
  JOINB  Src.LB -> Dst.HB
  SPLIT  Src.HW  -> Dst.LW (sign extented)
  SPLITB  Src.HB  -> Dst.LB

Very specific instructions that I think so few people will be interesting in it.

I find more useful a "SHUFFLE" instruction that let you permute the data in several ways, replacing also the SWAP, ROL.W #8, and other instructions mix to place datas in the place you want them to be.
ORC    Or with Complement

I don't see a use case.

I find more useful a "MASK"/ANDC which calculates negated AND, in order to apply data masking, which is much more useful.

Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:25


Gunnar von Boehn wrote:

We could also think if there are ways to enhance the 68K instruction set without even adding new instructions.
 
  For example:
  Lets say we want to add this instruction:
  LSL.L #9,Dn
 
  Lets say because of encoding space limitations we find no 16bit encoding for this but only a 32bit encoding.
 
  If this would be the case we could also do the same work by using 2 instructions:
  E.g
  LSL.L #8,Dn
  LSL.L #1,Dn
 
  Of course the two instruction would take 2 clocks instead of 1.
  But if we "enhance" the CPU Decoder it could merge those 2 instruction into 1. Thereby doing these 2 instructions in 1 cycle.
 
  The net effect would be the same as adding a new 32bit encoding but without even needing to add a new encoding.
  This means our CPU would run faster on old code and new code without that we need to change the existing 68k Compilers. :-)
 
  Another merge example:
 
  MOVE.L D0,D1
  LSL.L #8,D1
  LSL.L #1,D1
 
  These 3 instructions could in theory be all merged into 1.
 
  Another example:
  It would be great if we coudl do BSRcc in 1 cycle.
  Only the encoding is challenging.
 
  But maybe we could encode it like this:
  Bcc
  BSR
 
  Then it would be backward compatible with the old 68K CPUs
  and if the Decoder is smart enough still be executed in 1 cycle.
 
  What do you think?

I think that instructions merging is a preferable way to follow, instead of introducing new instructions.

It will also be more backward compatible with software which was already written.

Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:32


Gunnar von Boehn wrote:

Claudio Wieland wrote:

  We also have to mind limited cache sizes. Smaller code is better.
 

 
  This is true. But our free encoding space is also very limited.
 
  In the A-range the 68000 room for 2 new full instructions using the form EA,DN and having B/W/L support.
 
  This means we can not add many instructions that encode in 16bit.
  But we can add many instructions that encode in 32bit.
 
  Simple instructions that only use 1 operant like ABS Dn need less encoding space. We can certainly add a few of those in 16bit still.

If you plan to use the Line-A opcode space, please reserve it to the 3-operand SIMD unit, as I said some time ago.

It'll be perfect for such purpose, giving a clear and flexible design.

Also, I think it'll be much more useful than an ABS instruction.

Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:35


Denis Markovic wrote:
  One drawback is the 3 operands; if we don't want that, we could
  have special DSP registers to e.g. store minumum/maximum value and the minumum/maximum index; after looping with this instruction you could simply read the values from that memory mapped register; while this is not very 68k like it would be very powerful for signal processing and save a lot of opcode space.
 
  We could have 3 modes (similar to ARM):
  68k basic version, 68k basic plus DSP instructions (with memory mapped or special registers or registers could be readable/writable with an extra move), 68k basic plus DSP plus SIMD?

ARM already added DSP extension first, and SIMD instructions to its family, but they were very limited.

With the Cortex family a brand new, and independent, SIMD unit (NEON) was introduced, to solve the same problems in a much more convenient and elegant way.

So why we want to make the same mistake?

Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:38


Phil G. wrote:
 
Gunnar von Boehn wrote:

  In the A-range the 68000 room for 2 new full instructions using the form EA,DN and having B/W/L support.
 

You really want to break 68k Mac emulators, don't you ? ;-)

If you use Line-A for the "Robin" SIMD unit, you'll not break compatibility, since it have not be compatible with anything already developed. ;)

Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:41


Gunnar von Boehn wrote:
You mean a MAX instruction?
  Like: MAX.L D0,D1  (D1 will hold the bigger of D0 and D1 ?) 
  How important is this in your opinion?

They are commonly used, and there are even RISCs which implements them (STMicroelectronics LX did it).
Denis Markovic wrote:
 
  We will run into similar problems for selected SIMD instructions as we will need vector registers for this, and that will be a lot of data so save on context switches; any ideas?
 

  How about overlaying FPU and SIMD registers?

It's the same mistake that Intel did with MMX. Don't repeat it. :)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
20 Sep 2010 06:46


Cesare Di Mauro wrote:

Gunnar von Boehn wrote:

How about overlaying FPU and SIMD registers?

It's the same mistake that Intel did with MMX. Don't repeat it. :)

Why should this be a mistake - its clever!
And others like ARM did it also just recently.

INTEL had different problems.
INTEL real problem was that you could not mix MMX and FPU code.


Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:47


Gunnar von Boehn wrote:

Crazy idea for a possible extention:
 
  Lets say we use the "A"-Line as extention code.
  This gives us 12bit extention which we coul add to any existing instruction.
  With this extension it would be possible to:
  1) make every instruction conditional
  2) add a 3rd parameter in the same time
  3) Extent the existing parameters by 1 bit.
 
  This means we could either have every instruction be able to use all Registers equally. This means all instructions could update Address registers and all Addressmodes could use Data-Registers as pointers.
  Another option for using this extra bit would be to double the number of registers. 16 Data + 16 Address.
 
  I know these ideas where proposed here before.
  Using the A range would allow to do all of them at the same time.
 
  Crazy?

For Line-A I have quite different ideas.

Anyway, I don't like opcode prefixes. It sounds like an x86.

Also, doubling the registers isn't so much useful, and introduces problems such as providing the necessary extension bits to instruction which uses more than 2 registers (long multiplication, bit fields) and the register indirect addressing modes which uses 2 registers.

For me it's a wast of useful opcode space. Also, it resembles a new ISA to my eyes, which doesn't look very well comparing to the 68000 family tradition.

Cesare Di Mauro
Italy

Posts 528
20 Sep 2010 06:50


Gunnar von Boehn wrote:

Cesare Di Mauro wrote:

 
Gunnar von Boehn wrote:

  How about overlaying FPU and SIMD registers?

  It's the same mistake that Intel did with MMX. Don't repeat it. :)
 

 
  Why should this be a mistake - its clever!

It limited data usage, since FPU registers have a fixed size.

A 64 bits SIMD unit is anachronistic right now.
And others like ARM did it also just recently.

On the contrary: ARM introduced a brand new SIMD unit, NEON, which brings its independent register set.
  INTEL had different problems.
  INTEL real problem was that you could not mix MMX and FPU code

Yes, it was another problem. But not the only one.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
20 Sep 2010 06:54


Cesare Di Mauro wrote:

Gunnar von Boehn wrote:

 
Cesare Di Mauro wrote:

 
Gunnar von Boehn wrote:

  How about overlaying FPU and SIMD registers?

  It's the same mistake that Intel did with MMX. Don't repeat it. :)
 

 
Why should this be a mistake - its clever!

It limited data usage, since FPU registers have a fixed size.
A 64 bits SIMD unit is anachronistic right now.

I was not thinking of 64 bit.
From the programming view FPU register behave already like being 96 bit wide.  But as mentioned before extending the FPU to 128 bit would make good sense. 128 bit would also improve the performance of saving / restoring the FPU as they could then be better bursted out.


Matt Hey
USA

Posts 737
20 Sep 2010 07:33


Gunnar von Boehn wrote:

  I was not thinking of 64 bit.
  From the programming view FPU register behave already like being 96 bit wide.  But as mentioned before extending the FPU to 128 bit would make good sense. 128 bit would also improve the performance of saving / restoring the FPU as they could then be better bursted out.

What happened to the single precision only fpu idea for speed? You went from 32 bits to 128 bits now for the fpu register? Realistically, at least 64 bits is needed for compatibility and the C double and a few extra bits are needed beyond that for rounding. That's what the 68k FPU did but 96 bits isn't so great for alignment. It would be nice to support the new quad precision (binary 128) IEEE format even if the least precise bits were set to 0. The 15 bit biased exponent is already the same as for extended precision. It would make compiler support for the long double easier. Adding the half precision would reduce code size and be good for some gfx applications. The word format isn't a problem for CISC. Full support for IEEE binary formats is good for marketing as well.


posts 68page  1 2 3 4