Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Matt Hey
USA

Posts 727
31 Jan 2011 09:20


I'd like to revisit some of the N68k enhancements.
 
  1) I have a small update on using byte sized immediate extension words as sign extended longs. I have looked at a lot of disassembled Amiga code. 95% seems to be SAS/C or GCC compiled. GCC, especially 2.95.3, commonly leaves $ff in the upper byte of immediate byte (word sized) extensions. I have seen what might be the case of SAS/C doing the same with a cmp.b #$ff,d0. I haven't seen random trash in the upper byte, but there are a lot of old and odd compilers/assemblers out there. I think this is going to be an awesome enhancement even if only a sign extended byte is possible.
 
  2) I was kind of surprised that only Cesare mentioned he was in favor of base register update for full extension addressing modes when Gunnar mentioned it could be done for free. While up to an 8 bit shift would be nice, I have to say that I would use base register update much more often. Let me demonstrate. Code that fills or copies structures is fairly common. An example might be...
 
      move.l ($20,a0),d0
      move.l ($24,a0),(a1)+
      move.l ($28,a0),d1
 
  This is fast but wastes 2 words for $24 and $28 which are the next in series. The 68k can save the 2 words but is much slower as there is as an extra instruction and maybe a change use stall on some 68k processors...
 
      lea ($20,a0),a0
      move.l (a0)+,d0
      move.l (a0)+,(a1)+
      move.l (a0),d1
 
  With base register update we have fast and small...
 
      move.l ($20,a0)+,d0
      move.l (a0)+,(a1)+
      move.l (a0),d1
 
  I have some series moves that are much longer than this simple example. Also, indexed modes can be used and look to me like they would be more valuable because they can be used once and then post-incremented or pre-decremented after. There is the potential for some nice code reduction here.
 
  3) Last but far from least, I would like to revisit branches. Although my knowledge of predication and fetch decode pipelines is rather basic, it looks to me like predication is less powerful than a branch cache by itself. Most branches are consistently taken or not taken which the branch cache handles well. Predication is good for "evil" inconsistent (unpredictable) branches which are less common. There are a couple of other problems with predication as well. Many chances for predication are skipped because the most common branch route is not taken. Example...
 
    moveq #7,d0
    cmp.l d1,d0
    bcs .skip
.back:
    ...
    rts
 
.skip:
    move.l d0,d1
    bra .back
 
  We expect the value in d1 to be less than 7 most of the time and since branches forward are predicted not taken, we have to take this code out of series. It's easy to say that it's better to have this code all in one series on the N68k like this...
 
    moveq #7,d0
    cmp.l d1,d0
    bhi .skip
    move.l d0,d1
.skip:
    ...
    rts
 
  Having all the code in series is better in this case with predication but what if instead of move.l we had a div, movem or 4 consecutive instructions that we don't know if they will be predicated or not. They may be predicated on one future N68k processor and not on another. Another branch in the predicated part of the code will likely cause a stall too like jsr (-$36,a6). Note that putting the code in series did save a bra instruction, would likely load less instruction cache lines, is easier to read, and is easier to use the code as inlines (whether assembler or the compiler inlining).
 
  I have an idea that may solve these problems and hopefully offers improved branch performance while being more future proof. First, I would use the odd bit in branches for a prediction bit. The example branch above would get the bit set. The code would always respect the bit the first time. In the example above, there would be no predication until the branch prediction fails. If a branch fails, a counter is started keeping track of how many cycles the predicated code uses. The final cycle count would be put in a branch cache with numbers too great for predication clipped. A history bit for the branch would also be stored as a typical branch cache would do. A 1 cycle predicated branch might always be predicated, a 2 cycle predicated branch might be predicated if the branch is taken less than 90% of the time and a 3 cycle predicated branch might be predicated if the branch is taken less than 70% of the time for example.
 
  On a normal branch forward (predicted not taken), predication would be used the first time as there is no disadvantage. The branch history and cycles would be stored in the branch cache as above. If the branch is taken more than 90% of the time for a 2 cycle predication then the predication is skipped. A 3 cycle predication might be skipped at 70% similar to above.
 
  I believe this branch system would make larger predications practical while getting rid of the slow disadvantage of using predication when the branch cache prediction of branch taken is better. Storing the predication cycle count in the branch cache would only require 2 or 3 bits. Giving up the longer displacement branches of bcc2 may save more code but this use saves some code too (remember the bra .back when not inlined) while I believe offering a significant speed advantage. Large inlines make more optimizing possible and easier giving even more speed. This idea makes the programmers life easier if it's possible from the hardware side. I think a branch cache is a better use of fpga space than enhanced sprites. I bet this topic doesn't generate 15 pages though ;).
 
  The real world example below is branch optimized by me for a 68040/68060...
 
W3D_UpdateTexImage: 
    movem.l  d7/a4-a6,-(sp)
    move.l  a0,a4 ;a4=->Context
    move.l  a1,a5 ;a5=->Texture
    move.l  d1,d7 ;d7=level
    btst    #3,($1f,a0) ;Context->state&W3D_INDIRECT
    bne      .indirect ;if indirect
.direct: 
    tst.l    a3
    bne      .setpalette
  .paletteisset: 
    move.l  ($1a,a5),d0 ;Texture->mipmapmask
    lsl.l    #1,d0
    btst    d7,d0
    bne      .freeautomipmap
.newteximage: 
    move.l  a2,($1e,a5,d7.l*4) ;Texture->texsource
    move.l  d7,d1
    addq.l  #1,d1
    btst    d1,d0
    bne      .updateautomipmaps
    move.l  ($c0,a4),a6 ;a6=Context->DriverBase
    move.l  a5,a1 ;a1=->Texture
    move.l  a4,a0 ;a0=->Context
    subq.l  #1,d1 ;d1=last level
    move.l  d7,d0 ;d0=start level
    jsr      (-$ea,a6) ;UpdateTexture()
.notvisible: 
    movem.l  (sp)+,d7/a4-a6
    rts
 
.indirect:
    move.l  ($a4,a0),a1 ;Context->queue
    move.l  (a1),d0
    cmp.l    (8,a1),d0
    beq      .direct ;if QEMPTY
    bsr      W3D_Flush
    tst.l    d0
    beq      .direct
    movem.l  (sp)+,d7/a4-a6
    rts
 
.setpalette: 
    move.l  a3,($66,a1) ;Texture->palette
    bra      .paletteisset
   
.freeautomipmap: 
    move.l  d0,-(sp)
    move.l  ExecBase,a6
    move.l  ($1e,a5,d7.l*4),a1 ;Texture->texsource
    jsr      (-$2b2,a6) ;FreeVec()
    move.l  (sp)+,d0
    move.l  d0,d1 ;mipmapmasktexfmtsrc
    move.l  d1,d3
    move.l  a2,d4
    move.l  d0,d5 ;mipmapmasktexheight
    lsr.l    d3,d0 ;src width
    lsr.l    d3,d1 ;src height
    ; d2=Texture->texfmtsrc
    move.l  a1,d4
    bsr      CreateMipmap
.checkautomipmap: 
    addq.l  #1,d3
    btst    d3,d5
    bne      .updateautomipmap
    subq.l  #1,d3
    move.l  ($c0,a4),a6 ;a6=Context->DriverBase
    move.l  a5,a1 ;a1=->Texture
    move.l  a4,a0 ;a0=->Context
    move.l  d3,d1 ;d1=last level
    move.l  d7,d0 ;d0=start level
    jsr      (-$ea,a6) ;UpdateTexture()
    movem.l  (sp)+,d2-d5/d7/a4-a6
    rts
   
  Now with hint bit branches and always putting code in series we have...
   
W3D_UpdateTexImage: 
    movem.l  d7/a4-a6,-(sp)
    move.l  a0,a4 ;a4=->Context
    move.l  a1,a5 ;a5=->Texture
    move.l  d1,d7 ;d7=level
    btst    #3,($1f,a0) ;Context->state&W3D_INDIRECT
    beq2    .direct ;if direct
    move.l  ($a4,a0),a1 ;Context->queue
    move.l  (a1),d0
    cmp.l    (8,a1),d0
    beq      .direct ;if QEMPTY
    bsr      W3D_Flush
    tst.l    d0
    bne      .notvisible
.direct: 
    tst.l    a3
    beq2    .paletteisset
.setpalette: 
    move.l  a3,($66,a1) ;Texture->palette
.paletteisset: 
    move.l  ($1a,a5),d0 ;Texture->mipmapmask
    lsl.l    #1,d0
    btst    d7,d0
    beq2    .newteximage
.freeautomipmap: 
    move.l  d0,-(sp)
    move.l  ExecBase,a6
    move.l  ($1e,a5,d7.l*4),a1 ;Texture->texsource
    jsr      (-$2b2,a6) ;FreeVec()
    move.l  (sp)+,d0
    move.l  d0,d1 ;mipmapmasktexsource
    move.l  d7,d1
    addq.l  #1,d1
    btst    d1,d0
    beq2    .updatetexture
.updateautomipmaps: 
    movem.l  d2-d5,-(sp)
    move.l  ($62,a5),d2 ;Texture->texfmtsrc
    move.l  d1,d3
    move.l  a2,d4
    move.l  d0,d5 ;mipmapmasktexheight
    lsr.l    d3,d0 ;src width
    lsr.l    d3,d1 ;src height
    ; d2=Texture->texfmtsrc
    move.l  a1,d4
    bsr      CreateMipmap
.checkautomipmap: 
    addq.l  #1,d3
    btst    d3,d5
    bne      .updateautomipmap
    move.l  d3,d1
    movem.l  (sp)+,d2-d5
.updatetexture: 
    move.l  ($c0,a4),a6 ;a6=Context->DriverBase
    move.l  a5,a1 ;a1=->Texture
    move.l  a4,a0 ;a0=->Context
    subq.l  #1,d1 ;d1=last level
    move.l  d7,d0 ;d0=start level
    jsr      (-$ea,a6) ;UpdateTexture()
.notvisible: 
    movem.l  (sp)+,d7/a4-a6
    rts
 
  So much more readable, easier to program, more efficient, and saves a lot of code. Take the rts off and you have an instant inline as well. All we need to do is make series coding always the fastest way. Can we have easy of use and speed?
 

Cesare Di Mauro
Italy

Posts 526
31 Jan 2011 13:20


I have no time now to study point 3; I'll do it ASAP.

I totally agree for point 2, of course.

For point 1, I think that it can be easy to find all available compilers and make some tests, but, again, it'll need some time to do all the checks.

Wojtek P
Poland

Posts 1597
31 Jan 2011 13:53


Matt Hey wrote:

I'd like to revisit some of the N68k enhancements.

1) maybe i'm stupid but i can't really undersstand it. can you give an example
2) it's simply great, and if only can be encoded - be single cycle.
68050 can do register write and memory write in the same time
3) about predicated branches i already told - the encoding IMHO should be - LSB==0 to behave as usual (this is what existing code have), LSB==1 bhave the reverse way.

For example: LSB==0 - assume forward branches as untaken, backward as taken, LSB==1 - do the reverse. and do everything else you've said.

This is great idea.

My idea 4) Instuction to "prepare condition" for conditional jump BEFORE jump.

like
.blah:
instruction1
instruction2
prepeq
instruction3
instruction4
bprep blah

where - prepeq - prepare condition for delayed jump - yes if equal.

bprep - jump if prepared condition is tet.

This way you CAN optimize code even in case of completely impredictable branches, as long as you can have more instructions
to execute after you know the condition.
prep instruction can be "0 cycle" IMHO.

This way you got what powerpc and ARM have - just different way.
on ARM you for example set what instructions update flags. if you
make one instruction that update flags, 10 that not and then conditional jump, you can have processor that will not have to refill pipeline on this conditional jump.
Higher end of ARM CPUs make use of that feature.
On powerPC it's similar IMHO, but i don't have enough knowledge.

on N68k you will sacrifice 2 extra bytes of code for this when it make sense, but it WILL give great adventage when jumps cannot be predicted.

NO BRANCH prediction algorithm can predict jumps that depends on input data - but this will allow decoder to know earlier that jump will/will not be taken

Matt Hey
USA

Posts 727
31 Jan 2011 16:33


@Cesare
I have tested point 1 with GCC, SASC, Dice, Aztec and several assemblers. Most just leave the upper byte of the extension word as 0 except for some versions of GCC which does $ff too.

@Wojtek
Instructions like cmp.b #$12,d0 is encoded as $xxxx 0012. The upper byte of the extension word is padding that can be encoded to sign extend the data to a long value altering the instruction to be cmp.l #$12,d0. This saves a word per immediate long instruction and also a register as a moveq #$12,d1 + cmp.l d1,d0 is not needed. I hope that explains it as I don't have much time at work right now.

Wojtek P
Poland

Posts 1597
31 Jan 2011 17:17


Matt Hey wrote:

 
  Instructions like cmp.b #$12,d0 is encoded as $xxxx 0012. The upper byte of the extension word is padding that can be encoded to sign extend the data to a long value altering the instruction to be cmp.l #$12,d0. This saves a word per immediate long instruction and also a register as a moveq #$12,d1 + cmp.l d1,d0 is not needed. I hope that explains it as I don't have much time at work right now.
 

  Great. Thanks for explanations.
  so:
 
  - on existing code high byte is encoded as 00 or ff. So we have 254 possible values to encode things. Or maybe 255 as GCC written software is rather new and sources are available or at least - people that wrote it are mostly alive and will be able to just rerun with fixed gcc.
 
  So my proposition of encoding 8bits giving more power than you proposed:
 
  high byte:
  bit 7 - operate on value(0) or register(1):
  bit 6 - a bit to put to bits 31-8 of value BEFORE doing following for constants, or for registers when ==1 - sign extend values cut from registers.
  bit 5,4 - select rotate left, rotate right, shift left, shift right.
  bit 3,2,1,0 - amount of shift/rotate halved
  lower byte: value if bit7==0, register in 4 high bits, lower 4 bits selects:
  0000 - bits 7-0 from register
  0001 - bits 8-15 from register
  0010 - bits 23-16 from register
  0011 - bits 31-24 from register
  0100 - bits 15-0 from register (2 bytes)
  0101 - bits 23-8 from register (2 bytes)
  0110 - bits 31-16 from register (2 bytes)
  0111 - bits 7-0 and 31-24 from register (2 bytes)
  1000 - bits 24-0 from register (3 bytes)
  1001 - bits 31-8 from register (3 bytes)
  1010 - bits 7-0 and 31-16 from register (3 bytes)
  1011 - bits 15-0 and 31-24 from register (3 bytes)
  1100 - all register as is (but still allows shifting)
  1101 - unused
  111X - unused               
 
  special case 00 means nothing and will signal to work as usual as cmp.b
  other will mean cmp.l withs "compressed" value. Same for other .b  instructions of course.
 
  Why so strange?
 
  Just look at bunch of existing programs to see than MOST long values fits that scheme! In the same time it allows cutting bytes and shifting from registers. Sometimes that trick means 1 instruction instead of 3-4!
 
  My idea is not the same but similar as ARM ISA use to encode one parameter values - this is probably most important thing that
  gives this not very complex RISC it's power. Anyway ARM can't cut selected bytes like this.
 
  For implementing it you will need "byte shifter and cutter" hardware, rest can be done by reusing shift/rotate and sign extend module.
 
  How about:
  - taking second and third byte from register
  - sign extending it
  - shifting left 4 bits
  - inverting it
 
  when 0101010111001100 is in second and third byte of register, we get
  000000000000101010100011001111111
 
  Usages:
 
  - creating masks for oring, anding, eoring, testing whatever
  - large values for adds,subs etc. that have most bits==1  or ==0
  - cutting bytes from registers - for example single color values
  - shifting/rotating, sign extending and inverting data in registers within one instruction.
 
  Such instruction can be single cycle, but probably not all with latency==1, some with ==2 as barrel shifter takes full cycle.
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Jan 2011 17:22


I would use this mode as:
OPI.L #15bit(signexted),(ea)


Wojtek P
Poland

Posts 1597
31 Jan 2011 17:48


Gunnar von Boehn wrote:

  I would use this mode as:
    OPI.L #15bit(signexted),(ea)
   
 

  Exactly like that usage. 0x80 sign extended and shifted 8 bits left gives that 0xFFFF8000, or 0x80 shifted left 8 bits and inverted to get 0xFFFF7FFFF.
  Or whatever. Idea of shifting and cutting within instruction as ARM do is great IMHO and means both faster and denser code.
 
  Fixing gcc to not emit FF is trivial (actually it's gnu assembler not gcc), fixing eg. GNU assembler to support that is not difficult but AFTER people decide how syntax would have to like
 
  for constants i suggest
 
  #~-23 SL 6
 
  and assembler should automatically fit ANY constant if possible that way when .l is specified - automatically shortening instructions.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Jan 2011 19:56


Wojtek P wrote:

Gunnar von Boehn wrote:

    I would use this mode as:
    OPI.L #15bit(signexted),(ea)
   
   

    Exactly like that usage. 0x80 sign extended and shifted 8 bits left gives that 0xFFFF8000, or 0x80 shifted left 8 bits and inverted to get 0xFFFF7FFFF.

I said a sign extended 15bit value. Any 15bit value.


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Jan 2011 19:58


Wojtek P wrote:

  0000 - bits 7-0 from register
  0001 - bits 8-15 from register
  0010 - bits 23-16 from register
  0011 - bits 31-24 from register
  0100 - bits 15-0 from register (2 bytes)
  0101 - bits 23-8 from register (2 bytes)
  0110 - bits 31-16 from register (2 bytes)
  0111 - bits 7-0 and 31-24 from register (2 bytes)
  1000 - bits 24-0 from register (3 bytes)
  1001 - bits 31-8 from register (3 bytes)
  1010 - bits 7-0 and 31-16 from register (3 bytes)
  1011 - bits 15-0 and 31-24 from register (3 bytes)
  1100 - all register as is (but still allows shifting)
  1101 - unused
  111X - unused               
 

Way to many options.
This will slow the chip down - therefore not a good idea.


Wojtek P
Poland

Posts 1597
31 Jan 2011 21:02


Gunnar von Boehn wrote:

Wojtek P wrote:

 
Gunnar von Boehn wrote:

    I would use this mode as:
      OPI.L #15bit(signexted),(ea)
     
   

    Exactly like that usage. 0x80 sign extended and shifted 8 bits left gives that 0xFFFF8000, or 0x80 shifted left 8 bits and inverted to get 0xFFFF7FFFF.
 

 
  I said a sign extended 15bit value. Any 15bit value.
 

With my coding it's not possible - any 15 bit value having any 8 bits in any position padded with 0/1 left - it is.

It may be possible - 1 bit to sign it's sign extended 15 bit value.
But no shifting/rotating.

Wojtek P
Poland

Posts 1597
31 Jan 2011 21:04


Gunnar von Boehn wrote:

Wojtek P wrote:

 
    0000 - bits 7-0 from register
    0001 - bits 8-15 from register
    0010 - bits 23-16 from register
    0011 - bits 31-24 from register
    0100 - bits 15-0 from register (2 bytes)
    0101 - bits 23-8 from register (2 bytes)
    0110 - bits 31-16 from register (2 bytes)
    0111 - bits 7-0 and 31-24 from register (2 bytes)
    1000 - bits 24-0 from register (3 bytes)
    1001 - bits 31-8 from register (3 bytes)
    1010 - bits 7-0 and 31-16 from register (3 bytes)
    1011 - bits 15-0 and 31-24 from register (3 bytes)
    1100 - all register as is (but still allows shifting)
    1101 - unused
    111X - unused               
   
 

 
  Way to many options.
  This will slow the chip down - therrefore not a good idea.
 

I'm not an expert in CPU design but .... why ARM CPUs are not slow?
isn't that operation 1 more cycle for preparing argument - making it 2 cycle instruction that replaces 3 or more instructions?
if result of that instruction is not needed by following one it can be pipelined hiding extra cycle.
Am i wrong?


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Jan 2011 21:17


Matt Hey wrote:

  1) I have a small update on using byte sized immediate extension words as sign extended longs.
 

  I would interprete the high byte the following way
  Case A)
  #00xxxxxx  -- Byte Operation
  #11xxxxxx  -- Byte Operation
  #01xxxxxx  -- Long Operation #1+ the next 14 bits
  #10xxxxxx  -- Long Operation #0+ the next 14 bits
 
This encoding is backward compatible with all the two types of existing encoding 0x00 and 0xff. And it allows for 15bit immediate.
15bit immediate gives a real advantage over the 8bit that the moveq
  would offer us.
 
 
Matt Hey wrote:

    2)
        move.l ($20,a0),d0
        move.l ($24,a0),(a1)+
        move.l ($28,a0),d1
   
   
    With base register update we have fast and small...
   
        move.l ($20,a0)+,d0
        move.l (a0)+,(a1)+
        move.l (a0),d1
 

 
Tricky one.
How would you encode the first instruction?
Its more than (update An flag) as you combine two addres modes in addition to this.
 
 
 
Matt Hey wrote:

  3) Last but far from least, I would like to revisit branches. Although my knowledge of predication and fetch decode pipelines is rather basic, it looks to me like predication is less powerful than a branch cache by itself.
 

 
The current predication is only the start.
To be honest I'm not such a big fan of hint bits.
Old code will not benefit from them at all.
And compilers do not them as they could turn the code around to match the static predication to work perfectly.
And statistics show that programmer do not use them effectively.

There are reliable ways then a predication bits.

Our real goal is to implement a branch target cache.
A branch target cache does remember not only TAKEN/NOT TAKEN but it caches where branch did go to. This allows effective caching of even computed gotos. This makes AMIGA OS calls like this
  JSR -132(a6) take only 1 clock cycle. :-D
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Jan 2011 21:22


Wojtek P wrote:

Am i wrong?

Yes, because of two reasons.

1) Your multicycle instructions are bad for going superscalar.
They will hurt our performance when we go super scalar.

2) An real chip (full custom desing) has different ways of creating logic. In an FPGA you ONLY have the logic cells and you need to create what you want with them. The result is that certain logic chains create huge multiplexers which will hurt the speed of the whole design. This is not what you want.

 



Matt Hey
USA

Posts 727
01 Feb 2011 01:00


Ok, I'm back from work now.

@Wojtek
The shifting and register use you suggest for the immediate value would not be so useful on the 68k. PPC does have instructions that load a word into the upper 1/2 of a register or load and shift instructions but it's necessary because the instructions must always be 32 bits and there is not room for a 32 bit immediate value. The CISC 68k can load a 32 bit immediate value so this is not necessary. It's really not that inefficient for the N68k to do an op.l #immediate,EA with it's large fetch either. It's nice to save a word though as most of the time the upper word is unused. Using moveq #immediate,d0 + op.l d0,EA works just as fast because of instruction fusion but uses a trash register and is limited to byte size immediate. For loading a value into the upper 1/2 of a register it may be possible to instruction fuse an op.size + swap. The now fast bitfield instructions provide some shift plus operation functionality as well. I prefer Gunnars encoding for 15 bit immediate long operations.

My branch idea does include the use of the least significant bit of bcc instruction's branch address as the hint just as you say. A LSB of 1 (odd address) would reverse the branch logic of the branch from the normal forward predicted not taken and backward predicted taken. This was originally suggested by ThoR but Gunnar really wants to use the bit to double the branch range. Your idea of prepcc + bprep is interesting. It looks like a delayed branch strategy which would probably work but it's a far departure from the 68k way. I think we want something that will work reasonably well with existing code, not introduce very many new instructions, and have flexibility. My idea does not solve the branch that depends on dynamic/computed data but it looks like Gunnar has a simple solution of storing the last branch destination in the branch cache making it faster if it is used again.


Matt Hey
USA

Posts 727
01 Feb 2011 03:27


Gunnar von Boehn wrote:

  I would interpret the high byte the following way
  Case A)
    #00xxxxxx  -- Byte Operation
    #11xxxxxx  -- Byte Operation
    #01xxxxxx  -- Long Operation #1+ the next 14 bits
    #10xxxxxx  -- Long Operation #0+ the next 14 bits
 
  This encoding is backward compatible with all the two types of existing encoding 0x00 and 0xff. And it allows for 15bit immediate.
  15bit immediate gives a real advantage over the 8bit that the moveq would offer us.

Perfect!

Gunnar von Boehn wrote:

 
Matt Hey wrote:

      With base register update we have fast and small...
     
        move.l ($20,a0)+,d0
        move.l (a0)+,(a1)+
        move.l (a0),d1
 

 
  Tricky one.
  How would you encode the first instruction?
  Its more than (update An flag) as you combine two address modes in addition to this.
 

I would use bit 3 of the full extension word format (bd,An,Xn.SIZE*SCALE) if possible. Setting bit 3 would cause the calculated EA of ($20,a0) to be put in (update) the base register a0. In the example above, the index register would be suppressed.

Gunnar von Boehn wrote:

  The current predication is only the start.
  To be honest I'm not such a big fan of hint bits.
  Old code will not benefit from them at all.

True, although it would be fairly easy to modify existing assembler (or disassembled) code to the correct branch logic. A profiler utility could also do the job. This would especially be good for 68020/68030 code (most Amiga programs) where the branch logic is reversed. Changing the hint bit is much easier than turning the branch around and doesn't require pulling the conditional code out of series or putting it in series to remain optimal. Putting the code in series for predication is always a good idea IMHO, but if it's faster to branch ahead and skip the predication and we know that's the common case, then programmers will not put the code in series. If we count on the branch cache history, we will have to take the slow route through the predication many times until the history says the branch is taken enough to skip the predication and just branch. Programmers will then not put the code in series as that is faster and there will be no possibility for predication of forward branches predicted as taken. Older code will not fully be optimal, but a consistent future API and future proof API will have been defined.

Gunnar von Boehn wrote:

  And compilers do not them as they could turn the code around to match the static predication to work perfectly.

There is a problem that compilers don't do good branch optimization without profiler data. It would be possible to define the first option of an if then or case as the most likely used. Then the compiler could optimize the branches accordingly. Assembler inlines would be easier to write optimally. There are 3 possibilities here for enhancing compiled code with a hint bit. If the branch reverse bit doesn't get set correctly, then we are usually not that bad off.

Gunnar von Boehn wrote:

  And statistics show that programmer do not use them effectively.

Many programmers don't use profilers or look at their compiled code either. The hint bit would allow fine tuning of branch optimizations for those that want it and encourage programmers to always put there code in series because that would always be the fastest way.

Gunnar von Boehn wrote:

  There are reliable ways then a predication bits.

I don't know if you are talking about the hint bit or the predication counter bits in the branch cache I suggested.

Gunnar von Boehn wrote:
 
  Our real goal is to implement a branch target cache.
  A branch target cache does remember not only TAKEN/NOT TAKEN but it caches where branch did go to. This allows effective caching of even computed gotos. This makes AMIGA OS calls like this
  JSR -132(a6) take only 1 clock cycle. :-D 

Good. The 68060 branch cache has a target address as well as history and it is very effective. The predicted branch target would work well for library calls that are consistently going to the same place but not the greatest for jump tables. There is nothing to lose though as taking a chance is better than stalling.


Wojtek P
Poland

Posts 1597
01 Feb 2011 07:24


Gunnar von Boehn wrote:

Wojtek P wrote:

  Am i wrong?
 

 
  Yes, because of two reasons.
 
  1) Your multicycle instructions are bad for going superscalar.
  They will hurt our performance when we go super scalar.
 
  2) An real chip (full custom desing) has different ways of creating logic. In an FPGA you ONLY have the logic cells and you need to create what you want with them. The result is that certain logic chains create huge multiplexers which will hurt the speed of the whole design. This is not what you want.

These explains why FPGA implementations of ARM processors in FPGA and not custom silicon are both damn slow (20-50MHz) and quite large.

Wojtek P
Poland

Posts 1597
01 Feb 2011 07:39


Matt Hey wrote:

  Ok, I'm back from work now.
 
  @Wojtek
  The shifting and register use you suggest for the immediate value would not be so useful on the 68k. PPC does have instructions that load a word into the upper 1/2 of a register or load and shift instructions but it's necessary because the instructions must always be 32 bits and there is not room for a 32 bit immediate value. The CISC 68k can load a 32 bit immediate value so this is not necessary. It's really not that inefficient for the N68k to do an op.l #immediate,EA with it's large fetch either. It's nice to save a word though as most of the time the upper word is unused. Using moveq #immediate,d0 + op.l d0,EA works just as fast because of instruction fusion but uses a trash register and is limited to byte size immediate. For loading a value into the upper 1/2 of a register it may be possible to instruction fuse an op.size + swap. The now fast bitfield instructions provide some shift plus operation functionality as well. I prefer Gunnars encoding for 15 bit immediate long operations.
 

  For immediates my idea only save space. As with quite fast memory and large caches code size are not that important you are right.
  What i wanted to include is the way to "compress" few instruction in one, as sequences of shifts/rotate+other operations are very common in code.
  But as gunnar said it is not possible to impleement it without slowing down all things - then let's forget it.
 
 
  My branch idea does include the use of the least significant bit of bcc instruction's branch address as the hint just as you say. A LSB of 1 (odd address) would reverse the branch logic of the branch from the normal forward predicted not taken and backward predicted taken. This was originally suggested by ThoR but Gunnar really wants to use the bit to double the branch range. Your idea of prepcc + bprep is interesting. It looks like a delayed branch strategy which would probably work but it's a far departure from the 68k way. I think we want something that will work reasonably
 

  The problem is that all CISC ISA including 68k assumes sequential execution. branch prediction is great but there are simply branches
  that can't be predicted as it's input depended.
  Common case are large switch/case in C code which are often used and execute different parts depending of input data.
  You may think about other way than my proposition to hint a branch logic beforehand. When writing programs there are often few instructions to execute between calculating condition and executing it.
 
  Another - more general way - would be "branch if condition after X instructions" where X is somehow like 1-s.
 
  anyway most important IMHO is solving efficient branches from tables. conditional or not, but depending of value.
  Every p-code interpreters for example do this constantly - get a byte from p-code and jump to table of 256 pt rocedure address of possible p-code actions.
  There are many more examples, all of them ignored by current CPU manufacturers that concentrate more of DSP-alike performance, not general purpose.
 
 

  well with existing code, not introduce very many new instructions, and have flexibility. My idea does not solve the branch that depends on dynamic/computed data but it looks like Gunnar has a simple solution of storing the last branch destination in the branch cache making it faster if it is used again.
 

  Which is good but gives no help for my examples.

Megol .

Posts 672
01 Feb 2011 12:57


Wojtek P wrote:

Gunnar von Boehn wrote:

 
Wojtek P wrote:

  Am i wrong?
 

 
  Yes, because of two reasons.
 
  1) Your multicycle instructions are bad for going superscalar.
  They will hurt our performance when we go super scalar.
 
  2) An real chip (full custom desing) has different ways of creating logic. In an FPGA you ONLY have the logic cells and you need to create what you want with them. The result is that certain logic chains create huge multiplexers which will hurt the speed of the whole design. This is not what you want.
 

  These explains why FPGA implementations of ARM processors in FPGA and not custom silicon are both damn slow (20-50MHz) and quite large.

ARM implemented in FPGAs can be small and fast-ish (~same as other soft cores). However in both ASICs and FPGAs one have to pay the penalty for the shifter.
ARM is a very nice design for a small powerful RISC, extending it isn't as nice though.

Matt Hey
USA

Posts 727
02 Feb 2011 07:23


Wojtek P wrote:

    The problem is that all CISC ISA including 68k assumes sequential execution. branch prediction is great but there are simply branches that can't be predicted as it's input depended.
    Common case are large switch/case in C code which are often used and execute different parts depending of input data.
 

 
  Good programmers and good compilers can greatly reduce the number of input dependent unpredictable branches to the point that they really aren't a big problem. Small case statements can be turned into if then else statements. Case statements with repeating values like this...
 
    move.l (jmp_table,pc,d0.l*4),a0
    jmp (a0)
 
  jmp_table:
    dc.l routine_1
    dc.l routine_1
    dc.l routine_2
    dc.l routine_3
    dc.l routine_2
    dc.l routine_1
    dc.l routine_3
 
  can sometimes do this (with branch hinting)...
 
    moveq #%1011100,d1
    btst d0,d1
    beq2 routine_1
    moveq #%1101011,d1
    btst d0,d1
    beq2 routine_2
  routine_3:
 
  or this without branch hint (hmm, I wonder which is more readable, faster to write and may be faster and shorter)...
 
    moveq #%1011100,d1
    btst d0,d1
    bne continue_1
  routine_1:
    ...
    rts or bra
  continue_1:
    moveq #%1101011,d1
    btst d0,d1
    bne routine_3
  routine_2:
    ...
    rts or bra
  routine_3:
 
  There can't be more than 32 case elements though. There are actually a lot of case statements that have cases that go to the same routine. Gunnar's strategy of using the last branch target may in some cases not be much worse than my optimization. If the last branch was to routine_1 given a random d0, there would be a 3 in 7 chance of jumping to the same target address. In the last library written in C that I optimized, I was able to do the above tricks, turn the jump table into a data table, or delete the jump table entirely in 15 of 15 cases so far. We need better compilers and programmers is all ;).
 
 
Wojtek P wrote:

    You may think about other way than my proposition to hint a branch logic beforehand. When writing programs there are often few instructions to execute between calculating condition and executing it.

 
  Yes, that's the problem with delayed branches. There has to be something to do before branching and there often is not. Good programming and compilers minimize the amount of these input dependent branches. The rest aren't necessarily that bad IF they do enough work. ADis, the disassembler I have worked on, uses a opcode table with pointers to functions to handle the different opcodes. This branch is very difficult to predict, will often be different each time, and has many choices but it does a lot of work too. I don't think there is a more efficient way to handle it. Gunnar's last branch target guess will be correct sometimes and help here also. Every little bit helps with branches being so important to performance.
 

Wojtek P
Poland

Posts 1597
02 Feb 2011 12:48


Megol . wrote:

ARM is a very nice design for a small powerful RISC, extending it isn't as nice though.

At least in custom IC - you are wrong. Look at high-end ARM implementations how fast they are and how much power they use.

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22