Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have ideas and feature wishes? Post them here and discuss your ideas.

N68k Enhancements Revisitedpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
Team Chaos Leader
USA
(Moderator)
Posts 2094
11 Apr 2011 08:51


S P wrote:

  This frees 2 registers

If your code needs more registers then we should be talking about adding more registers instead of smc. :)

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Apr 2011 09:26


By using SMC you have infinite Registers.. (Only limited by the size of the Instruction cache.) Some innerloops can get more than 2 times faster on the N070..
 
By enabling SMC, the N070 can read read 4 longwords from the Cache in one cycle, and write to two registers.

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Apr 2011 10:57


With SMC the LEA instruction could have 3 registers as input

lea (a0,d0.l+d8.b),a2

a2=a0+d0.w+d8.b in a single cycle!

Angel of Paradise
Germany

Posts 61
11 Apr 2011 11:56


S P wrote:

  With SMC (2 cycles on N050 and one cycle on N070)
 
  move.l #xxx,d0  ;fused
  add.l (sp)+,d0 ;1 P1
  move.l #xxx,d1  ;fused
  add.l (sp)+,d1 ;1 P2
 
 
  I can create more examples..

This sounds like a good speed trick.
Can you show us a complete example?

Megol .

Posts 690
11 Apr 2011 13:53


Remember that SMC isn't free.
It isn't reentrant.
It requires an initialization loop.
It complicates cache data paths and requires more synchronization between instruction and data caches.
It greatly complicates the use of a pre-decoded instruction cache.


Phil "meynaf" G.
France
(Natami Team)
Posts 393
11 Apr 2011 14:51


I've had issues with SMC in the past, going from 68000 to 68020. I now just consider it as a dirty and not wishable practice.

More data registers cannot IMO be encoded in a clean way. My idea was once about new hybrid data/address registers, but now i'm no longer sure of that. May well be not worth the hassle.

I'm now in favor of new instructions using the high part of data regs in a better way. Say, something like DBFH which counts in the high word, and MOVE from/to register parts (directly to/from memory, which bit-field cannot do).

But, i do not want to bash SMC either.
So, SP, please provide a whole routine where it's a saviour. Perhaps it can be rewritten without SMC and little loss in speed, and if not, you'll have a nice point.


Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
11 Apr 2011 19:32


Phil G. wrote:

I've had issues with SMC in the past, going from 68000 to 68020. I now just consider it as a dirty and not wishable practice.

Self-modifying code, especially on the 68Ks, should be avoided and banned. The 68K's follow a Havard-type approach (separate data/code caches) and hence require costy synchronization of caches to make this work stable. Besides, there are only two reasons for SMC, number one is obfuscation (and I don't remember a legimitate reason for this) and another is patching software on the fly for bugs or features. The latter doesn't need to be fast.

The argument that SMC makes code faster is certainly no longer true, and the best way how to speed up an algorithm is usually either redesign it by understanding the bottleneck, or if that doesn't help, wait a year and get a new processor.

Designing a CPU such that it allows "easy SMC" is quite insane. The reverse should be the case - CPUs should make SMC hard because SMC is typically an indication that something suspicious is going on, e.g. like a virus in the system. Thus, write-protecting and checksumming code sections would be much more sensible to do.

Phil G. wrote:

  I'm now in favor of new instructions using the high part of data regs in a better way. Say, something like DBFH which counts in the high word, and MOVE from/to register parts (directly to/from memory, which bit-field cannot do).

Hmm. Applications? I rarely had a need where a dbfh would be necessary where a subq.l #1,dx wouldn't have done the job.
 
 
Phil G. wrote:

  But, i do not want to bash SMC either.

But I do. (-: It's a coding method from 30 years ago, and such times luckely passed. SMC is hard to control, error prone, and - as said - typically an indication of modications that are not in the interest of the user. Rarely, at least.

Greetings from Cambridge (this time, Cambridge UK, not Cambridge, MA).

Thomas



Team Chaos Leader
USA
(Moderator)
Posts 2094
11 Apr 2011 20:15


Thomas Richter wrote:

  Greetings from Cambridge (this time, Cambridge UK, not Cambridge, MA).

You never visit me in Houston, TX.  I feel so discriminated :D

I agree with you and Phil about smc.


Marcel Verdaasdonk
Netherlands

Posts 3991
11 Apr 2011 21:15


The reason the amiga scene is still going is because of the performance gain in dirty coding methods.
Totally disallowing SMC would be counter productive for this project.

since it would kill some old games, and demo's.

we are stuck with it so why not make the best of it people?

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
11 Apr 2011 22:31


Team Chaos Leader wrote:

Thomas Richter wrote:

  Greetings from Cambridge (this time, Cambridge UK, not Cambridge, MA).
 

  You never visit me in Houston, TX.  I feel so discriminated :D

Oh, I'm so so sorry. I have been at Houston on April 1st, but haven't had the time to send greetings. I was only passing through Houston "Intercontinental". Now guess my air carrier. (-;

Don't worry, I'll end up there sooner or later again.

Greetings,
Thomas



Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Apr 2011 22:32


Phil G. wrote:

  So, SP, please provide a whole routine where it's a saviour. Perhaps it can be rewritten without SMC and little loss in speed, and if not, you'll have a nice point.

Here is a fresh example. The N050 has a 8 cycle latency for add and muls. To create a latency free 4x4 matrix muls (Used in all 3d games) we can use SMC.
With SMC this method will run in 225 clocks.
Without SMC (datacache only) it's 249 clocks.
This is around 10% faster..

To get the SMC to work The CPU need to fix the cache issues.
fmove.s fp5,fp24+4(a3) ;190
...
.fp24 fmove.s #0,fp7 ;fused(216)

As you can see we have 26 cycles between the move to the instruction cache and the read.. In this time a paralell cache snooper can fix the cache...



MULS_MATRIX_4x4_N050_ZERO_LATENCY
;a0 start of dest matrix
;a1 start of src matrix
;a2 start of src2 matrix
;a3 should be PC but the assembler doesn't support it yet.
  sub.w #16*4,sp

  fmove.s M11(a2),fp0 ;1
  fmul.s M11(a1),fp0 ;2
  fmove.s M21(a2),fp1 ;2
  fmul.s M12(a1),fp1 ;3
  fmove.s M31(a2),fp2 ;4
  fmul.s M13(a1),fp2 ;5
  fmove.s M41(a2),fp3 ;6
  fmul.s M14(a1),fp3 ;7
  fmove.s M12(a2),fp4 ;8
  fmul.s (a1)+,fp4 ;9
  fmove.s M22(a2),fp5 ;10
  fmul.s (a1)+,fp5 ;11
  fmove.s M32(a2),fp6 ;12
  fmul.s (a1)+,fp6 ;13
  fadd.x fp0,fp1  ;14
  fmove.s M42(a2),fp7 ;16
  fmul.s (a1)+,fp7 ;17
  fadd.x fp2,fp3  ;18

  fmove.s M13(a2),fp0 ;19
  fmul.s M11(a1),fp0 ;20
  fadd.x fp4,fp5  ;21
  fmove.s fp1,,fp9+4(a3) ;22
  fmove.s M23(a2),fp1 ;23
  fmul.s M12(a1),fp1 ;24
  fmove.s M33(a2),fp2 ;25
  fmul.s M13(a1),fp2 ;26
  fadd.x fp6,fp7  ;27
  fmove.s fp3,(sp)+ ;28
  fmove.s M43(a2),fp3 ;29
  fmul.s M14(a1),fp3 ;30
  fmove.s M14(a2),fp4 ;31
  fmul.s (a1)+,fp4 ;32
  fmove.s fp5,fp10+4(a3) ;33
  fmove.s M24(a2),fp5 ;34
  fmul.s (a1)+,fp5 ;35
  fmove.s M34(a2),fp6 ;36
  fmul.s (a1)+,fp6 ;37
  fmove.s fp7,(sp)+ ;38
  fadd.x fp0,fp1  ;39
  fmove.s M44(a2),fp7 ;40
  fmul.s (a1)+,fp7 ;41
  fadd.x fp2,fp3  ;42

  fmove.s M11(a2),fp0 ;43
  fmul.s M11(a1),fp0 ;44
  fadd.x fp4,fp5  ;45
  fmove.s fp1,fp11+4(a3) ;46
  fmove.s M21(a2),fp1 ;47
  fmul.s M12(a1),fp1 ;48
  fmove.s M31(a2),fp2 ;49
  fmul.s M13(a1),fp2 ;50
  fadd.x fp6,fp7  ;51
  fmove.s fp3,(sp)+ ;52
  fmove.s M41(a2),fp3 ;53
  fmul.s M14(a1),fp3 ;54
  fmove.s M12(a2),fp4 ;55
  fmul.s (a1)+,fp4 ;56
  fmove.s fp5,fp12+4(a3) ;57
  fmove.s M22(a2),fp5 ;58
  fmul.s (a1)+,fp5 ;59
  fmove.s M32(a2),fp6 ;60
  fmul.s (a1)+,fp6 ;61
  fmove.s fp7,(sp)+ ;62
  fadd.x fp0,fp1  ;63
  fmove.s M42(a2),fp7 ;64
  fmul.s (a1)+,fp7 ;65
  fadd.x fp2,fp3  ;66

  fmove.s M13(a2),fp0 ;67
  fmul.s M11(a1),fp0 ;68
  fadd.x fp4,fp5  ;69
  fmove.s fp1,fp13+4(a3) ;70
  fmove.s M23(a2),fp1 ;71
  fmul.s M12(a1),fp1 ;72
  fmove.s M33(a2),fp2 ;73
  fmul.s M13(a1),fp2 ;74
  fadd.x fp6,fp7  ;75
  fmove.s fp3,(sp)+ ;76
  fmove.s M43(a2),fp3 ;77
  fmul.s M14(a1),fp3 ;78
  fmove.s M14(a2),fp4 ;79
  fmul.s (a1)+,fp4 ;80
  fmove.s fp5,fp14+4(a3) ;81
  fmove.s M24(a2),fp5 ;82
  fmul.s (a1)+,fp5 ;83
  fmove.s M34(a2),fp6 ;84
  fmul.s (a1)+,fp6 ;85
  fmove.s fp7,(sp)+ ;86
  fadd.x fp0,fp1  ;87
  fmove.s M44(a2),fp7 ;88
  fmul.s (a1)+,fp7 ;89
  fadd.x fp2,fp3  ;90

  fmove.s M11(a2),fp0 ;91
  fmul.s M11(a1),fp0 ;92
  fadd.x fp4,fp5  ;93
  fmove.s fp1,fp15+4(a3) ;94
  fmove.s M21(a2),fp1 ;95
  fmul.s M12(a1),fp1 ;96
  fmove.s M31(a2),fp2 ;97
  fmul.s M13(a1),fp2 ;98
  fadd.x fp6,fp7  ;99
  fmove.s fp3,(sp)+ ;100
  fmove.s M41(a2),fp3 ;101
  fmul.s M14(a1),fp3 ;102
  fmove.s M12(a2),fp4 ;103
  fmul.s (a1)+,fp4 ;104
  fmove.s fp5,fp16+4(a3) ;105
  fmove.s M22(a2),fp5 ;106
  fmul.s (a1)+,fp5 ;107
  fmove.s M32(a2),fp6 ;108
  fmul.s (a1)+,fp6 ;109
  fmove.s fp7,(sp)+ ;110
  fadd.x fp0,fp1  ;111
  fmove.s M42(a2),fp7 ;112
  fmul.s (a1)+,fp7 ;113
  fadd.x fp2,fp3  ;114
  fmove.s M13(a2),fp0 ;115
  fmul.s M11(a1),fp0 ;116
  fadd.x fp4,fp5  ;117
  fmove.s fp1,fp17+4(a3) ;118
  fmove.s M23(a2),fp1 ;119
  fmul.s M12(a1),fp1 ;120
  fmove.s M33(a2),fp2 ;121
  fmul.s M13(a1),fp2 ;122
  fadd.x fp6,fp7  ;123
  fmove.s fp3,(sp)+ ;124
  fmove.s M43(a2),fp3 ;125
  fmul.s M14(a1),fp3 ;126
  fmove.s M14(a2),fp4 ;127
  fmul.s (a1)+,fp4 ;128
  fmove.s fp5,fp18+4(a3) ;130
  fmove.s M24(a2),fp5 ;131
  fmul.s (a1)+,fp5 ;132
  fmove.s M34(a2),fp6 ;133
  fmul.s (a1)+,fp6 ;134
  fmove.s fp7,(sp)+ ;135
  fadd.x fp0,fp1  ;136
  fmove.s M44(a2),fp7 ;137
  fmul.s (a1)+,fp7 ;138
  fadd.x fp2,fp3  ;139

  fmove.s M11(a2),fp0 ;140
  fmul.s M11(a1),fp0 ;141
  fadd.x fp4,fp5  ;142
  fmove.s fp1,fp19+4(a3) ;143
  fmove.s M21(a2),fp1 ;144
  fmul.s M12(a1),fp1 ;145
  fmove.s M31(a2),fp2 ;146
  fmul.s M13(a1),fp2 ;147
  fadd.x fp6,fp7  ;148
  fmove.s fp3,(sp)+ ;149
  fmove.s M41(a2),fp3 ;150
  fmul.s M14(a1),fp3 ;151
  fmove.s M12(a2),fp4 ;152
  fmul.s (a1)+,fp4 ;153
  fmove.s fp5,fp20+4(a3) ;154
  fmove.s M22(a2),fp5 ;155
  fmul.s (a1)+,fp5 ;156
  fmove.s M32(a2),fp6 ;157
  fmul.s (a1)+,fp6 ;158
  fmove.s fp7,(sp)+ ;159
  fadd.x fp0,fp1  ;160
  fmove.s M42(a2),fp7 ;161
  fmul.s (a1)+,fp7 ;162
  fadd.x fp2,fp3  ;163
  fmove.s M13(a2),fp0 ;164
  fmul.s M11(a1),fp0 ;164
  fadd.x fp4,fp5  ;165
  fmove.s fp1,fp21+4(a3) ;166
  fmove.s M23(a2),fp1 ;167
  fmul.s M12(a1),fp1 ;168
  fmove.s M33(a2),fp2 ;169
  fmul.s M13(a1),fp2 ;170
  fadd.x fp6,fp7  ;171
  fmove.s fp3,(sp)+ ;172
  fmove.s M43(a2),fp3 ;173
  fmul.s M14(a1),fp3 ;174
  fmove.s M14(a2),fp4 ;175
  fmul.s (a1)+,fp4 ;175
  fmove.s fp5,fp22+4(a3) ;176
  fmove.s M24(a2),fp5 ;177
  fmul.s (a1)+,fp5 ;178
  fmove.s M34(a2),fp6 ;179
  fmul.s (a1)+,fp6 ;180
  fmove.s fp7,(sp)+ ;181
  fadd.x fp0,fp1  ;182
  fmove.s M44(a2),fp7 ;183
  fmul.s (a1)+,fp7 ;184
  fadd.x fp2,fp3  ;185

  fadd.x fp4,fp5  ;186
  fmove.s fp1,fp23+4(a3) ;187
  fadd.x fp6,fp7  ;188
  fmove.s fp3,(sp)+ ;189
  fmove.s fp5,fp24+4(a3) ;190
  fmove.s fp7,(sp)+ ;191
  sub.w #16*4,sp ;192

.fp9 fmove.s #0,fp0  ;fused
  fadd.s (sp)+,fp0 ;193
.fp10 fmove.s #0,fp1  ;fused
  fadd.s (sp)+,fp1 ;194
.fp11 fmove.s #0,fp2  ;fused
  fadd.s (sp)+,fp2 ;195
.fp12 fmove.s #0,fp3  ;fused
  fadd.s (sp)+,fp3 ;196
.fp13 fmove.s #0,fp4  ;fused
  fadd.s (sp)+,fp4 ;197
.fp14 fmove.s #0,fp5  ;fused
  fadd.s (sp)+,fp5 ;198
.fp15 fmove.s #0,fp6  ;fused
  fadd.s (sp)+,fp6 ;199
.fp16 fmove.s #0,fp7  ;fused
  fadd.s (sp)+,fp7 ;200

  fmove.s fp0,(a0)+ ;201
  fmove.s fp1,(a0)+ ;202
  fmove.s fp2,(a0)+ ;203
  fmove.s fp3,(a0)+ ;204
  fmove.s fp4,(a0)+ ;205
  fmove.s fp5,(a0)+ ;206
  fmove.s fp6,(a0)+ ;207
  fmove.s fp7,(a0)+ ;208

.fp17 fmove.s #0,fp0  ;fused
  fadd.s (sp)+,fp0 ;209
.fp18 fmove.s #0,fp1  ;fused
  fadd.s (sp)+,fp1 ;210
.fp19 fmove.s #0,fp2  ;fused
  fadd.s (sp)+,fp2 ;211
.fp20 fmove.s #0,fp3  ;fused
  fadd.s (sp)+,fp3 ;212
.fp21 fmove.s #0,fp4  ;fused
  fadd.s (sp)+,fp4 ;213
.fp22 fmove.s #0,fp5  ;fused
  fadd.s (sp)+,fp5 ;214
.fp23 fmove.s #0,fp6  ;fused
  fadd.s (sp)+,fp6 ;215
.fp24 fmove.s #0,fp7  ;fused
  fadd.s (sp)+,fp7 ;216

  fmove.s fp0,(a0)+ ;217
  fmove.s fp1,(a0)+ ;218
  fmove.s fp2,(a0)+ ;219
  fmove.s fp3,(a0)+ ;220
  fmove.s fp4,(a0)+ ;221
  fmove.s fp5,(a0)+ ;222
  fmove.s fp6,(a0)+ ;223
  fmove.s fp7,(a0)+ ;224
  rts  ;225




Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
11 Apr 2011 22:33


Marcel Verdaasdonk wrote:

The reason the amiga scene is still going is because of the performance gain in dirty coding methods.
  Totally disallowing SMC would be counter productive for this project.
 
  since it would kill some old games, and demo's.
 
  we are stuck with it so why not make the best of it people?

That is the question to begin with. One could also argue that such dirty coding principles caused the death of the platform because software wasn't as portable as it should, and thus caused a lot of problems when upgrading the Os to open new perspectives.

Anyhow, enough of that.

So long,
Thomas


Rune Stensland
Norway
(MX-Board Owner)
Posts 871
11 Apr 2011 23:12


Thomas Richter wrote:

  That is the question to begin with. One could also argue that such dirty coding principles caused the death of the platform because software wasn't as portable as it should, and thus caused a lot of problems when upgrading the Os to open new perspectives.
 

 
  If support is implemented in the N050.
  The Natami will be More compatible with more possibillites.
  I just showed you a loop that will execute 10% faster. There are many other cases.
 
  My mandelbrot zoomer can be 100% faster with SMC.
 
  But before I tune it, I need to know how many cycles latency I have. Or if it's even possible to implement for free in hardware.. In my example code there are many cycles where the readport/writeports of the cache is free. With a smarter cachecontroller wich fetch and store on every available cycle it should be possible to correct SMC in 26 cycles..



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Apr 2011 01:59


S P wrote:

  With SMC this method will run in 225 clocks.
  Without SMC (datacache only) it's 249 clocks.
  This is around 10% faster..
 
  As you can see we have 26 cycles between the move to the instruction cache and the read.. In this time a paralell cache snooper can fix the cache...
 

 
 
  I'm very sorry but actually the SMC will make your code _slower_ and not faster.
 
The reason is that a cache snooper works a little different than you expect here. The cache snooper will purge the affected cache regions, it will not merge your updates by itself.
 
This means your SMC will make you loose cache content of ICache lines.
 
The cache misses that you create with your SMC code will make the code in the end slightly slower than not using SMC.

The great thing about a Cache Snooper is that it will ensure that the Code runs error free. This means a Cache snooper will allow that "accidently" done SMC (like loading of code blocks) works fine even if the coder forgot to purge the cache.
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Apr 2011 02:05


Team Chaos Leader wrote:

 
S P wrote:

    This frees 2 registers
 

  If your code needs more registers then we should be talking about adding more registers instead of smc. :)
 

 
I agree with TCL.
 
Maybe more register will be better for your code?
SP, what do you think about this?
SP, how much better would the Matrix Mul work with 16 FPU registers?

Also I suppose adding compiler support for 8 more FPU register will be much easier than addict SMC support to a C compiler. What do you think?

Marcel Verdaasdonk
Netherlands

Posts 3991
12 Apr 2011 06:29


if i understood SP correctly adding registers would be a temporary solution to the problem, he's trying to do as much as he can in as little instructions as posible.

Rune Stensland
Norway
(MX-Board Owner)
Posts 871
12 Apr 2011 08:39


More registers would solve my problem... More registers cost more..
  I can get the same speed with SMC.
 
  I am interested in finding the maximum speed with the current limitations. I understand that a datawrite in the inst. cache will flush this cacheline immidiatly..  But why can't this line be qued so that when the instruction cache fetcher is idle, it will retrieve my fresh line.
   
    In CPU CODE:
    If INSTCACHE readport is IDLE fetch from the QUE
    In my example code I had 26 cycles between the write and the read. It should be enough...
   
    I found a way to reduce more cycles..
   
    The trick is to move the matrix inside the instruction cache while the fpu is busy calculating (latency). For every move done I save a cycle later with fusion...
   
    Then I don't need to store the temp variables on the stack. This will probobly save 64-128 cycles...
   
    I will finnish a version tonight..
   
    Take a look at my idea:
   
   

   

   
      fmove.l (a1)+,fp0
      fmove.l (a1)+,fp1
      fmove.l (a1)+,fp2
      fmove.l (a1)+,fp3
   
      fmove.s fp0,fp4    ; fused
      fmul.s  M11(a2),fp4 ;1
      fmove.s fp1,fp5    ; fused
      fmul.s  M21(a2),fp5 ;2
      fmove.s fp2,fp6    ; fused
      fmul.s  M31(a2),fp6 ;3
      fmove.s fp3,fp7    ; fused
      fmul.s  M41(a2),fp7 ;4
   
      ; 4 cycle latency that can be used to fill the matrix in the code
      fadd.x  fp4,fp5    ;1
      fadd.x  fp6,fp7    ;2
      fmove.s fp0,fp4    ;fused
      fmul.s  M12(a2),fp4 ;3
      fmove.s fp3,fp6    ;fused
      fmul.s  M32(a2),fp6 ;4
      ; 4 cycle latency that can be used to fill the matrix in the code
   
      fadd.x  fp5,fp7
   
   
    ....
    The generated instructions will look like this:
   
        fmove.s #0,fp0 ;fused
        fmul.s M11(a1),fp0  ;1
        fmove.s #0,fp1 ;fused
        fmul.s M12(a1),fp1  ;2
        fmove.s #0,fp2 ;fused
        fmul.s M13(a1),fp2  ;3
        fmove.s #0,fp3 ;fused
        fmul.s M14(a1),fp3  ;4
        fmove.s #0,fp4 ;fused
        fmul.s M11(a1),fp4  ;5
        fmove.s #0,fp5 ;fused
        fmul.s M12(a1),fp5  ;6
        fmove.s #0,fp6 ;fused
        fmul.s M13(a1),fp6+  ;7
        fmove.s #0,fp7 ;fused
        fmul.s M14(a1),fp7  ;8
    ...
   

   

   

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Apr 2011 08:48


S P wrote:

  The generated instructions will look like this:
   
  fmove.s #0,fp0                16 Bytes
  fmul.s M11(a1),fp0            6  Bytes
 

 
Our ICache bandwidth is huge with 128bit per clock.
But to fuse 24 Bytes per clock we would even need more. :-D
 
Also it would be very sensible to put a limit on the length
of the first fused instruction. The reason is that by supporting many different lenght our fusing mechanism becomes a lot more expensive.
 
We have 3 fuse options:
1) The fusing logic can fuse 2 Byte prefix instrucitons.
2) Fusing which allows prefix sizes of 2,4, or 6 Byte.
3) Any size.

As more complex our fusing optoin gets as more costly it becomes.
I would seriuously prefer to limit it to either 1) or 2) for cost savings.
 

BTW: Adding 8 more FPU register costs less chipsize
than making the fusing support ultra long prefix instructions.
;-D

The 68K has the encoding room in the FPU instruction to add more registers. Encoding 16 registers should be no problem if we want.


Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
12 Apr 2011 08:56


S P wrote:

More registers would solve my problem... But I am interested in finding the maximum speed with the current limitations. I understand that a datawrite in the inst. cache will fluch this cacheline immidiately..  But why can't this line be qued so that when the instruction cache fetcher is idle, it will retrieve my fresh line.

Basically, you're asking why a coding technique that had its merrits 30 years ago is no longer appropriate today. Well, the hardware has changed, and CPUs became overly proportial faster than memory access, thus caches had invented. Thus, I afraid, you're facing the situation where your knowledge and experience on the old hardware is simply obsolete, and you need to learn something new - the world has changed around you. Testing a 2010 CPU design through a 1980's "looking glass" is not quite appropriate. The CPU is designed for different coding principles, and one of the principles of the Havard architecture is that the CPU *does not* modify its own code. Which means that you're trying to work against the design, and hence face the challenge.

More registers not necessarily make the code faster, despite that it is probably fast enough as it goes. A more modern approach of speeding up such matrix multiplications is rather to use multiply-add instructions and SIMD extensions of the CPU. Thus, I do not suggest to add more registers at this time, but - should this be appropriate - to add the necessary instructions or parallelism.

Greetings,
Thomas


Gunnar von Boehn
Germany
(Moderator)
Posts 5775
12 Apr 2011 09:03


Thomas Richter wrote:

  A more modern approach of speeding up such matrix multiplications is rather to use multiply-add instructions 
 

 
But unfortunately this is not that easy as it looks like.
 
A Multiply-Add instruction does more work internally than a MUL or a ADD.
This means the latency has to be higher.
And _latency_ is the problem here that SP tries to solve.
 
 
Why are you against more register?
Do you see unsolveable problems wiht more registers that we do not see?

The advantage of more registers is that with them you could unroll the code and solve the latency issues nicely.

posts 435page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22