 |
Welcome to the Natami / Amiga ForumThis forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.
|
Do you have ideas and feature wishes? Post them here and discuss your ideas. |
|
|---|
Matt Hey USA
| | Posts 733 14 Mar 2011 15:19
| Phil G. wrote:
|
Team Chaos Leader wrote:
| Jostein Aarbakk wrote:
| 1) EOR instruction - More addressing modes |
I need this too. |
So do I. Not related to speed, but to register use. Sometimes you just don't have a temporary. |
Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space.
| |
Team Chaos Leader USA
| | (Moderator) Posts 2094 14 Mar 2011 15:39
| Matt Hey wrote:
|
Phil G. wrote:
| Team Chaos Leader wrote:
| Jostein Aarbakk wrote:
| 1) EOR instruction - More addressing modes |
I need this too. |
So do I. Not related to speed, but to register use. Sometimes you just don't have a temporary. |
Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space.
|
I use it in the deep innerloop of my encryption and decryption routines. It needs to be as fast as possible.
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 14 Mar 2011 16:56
| Team Chaos Leader wrote:
| I use it in the deep innerloop of my encryption and decryption routines. It needs to be as fast as possible.
|
Would you mind posting the workloop for us as reference?
| |
Thomas Richter Germany
| | (MX-Board Owner) Posts 1425 14 Mar 2011 18:49
| Wojtek P wrote:
| | Can MOVEM be improved to do read 4 registers or store 2 registers per cycle. Make quite a big difference.
|
I agree, this would be a real improvement, given that most subroutines start with saving the register set on the stack. Greetings, Thomas
| |
Cesare Di Mauro Italy
| | Posts 526 14 Mar 2011 19:33
| Team Chaos Leader wrote:
|
Matt Hey wrote:
| | Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space. |
I use it in the deep innerloop of my encryption and decryption routines. It needs to be as fast as possible. |
You don't need an enhanced EOR for this, since this instruction is rarely used in common code.You need a SIMD unit with binary operations support, so your loop will be A LOT faster.
| |
Megol .
| | Posts 675 14 Mar 2011 19:40
| Cesare Di Mauro wrote:
|
Team Chaos Leader wrote:
| Matt Hey wrote:
| | Are you guys using EOR that much? Personally, I don't use it much but I could see it used frequently for particular uses. Adding EOR.size EA,Dn would need to be a word encoding to be worthwhile and word encodings do take up a lot of encoding space. |
I use it in the deep innerloop of my encryption and decryption routines. It needs to be as fast as possible. |
You don't need an enhanced EOR for this, since this instruction is rarely used in common code. You need a SIMD unit with binary operations support, so your loop will be A LOT faster.
|
Depends on if he uses a real encryption algorithm or just some "xor with something" algorithm, the previous would need SIMD support of either scatter/gather or register-wide lookups. SIMD acceleration for Rijndael (AKA AES) or IDEA would be nice... ;)
| |
Rune Stensland Norway
| | (MX-Board Owner) Posts 871 14 Mar 2011 22:04
| Shifts and swaps are commonly used when working with fixed points. I understand that the current fusion design will enable the following 4 instructions in 2 cycles(N050) and 1 cycle in (N070) with superscalar. move.l d0,d1 swap d1 ;usefull for 16:16 fixed point move.l d2,d3 lsr.l #8,d3 ;usefull for 8:8 fixed point How about these: add.l d0,d1 swap d1 or.l d0,d1 swap d1 sub.l d0,d1 swap d1 muls.l d0,d1 swap d1 eor.l d0,d1 swap d1 neg.l d1 swap d1 2 reads and 1 write.. etc.. Here are two common merge pass in Chunky2Planar conversion routines. The merges can also be used when combining 2 buffers with shadetables input: d0=ABCD d2=abcd output: d0=AaCc d2=BbDd move.l d2,d3 ;swap 8 (d2,d4) lsr.l #8,d3 eor.l d4,d3 and.l d7,d3 ;d7=0x00ff00ff eor.l d3,d4 lsl.l #8,d3 eor.l d3,d2 input: d0=AABB d2=aabb output: d0=AAaa d2=BBbb move.l d2,d4 ;swap 16 (d0Xd2) move.w d0,d4 swap d4 move.w d4,d0 move.w d2,d4 Most amiga programs wich contain chunky graphics will use a variation of these merges. So I suggest a new instruction c2pmerge d0,d2,#n (n=1,2,4,8,16) The new N050 can interpretate 128bits per clock That meens that it can look at 8 (16 bit) instructions and search for a pattern per clock.. The first c2pmerge is 7 Instructions(112 bit wide) The second c2pmerge is 5 Instructions(80 bit wide) Can they be fused?
| |
Jostein Aarbakk Norway
| | Posts 7 14 Mar 2011 23:05
| Megol . wrote:
| EOR should be handled by fusion already IIRC?
|
Matt Hey wrote:
| The N68050 should be able to fuse the 2 instructions into 1 so they take just 1 cycle. ... The code will still be 2 words which might be slightly slower in rare cases but eor isn't that common and encoding space is valuable.
|
Fusion sounds like a good idea. If there isn't enough space for adding these new addressing modes to the EOR-instructions, I agree that other optimizations might be more important.However, if enough space is available, and EOR gets new addressing modes, I could speed up my program even further, because the fusion could be made with (perhaps) the next instruction in my program. Megol . wrote:
| Not overwriting registers is (again IIRC) handled by instruction fusion
|
Matt Hey wrote:
| We have been discussing how valuable and needed 3 op instructions are in the 68k. ...but it's not that much of a priority because code fusion is so good and speeds up existing 68k code.
|
Ok. To me it sounds like fusion is a good replacement for 3 OP instructions.Megol . wrote:
| Lookup tables would be "free" in the N68 design as long as it hits cache. Complicating the cache access path for this special case would slow down general code execution.
|
Matt Hey wrote:
| There would be no locking method but if you are not using too much data then the data will stay in the cache.
|
Ok. However, in my case, there are 2 parts of the program that I want to make use of the cache, but 1 of them (the 1st) will probably "mess up" the cache for the other one: 1) The outer loop, which loops through each byte in a file (which can be large). 2) The inner loop. For each byte in the outer loop, it uses the byte to fetch a longword from the correct position in the lookup table.I guess the problem with the automatic cache in this case, is that the outer loop after a while will fill the cache, thus removing the "lookup table" from the cache. It would be great if the CPU could manage to cache the data read in the outer loop without removing the "lookup table" from the cache. Matt Hey wrote:
| A series of TST.L (An) in a loop with An pointing to the start of each cache line (16 bytes) in the table will work to force data caching.
|
Good idea. I have to test this out.
| |
Jostein Aarbakk Norway
| | Posts 7 14 Mar 2011 23:16
| Gunnar von Boehn wrote:
| Cache are going to be a lot bigger than on classic systems. So far we tested cache ranging from 4 KB to 128 KB.
|
Sounds very good indeed.Gunnar von Boehn wrote:
| One question: You said you precalculate values for the CRC. How does this algorithm work? What are you precalculating in detail, how many instructions would it take to calculate this on the fly?
|
It is no option to calculate on the fly, because it requires a series of bitwise shifting and an Eor (in worst case scenario) for each bit shifted. See documented example code below (uses a precalculated lookup table). Calculating "on the fly" can be done by something similar as the "Calc_CRC_longword" subroutine below.Start: Bsr.s Calc_CRC_table ; Create the CRC lookup table ; Calculate CRC checksum of our string. Lea TestString(PC),a0 ; String to check Lea CRC_Table(PC),a1 ; CRC lookup table Moveq.l #-1,d0 ; Initialize result value Moveq.l #0,d2 .ByteLP Move.b (a0)+,d2 Beq.s .End Eor.b d0,d2 Lsr.l #8,d0 Move.l (a1,d2.w*4),d3 Eor.l d3,d0 Bra.s .ByteLP .End Not.l d0 Rts ; Precalculates 256 LONGWORDs (for values 0-255), and stores them in memory. Calc_CRC_table: Move.l #255,d6 Moveq.l #0,d1 Lea CRC_Table(PC),a1 .Loop Movem.l d1/d6/a1,-(sp) Bsr.s Calc_CRC_longword Movem.l (sp)+,d1/d6/a1 Move.l d0,(a1)+ Addq.l #1,d1 Dbf d6,.Loop Rts ; Precalculates 1 LONGWORD. ; For every 8 bits in the source BYTE, the loop does this: ; 1) Shifts source BYTE 1 bit right ; 2) Shifts result LONGWORD 1 bit right ; 3) If ONLY 1 of the 2 outshifted bits is 1, ; EOR the result (d0) with the QUOTIENT. ; ; INPUT PARAMETER: ; d1=Source BYTE (the byte to precalculate a code for). ; You can change this between 0 and 255. ; Test values (Source leads to this result in d0): ; 0 = $00000000 ; 1 = $77073096 ; 2 = $ee0e612c ; 3 = $990951ba ; 4 = $076dc419 ; Calc_CRC_longword: Move.l #$EDB88320,d5 ; QUOTIENT Moveq.l #0,d0 ; Result LONGWORD. Don't change this. Moveq #7,d7 .Loop Lsr.b #1,d1 Scs.b d2 Lsr.l #1,d0 Scs.b d3 Eor.b d2,d3 Extb.l d3 And.l d5,d3 Eor.l d3,d0 Dbf d7,.Loop Rts CRC_table: ds.l 256 TestString: dc.b "AB",0 ; NULL-terminated string. ;Testresults: "A"=D3D99E8B, "AB"=30694C07
| |
Cesare Di Mauro Italy
| | Posts 526 15 Mar 2011 06:24
| Jostein Aarbakk wrote:
| Fusion sounds like a good idea. If there isn't enough space for adding these new addressing modes to the EOR-instructions, I agree that other optimizations might be more important. |
There's no space to add them with a 16 bits opcode. You need at least 32 bits one, so opcode fusion is a better option.
Ok. However, in my case, there are 2 parts of the program that I want to make use of the cache, but 1 of them (the 1st) will probably "mess up" the cache for the other one: 1) The outer loop, which loops through each byte in a file (which can be large). 2) The inner loop. For each byte in the outer loop, it uses the byte to fetch a longword from the correct position in the lookup table. I guess the problem with the automatic cache in this case, is that the outer loop after a while will fill the cache, thus removing the "lookup table" from the cache. It would be great if the CPU could manage to cache the data read in the outer loop without removing the "lookup table" from the cache. |
Don't worry: the cache is designed to fit your needs.Lookup table's cache lines will not be evicted, since the whole LUT needs only 1KB of space (consider, however, to proper align it at 16 or 32 bytes bounds, depending on the CPU). If your outer loop works with a file, usually it'll use a little buffer (128 bytes is a common value) to read (or write) data from it, so it will stay entirely on a few cache lines. The same applies if the o.s. does file block cache (AmigaOS does not, if I remember correctly), since blocks are usually small (512 bytes for AmigaOS, 1KB for old Unixes, 4KB for modern o.s. filesystems).
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 15 Mar 2011 09:12
| Hi SP, S P wrote:
| I understand that the current fusion design will enable the following 4 instructions in 2 cycles(N050) and 1 cycle in (N070) with superscalar. move.l d0,d1 swap d1 ;usefull for 16:16 fixed point move.l d2,d3 lsr.l #8,d3 ;usefull for 8:8 fixed point
|
Yes the above 2 examples can be fused. The fusing works under the following conditions. 1) 2 Sources, 1 Destination, 1 ALU Operation. This technically means you can fuse moves together with another instructions. S P wrote:
| How about these: add.l d0,d1 swap d1 |
2 Alu operations, can not be fused. S P wrote:
| or.l d0,d1 swap d1 sub.l d0,d1 swap d1 muls.l d0,d1 swap d1 eor.l d0,d1 swap d1 neg.l d1 swap d1 | None of them can be fused. Fusing requires that the combined operation of BOTH instructions is already existing as ALU operation in the CPU. S P wrote:
| Here are two common merge pass in Chunky2Planar conversion routines. The merges can also be used when combining 2 buffers with shadetables input: d0=ABCD d2=abcd output: d0=AaCc d2=BbDd 1 move.l d2,d3 ;swap 8 (d2,d4) 2 lsr.l #8,d3 3 eor.l d4,d3 4 and.l d7,d3 ;d7=0x00ff00ff 5 eor.l d3,d4 6 lsl.l #8,d3 7 eor.l d3,d2 |
Only 1 and 2 can be fused. S P wrote:
| input: d0=AABB d2=aabb output: d0=AAaa d2=BBbb 1 move.l d2,d4 ;swap 16 (d0Xd2) 2 move.w d0,d4 3 swap d4 4 move.w d4,d0 5 move.w d2,d4 |
1 and 2 could maybe be fused. I'll have to doublecheck this one. S P wrote:
| Most amiga programs wich contain chunky graphics will use a variation of these merges. So I suggest a new instruction c2pmerge d0,d2,#n (n=1,2,4,8,16) The new N050 can interpretate 128bits per clock That means that it can look at 8 (16 bit) instructions and search for a pattern per clock.. |
It means that the CPU can execute UP TO 128 bits per clock. S P wrote:
| The first c2pmerge is 7 Instructions(112 bit wide) The second c2pmerge is 5 Instructions(80 bit wide) |
In these examples fusing will improve performance only a little. But luckily C2P routines are not needed any more at all on NATAMI - as you have native chunky support now. This means by leaving AWAY these routines such programs will be accelerated a LOT on NATAMI. :-D
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 15 Mar 2011 13:12
| Hi, Ok lets look at your code and lets analyse it. 1) So first let look at it optimistic. The MOVE and the EOR can not be fused as the combination does not use the same DST. But a SuperScalar CPU could forward to result to the second ALU. Here is the number of cycles the workloop will take. By slightly changing the code you can remove one instruction and get down from 7 to 6 instructions for the inner loop. 68050 68070 ByteLP: Move.b (a0)+,d2 1 1 Beq.s End 2 1 Eor.b d0,d2 3 2 Lsr.l #8,d0 4 2 Move.l (a1,d2.w*4),d3 5 3 Eor.l d3,d0 6 3 *Forwarding Bra.s ByteLP 7 3 bra.b Start ByteLP: Eor.b d0,d2 1 1 Lsr.l #8,d0 2 1 Move.l (a1,d2.w*4),d3 3 2 Eor.l d3,d0 4 2 *Forwarding Start: Move.b (a0)+,d2 5 3 Bne.s Loop 6 3
In this special case having a EOR ea,Dn would be beneficial as you could then write it like this: bra.b Start ByteLP: Eor.b d0,d2 1 1 Lsr.l #8,d0 2 1 EOR.l (a1,d2.w*4),d0 3 2 Start: Move.b (a0)+,d2 4 2 Bne.s Loop 5 2
This would get the workloop down from 3 cycles to 2 cycles for the 070. This would be rather swift then. 2) Now lest look at the code realistic bra.b Start ByteLP: Eor.b d0,d2 1 1 Lsr.l #8,d0 2 1 -- ALU to EA flow usage penalty of 1/2 cycles!! Move.l (a1,d2.w*4),d3 3,4 2,3,4 Eor.l d3,d0 5 2,3,4 *Forwarding Start: Move.b (a0)+,d2 6 5 Bne.s Loop 7 5
The table lookup using Index will loose performance on 68050, 68060 and 68070. You could try to rework the code to get the D2 update and the D2 usage further apart from each other.
| |
Matt Hey USA
| | Posts 733 15 Mar 2011 14:36
| @Gunnar Don't forget we have ColdFire MVZ and MVS now! ByteLP: Mvz.b (a0)+,d2 Beq.s End Eor.l d0,d2 Lsr.l #8,d0 Move.l (a1,d2.l*4),d3 Eor.l d3,d0 Bra.s ByteLP
It looks like an immediate can be extended for free in a code fusion but can a register be extended for free also? moveq #7,d0 ;free extb.l eor.l d0,d1 so is this possible too... mvz.b (a0)+,d0 ;is this sign extension free also? eor.l d0,d1 and can something like this be fused even though the first move is not long and there are 3 instructions... move.b (a0)+,d0 extb.l d0 eor.l d0,d1 It also looks like code fusion and result forwarding is still possible for multiple sizes if the first instruction is a long. Is this so? move.l (a0),d0 and.w #$f0f0,d0 ;upper 16 bits does not need to be reread The source is long, data is anded to the long source and output is long. Even this would be great. I understand that starting with a word or byte won't work as you explained. We can use MVS and MVZ to start with a long if we can just do word and byte operations after. Is code fusion possible for this... mvz.b (a0)+,d0 swap d0
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 15 Mar 2011 15:53
| Matt Hey wrote:
| @Gunnar Don't forget we have ColdFire MVZ and MVS now! ByteLP: Mvz.b (a0)+,d2 Beq.s End Eor.l d0,d2 Lsr.l #8,d0 Move.l (a1,d2.l*4),d3 Eor.l d3,d0 Bra.s ByteLP
| Where would be the benefit of this change? Matt Hey wrote:
| It looks like an immediate can be extended for free in a code fusion but can a register be extended for free also? moveq #7,d0 ;free extb.l eor.l d0,d1
|
Sorry this does not work. The requirement for fusing is that the destinations of BOTH instructions are the same. This means only one destination has to be updated by both instructions. Your example will update two registers - Therefore it can't be fused. Cheers
| |
Megol .
| | Posts 675 15 Mar 2011 16:10
| But MVZ isn't useful for this routine right? My simple attempt to optimize it a bit by making the move and eor potentially fusionable. move.b (a0)+,d2 beq.s .end .nextbyte eor.b d0, d2 lsr.l #8, d0 move.l (a1,d2.w*4), d3 eor.l d0, d3 ; d3 replaces d0 move.b (a0)+, d2 beq.s .end2 eor.b d3, d2 lsr.l #8, d3 move.l (a1, d2.w*4), d0 eor.l d3, d0 ; d0 is now the crc again move.b (a0)+, d2 bne.s .nextbyte .end mov d0, d3 .end2 ; here d3 is the crc value...
| |
Matt Hey USA
| | Posts 733 15 Mar 2011 17:23
| Gunnar von Boehn wrote:
| Matt Hey wrote:
| @Gunnar Don't forget we have ColdFire MVZ and MVS now! ByteLP: Mvz.b (a0)+,d2 Beq.s End Eor.l d0,d2 Lsr.l #8,d0 Move.l (a1,d2.l*4),d3 Eor.l d3,d0 Bra.s ByteLP
| Where would be the benefit of this change? |
At least on the 68060, move.l (a1,d2.l*4),d3 instead of move.l (a1,d2.w*4),d3 would save a cycle. I believe there is a penalty hear also on 68020/68030 and maybe 68040. Unfortunately, the 68060 doesn't have the mvz.b instruction but it would have only saved an instruction before the loop and it's better to use long instructions if possible ;). For the 68020+, a moveq #0,d2 before the loop and using move.l (a1,d2.l*4) should save at least a cycle per iteration of the loop after the first... Lea TestString(PC),a0 ; String to check Lea CRC_Table(PC),a1 ; CRC lookup table Moveq.l #0,d2 Moveq.l #-1,d0 ; Initialize result value .ByteLP Move.b (a0)+,d2 Beq.s .End Eor.b d0,d2 Lsr.l #8,d0 Move.l (a1,d2.l*4),d3 ;this is better than (a1,d2.w*4) Eor.l d3,d0 Bra.s .ByteLP .End Not.l d0 Rts
Will the N68k be able to sign extend and do all shift sizes in the EA without penalty? Gunnar von Boehn wrote:
| Matt Hey wrote:
| It looks like an immediate can be extended for free in a code fusion but can a register be extended for free also? moveq #7,d0 ;free extb.l eor.l d0,d1 |
Sorry this does not work. The requirement for fusing is that the destinations of BOTH instructions are the same. This means only one destination has to be updated by both instructions. Your example will update two registers - Therefore it can't be fused. |
Doh! I should be more careful. I should have put... moveq #7,d0 ;free extb.l eor.l d1,d0
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 15 Mar 2011 17:43
| Matt Hey wrote:
| Will the N68k be able to sign extend and do all shift sizes in the EA without penalty?
|
Yes - the 68050 can already do EVERY EA for free. (zero extra cycles!)Exceptions are of couse memory indirect modes... This means the timing on the 68050 looks like this:
Move.l D0,d3 -- 1 cycle Move.l (a1),d3 -- 1 cycle Move.l (a1)+,d3 -- 1 cycle Move.l (a1,d2.l),d3 -- 1 cycle Move.l (a1,d2.w*8),d3 -- 1 cycle Move.l (12345678,a1,d2.w*8),d3 -- 1 cycle
What is a problem is:
Eor.l d0,d2 -- Register write to D2 in ALU Move.l (a1,d2.l*4),d3 -- Register usage of D2 in EA Unit
The two ALUs (EA-Unit and main ALU) are below each other in the pipeline. You need to have 3 instructions between an instruction which updates a Data-register in the ALU and an Instruction which needs this Data-Register in the EA-Unit. BTW this is conceptional and of how the ideal 68K CORE works. The 68060 behaves the same this is quite well explained in the 060 mamual.
Matt Hey wrote:
| I should be more careful. I should have put... moveq #7,d0 ;free extb.l eor.l d1,d0
|
Yes this is fuseable! MVS is currently handled by the 68050 like a fused MOVE and EXT instruction. This means the extension is done in the ALU. Which means MVS can NOT be fused atm with another instruction. MOVEQ is extented earlier in the pipeline as the immediate value is already available during dedcoding - the decoder does the extending in the very frist pipeline stage already. Cheers
| |
Matt Hey USA
| | Posts 733 15 Mar 2011 18:05
| Gunnar von Boehn wrote:
| Matt Hey wrote:
| Will the N68k be able to sign extend and do all shift sizes in the EA without penalty? |
Yes - the 68050 can already do EVERY EA for free. (zero extra cycles!) |
Sweet! Gunnar von Boehn wrote:
| What is a problem is: Eor.l d0,d2 -- Register write to D2 in ALU Move.l (a1,d2.l*4),d3 -- Register usage of D2 in EA Unit
The two ALUs (EA-Unit and main ALU) are below each other in the pipeline. You need to have 3 instructions between an instruction which updates a Data-register in the ALU and an Instruction which needs this Data-Register in the EA-Unit. BTW this is conceptional and of how the ideal 68K CORE works. The 68060 behaves the same this is quite well explained in the 060 manual. | Jostein already does a better job of instruction scheduling than all 68k compilers I've seen 8-/. The 68060 change/use stall is only 2 cycles in some cases. Is this because it's superscaler or by register result forwarding? Possible on N68070 at least? "The OEP does not experience any sequence-related pipeline stalls. The most common example of this type of stall is a change/use register stall. This type of stall results from a register being modified by an instruction and a subsequent instruction generating an address using the previously modified register. The second instruction must stall in the OEP until the register is actually updated by the previous instruction. For example: muls.l #<data>,d0 move.l (a0,d0.l*4),d1 In this sequence, the second instruction is held for 2 clock cycles stalling for the first instruction to complete the update of the d0 register. If consecutive instructions load a register and then use that register as the base for an address calculation (An), a 2-clock-cycle wait may be incurred. This represents the maximum change/use penalty for a base register. The maximum change/use penalty for an index register (Xi) is 3 clock cycles (for Xi.l*2, Xi.l*8, and Xi.w). The change/use penalty for an index register if Xi.l*1 or Xi.l*4 is 2 clock cycles. Certain instructions have been optimized to ensure no change/use stall occurs on subsequent instructions. The destination register of the following instructions is available for subsequent instructions: lea mov.l &imm,Rn movq clr.l Dn, any op (An)+ any op -(An) as a base register for address calculation with no stall, or as an index register for address calculation with no stall, if Xi.l*{1,4}. If the index register used is Xi.l*2, Xi.l*8, or Xi.w, then the previously described 3 cycle stall occurs."
| |
Megol .
| | Posts 675 15 Mar 2011 18:14
| Gunnar von Boehn wrote:
| (stuff removed) MVS is currently handled by the 68050 like a fused MOVE and EXT instruction. This means the extension is done in the ALU. Which means MVS can NOT be fused atm with another instruction. MOVEQ is extented earlier in the pipeline as the immediate value is already available during dedcoding - the decoder does the extending in the very frist pipeline stage already. Cheers
|
I understand why MVS requires the execution stage but does the same apply to MVZ too?
| |
Gunnar von Boehn Germany
| | (Moderator) Posts 5775 15 Mar 2011 18:15
| Matt Hey wrote:
| lea mov.l &imm,Rn movq clr.l Dn, any op (An)+ any op -(An)
|
The manual is complecated written. What they are saying that they do not need to use the lower ALU for executing those instructions. This is basically what I said differently before also.
| |
|
|
|
|