Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
Do you have questions about the Natami?
Post it here and we will answer it!

OCS/ECS/AGA Flaws In the Coprocessors Themselvespage  1 2 3 4 5 6 7 8 9 
Megol .

Posts 672
30 Mar 2011 19:44


SID Hervé wrote:

Hello
 
  Like the system programming, the direct programming is a simple tool that followed, follows, and will follow any future hardware changes.
 
  A reduced documentation will not prevent the direct use of hardware. The making available of RKMs has never encouraged to override the programming system. Instead, it documents the internal workings of the Amiga. It recommends, when necessary, and details the direct access to hardware. The absence of this informations would surely raises more development that would not have been really friendly.
 
  On the other hand, the fact that the hardware becomes more powerful does not mean that the Holy Grail of democoder is completed. This will give him more power with a size roughly equivalent (without datas).
 
  And it might be wise to distinguish the demo and hacking, those are two different sports.

Hacking = doing stuff with things that they aren't designed for. Whether that is circumventing security ("cracking"), optimizing code for a certain system/removing generality or even manipulating peoples normal response ("social engineering") doesn't matter. I have seen an angler hacking a toothbrush into a lure, a car mechanic hack a tool for some special purpose...

SID Hervé
France

Posts 663
30 Mar 2011 19:52


Hello
And about democoder?

Cesare Di Mauro
Italy

Posts 526
30 Mar 2011 19:57


Kowalski . wrote:

Cesare Di Mauro wrote:

  a more polite hardware interface.
     
 

 
  Being italian as Cesare, I think he meant "polished" or "neat" with this sentence.
  Just for the record. Eventually I'll stand corrected.
  Cheers,
 
  Andre

Indeed. :)

Thanks Andre (or andres? O:-). I often make this mistake, confusing polish and polished. :-/

Cesare Di Mauro
Italy

Posts 526
30 Mar 2011 20:24


Samuel D Crow wrote:
For example, so far the two additional display modes for NatAmi are byte-planar: 8-bit (1 byte-plane) and 24-bit (3 byte-planes).

But they are "alien" to the Amiga hardware, which is bitplane(s) based.

You still need additional hardware in order to support them. And o.s. support too.
There's no point in making a 32-bit display mode (although a 4 byte-plane Blitter mode will exist for per-pixel alpha-blending) since it would take more bandwidth but yield no more color depth.  There's also no point in making a 2 byte-plane mode since the Blitter might need additional hardware to deal with that situation.

If the Blitter supports 32 bits chunky, it's still far away from the traditional bitplanes. So you need additional hardware again.
The point is that we'll likely be able to reuse some of the register mapping with byte-planes that were used with bit-planes.

You can do the same with chunky modes. You still do it for 8 bits chunky and for the 3 byteplane modes, which are quite different from the usual bitplanes.
The future will not need to hold much more of the PC-style chunky modes since they waste bandwidth and chip resources simply because they can, and they are slightly easier to use.

I don't agree. Bitplanes/byteplanes wast space and bandwidth too, especially using DDR-2 memory were, once started, a burst must be completed.

With bitplanes you raise the granularity to 128 bits / pixels, which is A LOT.
With byteplanes it's better, but it is still 16 pixels (x3 in terms of bandwidth and space).

16 and 32 (and 24 too) bits chunky modes are much more compact and have a better granularity (8, 4 and 5 respectively).

So, at same depth, a chunky mode is always and absolutely preferable in terms of space and bandwidth.

Another thing which is related is latency. Since bitplanes and byteplanes require to load all data before start using them, the output requires greater latency compared to chunky modes, where a single load carries already usable data.

But we already discussed about these things.
With careful planning, the existing design will scale up to better FPGAs with minimal modification.  The design is meant to be optimal without having to be extensible at all.  This will make design costs lower over time.
 
Now do you see what we are trying to do?

I think that you are making again the same mistakes.

Anyway, that's my opinion. ;)

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 20:32


Thomas Richter wrote:

From today's perspective, it is a nightmare because now higher bandwidths are available, and when using them for images, you need to fire the blitter as often as 24 times to render an image. 

Actually I do not see Bitplanes as negative as you.
First of all, supporting Bitplanes in the chipset is no problem at all.
The Blitter logic is small, and takes little chip space.
If you want to enhance your chipset to e.g support Byteplanes, or Pixels of XXX Bytes then you can do this without any need to remove the Bitplane logic.

Lets say for example your Bitplane Blitter needs 1000 LE.
What is this compared to a 55,000 LE FPGA?
And what is this compared to a 1,000,000 LE ASIC?

Is there a need to remove the Bitplane support from future chipsets?
No its really not!

Regarding the need of several Blitts.
If you want this need could also quite easily be removed.

The AMIGA did always support interleaved Bitplanes.
With interleaved Bitplanes you can do as many Planes as you want using a single Blit.

With a tiny little enhancement to the AMIGA Blitter you could also do such a Blit using a single Plane mask.
Another possible improvement would be to add a FIFO to the Blitter allowing it to accept several jobs in a row. This is also not difficult and would free the CPU from waiting between Blits.

Cheers
Gunnar

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 20:36


Cesare Di Mauro wrote:

Another thing which is related is latency. Since bitplanes and byteplanes require to load all data before start using them, the output requires greater latency compared to chunky modes, where a single load carries already usable data.

Lets go into detail here. Its always better to discuss these things using real technical numbers.

Maybe you can give us an example of how much latency this is in your opinion.

Please explain us why 3 Byteplane have more latency than for example a 32bit Chunky mode - and how much more.

Cheers


Cesare Di Mauro
Italy

Posts 526
30 Mar 2011 20:56


Suppose that the display logic needs to fetch the data to display it, and you have a 128 bits burst memory interface.

With a 24 bits chunky mode it starts to output the pixels just after the first burst.

With a 3 byteplane mode it needs to fetch all 3 128 bits data, combine them (suppose that it take zero cycles), and it starts to output the pixels after the third burst.

So the 3 byteplane has at least tripled the latency compared to the 24 bits chunky.

The same happens with the Blitter and the CPU: once completed a burst, with chunky modes you already have usable data.

Another thing that I forgot before is the fact that 16, 24 and 32 bits chunky modes stay in contiguous memory blocks, so fetching a row doesn't require to send commands to the memory controller to start a burst into another memory address, whereas bitplanes and byteplanes need to do this.

I'm not a memory interface expert, but may be this can require additional cycles, so increasing latency and wasting bandwidth too.

Samuel D Crow
USA
(Natami Team)
Posts 1295
30 Mar 2011 21:58


@Cesare Di Mauro

Remember that the byteplanes are a display mode.  The Amiga chipset required the screen and sprites to all be a multiple of 16 pixels across anyway.

If you're referring to the Blitter's latency, it simply takes 3 passes at the same data chunk with the job queue overhead built into the new hardware whether through the Bopper or through job queue registers on the Blitter itself.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 23:03


Cesare Di Mauro wrote:

Suppose that the display logic needs to fetch the data to display it, and you have a 128 bits burst memory interface.
 
With a 24 bits chunky mode it starts to output the pixels just after the first burst.

In 1980th your calculation would have been correct.
But since about 15 years memory works totally different.

Today all DRAM memory has many waitstates.
And this is by desing this way.
Therefore from the time when you start your burst request
until the time when the data from the memory will arrive, there will be many cycles past.

Our NATAMI has a very low latency (compared to a PC) but nevertheless about 26 cycles will have passed until the data will arrive.

By design the memory controller will use this time and send out continously new bursts request. This means the DMA engine will shoot out a full rain of DMA request, long before the first data words will arrive back.

Cesare Di Mauro wrote:

I'm not a memory interface expert, but may be this can require additional cycles, so increasing latency and wasting bandwidth too.

Todays memory works differently.
You have pages and banks and memory chips can hold several banks open in parallel without a drawback.

You can savely trust those which are memory experts - there are no problems with using Byteplanes.

One Thousand
USA

Posts 832
30 Mar 2011 23:12


@Cesare
 
Just so you know, I like chunky too.  You are not alone.
 
 
@Sam
 
It seems to me that we are more bound by the memory interface than what old chipsets were bound by.  We get to mess with the chipset design. 

@Gunnar

Even though you can have multiple pages open, but there are only so many you can have open.  And the more you have open for the display, the less you can have open for other things.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
30 Mar 2011 23:44


One Thousand wrote:

  Even though you can have multiple pages open, but there are only so many you can have open.  And the more you have open for the display, the less you can have open for other things.
 

 
Wow, its great that you want to help us design memory controllers.
I was not aware that we have so many DDR memory controller experts in the forum that could have helped us. :-D
 
Ok, let us make some progress here. ;-D
I'm sure you are all aware of the burst behavior of ALTERAs latest high-performance memory controllers.
Unfortunately the maximum burst length can not simply be doubled as someone would naively expect.
Therefore the burst length in combination with the maximum number of burst in flight is a design constant. This constant stands in opposition to the design specific minimal latency resulting from the combination of altmemphy and hpmc units. The design latency results in a number of cycles which are above the maximum useable cycles for transfes resulting of the product of burst length and burst numbers. Which in other words means that the design included  timing gaps, to be used for bank switch activities.

Now based on your post, I assume you know a way to instead use those reserved gaps, not for banks switching activities, but instead you know a way to further increase the troughput for pure sequential access?

Is this what you wanted to propose?

Without offending anyone.
DDR memory controllers and their implementations in FPGA's is a very interesting topic but unless you know something about this topic, any advice how to improve the NATAMI memory design, might not be as helpful as you think.

I think its more fruitful to discuss about topics where people have more background knowledge.
 

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Mar 2011 00:06


Marcel Verdaasdonk wrote:

ah, so you do read people's post.
 
  Gunnar I was asking about paula's state-machine's features.

Yes, Paula has a state machine.
Now which feature do you talk about?
What is your question exactly.
IS there something you would w=change or not change and why?

One Thousand
USA

Posts 832
31 Mar 2011 00:17


Gunnar, your post was sarcastic and uninviting.  Is your real intention to have a proper discussion? 

Now, how does what you said relate to 3x8 being better than chunky?  Really?  Simplify it and make it relevant.

You can line up the requests for chunky to keep the flow going nicely too.  The nice thing about display logic, for the most part, is you can plan way ahead.  With chunky, you only have one stream to deal with.  Planar, you have to deal with multiples.

And besides, you do not want to saturate the bus with only display.  Then you'd have nothing else. 



Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Mar 2011 00:19


Cesare Di Mauro wrote:

16 and 32 (and 24 too) bits chunky modes are much more compact and have a better granularity (8, 4 and 5 respectively).
 
  So, at same depth, a chunky mode is always and absolutely preferable in terms of space and bandwidth.

  Anyway, that's my opinion. ;)

I think many other will disagree with your opinion.
I think we all have the same opinion about 16bit chunky.
So this is easy.

32bit chunky does increase bandwidth requirements by 33%.
Which means it slows your system down by 33%.

24bit chunky does not exits as mode.
24bit can NEVER be chunky because neither a Blitter nor a CPU can write 24bit properly in a single access.

What you call 24bit chunky needs 3 bytes each 1 Byte the same way as the 3xByteplane mode will need it.

Both modes have the same optimal bandwidth requirement.
Both modes have different advantages.
- Your proposal is clearly more difficult to address.
+ Your is sequential so while it needs 3 byte accesses to write they have a good chance to fall in 1 or 2 cachelines. While the Hybrid mode will always fall in 3 lines. For bobs this make no real difference but for drawing vertical lines this might be a little advantage.
- The Hybrid mode has the advantage that you can have the different planes easily in different resolutions. This allows optimal display of video modes which use color planes in lower resolutions.
- The Hybrid mode has the advantage that you can easily update the planes one by one. This is ideal if you work with images which come with individually compressed byte planes. This makes decompressing those images or decompression video streams work optimal in this mode. Here the layout of the Hybrid mode is optimal lcoally for the CPU cache.

In short both modes have pluses and minuses.
I can see a lot of nice features of the Hybrid mode which make it quite useful in many ways.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Mar 2011 00:32


One Thousand wrote:

  Gunnar, your post was sarcastic and uninviting.
 

Sorry, I couldn't help. :-D
Those comments were so funny.

There is nothing wrong with not beeing an expert in a certain area.
But it starts to get funny if a lot of "advice" is given
or comments are posted like facts like "chunky is simpler".

The story would look totally different if people would ask, instead giving advice about topic which they lack deep expert knowledge.
   
 
One Thousand wrote:

  With chunky, you only have one stream to deal with.
 
  Planar, you have to deal with multiples.
 

 
Well, lets look at this in detail, maybe the result will surprise some people here.
 
To display a good resolution you need to shoot out a bunch of requests. If you do not do this you will not have enough video bandwith.
 
For the sake of argument lets compare 3x8 with 1x32 mode.
Lets say for your desired resolution you need 6 request in flight for 24bit. This means for each of the 3 byteplanes you need to have 2 requests in flight in parallel.
 
Your 32bit screen has a 33% higher DMA requirement,
which means for the same resolution this 1 plane
needs to have not 6 but 8 memory bursts in flight in parallel.
 
Does your 32bit screen look simpler?
Only if you do not care how a memory interface works.
Because in reality this one plane needs 33% more requests in flight than the 3 plane mode.
 
 
The math is very simple.
The 32bit chunky mode is wasting bandwidht.
The 32bit chunky mode needs 33% more buffers and 33% more requests in flight cmopared to the Hybrid mode.
 
The 32bit mode is clearly simpler for the CPU if you really want to draw each and every pixel one by one with the CPU.
This goes without saying.
If course the CPU is to slow to do any fullscreen single pixel drawing anyway - so you would not go far with this at least not if you want to do something in good resolution in real time....
 
If you do not care about real time then having to write 3 bytes for the the Hybrid mode should normally not be a drawback.
And the Hybrid mode has some clear advantages in certain video modes.
 
The 32bit mode makes sense as display mode - just for backward compatibilty reasons.
 

One Thousand
USA

Posts 832
31 Mar 2011 00:52


Thank you, that was a better response.

What I meant by "multiple streams" was "multiple distinct streams" because of the jumping around that is inherent in planar.  My bad for not clarifying that well enough.  But . . .

That was nice of you to use 32bit chunky.  However, you ought to remember that I am more for a nice 16bit chunky.

So in other words, go with a nice 16bit chunky screen with brightness/color separated colorspace, right?  It has less DMA requirement, less memory space, and it is one nice stream without jumping around as well as a more efficient use of bits.  :)

---

By the way, I do not think people are saying drop the 3x8 mode, which you are for.  What they are saying is they want the chunky modes.  Why not have both?  You may want one mode for one reason, or another mode for another reason.  Just recently you gave a big spill about how well FPGAs are growing and that we have space.

Gunnar von Boehn
Germany
(Moderator)
Posts 5775
31 Mar 2011 01:04


One Thousand wrote:

  By the way, I do not think people are saying drop the 3x8 mode, which you are for.  What they are saying is they want the chunky modes.  Why not have both?  You may want one mode for one reason, or another mode for another reason.  Just recently you gave a big spill about how well FPGAs are growing and that we have space.
 

 
I think no one said we ultimately drop chunky.
 
First of all the AMIGA Blitter is so flexible in its design that you can use it for any mode.
This means for the Blitter works fine for bitplanes, and for the basics its works fine for 8bit chunky or byteplanes, and its also working fine for 16bit chunky or 32bit chunky.
 
For display modes supporting 16bit and 32bit is not a huge deal and its certainly useful for backward compatibility to Cyber and SDL games.

Thomas Richter would could propoably say its bad that people exposed the information about those mode and about framebuffer layout before ... therefore we now have to support them..
 

But the interesting part of supporting in my opinion comes when you support them not only for display but also with all HW acceleration as rendering targets.
And here begins the part where you want to discuss if you need or want this.

I wonder if providing full HW acceleration or 3D acceleration for all those different modes makes the best sense.
 
For future games 16bit looks more and more limited (not to say it starts to sucks)
So the first question is: Does it make sense to support this mode with the 3D core at all?

If the overhead of rendering the pixels not so high that the saving some bits in display memory becomes a bad tradeoff?
 
I agree with Sam here that we should not spread ourselves too thin.

If we advertise one of the new mode as "recommended mode" and provide for this one  "strong HW acceleration" then this makes probably more sense than to provide 5 modes with only halve working acceleration for each of them.
 
But maybe I'm to cautious here?


One Thousand
USA

Posts 832
31 Mar 2011 01:24


Actually, *you* said drop chunky.  You did that when you changed focus to planar and wanted to extend planar to 10,11,12 bits.  And your behavior has also appeared to be against having chunky.  That is why everybody was concerned, including me.
 
On the 16bit, it depends.  Maybe some one prefers using the extra bandwidth for a higher res or more FPS.  As for color quality, a good brightness/color separated colorspace (for example, the Jag's CRY) is near 24bit RGB.
 
Now, as for Tami.  That does take a lot more thought than having some screenmodes.  I would ask that you are careful about that.  There has been numerous requests for information, but you have avoided that.
 
There is also the thought that maybe it is better to just have the multiple 68k cores.  This way, one can program whatever kind of rendering you want.  But a fixed function, high level pipeline is much more limited.  The trade-off being higher performance in a fixed function or flexibility.
 
Personally, I would go the multiple 68k cores over the one fixed function pipeline.  But maybe we can filter out some functions that could be very useful on a lower level and have a combination of both.  For example, since 32bits is the magical number, packing and unpacking of pixels (chunky or hybrid) could be useful.

--

Another thought about Tami, if she was the tile-based, deferred renderer we last talked about, the tile is in full color res all the time.  Could there not be a separate writing unit that is smart enough to dump out and rearrange according to the desired screenmode?

--

Curious:  Why can there not be a packed 24bit chunky mode?  Wouldn't you only need a little padding at end of the screen or maybe line, instead of the alpha padding on every 32bits?  Just find a nice LCM to match up?  Perhaps I am missing something. 

Cesare Di Mauro
Italy

Posts 526
31 Mar 2011 05:39


Samuel D Crow wrote:

  @Cesare Di Mauro
 
  Remember that the byteplanes are a display mode.

  So do you plan to use it only for the framebuffer?
 
The Amiga chipset required the screen and sprites to all be a multiple of 16 pixels across anyway.

  That was because the chipset works with bitplanes and words (16 bits).
 
  This constraint doesn't necessary apply to different modes (chunky and byteplanes), except for the chipset implementation.
 
If you're referring to the Blitter's latency, it simply takes 3 passes at the same data chunk with the job queue overhead built into the new hardware whether through the Bopper or through job queue registers on the Blitter itself.

  I was talking in general, so counting any possible usage: display logic, Blitter and CPU.

Cesare Di Mauro
Italy

Posts 526
31 Mar 2011 05:54


Gunnar von Boehn wrote:

 
Cesare Di Mauro wrote:

  Suppose that the display logic needs to fetch the data to display it, and you have a 128 bits burst memory interface.
   
  With a 24 bits chunky mode it starts to output the pixels just after the first burst.

  [...]By design the memory controller will use this time and send out continously new bursts request. This means the DMA engine will shoot out a full rain of DMA request, long before the first data words will arrive back.

  OK
 
Cesare Di Mauro wrote:

  I'm not a memory interface expert, but may be this can require additional cycles, so increasing latency and wasting bandwidth too.
 

  Todays memory works differently.
  You have pages and banks and memory chips can hold several banks open in parallel without a drawback.

  Home many requests can be accepted? What is the bank size?
 
You can savely trust those which are memory experts - there are no problems with using Byteplanes.

  What you said before doesn't change so much the sceneries.
 
  With byteplanes you need to wait for the 3 memory requests before you can start to output data. With (24 bits) chunky you can start just after you received the first burst. So byteplanes latency is greater than chunky.
 
  Also, they require 3 internal buffers to hold all the data, whereas chunky requires just one.
 
  With byteplanes granularity is much higher: 16 pixels x 3. Chunky have much lower. So if your operation isn't "16 pixels aligned" you are wasting space and/or bandwidth.
 
  Is it correct?

posts 166page  1 2 3 4 5 6 7 8 9