Home   News   Concept   AMIGA-Compatible   Hardware   Forum   Questions+Answers   Pictures   Contact & Team

Welcome to the Natami / Amiga Forum

This forum is for AMIGA fans interested in the new NATAMI platform.
Please read the forum usage manual.



All TopicsNewsQAFeaturesTalkTEAMLogin to post    Create account
The team will post updates and news here

Multicore Processors Capable of Repairing Itself
SID Hervé
France

Posts 663
23 Apr 2011 23:26


Probably a value added for the future:

EXTERNAL LINK

Phil "meynaf" G.
France
(Natami Team)
Posts 393
25 Apr 2011 08:26


SID Hervé wrote:

Probably a value added for the future:
 
  EXTERNAL LINK 

I've read about that too some time ago.

These chips don't exactly "repair" themselves. They just switch the dead cores off. Definitely not useful for a personal computer.


Wojtek P
Poland

Posts 1597
25 Apr 2011 09:23


Phil G. wrote:

SID Hervé wrote:

  Probably a value added for the future:
 
  EXTERNAL LINK 

  I've read about that too some time ago.
 
  These chips don't exactly "repair" themselves. They just switch the dead cores off. Definitely not useful for a personal computer.
 

For "personal computer" - no
For windoze PC - yes. They can make new multicore trash even cheaper, ignore testing at all, and then rely on that mechanism.
Sell 16 core CPU to client, then half or more will fail and switch off but what a problem - PC/windoze client do buy new computer to "have one" not because they need it. They don't really need more than single core in most cases.

Good idea :)


SID Hervé
France

Posts 663
25 Apr 2011 11:38


Hello

You're probably right, I also assume that this technology will not for the mass market due to different economic implications. I think those are specific sectors that could be concerned, probably those whose purpose or priority is not money.

Wojtek P
Poland

Posts 1597
25 Apr 2011 12:18


SID Hervé wrote:

Hello
 
  You're probably right, I also assume that this technology will not for the mass market due to different economic implications. I think those are specific sectors that could be concerned, probably those whose purpose or priority is not money.

I think it IS for mass market. Just without even telling the customer about it. Right technology for right customer :)

Now it is common to sell failed chips as less capable. And this is normal and was normal and fine.
Producer made 4 core processor, but found one failed and sell as 3 core one.
But if then more would fail - you are able to request warranty replacement.

When process technology is constantly shrinking parts are getting constantly more unreliable in time.
Ultra-very-small transistors are prone to wear in short time.
That's why that "technology" got invented.
Sell customer say 8 core CPU, within some short time half will fail but 99% of customer will not notice it and will not want replacement.


SID Hervé
France

Posts 663
25 Apr 2011 13:17


The lengthening of life means less sales, it contradicts the principle of mass consumption (by eliminating the need, it delays the act of purchase).

I think that this rationalization will be implemented mainly where the security is required. It is probable that other niches could apply but for other reasons.

Marcel Verdaasdonk
Netherlands

Posts 3977
25 Apr 2011 14:56


SID depending on how this self test is implemented it could be valid to say it enhances system stability too.

André Jernung
Sweden
(MX-Board Owner)
Posts 988
25 Apr 2011 18:15


I removed some personal attacks started by Megol. Keep it civil.

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
25 Apr 2011 19:41


Such CPUs could have some advantage in data centers where fault times/down times must be minimized. In case a core dies, the job can be taken over by another CPU provided the overall hardware is designed smart enough, and jobs run on "virtual servers" rather than real servers. I don't think it's very relevant for consumer hardware - if the CPU dies, get another one, it doesn't matter that the system is not usable in the meantime.

In professional applications, loosing one CPU is probably better than loosing an entire blade, all provided the Os can deal with this situation.

Greetings,
Thomas


SID Hervé
France

Posts 663
25 Apr 2011 20:43


I do not think the computer will be the only one affected.
 
This kind of self-management of resources could be found in future devices where there will several identical sub-assemblies. For example, a set (CPU, ASIC) could be replaced by a set (CPU, CPU). This could finally help to further reduce manufacturing costs.

Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
25 Apr 2011 22:58


SID Hervé wrote:

I do not think the computer will be the only one affected.
 
  This kind of self-management of resources could be found in future devices where there will several identical sub-assemblies. For example, a set (CPU, ASIC) could be replaced by a set (CPU, CPU). This could finally help to further reduce manufacturing costs.

You already find racks like this, i.e. racks using a non-uniform memory-architecture, assembled for two or four CPU slots. Usually, if a CPU wants to access memory not local to itself, it needs to contact a neighbouring CPU to get the memory contents passed over, of course at a speed impact.

Now, vendors start selling such racks without all CPU slots equipped with hardware. However, to be able to deliver them with full memory, *something* then has to replace the CPU in the then otherwise un-occupied slot, only to allow the occupied CPU slot to reach memory behind it. Vendors now instead replace this CPU by a "dummy" ASIC that that has nothing more to do than to implement the bus-access protocol of a CPU to provide access to memory local to the slot. Of course, these dummy ASICs don't do any computation. They only act as bus-bridges.

It's all reality already, you know.

I don't remember the vendor, though.


Wojtek P
Poland

Posts 1597
25 Apr 2011 23:21


Thomas Richter wrote:

Such CPUs could have some advantage in data centers where fault times/down times must be minimized. In case a core dies, the job can be taken over by another CPU provided the overall hardware is

This is not what it provides. For what you said - you need lots of additional logic to

a) detect the fault - which is what this article described)
b) perform RECOVERY - which is NOT what article described.

Recovery means taking over CPU state to another CPU and continue the task.

This is what IBM z-Architecture mainframes do.

Without this it's nothing more than ability to disable any processor
but every faulting circuit WILL mean crash of whole system.

This is what you can at present without this invention.

But this invention is to HIDE it from customer.
Which is wrong, except for trash-consumers that doesn't even use that processing power, and don't even know how to check how much is available.


Wojtek P
Poland

Posts 1597
25 Apr 2011 23:25


Marcel Verdaasdonk wrote:

SID depending on how this self test is implemented it could be valid to say it enhances system stability too.

No it will not. Self-test is performed at startup.

To increase stability you have to make redundant circuits to control EVERY step of a CPU, and detect fault and BE able to recovery.

Extreme case are IBM mainframes CPU. Actually for each processor there are two of them and extra logic that controlsthe results.
the extra logic is redundant too.
When single fault is detected - cycle is restarted. This fixed
failures caused by noise, natural radiation  etc.

When multiple faults in short time are detected - it means failure of circuit, and processors are halted and FULL STATE of CPU is extracted and pushed to spare CPU and restarted.

In user point of view nothing happened, except information about failure.
if no spare CPUs are left you have to replace boards.


Jakob Eriksson
Sweden
(Moderator)
Posts 1097
26 Apr 2011 06:14



Wojtek, hding is not always bad. If the redundancy is on a higher level, (web application backend for instance) then it could just mean a more economical way to get the last from a piece of hardware before it dies.


Thomas Richter
Germany
(MX-Board Owner)
Posts 1425
26 Apr 2011 09:38


Wojtek P wrote:

Thomas Richter wrote:

  Such CPUs could have some advantage in data centers where fault times/down times must be minimized. In case a core dies, the job can be taken over by another CPU provided the overall hardware is
 

  This is not what it provides. For what you said - you need lots of additional logic to
 
  a) detect the fault - which is what this article described)
  b) perform RECOVERY - which is NOT what article described.
 
  Recovery means taking over CPU state to another CPU and continue the task.
 
  This is what IBM z-Architecture mainframes do.
 
  Without this it's nothing more than ability to disable any processor
  but every faulting circuit WILL mean crash of whole system.
 
  This is what you can at present without this invention.
 
  But this invention is to HIDE it from customer.
  Which is wrong, except for trash-consumers that doesn't even use that processing power, and don't even know how to check how much is available.
 

Where does it say that it "hides" anything? The important fact is that the system can continue running. Of course, a service technician must be able to detect the fault, and replace the faulty circuit at the next convenient time. If replacement is necessary to restore the full computation power.

Not everyone not sharing your crude opinions is "stupid" or "trash". They may just have different opinions or motivations beyond the ones you're understanding.


Wojtek P
Poland

Posts 1597
26 Apr 2011 20:39


Jakob Eriksson wrote:

  Wojtek, hding is not always bad. If the redundancy is on a higher level, (web application backend for instance) then it could just mean a more economical way to get the last from a piece of hardware before it dies.
 


hiding is always bad. Ability to recover from failures is always good. Ability in hardware - even better.

Hiding=user is signalled of a problem.

I am 99.9% sure this invention is only to be able to hide things from user.


Wojtek P
Poland

Posts 1597
26 Apr 2011 20:40


Thomas Richter wrote:

  Where does it say that it "hides" anything? The important fact is that the system can continue running. Of course, a service technician must be able to detecthe fault, and replace the faulty circuit at the next convenient time. If replacement is necessary to restore the full computation power.
 
  Not everyone not sharing your crude opinions is "stupid" or "trash". They may just have different opinions or motivations beyond the ones you're understanding.
 

I already wrote about the difference between true "failure recovery" - like on IBM mainframes (but not only there) and this "invention". If you missed this please reread my posts.


Steve Thomas
United Kingdom

Posts 178
27 Apr 2011 10:40


Developments like this could extend the working life of chips, imagine a 16 core chip is produced, 4 cores are dead, they sell it as an 8 core chip (who really wants to pay the premium on a 12 core chip?) that leaves 4 cores as spares to replace faulty cores later in the chips life, their chips could get a reputation for being more reliable and longer lasting than the competition.
Obviously the chip will have a "core manager" that restricts how many cores are allowed to work to 8 and can switch cores when one fails.

Vidar Hokstad
United Kingdom

Posts 70
27 Apr 2011 12:58


Wojtek P wrote:

  But this invention is to HIDE it from customer.
  Which is wrong, except for trash-consumers that doesn't even use that processing power, and don't even know how to check how much is available.
 

The article doesn't support your claim. It only states that it'll disable damaged cores, it does not say whether or not the user will have access to this information. You're jumping to conclusions you have no basis for.

As for who will use it, consumers are not major purchasers of CPU's with high number of cores as very little consumer software takes advantage of it.

People who want fast servers are. I'm taking delivery of our first 32 core machine today. We could've gone for 48 core but it wasn't really cost effective yet (it'd mean going for 12 core AMD chips, and they're not cheap). It takes up less space in terms of volume than an average consumer level desktop PC.

The larger the number of cores, the higher the cost of keeping stability up. If I could for example get a 48 core version for what I'm paying for the 32 core now rather than the extortionate prices I'd actually have to pay, I'd jump at the opportunity even if a core would fail every 6 months as long as the hardware would detect it and keep running with less capacity.

It's typically cost effective for us to replace hardware every 3 years or so (because of the cost of space and power and operational costs of managing the hardware vs. improving performance of new servers) so we'd end up with more capacity for the same money, even if the total capacity would drop off over time.

EVEN if you're right and this invention won't support automatic recovery (and I don't think so - it explicitly describes allocating the chips workload to available cores, which would seem to at least imply that it's being designed to keep executing if a chip fails), it'd *still* be far better for us to take a crash and a reboot on a core failure than having to yank a server for repairs (and no, z-Architecture is not an option - far too expensive for what we do)



posts 19