To clarify and contextualize a bit what you're saying: The one big obstacle in c...

paulmd · on April 4, 2022

> changes in the way chips are designed can make them resilient to defects, so you no longer need to trash that chip on your wafer that's affected by such a defect,

no, it's basically "chiplets but you don't cut the chiplets apart". You design the chiplets to be nodes in a mesh interconnect, and failed chiplets can simply be disabled entirely and then routed around. But they're still "chiplets" that have their own functionality and provide a coarser conceptual block than a core itself and thus simplify some of the rest of the chip design (communications/interconnect, etc).

note that technically (if you don't mind the complexity) there's nothing wrong with harvesting at multiple levels like this! You could have "this chiplet has 8 cores, that one has 6, that one failed entirely and is disabled" and as long as it doesn't adversely affect program characteristics too much (data load piling up or whatever) that can be fine too.

however, there's nothing about "changes in the way the chips are designed that makes them more resilient to defects", you still get the same failure rates per chiplet, and will still get the same amount of failed (or partially failed) chiplets per wafer, but instead of cutting out the good ones and then repackaging, you just leave them all together around "route around the bad ones".

The advantage is that MCM-style chiplet/interposer packaging actually makes data movement much more expensive, because you have to run a more powerful interconnect, where this isn't moving anything "off-chip", so you avoid a lot of that power cost. There are other technologies like EMIB and copper-copper bonding that potentially can lessen those costs for chiplets of course.

What Intel is looking at doing with "tiles" in their future architectures with chiplets connected by EMIB at the edges (especially if they use copper-copper bonding) is sort of a half-step in engineering terms here but I think there are still engineering benefits (and downsides of course) to doing it as a single wafer rather than hopping through the bridge even with a really good copper-copper bond. Actual full-on MCM/interposer packaging is a step worse than cu-cu bonding and requires more energy but even cu-cu bonding is not perfect and thus not as good as just "on-chip" routing. So WSI is designed to get everything "on-chip" but without the yield problems of just a single giant chip.

Brian_K_White · on April 8, 2022

"there's nothing about "changes in the way the chips are designed that makes them more resilient to defects"

Of course there is. Scarequotes invalid.

Ever since the first first pal/gal/cpld/fpga-like device, the essence of malleable silicon has existed, and has been incorporated into things more and more over time.

It's 40 year old stuff by now.

AceJohnny2 · on April 4, 2022

I'll add that many DRAM chips already do something like this, but ironically enough the re-routing mechanism adds complexity which is itself a source of problems, (be it manufacturing or design, such as broken timing promises)

Also, NAND Flash storage (SSD) is designed around the very concept of re-routing around bad blocks, because the very technology means they have a wear-life.

Dylan16807 · on April 4, 2022

> I'll add that many DRAM chips already do something like this, but ironically enough the re-routing mechanism adds complexity which is itself a source of problems, (be it manufacturing or design, such as broken timing promises)

The best-performing solution there is probably software. Tell the OS about bad blocks and keep the hardware simple.

nine_k · on April 4, 2022

I think this is already implemented both in Linux and in Windows; you can tell the OS which RAM ranges are defective.

Doing this from the chip side is not there yet, apparently. I wonder when will this be included in the DRAM feature list, if ever. I suspect that detecting defects from the RAM side is not trivial.

Dylan16807 · on April 4, 2022

> I suspect that detecting defects from the RAM side is not trivial.

Factory testing or a basic self-test mode could easily find any parts that are flat-out broken. And as internal ECC rolls out as a standard feature, that could help find weaker rows over time.

CamouflagedKiwi · on April 5, 2022

Yep, my last PC developed a defect in one of the RAM modules. Finding it using memtest86 was trivial; easier than figuring out exactly how to tell Windows what to do about it...

Of course it did take a little bit of a hunch to go from "the game I'm playing crashes at this point" to "maybe my RAM is defective". I suppose ECC would help spot this.

EricE · on April 6, 2022

Spot and correct it too :)

candiddevmike · on April 4, 2022

Wikipedia link on microlithography if you want a rabbit hole about wafer making:

https://wikipedia.org/wiki/Microlithography

Being able to print something in nanometers is an overlooked technical achievement for human manufacturing.

adhesive_wombat · on April 4, 2022

If that rabbit hole appeals, the ITRS reports (now called IRDS[2]) are very good mid-level, year-by-year summary of the state of the art in chipmaking, including upcoming challenges and future directions.

> Being able to print something in nanometers is an overlooked technical achievement for human manufacturing.

IMO, a semiconductor fab probably is the highest human achievement in terms of process engineering. Not only do you "print" nanometric devices, you do it continuously, in a multi-month pipelined system and sell the results for as little as under a penny (micros, and even the biggest baddest CPUs are "only" a thousand pounds, far less than any other item with literally a billion functional designed features on it).

[1]: https://en.wikipedia.org/wiki/International_Technology_Roadm...

[2]: https://en.wikipedia.org/wiki/International_Roadmap_for_Devi...

RugnirViking · on April 5, 2022

is it the same reports accessed by the process in https://irds.ieee.org/home/how-to-download-irds ?

adhesive_wombat · on April 5, 2022

For IRDS, yes.

hadlock · on April 14, 2022

This is briefly touched upon in Neal Stephenson's "Seveneves", after The Cataclysm happens humanity regrows most technology, but for the most part ICs are limited to 8086-level devices as even thousands of years later they are unable to reach the level of semiconductor technology before it. The amount of technology and power in something as small as a cell phone or lowly raspberry pi is nothing short of a marvel, particularly at their price points.

dylan42 · on April 4, 2022

> change in the way chips are designed can make them resilient to defects

This is already happening for almost all modern chips manufactured in the last 10+ years. DRAM chips have extra rows/cols. Even Intel CPUs have redundant cache lines, internal bus lines and other redundant critical parts, which are burned-in during initial chip testing.

zitterbewegung · on April 4, 2022

The Cerebras WFE is design has on each wafer to disable / efuse a portion of itself to account for defects. This is what you can do if your control the wafer.

ip26 · on April 4, 2022

The other one big obstacle is chips are square while wafers are round.

paulmd · on April 4, 2022

it depends on the exact shape of your mask of course, but typically losses around the edges are in the 2-3% range.

It's not really possible to fix this either since wafers need to be round for various manufacturing processes (spinning the wafer for coating or washing stages) and round obviously isn't a dense packing of the mask itself. It just kinda is how it is, square mask and round wafer means you lose a bit off the edges, fact of life.

ip26 · on April 5, 2022

The interesting part is loss scales with the die size.