There is a lot of righteous anger directed toward Intel over CPU bugs that were revealed by Spectre/Meltdown. I agree that things could have been handled better, particularly with regards to transparency and the sharing of information among the relevant user communities that could have worked together to deploy effective patches in a timely fashion. People also aren’t wrong that consumer protection laws obligate manufacturers to honor warranties, particularly when a product is not fit for use as represented, if it contains defective material or workmanship, or fails to meet regulatory compliance.
However, as an open source hardware optimist, and someone who someday aspires to see more open source silicon on the market, I want to highlight that demanding Intel return, exchange, or offer rebates on CPUs purchased within a reasonable warranty period is entirely at odds with demands that Intel act with greater transparency in sharing bugs and source code.
Transparency is Easy When There’s No Penalty for Bugs
It’s taken as motherhood and apple pie in the open source software community that transparency leads to better products. The more eyes staring at a code base, the more bugs that can be found and patched. However, a crucial difference between open source software and hardware is that open source software carries absolutely no warranty. Even the most minimal, stripped down OSS licenses stipulate that contributors carry no liability. For example, the BSD 2-clause license has 189 words, of which 116 (60%) are dedicated to a “no warranty” clause – and all in caps, in case you weren’t paying attention. The no-warranty clause is so core to any open source license it doesn’t even count as a clause in the 2-clause license.
Of course contributors have no liability: this lack of liability is fundamental to open source. If people could sue you for some crappy code you you pushed to github years ago, why would you share anything? Github would be a ticking time bomb of financial ruin for every developer.
It’s also not about code being easier to patch than hardware. The point is that you don’t have to patch your code, even if you could. Someone can file a bug against you, and you have the legal right to ignore it. And if your code library happens to contain an overflow bug that results in a house catching fire, you walk away scot-free because your code came with no warranty of fitness for any purpose whatsoever.
Oohh, Shiny and New!
Presented a bin of apples, most will pick a blemish-free fruit from the bushel before heading to the check-out counter. Despite the knowing the reality of nature – that every fruit must grow from a blossom under varying conditions and hardships – we believe our hard-earned money should only go toward the most perfect of the lot. This feeling is so common sense that it’s codified in the form of consumer protection laws and compulsory warranties.
This psychology extends beyond obvious blemishes, to defects that have no impact on function. Suppose you’re on the market to buy a one-slot toaster. You’re offered two options: a one-slot toaster, and a two-slot toaster but with the left slot permanently and safely disabled. Both are exactly the same price. Which one do you buy?
Most people would buy the toaster with one slot, even though the net function might be identical to the two-slot version where one slot is disabled. In fact, you’d probably be infuriated and demand your money back if you bought the one-slot toaster, but opened the box to find a two-slot toaster with one slot disabled. We don’t like the idea of being sold goods that have anything wrong with them, even if the broken piece is irrelevant to performance of the device. It’s perceived as evidence of shoddy workmanship and quality control issues.
News Flash: Complex Systems are Buggy!
Hold your breath – I’d wager that every computer you’ve bought in the past decade has broken parts inside of them, almost exactly like the two-slot toaster with one slot permanently disabled. There’s the set of features that were intended to be in your chips – and there’s the subset of series of features that finally shipped. What happened to the features that weren’t shipped? Surely, they did a final pass on the chip to remove all that “dead silicon”.
Nope – most of the time those partially or non-functional units are simply disabled. This ranges from blocks of cache RAM, to whole CPU cores, to various hardware peripherals. Patching a complex chip design can cost millions of dollars and takes weeks or even months, so no company can afford to do a final “clean-up” pass to create a “perfect” design. To wit, manufacturers never misrepresent the product to consumers – if half the cache was available, the spec sheet would simply report the cache size as 128kB instead of 256kB. But surely some customers would have complained bitterly if they knew of the defect sold to them.
Despite being chock full of bugs, vendors of desktop CPUs or mobile phone System on Chips (SoCs) rarely disclose these bugs to users – and those that do disclose almost always disclose a limited list of public bugs, backed by an NDA-only list of all the bugs. The top two reasons cited for keeping chip specs secret are competitive advantage and liability, and I suspect in reality, it’s the latter that drives the secrecy, because the crappier the chipset, the more likely the specs are under NDA. Chip vendors are deathly afraid users will find inconsistencies between the chip’s actual performance and the published specs, thus triggering a recall event. This fear may seem more rational if you consider the magnitude of Intel’s FDIV bug recall ($475 million in 1994).
This is a pretty typical list of SoC bugs, known as “errata”. If your SoC’s errata is much shorter than this, it’s more likely due to bugs not being disclosed than there actually being less bugs.
If you Want Messages, Stop Shooting the Messengers
Highly esteemed and enlightened colleagues of mine are strongly of the opinion that Intel should reimburse end users for bugs found in their silicon; yet in the same breath, they complain that Intel has not been transparent enough. The point that has become clear to me is that consumers, even open-source activists, are very sensitive to imperfections, however minor. They demand a “perfect” machine; if they spend $500 on a computer, every part inside better damn well be perfect. And so starts the vicious cycle of hardware manufacturers hiding all sorts of blemishes and shortcomings behind various NDAs, enabling them to bill their goods as perfect for use.
You can’t have it both ways: the whole point of transparency is to enable peer review, so you can find and fix bugs more quickly. But if every time a bug is found, a manufacturer had to hand $50 to every user of their product as a concession for the bug, they would quickly go out of business. This partially answers the question why we don’t see open hardware much beyond simple breakout boards and embedded controllers: it’s far too risky from a liability standpoint to openly share the documentation for complex systems under these circumstances.
To simply say, “but hardware manufacturers should ship perfect products because they are taking my money, and my code can be buggy because it’s free of charge” – is naïve. A modern OS has tens of millions of lines of code, yet it benefits from the fact that every line of code can be replicated perfectly. Contrast to a modern CPU with billions of transistors, each with slightly different electrical characteristics. We should all be more surprised that it took so long for a major hardware bug to be found, than the fact that one was ever found.
Complex systems have bugs. Any system with primitives measured in the millions or billions – be it lines of code, rivets, or transistors – is going to have subtle, if not blatant, flaws. Systems simple enough to formally verify are typically too simple to handle real-world tasks, so engineers must rely on heuristics like design rules and lots and lots of hand-written tests.
There will be bugs.
Realities of the Open Hardware Business
About a year ago, I had a heated debate with a SiFive founder about how open they can get about their documentation. SiFive markets the RISC-V CPU, billed as an “open source CPU”, and many open source enthusiasts got excited about the prospect of a fully-open SoC that could finally eliminate proprietary blobs from the boot chain and ultimately through the same process of peer review found in the open source software world, yield a more secure, trustable hardware environment.
However, even one of their most ardent open-source advocates pushed back quite hard when I suggested they should share their pre-boot code. By pre-boot code, I’m not talking about the little ROM blob that gets run after reset to set up your peripherals so you can pull your bootloader from SD card or SSD. That part was a no-brainer to share. I’m talking about the code that gets run before the architecturally guaranteed “reset vector”. A number of software developers (and alarmingly, some security experts) believe that the life of a CPU begins at the reset vector. In fact, there’s often a significant body of code that gets executed on a CPU to set things up to meet the architectural guarantees of a hard reset – bringing all the registers to their reset state, tuning clock generators, gating peripherals, and so forth. Critically, chip makers heavily rely upon this pre-boot code to also patch all kinds of embarrassing silicon bugs, and to enforce binning rules.
The gentleman with whom I was debating the disclosure of pre-boot code adamantly held that it was not commercially viable to share the pre-boot code. I didn’t understand his point until I witnessed open-source activists en masse demanding their pound of flesh for Intel’s mistakes.
As engineers, we should know better: no complex system is perfect. We’ve all shipped bugs, yet when it comes to buying our own hardware, we individually convince ourselves that perfection is a reasonable standard.
The Choice: Truthful Mistakes or Fake Perfection?
The open source community could use the Spectre/Meltdown crisis as an opportunity to reform the status quo. Instead of suing Intel for money, what if we sue Intel for documentation? If documentation and transparency have real value, then this is a chance to finally put that value in economic terms that Intel shareholders can understand. I propose a bargain somewhere along these lines: if Intel releases comprehensive microarchitectural hardware design specifications, microcode, firmware, and all software source code (e.g. for AMT/ME) so that the community can band together to hammer out any other security bugs hiding in their hardware, then Intel is absolved of any payouts related to the Spectre/Meltdown exploits.
This also sets a healthy precedent for open hardware. In broader terms, my proposed open hardware bargain is thus: Here’s the design source for my hardware product. By purchasing my product, you’ve warranted that you’ve reviewed the available design source and decided the open source elements, as-is, are fit for your application. So long as I deliver a product consistent with the design source, I’ve met my hardware warranty obligation on the open source elements.
In other words, the open-source bargain for hardware needs to be a two-way street. The bargain I set forth above:
- Rewards transparency with indemnity against yet-to-be-discovered bugs in the design source
- Burdens any residual proprietary elements with the full liability of fitness for purpose
- Simultaneously conserves a guarantee that a product is free from defects in materials and workmanship in either case
The beauty of this bargain is it gives a real economic benefit to transparency, which is exactly the kind of wedge needed to drive closed-source silicon vendors to finally share their full design documentation, with little reduction of consumer protection.
So, if we really desire a more transparent, open world in hardware: give hardware makers big and small the option to settle warranty disputes for documentation instead of cash.
Author’s Addendum (added Feb 2 14:47 SGT)
This post has 2 aspects to it:
The first is whether hardware makers will accept the offer to provide documentation in lieu of liability.
The second, and perhaps more significant, is whether you would make the offer for design documentation in lieu of design liability in the first place. It’s important that companies who choose transparency be given a measurable economic advantage over those who choose obscurity. In order for the vicious cycle of proprietary hardware to be broken, both consumer and producer have to express a willingness to value openness.