Nate just linked me to this post interviewing an inside source in Microsoft about the causes of the RROD. Now that I’m involved in hardware manufacturing of consumer devices, it’s a fascinating case study of what not to do, so I’m paying attention and taking notes.
A while back I posted that I was looking for an RROD Xbox360; I actually sent it off to MEFAS to get digested for solder joint inspection on the GPU through a process called “dye and pry”. In this process, the motherboard is flooded with red ink, and then the GPU is mechanically pried off the board. The red ink flows into any of the tiny cracks in the solder balls, and at least in theory, when you pry the GPU off the cracked regions will shear first so you will be left with visible red spots at the points of failure.
The findings were interesting. Below is what a normal ball looks like after the test:
(click on the image for a larger version)
And here is one of several balls on the GPU that exhibited signs of partial failure:
There was also some “voiding” seen in the balls, e.g. trapped gas bubbles inside the solder balls that might serve as starting points for mechanical failure. Some voiding is expected, and there’s not a lot of data I can find correlating failure with voiding, but I could imagine in a stressful mechanical environment these things don’t help.
I was a bit puzzled by these results because you didn’t see any “catastrophic” failure — pools of red ink over a connection interface — just partial cracking. Partial cracking isn’t terribly uncommon, and many products work quite well despite such artifacts. However, after reading the article linked above, if Microsoft shorted safety margins around many of the design parameters to get the product out on time, it makes sense that the summation of many partial failures could lead to a total system failure — failures that have symptoms that vaguely cluster together but are difficult to point to any single root cause. Heisenbugs. Yuck.
Complex systems are a bitch to get right — and reliable. I think about that every time I step onto an airplane, or when I read about the space program. Respect to the engineers at Boeing and NASA!
I am starting to belive the lager issue lies in how the cpu and ram are are mounted on the wafer that is then mounted via bga to the board.
In my own tests on faulty systems I have found the core of the gpu and the graphics ram removed from the wafer quite easily under relitivly low temperatures.
I can send pictures if you like.
Also I have noticed that under partial failures (still functional but intermitant) these units can generate a great deal more heat than they do after they have been properly reflowed indicating to me that the problems will just snowball when things start to go bad.
[…] Source: Bunnie’s Blog […]
And if a small number of VDD or VSS balls fail, all sorts of failures are possible. I once tracked down a failure on an ASIC, proving that one bit column of an internal SRAM was getting flipped, but the problem showed up in just one proto (of a few hundred). Never got any further, but I was left wondering if it was power rail noise, perhaps due to failed C4 or BGA balls.
[…] TOP: Is what a normal solder ball looks like after the test BOTTOM: One of several balls on the GPU that exhibited signs of partial failure As you can see, by these results, you don’t see any “catastrophic” failure, which would be pools of red ink over a connection interface, here we see it’s just partial cracking. Partial cracking isn’t terribly uncommon, and many products work quite well despite such artifacts. However, after reading the [SeattlePI RRoD] article, if Microsoft shorted safety margins around many of the design parameters to get the product out on time, it makes sense that the summation of many partial failures could lead to a total system failure, and those are failures that have symptoms that vaguely cluster together, but are difficult to point to any single root cause; See Heisenbugs. […]
Thanks for publishing your test results. I noticed that some solder balls pulled off the board while others stuck with the GPU. So perhaps the failures are visible at the other end, hidden by the solder ball? This wouldn’t make sense intuitively since you’d think the weakest side (GPU or board) would be where the ball separates when you pry. Still, wondering if there’s more behind this…
Just thought I’d throw my two cents in- my own personal bout with the RRoD started last june or so- I play my 360 in the garage, and there were a few days that hit 100+ degrees and here are my friends, leaving the 360 on to simply PLAY MUSIC outside in hot, HOT weather. Well, first there were a few iffy boot-ups, where it worked after 4 tries or so, then after a day of that or so it went full on RRoD. Time passes, I am loathe to try any heat gun tricks or whatnot when supposedly MS is covering my console under warranty. I even had them send me a box, but then I heard that if your console was opened, blah blah they won’t fix it, and my case was very OBVIOUSLY tampered with. So, with nothing better to do, I decided to give the “eraser” trick a try after I ran across this one guy’s site who actually gave me a feasable explanation as to why it might work, not to mention he said that he got the idea because he started thinking about some BMG video card he has which has a common problem that (I guess) a lot of people ran into- and what he said was that for some reason the BMG chips (on the video card) required PRESSURE on the for them to sit right. So, looking at the 360’s chips and seeing the SAME ones as on his video card, he tried the eraser thing out. So, not having anything to lose, I figured I’d give it a shot and LO AND BEHOLD it freaking worked. Of course, I never really imagined that the 360 would last very long, as it’s supposed to be a temp fix according to most people’s experience.
So I agree that it’s the BMG chips and how they’re mounted, as apparently this is the alleged(sp?) reason putting PRESSURE on the chips makes them work.
Oh, and my 360 finally did die- from condensation though, from me playing it outside (doh). Ended up getting a Halo Special Edition with the HDMI and the newer power supply, and when I opened it up what did I noticebut that there had been an addition of a whole other heat-sink assembly type thing which looks like it’s for those damn BMG chips. Probably applies pressure to them as well, and paddles their bottoms when they’re bad. Or something. ;)
Nate — which side the solder ball sticks too seems to be random to me. I’ve done a few dye and prys and “perfectly healthy” chips will exhibit an arbitrary pattern of which half of the solder junction is the weakest. I think it depends partially upon things like how much copper there is for the ball to adhere to. For example, if you have a flooded ground plane on the PCB, it will adhere the ball better than just a pad surrounded by air gap.
It may also depend upon subtle things like where the voids are in the ball–a large portion of the balls will have tiny voids in them from the volatilization of the flux, and their distribution may play a factor in which side breaks first, the board side or the package side. Short answer is, I don’t know of any correlation between failure and which side the ball sticks to, and there are a lot of possible explanations for why the coin might land on one side or the other.
Xmonster–I’m not quite sure what you’re referring to when you say “wafer” but I assume you’re referring to the silicon flip chip on the package. That, too, is a point of failure, and you can have delamination and solder ball cracking there as well. For various reasons it’s typically less likely that’s the cause, but under extreme heat conditions I would agree that everything is suspect.
“failures that have symptoms that vaguely cluster together but are difficult to point to any single root cause. Heisenbugs.”
I’m sorry, but this is not the definition of a Heisenbug. A Heisenbug is a bug that goes away when you try to analyze/debug it. Just being difficult to pinpoint is not enough to call it a Heisenbug. Please edit your post to correct this error. Thank you.
shut your pie hole tabicat you fool.
Please correct your retardiness. Thank you
I’d like to add the phone company (at least the guys who make the gear) to NASA and Boeing. Anybody who works with the Internet knows it breaks all the time. Individual hosts, even whole neighborhoods can just disappear for minutes or hours. Of course, there’s always a reason: weather, rain, chipmunks, power, whatever.
But it amazes me that every day, for as long as I can remember, when I pick up the phone, there’s a dial tone. Power’s out, call the power company. Heat’s out, call the gas company. Internet is out, call the cable company. No matter what else is broken, the damn phone still works. Kudos to the guys who make phone switches. This is why I haven’t ditched my POTS landline, even though I mostly use my cellphone.
PS. Of course, I click the submit button, and wouldn’t you know it, but MY INTERNET WENT DOWN. It took 2.5 hours to come back. Thank you, Comcast.
tabicat — in hardware, these vague clusters of failures typically manifest as a heisenbug. When you attempt to debug the problem, you often start by attacking just one possible cause of failure. You put your scope probe on it, and because you’re probing slight parametric shifts, the system starts working better all of a sudden (or worse, but the point is you think you “found it”). So you fix the bug and it seems to go away, but because the actual root cause is due to multiple, linked parametric shifts, the symptoms eventually comes back. Often times “fixing” one bug can in fact aggravate other parametric shifts over time, so the symptomatic band-aid can actually worsen the root cause of the problem. But that’s not the point of the post, so I don’t go into an analysis of why this is a heisenbug; however, it is a fair point for discussion in the comment area.
To make the example more concrete, suppose you are Microsoft and trying to debug the problem. You know there is an issue with the GPU; so you remove the GPU and put a new one back on. The process of doing that destroys information about the bad solder joints, so in the process of trying to do the “obvious thing” to debug/analyze it you’ve also fixed the bug–at least in the short term. Even if you did suspect the solder joints, the process of a “dye and pry” destroys the system integrity — obviously, you can’t boot the system again once you’ve pulled off the GPU — so inconclusive results like the ones I had about partial cracks in the BGA are insufficient to declare a root cause. The summation of a lack of margin in the silicon, system design, manufacturing and mechanical integration all come together in this case to create a system that has a net high rate of failure, but only a vague clustering of symptoms with no single root cause, and direct attempts to characterize any single root cause affects or destroys information about other causes.
This is fairly similar to the original software-derived notion of a heisenbug where inserting a printf or attaching gdb fixes the bug, again due to shifts in linked phenomenon — variations in the way the stack is organized or the timing of execution around a race condition — making it very difficult to diagnose.
Marc — indeed, that’s the true value of a “landline”. I regret that I did away with mine, but instead I’ve made up for it with redundancy in numbers — I’ve got two separate ways to access the internet (cable modem & EVDO) plus a cell phone (on a different network, GSM, so it is truly redundant), so it’s rare that all three go down at once.
That’s interesting, re: the ink-flood process. Wouldn’t it be better to use x-rays for that? I talked my dentist into imaging some chip-scale packages I’d (successfully) hand-soldered, and while the resolution wasn’t great, you can still tell where the solder went if you look closely.
(For what it’s worth, the best of those X-ray shots are up at http://www.ke5fx.com/hpll.htm , about halfway down the page.)
Given a bit more resolution and some additional time to tweak the exposure levels, I think I’d be able to tell a lot about the connection quality underneath a BGA package.
I have had lots of first-hand experience with the X-ray and die-and-pry diagnostics, and don’t have a lot of faith in the tests. It is often used by manufacturers to defend their manufacturing quality, but can also obscure the real source of defects (solder wiskers, loose solder micro-balls, …). The soldering defects can often be localized to a part that is particularly hard to get down properly (fine pitch BGAs, odd thermal charactistics of the solder lands, …). I have seen die-and-pry tests “pass” on an easy-to-solder part when the neighboring, challenging part on the board was the real culprit. X-ray imaging can often miss defects too – you have to look at precisely the “right” level in a 5DX scan.
The general observation of “apply pressure to a part and it is fixed” that people report can also be very deceptive. Good example of Heisenburg effect. I have witnessed multiple situations in complex electronics assemblies where this was tried (“it is so simple, push down on the part”) and symptoms vanish. Root cause turned out to be circuits outside of the “pressurized part”. One example was a defective reference voltage circuit connected to the “pressurized part”. The symptoms vanished with applied force because the capacitance/impedance of the component inputs changed enough with the applied force and this altered the behavior of the external reference circuits. Root cause identification made the guy who had all the C-clamps attached to the board feel a little foolish.
It is not clear to me the MS development team has the depth of experience needed for complex electronics. They are a software company after all….
[…] Somehow I’m taking over the internet without even trying. First there was that picture of me on the Wired blog. Then the other day my Xbox 360 I bought on launch day was laid bare on bunnie’s blog. Just now I saw my serial LCD, that fbz and I had been playing with over the holidays, on We Make Money Not Art. Frankly, I think it’s just an indication that I’m not publishing enough on my own blog… […]
Just thought I’d let you know we lost our 4th xbox last week…the drive seemed to get really noisy and crunchy and then kapow! The ring of death. Traded up for a black xbox elite…apparently they have a better track record. Time will tell.
Maybe one aspect is also the tin-solder.I am not sure but is the Xbox manufactored with lead free solder ?
I solder every day and must say that the lead free solder is not very good
to handle. Maybe the manufactor has problem with it.
[…] http://www.bunniestudios.com/blog/?p=223 […]
I always knew, that the author is very competent of this question! Thanks the Author! Has received weight of pleasure after perusal of clause. I would like to talk to you more in detail, on this question, but I have not found yours Icq or skype … :-[
Hello, excuse, I can bad speak on English… I have found yours blog through search google, I was interested with your texts, I could translate them on Russian, for the publication in the small edition of our company? I would be grateful to you. Thanks.
Nice blog! Thank you :)
[…] To start his research, [Chris] purchased an XR400 RFID reader of off eBay. This is an industrial reader with four antenna ports and Windows CE. He got a great deal… because it didn’t work. He guessed that the ball grid array (BGA) solder joints had cracked. Putting enough pressure on the chips allowed the device to boot. He repaired the board using a heat gun to reflow the solder. He referenced this video of an Xbox 360 being repaired with the same technique. [bunnie] has a post from last year investigating Xbox 360 RRODs and possible BGA failures. […]
Interesting article. Were did you got all the information from… :)
[…] From day one, the Xbox 360 has been plagued by hardware failures. So many failures that Microsoft ended up pushing the 90 day warranty up to a full year. Less than a year later they acknowledge the systemic RROD problem and extended replacement for affected consoles to three years. The RROD is named because of the three red lights displayed when the console failed. The culprit appears to be poor cooling of the console’s components. Components like the GPU would overheat causing solder joints to fail. People were able to repair their own consoles by reflowing with a heatgun. Microsoft has never officially disclosed why these systems fail. Our console purchased on launch day RROD’d, but [bunnie]’s solder joint inspection of it proved inconclusive. Every Xbox owner on Joystiq’s staff has had an RROD. […]
[…] From day one, the Xbox 360 has been plagued by hardware failures. So many failures that Microsoft ended up pushing the 90 day warranty up to a full year. Less than a year later they acknowledge the systemic RROD problem and extended replacement for affected consoles to three years. The RROD is named because of the three red lights displayed when the console failed. The culprit appears to be poor cooling of the console’s components. Components like the GPU would overheat causing solder joints to fail. People were able to repair their own consoles by reflowing with a heatgun. Microsoft has never officially disclosed why these systems fail. Our console purchased on launch day RROD’d, but [bunnie]’s solder joint inspection of it proved inconclusive. Every Xbox owner on Joystiq’s staff has had an RROD. […]
[…] From day one, the Xbox 360 has been plagued by hardware failures. So many failures that Microsoft ended up pushing the 90 day warranty up to a full year. Less than a year later they acknowledge the systemic RROD problem and extended replacement for affected consoles to three years. The RROD is named because of the three red lights displayed when the console failed. The culprit appears to be poor cooling of the console’s components. Components like the GPU would overheat causing solder joints to fail. People were able to repair their own consoles by reflowing with a heatgun. Microsoft has never officially disclosed why these systems fail. Our console purchased on launch day RROD’d, but [bunnie]’s solder joint inspection of it proved inconclusive. Every Xbox owner on Joystiq’s staff has had an RROD. […]
A pictures worth 1k words , Take a look at the hart of
99.99% of all XBOX 360s problems
http://www.allxboxrepair.com/allxboxrepair1.html
[…] disclosed why these systems fail. Our console purchased on launch day RROD’d, but [bunnie]’s solder joint inspection of it proved inconclusive. Every Xbox owner on Joystiq’s staff has had an […]
Good here i found a very useful side to fix the red light
Yeah I can see how those solder balls full of voids and cracks could fail with enough thermal cycling, especially if they’re lead free.
Well I found out some interesting things regarding G.P.U.
I got the pin outs for ati /Microsoft G.P.U. and source the
a cluster of pins that seem to be the first to fatigue open.
Also as the resistance increases so does the heat , little bit of
a thermal runaway. The topology of a non tampered 360 pc board
from what i could measure crowns in the middle of the G.P.U
and toes up on all four corners when at 68c+ and all most reverse
after cool down . so there’s a cluster of pin in a ring just inside the
edges that seem to have the greatest mechanical stress .but
this has only appeared on one test board so it’s still inconclusive
I had to build a re-flow machine :( I much rather have bought one.
Rob
all xbox repair
Once I made similar experiment. It was curious to esteem this article
Im not 100% sure on what to do, but do you still have a warranty?
I think my brain just exploded when reading this post. Awesome work.
cheap prom dresses
How do you reflow with a heatgun? I know this is how I removed parts off of a smt circuit board. I would guess it has to be done just right or you will ruin the board.
do you have any idea about the slim line xbox will it have the same problem
I’ve bookmarked this site because of the useful information
cable companies are also offering broadband internet these days and the cost is cheap too ~”*
I’m a spammer, I barely speak any english but I came here to post my link, just so that I don’t get any benefit because I’m too stupid to notice the nofollow that wordpress adds to all the links.
P.S.: Nice work, love your site. Any thoughts on actual cures for this malaise with from lead free solder? I’ve been testing new OLPCs with people from the foundation but they’re having some problems related to cold solders, which is strange on a laptop that isn’t particularly subject to bending on the CPU side, nor particularly hot – especially when compared to the early XBOXs.
Just came to this blog for a little info. My mate just lost his xbox system to a hard drive fault. His favourite game was COD and he spent hours playing. Now that it’s gone his wife will not let him buy a new one. Does anyone have a cheap second hand unit lying around?