The hardest part for the hardware engineer to estimate is the debug phase of a product development. It is also one of the most ignored sections in planning.
CAD tools have progressed over the years in terms of ease of use and integration into PCB and mechanical. But ultimately, the design work is carried out by a person who is not only fallible, but may also be working with incomplete or incorrect data. Some bugs are inevitable on all but the simplest designs and so the art of troubleshooting these bugs is all-important.
Bugs can range from something going BANG the first time power is applied to intermittent glitches reported in association with completely unrelated things like “it was raining” or “it only happens on his bench, not on mine”. Consequently the ease of fixing bugs similarly ranges from a five-minute job to months of work.
Debugging can be the most fun part of electronics design when it is going well. There is a great of satisfaction in finding and fixing the intractable bugs. But to succeed, it is important to be systematic in the approach taken to fixing bugs.
In this article are listed the steps needed to bring such a systems approach to troubleshooting hardware in product development. To illustrate these principles, I will refer back occasionally to work I performed as an embedded engineer on a system It looked something like Figure 1 below:
Software bugs had been eliminated as the cause, so this looked like a hardware bug and I was asked to investigate.
Step 1: Picture success An important part of debugging is having the right mental attitude, as persistent problems can grind down your morale. In particular, it feels bad going to work two days in succession with the investigation stuck at exactly the same point. In such a case, ask yourself “Will I still be working on this bug in a year’s time?” The answer: Of course not! This bug isn’t forever, it’s going to be fixed. It’s not that there’s no solution, it’s that I simply haven’t seen it yet.
Step 2: Keep notesResist the temptation to dive straight in trying to fix the bug immediately. But it is important to determine first if others have dealt successfully with a similar problem. Collect reports from multiple sources, even though they may sometimes have conflicting data attached. A spreadsheet can work well here to organize what you find.
Step 3: Reproduce the problemThis is often the hardest and most time-consuming part. The frequency that bugs show themselves varies enormously. So at this point, based on the information you have collected, you need to create the conditions by which you can make the bug happen at your command.
At this point diagnosis can begin. The initial bug report may be “It stopped working”, “It crashed”, or other equally vague reports. Keep working until you have all the information you can get from the one who reported the problem and also have enough to narrow down the range of possible causes.
Don’t worry about or speculate on the cause, just focus on reproducing the bug. Be careful at this point. Sometimes, similar but different issues can appear to be caused by the same bug, with one masking the other. If other bugs are uncovered while looking for the initial bug, make a note of them and go back to them later, but don’t get side-tracked.
Step 4: Gather the evidenceBe methodical and document what you see and what happens. Don’t theorise at this point about causes, just create a table of what aggravates or alleviates the bug as well what as has no effect. Be aware that multiple bugs can have the same symptoms, which can produce contradictory evidence.
Step 5: Try the easy stuff firstSo you can reproduce the problem, look for the easy explanations first. For instance, are the connectors wired back-to-front? Are the chip pin-outs as per the data sheet? Are clocks running at the correct frequency? Very often, bugs are caused by mistakes that look dumb only in hindsight.
Remember when you did the design work, you had probably thousands of small decisions to make; most of these were correct. If the error proves to be “obvious” you can correct it and skip to step 9.
Step 6: Break the problem downSo now the easy things have been checked out. You can reproduce the problem, but perhaps only occasionally, or there are conflicting messages. It is important to remember that In complex systems, multiple bugs can show the same symptoms on the surface, but require different cures.
To help clarify the issue, eliminate as much of the system as possible that doesn’t appear to be relevant to the bug. For instance, you could power down devices on the PCB that are unrelated, or unplug cables to other boards. Do this while retesting. If the bug suddenly stops when an unrelated module is taken out of the equation, you have a smoking gun. Document it. Try to reproduce the bug again, with the module and then again without it.
Step 7: Talk it over with a colleagueWhen dealing with what seem to be intractable bugs, just talking it over with someone else can often help, even when that person is from a different engineering discipline. Explaining what you see to another can be all that is required for you to see the bug from a different point of view and realize a crucial fact. At the very least you may come up with inconsistencies that need ironing out or receive suggestions of other things to try.
This conversation is best had away from the action. Go through exactly what the evidence is, one bit at a time, then look for what other experiments or investigations can be carried out. Then go back to the board and carry on.
Step 8: Apply the fixYou understand the bug and have come up with a rational solution. You run the code and the problem appears to be solved. However, your job isn’t over yet.
Step 9: Try to break it againTry to break the system again. To be sure you succeeded you will need to put the system through an appropriate series of stress tests an order of magnitude beyond that of the original implementation.
For instance, if a real-time system such as the one above, crashed every ten minutes and never lasted longer than an hour, but now runs for ten hours, the bug is almost certainly fixed.
You may find that the system behaves better, but still crashes. But at this point you may have discovered a new bug that had been masked by the previous bug. You need to treat it as such and go back to step one, creating a fresh investigation on the “cured” system.
Step 10: Remember ‘disappearing’ bugs are still there if you haven’t fixed themSometimes bugs just appear to go away by themselves. This can be frustrating, but you can be sure that you haven’t fixed the bug. Either the initial report was incorrect or the bug is still there. These are the sort of bugs that reappear when your boss, his boss, or a customer is present.
Step 11: CelebrateRemember how bad it felt when the bug was grinding you down? Now, celebrate when you win. It’s you: 1, bugs:0. Now the game can move on and you can be sure you'll fix the next bug too.
well! still more to come.....
thanks
shashank sharma