I am pretty sure everyone who at least got to program something a bit more complex than spitting out "Hello World!" on the standard output, has some idea how frustrating finding certain bugs could be. Especially if the said bug lives in some embedded system whose only external interface is a serial line on which it should perform it's main function. Especially if it manifests itself as a real nasty example of a Heisenbug. And, guess what, at the end it turns out that the bug crept in from the very silicon itself, Atmel's flagship, the AVR32 micro controller.
To cut it short, for the impatient, the whole story on the bug itself can be read presented by me (as "dev.null") and "Catweax" on the AVR Freaks forum here. In the following I will just talk about it's impacts, on me, and with me, on the company I work for.
What a typical behaviour of an experienced software developer would be if he sees his creation does not exactly behave like he wanted it to? Well, certainly not blaming the compiler, much less the hardware itself.
It was somewhere April or maybe May, I was assigned to build up a filtering many endpoint analog/digital input-output module's software for an AVR32 UC all ground up first with Modbus interfacing. There was high performance (also high throughput on the Modbus interface) and some degree of proven safety in mind, so Atmel's ASF with it's shoddy interrupt handling was out of question.
All went well until I implemented some self-checks in the software, then it was like Cerberus fell asleep at the gates of Hell and the demons just broke all loose. The checks would signal inconsistencies all around for no apparent reasons while the software otherwise seemed to work just right.
Probably the entire June was spent with me hunting where could I miss some strategic "volatile", and then after proving line by line that at C language level this damned thing must be correct, I started studying the assembly listings (which also meant I had to learn it, not much with my experience, but still something).
That time I blamed the compiler (avr32-gcc), at least some of it's optimizations which seemed to make the problem appear or go. After some group discussions, it was set that the compiler must be used optimizations off until more will be figured out, and the frequencies should be pumped up to meet the requirements. There was a different AVR32 project also under my development from then which gone through with this policy.
Meanwhile of course I got to browse around including AVRFreaks, including Catweax's topic (see the linked topic), who described a by-his-experience DMA related problem in the processor. I was aware of this, and that as well that my DMA is just working fine (which assumption was indeed correct).
The company's decision on the input-output module was kind of waiting, put on hold since without optimizations it wouldn't meet the performance requirements, so with the situation it looked like we needed an another micro controller. Costly stuff, but putting this on hold was a costly one, too.
In September the other AVR32 project was ready, thankfully it's implementation went without much hiccups apart from that for the lack of optimization to the end to supply a proper safety margin I had to fiddle with the clocks to run it at 24MHz with a PLL.
So back to the problem. Another month of work, now starting with assuming I am probably after a compiler bug here. So reducing, reducing, and reducing the software, to a simple test case. Too bad the bug sometimes just decided to vanish completely, so frequently I needed to revert, start over, and give it a go again. Many days of frustrating messing around got me to a small test case.
This small test case finally showed something like if after an interrupt disable, some memory read would be corrupted, if some other indeterminate prerequisites are also met. I still had no clue on if it was in any relation with Catweax's find as his one seemed to relate to a DMA versus interrupt race condition.
That was the time we (the company) tried to contact Atmel, and that I posted my finds on AVRFreaks (to tell the truth I popped up a question there beforehand too asking someone to check a chunk of assembly generated by the C compiler, relating to this behaviour, but no-one helped - maybe because the instruction sequence itself interpreted - that is, not ran on the actual UC - was leading to a correct behaviour).
Thankfully Catweax found the topic, did a little of testing, and in the end suggested that probably the two problems are actually one. Well, that was the light, at last!
After several more days with some additional discussions the case was proven and the bug got somewhat properly documented there. Too bad Catweax just disappeared - maybe the problem just wasn't interesting any more when solved, but anyway, he did just what was necessary there, maybe he just has a job like me.
So everyone is happy, the dragon was slain, the prince kissed the frog who turned in a lovely princess, but oh well, is this still a happy end?
Anyone missing anything?
Where was Atmel all the way? Neither us, neither previously Catweax got any credible response from them, neither the issue is documented anywhere while Catweax found the problem already more than a year ago (posted in August, 2011).
As far as for us, at least just for my time (which the company couldn't utilize for "useful" stuff then), if I calculate with those numbers they like to tell (that's far far from my actual salary, I would be happy if it was!), it did cost $23000. Yes, more than twenty kilo bucks, just for that I was sucked in this mess.
Costly little adventure, indeed.