Tuesday, October 7, 2008

Threading a fine line

I have come to realize something about myself as a programmer, and that is that my past programming experience has not been in development but in hacking. Sure, I've written a large chunk of code here or there, but the vast majority of the code I have written has been written in a blitz. From homework assignments in college, to writing Perl scripts for molecular dynamics research, the average lifetime of my code is probably about 2 weeks.

Things are different now that I've taken on the responsibility of writing software for analyzing large, image-based biological screens. It used to be easier to just dodge code speed bumps, "Hack it, just make it work because no one else will have to learn how to use it." Now when I encounter a speed bump, I can't go around it or even drive over it, I have to get out of the car and find a way to peel this damn thing off the road or consider the nightmare of paving a new one!

At the start of the day yesterday, I was bogged down by the thought of having two highly elusive bugs hiding somewhere in my code. I knew I couldn't just look the other way and continue with development until the code was 100% stable. The frustration was that the code seemed completely stable because the errors were often difficult to reproduce. Still, they were there, and I knew they weren't going to be easy to fix, so I set to mailing the wxpython-users group for help before attacking the problem on my own.

The first bug, was related to a worker thread whose job it was to fetch image tiles and add them to a bin. I noticed that if I was resizing the window while the thread was hard at work, I would occasionally get error output in my console which didn't mean anything to me. However, even less frequently, I would get other errors which would cause the thread to crash.

This was unnerving psychologically because neither of these errors was terribly disruptive to the user. Even in the case where the worker thread would crash, it could be easily restarted by clicking a button. Still, as a developer I realized that small problems like these will snowball as an application grows. They are far easier to thwart at their outset than to put it off in hopes of them resolving themselves.

After playing with it for a while I found that triggering lots of resize events on the bin when the thread was adding a cell was more likely to cause an error. Clearly there was a lack of mutual exclusion for the bin. The solution was simple enough, do the image loading in the worker, then signal the main thread whenever an image was ready.

Today I stand with one bug left to fix. This one unpredictably crashes the entire app in a few inept-user scenarios. Lets just say I've already started to paving over that road.