I had thought of the case for <title/> but basically out of laziness talked myself out of writing a separate test case for it, presuming the test cases for a zero-length title and an un-closed title covered the corner cases. (The entire document was guaranteed to be converted to valid UTF-8, perhaps with invalid character substitution characters, by that late in the Content Converter pipeline.)
So, as soon as someone asked me if I had changed the title parsing code, I was 50% sure of which corner case I had screwed up before looking at any code. It took me about 30 minutes to submit a code fix with updated test cases. I think less than 1 billion documents had been processed, resulting in less than 1,000 pages missing updates due to my bug.
It helped me empathize with young engineers dealing with their first high responsibility bug.
About 4 years ago, I was managing a guy right out of school who pushed a minor bug that broke real-time risk calculations for a major multinational financial institution in the middle of the European trading day, prior to the NYC open, and people were yelling over email that they were trading blind. Someone had committed an important change right after his, so a simple rollback was highly sub-optimal.
Remembering how I felt years ago, I reassured the new guy that people were yelling over email because it was important, not because they were mad at him. I told him that I thought he was the most familiar with his change and the most capable person to fix it, and that he should do his best to calm down and focus, but he should let me know if he needed help, and I would do my best to calm folks down. I told him he would probably remember that mistake the rest of his life, but nobody else was going to remember it a week later. He had the bug fix in production in under an hour.
He sent me an email from home that night worried that he had let the team down, and I reiterated that he was going to be the only one who remembered the mistake longer than a week. The post-mortem follow-up was just to reiterate to authors and reviewers the importance of corner-case tests, and nobody brought it up later.
I really only remember it because my manager sent me an email that night praising how well I handled the new guy's first big production bug.
That's a fair criticism. Deep down, I usually have a pretty high opinion of my abilities. I think I'm pretty good at hiding it in person, but I'm less good at hiding it in my writing. I feel happiest and most excited to write when I'm thinking about some of my happiest memories. I try to also be open about the mistakes I've made. I've generally been much more lucky than skilled.
I've definitely written more than one bug where post-mortem estimates were over $10,000 in losses.
August 20, 2013, I finished a code change (in Hong Kong) to Goldman's global algorithmic trading system and sent it out to a colleague in Europe to review. A friend of mine was a machine learning person in our Tokyo office and was in town for work, so a bunch of us had dinner and a small number of drinks. I stopped by the office on my way home to check if the change had been approved. It had, and I hesitated a bit to put it in production, because I had a couple of drinks and it was late at night. However, rationalized that I had written all of the code while awake and without a drop of alcohol, and pushed the change into production.
I woke up the next morning to read news [1] that Goldman had lost up to 100 million dollars in an automated trading problem within 1-2 hours after I pushed my change. I couldn't see how my change could possibly have caused that error, but was still a bit panicked until I reassured myself that my cell phone would have been called once a minute until I woke up if I had made a change that caused a loss of that magnitude.
I went into the office and saw that a chat window I had open with a friend in the NY office showed "presence unknown". An email sent to them bounced. So, I walked over to the derivatives (Flow) Strats desk, sat down in an empty chair next to one of my friends, and just quietly said "... so " and the name of my friend in NY. My friend on the Flow desk's eyes got wide and he said "how did you know?". I actually didn't know until the Flow Strat's reaction confirmed my guess.
My friend in New York was actually very careful, but he had been working under time pressure late at night and pushed a bug into production. He'd been more responsible than I had the night before. I got really lucky, and he got really unlucky. He's actually a really solid engineer. He caught plenty of very subtle bugs in other people's code, at least once when he hadn't been asked to review the code.
After August 20, 2013, if at all possible, I push changes into production before noon, and not on Fridays.
If memory serves the "maybe $100 million" ended up being around $28 million.
And that's the time that I could have easily caused a $28 million loss.
There was also a time I misplaced a paren and had a bad actor noticed, they could have used 60 million customer computers in a DDoS UDP traffic amplification attack. My test cases weren't matching my hand-worked-out examples, but I eventually just gave up and assumed my code was correct and put incorrect values in the message authentication code test vectors. Never roll your own crypto, especially if your test vectors aren't coming out as you expect. That was 2004.
The "yelling" was coming to the team email list, asking for ETAs and progress updates for when real-time risk would be back up. Roughly 4 people at the time knew the bug could be traced to the new guy's commit, and none of those people were doing the "yelling". And it was Goldman, so the "yelling" was kept very professional (no swearing, strictly enforced). But, there were literally tens of billions of dollars that needed to be dynamically hedged, but that wasn't possible without real-time risk, the European markets were open, and markets in the Americas were going to be open within a couple of hours. Trading and management were making sure that that everyone on the team email list understood that this was drop-everything important, perhaps using all caps.
Yes, I and the person who reviewed the change bear more responsibility than the new developer. Also, I say "new guy", but the person who had interned with us the Summer after "the new guy" had already joined full time at that point, so "the new guy" had been working full time with the team for at least 9 months at that point. I also remember the room where it happened, which wasn't the first room we were in, so maybe he had been with us full time more like 18 months. In any case, it was the first time when he was trying keep the weight of billions of dollars out of his head and calmly but quickly fix a bug.
Did you consider the fact that you probably know nothing about the dynamics of their workplace, the structure of their management/leadership, etc. before assigning blame?
Probably typing on a non english keyboard. When I learned C a long time ago, I read somewhere 'isnt it nice that one doesnt need a lot of keystrokes like for begin and end', and I thought pls give me begin and end instead of this unpleasent slow hand movement.
Spot on. It's the reason many non-native speaker developers I know nevertheless use an English keyboard layout. I personally made my own hybrid layout that is basically an English US layout with the letters rearranged according to my native layout.
The author was born and raised in Brooklyn and went to Cornell, if I remember correctly. As far as I know, English was his only language, and he was almost certainly using either US QWERTY or Dvorak keyboard layout.
Yes, "laziness" is unfair and imprecise, a laziness on my part. :(
I think it was an issue of familiarity and comfort, not newness. The author joined Google before I did. If you're basically working with one other person, and you're rarely getting code reviews from outside your coding pair, and few other people interact with your code, it's easy to develop some bad habits and forget that your code choices have externalities. To be fair, the externalities were usually rather small.
I had thought of the case for <title/> but basically out of laziness talked myself out of writing a separate test case for it, presuming the test cases for a zero-length title and an un-closed title covered the corner cases. (The entire document was guaranteed to be converted to valid UTF-8, perhaps with invalid character substitution characters, by that late in the Content Converter pipeline.)
So, as soon as someone asked me if I had changed the title parsing code, I was 50% sure of which corner case I had screwed up before looking at any code. It took me about 30 minutes to submit a code fix with updated test cases. I think less than 1 billion documents had been processed, resulting in less than 1,000 pages missing updates due to my bug.