Start writing tests, change no production code. You need to understand how it wo...

akavel · on Aug 23, 2018

I once did a major cleanup at the start of my career. The above advice is closest to what proved successful to us, together with the response by JohnBooty https://news.ycombinator.com/item?id=17825093.

In our case, I think we didn't write a lot of tests, because we didn't know they're so useful then. But much later I did a few ports of FOSS software between programming languages, and comprehensive test suites were absolutely godsend in those cases. So, if you can, try to write tests as first thing indeed.

Other than that, we took a kind of "organism fighting a cancer" approach. In a spaghetti codebase, we tried to find some smallest possible islands of the most isolated code (a lot of detectivistic work). Then try to improve the isolation even more — slowly engulfing the "tumor" in a more reasonable API. Once we had a somewhat acceptable API, with mostly well understood semantics, rewrite the internals of the tumor from scratch. Rinse and repeat.

Notably, as mentioned by JohnBooty, we did talk this with management. It made our product release late by ~1yr. The argument was that there was a particular critical bug, which would result in lost consumer data. This one bug was found very close to original release date, and opened our eyes to the horrors in the codebase (actually, a big part written by a subcontractor).

Sorry if the reply is somewhat chaotic, I didn't have much time to write it.

EDIT: Also, I believe the term "technical debt" is reasonably good when talking with management/owners, as it has correct connotation of an investment at early stage, which however has to be repaid later at a cost.

yashap · on Aug 23, 2018

Definitely this, though I’d say the first thing to address is good observability. Are all errors being logged, with stack traces? If it’s a web service, are errors rates and timings for all endpoints tracked?

Next, the deployment process. Are tests run on all deploys? Are there canary deploys to minimize impact when bad code goes out? Easy rollbacks?

After that, I’d address the dev environment. Can you run it easily locally? Can you easily attach a debugger?

After that, add end-to-end, integration, then unit tests, in that order. Only then, once you understand the code well, and have it nicely instrumented and tested, should you consider significant refactors. By that point, you may even realize that a lot of it is good enough and rarely touched, and doesn’t really need to be refactored.

If any of the above steps are already well covered, skip them. And don’t spend all your time doing this - talk to your manager(s), try to get buy in to spend say 1/3 of your time on the health/quality of the code, 2/3 on features/bug fixes/etc. Justify WHY - in this case it sounds like the code would be hard for anyone but the original author to modify with confidence, to make sure the code scales past this one person without the product becoming full of bugs and instability, it’s necessary to devote significant time each sprint (or whatever) to improving quality. Can frame it as “original author was just trying to ship quickly, and could keep track of all the hacks because they wrote them, but to bring more people onto the project it has to be made safer/easier to modify for non-experts”. Maybe wait until you’ve been there a month or three before you start pushing this.

At the end of the day, the code doesn’t have to be perfect, it just has to be decently easy to modify with reasonable confidence. Tests should catch most errors, monitoring should quickly catch the rest to minimize the impact. Work towards that, over time, without killing yourself trying to sneak in refactors that your manager(s) don’t know about.

askmike · on Aug 23, 2018

I think you are right, though in my experience in these situations is that defining the exact behavior is very hard. In not many cases was the messy code written messy from the start, it's usually an accumulation of changes (to behavior) - which are very hard to figure out.

I've found it's also hard to test code that isn't modular and split up: before you can test anything you need to mock 2 database connections, one external API endpoint and reading/writing from 3 different folder locations. Either that or you split them up yourself and risk breaking something.

haney · on Aug 23, 2018

So for a project that lacks tests, I’d suggest higher level integration/smoke tests so that you don’t have to modularize in order to write tests. Don’t mock a database, use a “real” test database with fixtures, if necessary use tools like Selenium and snapshot testing. These tests may be brittle and they may not cover all branches. But you’re looking for things like “can a user register”, “can an order be placed”. For http you can use tools like Charles Proxy to record / replay network traffic from real observations of the existing code running.

herbst · on Aug 23, 2018

This is what i would suggest as well. I did some "rewrites" in my carreer and they only really went well when i worked long enough with the legacy code before. Essentially getting a feeling for what it really does, being that with tests (preferred) or by reading the same lines again and again while introducing new features/bugs.

If you feel the code is messy it is likely you are missing some essential business logic as well when blindly rebuilding it.