As an ML-focused python dev I have never been able to break the habit of REPL-driven development, but I find it works really well for "building code that works" rather than coming up with a tower of abstractions immediately. A typical python development workflow for me is:
* Start with a blank `main` file and proceed linearly down the page, executing as I go.
* Gradually pull out visually awkward chunks of code and put them into functions with no arguments at the top of the file.
* If I need to parameterize them, add those parameters as needed - don't guess at what I might want to change later.
* Embrace duplication - don't unnecessary add loops or abstractions.
* Once the file is ~500 LOC or becomes too dense, start to refactor a bit. Perhaps introduce some loops or some global variables.
* At all times, ensure the script is idempotent - just highlighting the entire page and spamming run should "do what I want" without causing trouble.
* Once the script is started to take shape, it can be time to bring some OO into it - perhaps there is an object or set of objects I want to pass around, I can make a class for that. Perhaps I can start to think about how to make the functionality more "generalized" and accessible to others via a package.
This is literally the only way I've ever found to be productive with green field development. If my first LOC has the word "class" or "def" in it - I am absolutely going to ripping my hair out 12 hours later, guaranteed.
> * Gradually pull out visually awkward chunks of code and put them into functions with no arguments at the top of the file.
I have seen scientific code written in this manner written in both Python and Fortran.
This may be some intuitive way to start off, and even complete the task at hand.
But for people trying to read, understand, and realistically, debug your code, this complicates things.
Because each no-argument function can only work via its side effects.
Your script becomes a succession of state transitions, and to understand it, you have to keep the intended state after each step in your head.
And in case of a mistake, you can't even call these functions individually via passing their intended arguments using the REPL. You have to set up all the state beforehand manually, call your no-arg function, observe the state afterwards. Which becomes more awkward the further down you are in your script, since all your dependencies are implicit and possibly even completely undocumented.
Sure that's like the worst case scenario, but in practice the entire point is that as soon as it becomes difficult to reason about, that's when you start cleaning up the functions and add some obvious OO. You don't leave it as a mess.
> Embrace duplication - don't unnecessary add loops or abstractions
I’ll usually make a function or perhaps tiny class as soon as I start reusing bits of code.
Apart from that, agree as stated. At my previous job (Python shop), a lot of the data engineers came from a Java background, and had a tendency to think top-down. Many things were over engineered ‘just because we might need it’:
- Factory classes used only once or twice in entire code base
- Lets make an AbstractReaderInterface because we might want to abstract the file type or location later (while 100% of files are Parquet on S3)
I’ve really enjoyed using dataclasses and Pydantics BaseModels prolifically, and adding type hints (coupled with type checks in CI).
Model the data, write a well structured imperative workflow, set up CI, write unit tests, enforce typing. Add OOP if needed, then close ticket.
Agreed. The general principle is (and it's a hard balance, of course, to avoid just slapdash work): don't do work up front you might not need. You will make your current work take longer than it should, which you will (correctly) be blamed for, and any time you save in the future you won't get the credit for.
This isn't a cynical statement; it's just my experience. Use it to your advantage!
People are getting caught up on the loops thing. All I meant was in my line of work I often end up with many special cases of general processes. Writing a loop prematurely always bites me - I end up writing control flow for handling the one-offs, it somehow always becomes more obtuse than just listing things out literally.
Goto. In the end it's all about manipulation of the instruction pointer.
Loops as deduplication is a very specific subset of looping that is very popular in some languages and almost nonexistent in others. If you don't have destructuring and convenient list literals you might never ever see it in the wild.
But even the Java 7 version (iterating an ImmutableList.of multi-nested Maps.immutableEntry of new FunctionN) can be workable despite its hilarious amount of repeated type annotations, if you have learned to stop worrying and love the tooling. Stuff like typescript makes it a breeze, so much that one might occasionally forget that it's not the regular form of looping.
> Embrace duplication - don't unnecessary add loops or abstractions.
> Once the file is ~500 LOC or becomes too dense, start to refactor a bit. Perhaps introduce some loops or some global variables.
I agree with most, but if I "embrace duplication" I can reach 500+ LOC in half an afternoon :P. It seems to have really paid off for me to start some degree of abstraction (not OO yet in general) early enough. Tbf, it is easier for me to tidy up the code with some abstractions, rather than ensure that the "main"-titled script runs from beginning to end each point of time with no errors, which imo can hinder experimentation more than abstraction. But also depends on what I do, I guess, the greener the field the more I feel this way.
Wow, not a ML person, but controls and robotics. Yet, this describes my workflow for a lot of things almost to a tee. Even down to the avoiding loops. I tend to do that when I want to run the same simulation or analysis for a couple variables or datasets. It's interesting, because in the past I was a lot more prone to turn that into a loop early on. But this makes your code brittle. You'll want to do something slightly different for the two datasets, which means a bunch of conditionals in your loop. It's actually really similar to the problems you get with boolean flags when you try to abstract into a function too soon. It actually takes disciple to for me to commit to copy past but I think it pays it off.
I too agree with pretty much everything you say. Just want to add that I pretty much solely use ptpython[0]. It can handle line breaks in pasted code, vim (or emacs) bindings and syntax highlighting, and much more.
Pretty much follow an identical process. When I do finally rewrite the code, after getting a working version, the duplication pretty much screams clean me up and simplify/generalize. I have never been able to just see the whole thing before I start. The process itself teaches you things.
I usually use VS code and the "interactive" python functionality (not jupyter). I highlight code and execute just that code with a hotkey. Works just as well with any kind of vim-slime like functionality.
I also primarily write ML-focused Python. For me, having originally learned R and C at the same time, nothing has ever surpassed RStudio as a dev environment. For the past several years my preferred setup has been tmux and Vim with vim-slime in one pane and IPython in the other.
(Personally, and speaking only for myself, I hate Jupyter notebooks with a burning passion. I think they are one of the worst things ever to have happened to software development, and definitely the worst thing ever to have happened to ML/data science.)
> Why do you hate Jupyter notebooks so much that it reaches “worst thing to ever have happened“ status?
It's the "with a passion" part. A certain sub-population is prone to deciding that they love or hate something, based on some early experience or social context, and then every future experience with the thing is then strong-armed into supporting that supposed strong opinion. There is no rational reason for this. It's a very extreme form of confirmation bias.
It's pretty fascinating actually, as it's often times employed by rather intelligent people. With a slight tendency towards the autistic end of the spectrum, but there is certainly more research into this needed. Perhaps somebody working on a degree in sociology is interested i digging further?
I can't justify it - it's pure preference and opinion, irrationally held. A big part of it is probably that type of programming I generally need to do is closer to using an overgrown calculator (with DataFrames) than doing proper Software Development to build a thing.
I much prefer having the code over _here_, and then having the results in a separate pane over _there_. Jupyter style mixing of inputs and outputs tends to confuse me, and in my hands gets very messy very quickly.
The slides in this light hearted talk from JupyterCon in 2018 probably give a better explanation than I could.
If like to know too. I learned python in jupyter notebooks. It makes experimenting and incremental development much easier (IMO) provided you remember to account for the current state of the notebook, which sometimes has me pulling my hair out.
> provided you remember to account for the current state of the notebook
Notebook style development is considered an anti pattern in most situations for this reason. It is too easy to execute out of order. Even the original parent of this thread said they recite the entire notebook every time to ensure they catch these issues. But, it’s not perfect and you can have leftover state this way too if you’re not careful.
My guess is that this is the reason the GP here is so against them. I find them helpful for data exploration, but that’s it.
Not OP but I often code in the REPL for python as well. Sometimes I'll stub out my code and just drop into an interactive debugger where I'm writing the next section.
In the python debugger, if you type `interact`, it'll give you the normal python repl. This combined with the `help` and `dirs` are super useful for learning new frameworks/libraries and coding with your actual data.
* Start with a blank `main` file and proceed linearly down the page, executing as I go.
* Gradually pull out visually awkward chunks of code and put them into functions with no arguments at the top of the file.
* If I need to parameterize them, add those parameters as needed - don't guess at what I might want to change later.
* Embrace duplication - don't unnecessary add loops or abstractions.
* Once the file is ~500 LOC or becomes too dense, start to refactor a bit. Perhaps introduce some loops or some global variables.
* At all times, ensure the script is idempotent - just highlighting the entire page and spamming run should "do what I want" without causing trouble.
* Once the script is started to take shape, it can be time to bring some OO into it - perhaps there is an object or set of objects I want to pass around, I can make a class for that. Perhaps I can start to think about how to make the functionality more "generalized" and accessible to others via a package.
This is literally the only way I've ever found to be productive with green field development. If my first LOC has the word "class" or "def" in it - I am absolutely going to ripping my hair out 12 hours later, guaranteed.