Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As an ML-focused python dev I have never been able to break the habit of REPL-driven development, but I find it works really well for "building code that works" rather than coming up with a tower of abstractions immediately. A typical python development workflow for me is:

* Start with a blank `main` file and proceed linearly down the page, executing as I go.

* Gradually pull out visually awkward chunks of code and put them into functions with no arguments at the top of the file.

* If I need to parameterize them, add those parameters as needed - don't guess at what I might want to change later.

* Embrace duplication - don't unnecessary add loops or abstractions.

* Once the file is ~500 LOC or becomes too dense, start to refactor a bit. Perhaps introduce some loops or some global variables.

* At all times, ensure the script is idempotent - just highlighting the entire page and spamming run should "do what I want" without causing trouble.

* Once the script is started to take shape, it can be time to bring some OO into it - perhaps there is an object or set of objects I want to pass around, I can make a class for that. Perhaps I can start to think about how to make the functionality more "generalized" and accessible to others via a package.

This is literally the only way I've ever found to be productive with green field development. If my first LOC has the word "class" or "def" in it - I am absolutely going to ripping my hair out 12 hours later, guaranteed.



> * Gradually pull out visually awkward chunks of code and put them into functions with no arguments at the top of the file.

I have seen scientific code written in this manner written in both Python and Fortran. This may be some intuitive way to start off, and even complete the task at hand.

But for people trying to read, understand, and realistically, debug your code, this complicates things.

Because each no-argument function can only work via its side effects. Your script becomes a succession of state transitions, and to understand it, you have to keep the intended state after each step in your head.

And in case of a mistake, you can't even call these functions individually via passing their intended arguments using the REPL. You have to set up all the state beforehand manually, call your no-arg function, observe the state afterwards. Which becomes more awkward the further down you are in your script, since all your dependencies are implicit and possibly even completely undocumented.


The correct way is:

1. Use comments to split visual awkward part into chunks.

# a lot of code...

## ===

# more a lot of code...

2. Use inner functions if that chunk need to be reused

3. Only move the chunk to top-level function if you think it's worth to take time to make its required state into parameters/return value


Sure that's like the worst case scenario, but in practice the entire point is that as soon as it becomes difficult to reason about, that's when you start cleaning up the functions and add some obvious OO. You don't leave it as a mess.


Agree 99% except this statement:

> Embrace duplication - don't unnecessary add loops or abstractions

I’ll usually make a function or perhaps tiny class as soon as I start reusing bits of code.

Apart from that, agree as stated. At my previous job (Python shop), a lot of the data engineers came from a Java background, and had a tendency to think top-down. Many things were over engineered ‘just because we might need it’:

- Factory classes used only once or twice in entire code base

- Lets make an AbstractReaderInterface because we might want to abstract the file type or location later (while 100% of files are Parquet on S3)

I’ve really enjoyed using dataclasses and Pydantics BaseModels prolifically, and adding type hints (coupled with type checks in CI).

Model the data, write a well structured imperative workflow, set up CI, write unit tests, enforce typing. Add OOP if needed, then close ticket.


Agreed. The general principle is (and it's a hard balance, of course, to avoid just slapdash work): don't do work up front you might not need. You will make your current work take longer than it should, which you will (correctly) be blamed for, and any time you save in the future you won't get the credit for.

This isn't a cynical statement; it's just my experience. Use it to your advantage!


I don't understand why you consider loops an abstraction. They are some of the most basic building block.


People are getting caught up on the loops thing. All I meant was in my line of work I often end up with many special cases of general processes. Writing a loop prematurely always bites me - I end up writing control flow for handling the one-offs, it somehow always becomes more obtuse than just listing things out literally.


Loop as an abstraction of copy-paste.


They considered it a dedupe not an abstraction

It would be unnecessary to have a loop over keys of a dictionary to call function xyz when you can just repeat the xyz calls (it would look nicer too)

Unless the dictionary is huge and dynamically loaded.


Loops are an abstraction over conditional branching and, depending on the kind of loop, some other things.


> Loops are an abstraction over conditional branching

How?


Goto. In the end it's all about manipulation of the instruction pointer.

Loops as deduplication is a very specific subset of looping that is very popular in some languages and almost nonexistent in others. If you don't have destructuring and convenient list literals you might never ever see it in the wild.

But even the Java 7 version (iterating an ImmutableList.of multi-nested Maps.immutableEntry of new FunctionN) can be workable despite its hilarious amount of repeated type annotations, if you have learned to stop worrying and love the tooling. Stuff like typescript makes it a breeze, so much that one might occasionally forget that it's not the regular form of looping.


What do you think is being evaluated on each iteration of a loop?


> Embrace duplication - don't unnecessary add loops or abstractions.

> Once the file is ~500 LOC or becomes too dense, start to refactor a bit. Perhaps introduce some loops or some global variables.

I agree with most, but if I "embrace duplication" I can reach 500+ LOC in half an afternoon :P. It seems to have really paid off for me to start some degree of abstraction (not OO yet in general) early enough. Tbf, it is easier for me to tidy up the code with some abstractions, rather than ensure that the "main"-titled script runs from beginning to end each point of time with no errors, which imo can hinder experimentation more than abstraction. But also depends on what I do, I guess, the greener the field the more I feel this way.


Wow, not a ML person, but controls and robotics. Yet, this describes my workflow for a lot of things almost to a tee. Even down to the avoiding loops. I tend to do that when I want to run the same simulation or analysis for a couple variables or datasets. It's interesting, because in the past I was a lot more prone to turn that into a loop early on. But this makes your code brittle. You'll want to do something slightly different for the two datasets, which means a bunch of conditionals in your loop. It's actually really similar to the problems you get with boolean flags when you try to abstract into a function too soon. It actually takes disciple to for me to commit to copy past but I think it pays it off.


I too agree with pretty much everything you say. Just want to add that I pretty much solely use ptpython[0]. It can handle line breaks in pasted code, vim (or emacs) bindings and syntax highlighting, and much more.

[0] https://github.com/prompt-toolkit/ptpython


Pretty much follow an identical process. When I do finally rewrite the code, after getting a working version, the duplication pretty much screams clean me up and simplify/generalize. I have never been able to just see the whole thing before I start. The process itself teaches you things.


Interesting thought (and coding) process. I love "design, thinkering and basically active-thinking" with pseudo-code in a txt file, i.e design.txt.

Just noting functions, d-structure and some flow usually helps to arrive at something worthwhile...


by REPL here you mean jupyter notebook?


I usually use VS code and the "interactive" python functionality (not jupyter). I highlight code and execute just that code with a hotkey. Works just as well with any kind of vim-slime like functionality.


REPL = Read-Eval-Print Loop. So could be iPython or just plain `python` in general, can’t say what OP is using of course


I also primarily write ML-focused Python. For me, having originally learned R and C at the same time, nothing has ever surpassed RStudio as a dev environment. For the past several years my preferred setup has been tmux and Vim with vim-slime in one pane and IPython in the other.

(Personally, and speaking only for myself, I hate Jupyter notebooks with a burning passion. I think they are one of the worst things ever to have happened to software development, and definitely the worst thing ever to have happened to ML/data science.)


Why do you hate Jupyter notebooks so much that it reaches “worst thing to ever have happened“ status?

Why do you love R Studio so much? (I’ve never used it, so no judgment)


> Why do you hate Jupyter notebooks so much that it reaches “worst thing to ever have happened“ status?

It's the "with a passion" part. A certain sub-population is prone to deciding that they love or hate something, based on some early experience or social context, and then every future experience with the thing is then strong-armed into supporting that supposed strong opinion. There is no rational reason for this. It's a very extreme form of confirmation bias.

It's pretty fascinating actually, as it's often times employed by rather intelligent people. With a slight tendency towards the autistic end of the spectrum, but there is certainly more research into this needed. Perhaps somebody working on a degree in sociology is interested i digging further?


I can't justify it - it's pure preference and opinion, irrationally held. A big part of it is probably that type of programming I generally need to do is closer to using an overgrown calculator (with DataFrames) than doing proper Software Development to build a thing.

I much prefer having the code over _here_, and then having the results in a separate pane over _there_. Jupyter style mixing of inputs and outputs tends to confuse me, and in my hands gets very messy very quickly.

The slides in this light hearted talk from JupyterCon in 2018 probably give a better explanation than I could.

https://conferences.oreilly.com/jupyter/jup-ny/public/schedu...


If like to know too. I learned python in jupyter notebooks. It makes experimenting and incremental development much easier (IMO) provided you remember to account for the current state of the notebook, which sometimes has me pulling my hair out.


> provided you remember to account for the current state of the notebook

Notebook style development is considered an anti pattern in most situations for this reason. It is too easy to execute out of order. Even the original parent of this thread said they recite the entire notebook every time to ensure they catch these issues. But, it’s not perfect and you can have leftover state this way too if you’re not careful.

My guess is that this is the reason the GP here is so against them. I find them helpful for data exploration, but that’s it.


Yeah I can't argue against that, I've been stung way to many times with it. Never heard of the term "anti pattern" before but yeah I get it.


Not OP but I often code in the REPL for python as well. Sometimes I'll stub out my code and just drop into an interactive debugger where I'm writing the next section.

In the python debugger, if you type `interact`, it'll give you the normal python repl. This combined with the `help` and `dirs` are super useful for learning new frameworks/libraries and coding with your actual data.


What interactive debugger IDE are you using that lets you enter “interact”?


pdb, the one shipped with python


Good idea to code in the debugger. Going to try that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: