As an ML-focused python dev I have never been able to break the habit of REPL-dr...

mbwgh · on Aug 5, 2023

> * Gradually pull out visually awkward chunks of code and put them into functions with no arguments at the top of the file.

I have seen scientific code written in this manner written in both Python and Fortran. This may be some intuitive way to start off, and even complete the task at hand.

But for people trying to read, understand, and realistically, debug your code, this complicates things.

Because each no-argument function can only work via its side effects. Your script becomes a succession of state transitions, and to understand it, you have to keep the intended state after each step in your head.

And in case of a mistake, you can't even call these functions individually via passing their intended arguments using the REPL. You have to set up all the state beforehand manually, call your no-arg function, observe the state afterwards. Which becomes more awkward the further down you are in your script, since all your dependencies are implicit and possibly even completely undocumented.

raincole · on Aug 5, 2023

The correct way is:

1. Use comments to split visual awkward part into chunks.

# a lot of code...

## ===

# more a lot of code...

2. Use inner functions if that chunk need to be reused

3. Only move the chunk to top-level function if you think it's worth to take time to make its required state into parameters/return value

extr · on Aug 5, 2023

Sure that's like the worst case scenario, but in practice the entire point is that as soon as it becomes difficult to reason about, that's when you start cleaning up the functions and add some obvious OO. You don't leave it as a mess.

bogeholm · on Aug 4, 2023

Agree 99% except this statement:

> Embrace duplication - don't unnecessary add loops or abstractions

I’ll usually make a function or perhaps tiny class as soon as I start reusing bits of code.

Apart from that, agree as stated. At my previous job (Python shop), a lot of the data engineers came from a Java background, and had a tendency to think top-down. Many things were over engineered ‘just because we might need it’:

- Factory classes used only once or twice in entire code base

- Lets make an AbstractReaderInterface because we might want to abstract the file type or location later (while 100% of files are Parquet on S3)

I’ve really enjoyed using dataclasses and Pydantics BaseModels prolifically, and adding type hints (coupled with type checks in CI).

Model the data, write a well structured imperative workflow, set up CI, write unit tests, enforce typing. Add OOP if needed, then close ticket.

robertlagrant · on Aug 5, 2023

Agreed. The general principle is (and it's a hard balance, of course, to avoid just slapdash work): don't do work up front you might not need. You will make your current work take longer than it should, which you will (correctly) be blamed for, and any time you save in the future you won't get the credit for.

This isn't a cynical statement; it's just my experience. Use it to your advantage!

IceSentry · on Aug 4, 2023

I don't understand why you consider loops an abstraction. They are some of the most basic building block.

extr · on Aug 5, 2023

People are getting caught up on the loops thing. All I meant was in my line of work I often end up with many special cases of general processes. Writing a loop prematurely always bites me - I end up writing control flow for handling the one-offs, it somehow always becomes more obtuse than just listing things out literally.

freehorse · on Aug 4, 2023

Loop as an abstraction of copy-paste.

quickthrower2 · on Aug 5, 2023

They considered it a dedupe not an abstraction

It would be unnecessary to have a loop over keys of a dictionary to call function xyz when you can just repeat the xyz calls (it would look nicer too)

Unless the dictionary is huge and dynamically loaded.

dragonwriter · on Aug 4, 2023

Loops are an abstraction over conditional branching and, depending on the kind of loop, some other things.

nonethewiser · on Aug 5, 2023

> Loops are an abstraction over conditional branching

How?

usrusr · on Aug 5, 2023

Goto. In the end it's all about manipulation of the instruction pointer.

Loops as deduplication is a very specific subset of looping that is very popular in some languages and almost nonexistent in others. If you don't have destructuring and convenient list literals you might never ever see it in the wild.

But even the Java 7 version (iterating an ImmutableList.of multi-nested Maps.immutableEntry of new FunctionN) can be workable despite its hilarious amount of repeated type annotations, if you have learned to stop worrying and love the tooling. Stuff like typescript makes it a breeze, so much that one might occasionally forget that it's not the regular form of looping.

einichi · on Aug 5, 2023

What do you think is being evaluated on each iteration of a loop?

freehorse · on Aug 4, 2023

> Embrace duplication - don't unnecessary add loops or abstractions.

> Once the file is ~500 LOC or becomes too dense, start to refactor a bit. Perhaps introduce some loops or some global variables.

I agree with most, but if I "embrace duplication" I can reach 500+ LOC in half an afternoon :P. It seems to have really paid off for me to start some degree of abstraction (not OO yet in general) early enough. Tbf, it is easier for me to tidy up the code with some abstractions, rather than ensure that the "main"-titled script runs from beginning to end each point of time with no errors, which imo can hinder experimentation more than abstraction. But also depends on what I do, I guess, the greener the field the more I feel this way.

patrick451 · on Aug 5, 2023

Wow, not a ML person, but controls and robotics. Yet, this describes my workflow for a lot of things almost to a tee. Even down to the avoiding loops. I tend to do that when I want to run the same simulation or analysis for a couple variables or datasets. It's interesting, because in the past I was a lot more prone to turn that into a loop early on. But this makes your code brittle. You'll want to do something slightly different for the two datasets, which means a bunch of conditionals in your loop. It's actually really similar to the problems you get with boolean flags when you try to abstract into a function too soon. It actually takes disciple to for me to commit to copy past but I think it pays it off.

teamspirit · on Aug 5, 2023

I too agree with pretty much everything you say. Just want to add that I pretty much solely use ptpython[0]. It can handle line breaks in pasted code, vim (or emacs) bindings and syntax highlighting, and much more.

[0] https://github.com/prompt-toolkit/ptpython

strangattractor · on Aug 4, 2023

Pretty much follow an identical process. When I do finally rewrite the code, after getting a working version, the duplication pretty much screams clean me up and simplify/generalize. I have never been able to just see the whole thing before I start. The process itself teaches you things.

rawoke083600 · on Aug 5, 2023

Interesting thought (and coding) process. I love "design, thinkering and basically active-thinking" with pseudo-code in a txt file, i.e design.txt.

Just noting functions, d-structure and some flow usually helps to arrive at something worthwhile...

slim · on Aug 4, 2023

by REPL here you mean jupyter notebook?

extr · on Aug 5, 2023

I usually use VS code and the "interactive" python functionality (not jupyter). I highlight code and execute just that code with a hotkey. Works just as well with any kind of vim-slime like functionality.

bogeholm · on Aug 4, 2023

REPL = Read-Eval-Print Loop. So could be iPython or just plain `python` in general, can’t say what OP is using of course

ploika · on Aug 5, 2023

I also primarily write ML-focused Python. For me, having originally learned R and C at the same time, nothing has ever surpassed RStudio as a dev environment. For the past several years my preferred setup has been tmux and Vim with vim-slime in one pane and IPython in the other.

(Personally, and speaking only for myself, I hate Jupyter notebooks with a burning passion. I think they are one of the worst things ever to have happened to software development, and definitely the worst thing ever to have happened to ML/data science.)

proamdev123 · on Aug 5, 2023

Why do you hate Jupyter notebooks so much that it reaches “worst thing to ever have happened“ status?

Why do you love R Studio so much? (I’ve never used it, so no judgment)

huganabaga · on Aug 6, 2023

> Why do you hate Jupyter notebooks so much that it reaches “worst thing to ever have happened“ status?

It's the "with a passion" part. A certain sub-population is prone to deciding that they love or hate something, based on some early experience or social context, and then every future experience with the thing is then strong-armed into supporting that supposed strong opinion. There is no rational reason for this. It's a very extreme form of confirmation bias.

It's pretty fascinating actually, as it's often times employed by rather intelligent people. With a slight tendency towards the autistic end of the spectrum, but there is certainly more research into this needed. Perhaps somebody working on a degree in sociology is interested i digging further?

ploika · on Aug 5, 2023

I can't justify it - it's pure preference and opinion, irrationally held. A big part of it is probably that type of programming I generally need to do is closer to using an overgrown calculator (with DataFrames) than doing proper Software Development to build a thing.

I much prefer having the code over _here_, and then having the results in a separate pane over _there_. Jupyter style mixing of inputs and outputs tends to confuse me, and in my hands gets very messy very quickly.

The slides in this light hearted talk from JupyterCon in 2018 probably give a better explanation than I could.

https://conferences.oreilly.com/jupyter/jup-ny/public/schedu...

account-5 · on Aug 5, 2023

If like to know too. I learned python in jupyter notebooks. It makes experimenting and incremental development much easier (IMO) provided you remember to account for the current state of the notebook, which sometimes has me pulling my hair out.

mbreese · on Aug 5, 2023

> provided you remember to account for the current state of the notebook

Notebook style development is considered an anti pattern in most situations for this reason. It is too easy to execute out of order. Even the original parent of this thread said they recite the entire notebook every time to ensure they catch these issues. But, it’s not perfect and you can have leftover state this way too if you’re not careful.

My guess is that this is the reason the GP here is so against them. I find them helpful for data exploration, but that’s it.

account-5 · on Aug 5, 2023

Yeah I can't argue against that, I've been stung way to many times with it. Never heard of the term "anti pattern" before but yeah I get it.

hughesjj · on Aug 4, 2023

Not OP but I often code in the REPL for python as well. Sometimes I'll stub out my code and just drop into an interactive debugger where I'm writing the next section.

In the python debugger, if you type `interact`, it'll give you the normal python repl. This combined with the `help` and `dirs` are super useful for learning new frameworks/libraries and coding with your actual data.

Tokumei-no-hito · on Aug 7, 2023

What interactive debugger IDE are you using that lets you enter “interact”?

hughesjj · on Aug 8, 2023

pdb, the one shipped with python

extr · on Aug 5, 2023

Good idea to code in the debugger. Going to try that.