> Moreover, you may quickly realize much of this work is repetitive and while time-consuming, is “easy”. In fact, most analyses involve a great deal of time to understand the data, clean it and organize it. You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.
This. Universities and online challenges provide clean labeled data, and score on model performance. The real world will provide you... “real data” and score you (hopefully) by impact. Real data work requires much more than modeling. Understanding the data, the business and value you create are important.
As per #6, better data and model infrastructure is crucial in keeping the time spent on these activities manageable, but I do think they’re important parts of the job.
I’ve seen data science teams at other companies working for years on topics that never see production because they only saw modeling as their responsibility. Even the best data and infrastructure in the world won’t help if data scientists do not feel co-responsible for the realization of measurable value for their business.
Training integrative data professionals could be a great opportunity for bootcamps. Universities will (understandably) focus on the academically interesting topic of models, while companies will increasingly realize they need people with skills across the data value chain. I know I would be interested in such profiles. :)
I took a data visualisation class in uni that handled this really cleverly. The second assignment sounded very easy. The teacher provided links to the sources where we could find data.
Most people figured that with such a simple assignment (not significantly harder than the first one, which was also easy-ish) they could put off doing it until the last moment.
Most people failed.
This real world data needed hours upon hours of cleaning before it was in any way useable. Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start.
Never again will I underestimate the dirtiness of real world data. One of the best teachers I had.
This is universal to STEM degrees I think. In mechanical engineering classes you analyze a beam, in real life you analyze an assembly with 50 components that have undergone 100 revisions with 20 different materials and loading from 4 directions that vary with time. Oh, and you have 4 sensors to give you information to analyze critical stresses. But one of them is broken, and Bob who can fix it is on PTO until next Monday, so...
Internships are supposed to fill this gap but it'd be nice if all students could get a taste of real world systems and data. For tech, maybe if they could partner with the IT department at the school to get them exposed to real, messy data. Maybe there are some teaching datasets with over a billion rows that people could play around with.
The biggest surprise to me when I got out of school was how messy things were - data, systems, management, priorities...everything.
When I went back to grad school, we had arguments about the assumptions. It was a total 180 from undergrad, and much more useful. So when I came out of grad school, I was able to deal with the ambiguities - maybe even thrived because I understood them.
I majored in nonprofit management and every class had a required field work component with an area charity. I learned so much from the combination of intense coursework and real world experience. Now that I'm the head of data science at a corporation, I wish such integration existed in this field.
> This is universal to STEM degrees I think. In mechanical engineering classes you analyze a beam, in real life you ...
Hard to believe this. Don't these degrees require rigorous laboratory assignments where the student learns to differentiate best case scenario with real world uncertainties? STEM is not just some IT certification
Hmmm. We had a whole course on measurement systems that get to the heart of understanding that source of your data and inevitable bias/error is more important than just crunching the data as given. For example, from a typical four year degree.
Not really. MechE courses are really theoretical, and the labs are focused on just being enough to demo the theories. Most of my professors had never worked in industry, they had been in academia their entire lives. Even they wouldn't know how to bridge the gap.
In an ideal world, we'd have separate tracks for people entering industry versus academia/research, but that's a long way off.
That's insane. ME degrees that I know seem to be defined by industry (ie. application of theory). Nobody pursues that degree to stay in academia/research. Anyway you can always pursue an advanced degree if you want to stay in academia. Don't get it twisted though - STEM is not a vocation as per your suggestion that "people entering industry" deserve a special path.
Edit: the comment I replied to has since been edited to show how the teacher understood what he was doing, and how he made it teachable lesson and not a punitive one.
Not to miss the point, but I don't think "All my students failed" is the mark of a good teacher. It sounds like the teacher failed to prepare their students for the nature of the assignment. Perhaps he was surprised as they were when they all failed, as I doubt failing most of his class was his intention.
You are being downvoted but you are exactly on point. If some fail they may be bad students, but if the majority of my students fail they're not bad students, it is me who is a bad teacher.
> Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start.
I think what he meant is they 'failed' to get it completed on time and it was meant as a teaching lesson.
Agreed, and the now edited comment illustrates how the teacher made it a safe lesson. That portion wasn't in the comment when I replied, and it sounded more like the teacher simply failed to prepare their students.
"Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start."
That wasn't in the original comment. It has been edited since I replied, which is fine. I do it all the time, sometimes you miss that someone replied during your editing.
Yeah plenty of time for workplaces to do that for you. I can count on my hand the number of times something has been a hard deadline. This teacher taught a valuable lesson usable for the rest of the student's carreers and "the most students shouldn't fail mentality" has led to professors I know personally questioning the caliber of student they are receiving and this is a top 30 program I'm referring to. More people should fail, maybe they'd start treating things seriously and the problem of underqualified technical applicants would resolve itself.
I’m currently preparing a data visualization course to be taught this fall, and I would love to hear more about this! If you’d be willing to share some of those resources or the contact information for your professor, I’d really appreciate it. You can find contact info at the link in my profile :)
>You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.
I don't get why building a model people consider to be the "fun" part. That's mostly spitting data in, watching a loading screen, and then observing the output.
That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be. Likewise, learning the business side and seeing what is possible no one has considered is great fun too.
My favorite part is feature engineering. Pre-processing and cleaning is fun too, but morphing the data into formats that extract a diamond from coal is a lot of fun, and what data science is all about. Clicking go on some ML algo is just icing on the cake, seeing it reveal bits maybe even I overlooked in the data.
If you like ML why not be an MLE? That's what MLEs do, and they're a more desirable job. DS is all about the research, discovering and learning new information, and making the impossible possible.
The standard whatever.fit(X, y) isn't very appealing but there are much more bespoke models that require creative engagement with stats/CS knowledge, e.g. Bayesian hierarchical models or deep learning models that are more complicated than what can be copy/pasted from Medium.
I've done a lot of ensemble and stacked ensemble learning. I've also used BERT and a couple of other advanced ML, but usually I resort to advanced feature engineering if I can first, so I get what you mean, but it's still not as fun to me as figuring out patterns in data.
It's sort of two-sided, I think. It can be fun to figure out _meaningful_ patterns in data. I don't really find it fun to figure out that "so and so didn't use software that understood NA values back in nineteen tickety two, so some NA values are NA because they're newer, and some NA values are 0 because 0 is just like NULL in somebody's head, and some NA values are -999 because that was a thing they did in the Before Times."
MLE is a fairly new title that, as best I can tell, exists primarily in those few places that have a mature enough workflow to have people who can actually dedicate their time to the ML part and have other roles take care of the rest.
Everywhere else, there is only DS, and it involves everything.
To answer your first question though, the training and testing of these models is fun because it feels like a puzzle game: did all my understanding and preparation of the data (and the business) pay off and the model does its job as expected? Is there something I’m missing? What’s the simplest model + configuration I can use that produces acceptable results and what does that say about the problem space? Can I combine models in some way to get the results? Is nothing working because it’s an ultimately fruitless exercise and our hypothesis is wrong? Or is there something we’re missing that is in turn the reason the model is missing something? Etc etc.
Then as the output you get something that ingests some data and then makes a decision with it! That’s cool to me.
I get where you're coming from. I guess just the problem domain I'm in, and my experience level, I tend to get what I expect from a model, and if I don't I'm more like, "wtf?" which isn't anywhere as fun of a way to do that part of the process.
Also, I know what is possible and impossible before I start writing code (if you don't count EDA code). There are exceptions, like it should be possible but it turns out the data is bad, but it didn't look bad from the EDA. Thankfully I've never had that. I always perform a Feasibility Assessment before anything else.
Not to imply what you're doing is somehow incorrect. Problems can vary quite a bit and I recognize that. For example, there have been times where I've had to mine to see if anything is there, doing ML over it to validate a hypothesis then using that information to create a new hypothesis, rinse and repeat. That's scary, because I could turn up nothing. I haven't done a lot of mining I admit though. Usually my problems are much more obvious from the get go, or much more research intensive.
One time I did three months of reading papers on arxiv.org just to figure out if something was feasible and how to best do it. Though that was definitely not a standard problem.
> That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be
Exactly! This is the reason why I love my job. It gets even better when you uncover a non-intuitive insight.
I have been in the data Analytics space for 15+ years. The one mantra I try to always focus is what’s the business impact of what our team is creating.
This is a simple yet very powerful rule that helps us quickly disband ideas that:
1. Do not have a robust testing mechanism. No model is useful unless it performs in the real world. Measuring this is a severely non-trivial problem with multiple operational considerations.
For e.g. are you able to run manage true control/test groups? How do you build a “reverse” data pipeline to verify your models? And, if you are required to update model weights constantly, where and how will you update the model parameters?
2. Conversely, some of the most impactful products I worked on were probably delivered in simple excel sheets or had just under 20 lines in my Jupyter notebook. Not every business problem is demanding a deep learning network. For e.g. we worked on a data-driven capacity forecasting exercise for a call-centre. I can tell you that the sophistication of the model was the last thing on my mind as I had to work on careful interpretation and data collection.
3. Data Science departments should sit closer to business than what appears to be the trend correctly. At least business data science teams ( Apart from technical data teams focusing on product analytics to improve performance etc ). Courses and academic programs, I think, have developed a bias towards tools and techniques without the underlying analytical interpretative techniques needed to work with data. For e.g a new data scientist in my team delivered excellent code but she couldn’t detect logical misses in the data (for e.g losing some data during processing, using columns with almost all data missing)
On the other end of this spectrum, we are in the lagging end of the hype bubble still so there are many top leaders who are expecting to plug in “data science” and realise Billions of dollars in savings, new sales etc.
There was a remark in the old school Linear Algebra book we had in university (Edwards & Penney) that stuck with me, to the effect (probably I recall the details wrong) that one of the authors were once involved in data analysis of water samples collected from a bunch of rivers by 15 engineers, and it turned out no 6 of these engineers' measurements were internally consistent. The moral of the story was that real world data is messy, you need to learn least squares and related methods to make sense of the data.
Now with "data science" you've taken a step further, and instead of applying the math to lab reports on meticulously filled out forms, you're going to aggregate all the messy sources you can get your hands on. Of course your headaches will multiply.
>This. Universities and online challenges provide clean labeled data, and score on model performance.
First homework assignment in the stats class I teach is to clean data that the class generated with directions they all perceived as clear. It's near about the most hated assignment I have ever given. Amazing how many ways there are to encode gender of a experimental participant.
This rings true to me. I've seen a lot of models get built that are never used. Although in my experience it wasn't that data scientists didn't care about business value, it's just that data science often requires breaking down silos and asking other teams to change their behavior.
This article mentions that leadership often doesn't support data science, but I think it actually doesn't go far enough. Leadership doesn't just have to support the data scientists, it has to actually tell other teams to prioritize data science projects over what they are currently doing. Since these data science projects are riskier than standard projects, it makes sense that leadership doesn't often do this (and focusing on the standard projects could be the right call). However, it also means that it's very hard for data scientists to create business value.
This. Universities and online challenges provide clean labeled data, and score on model performance. The real world will provide you... “real data” and score you (hopefully) by impact. Real data work requires much more than modeling. Understanding the data, the business and value you create are important.
As per #6, better data and model infrastructure is crucial in keeping the time spent on these activities manageable, but I do think they’re important parts of the job.
I’ve seen data science teams at other companies working for years on topics that never see production because they only saw modeling as their responsibility. Even the best data and infrastructure in the world won’t help if data scientists do not feel co-responsible for the realization of measurable value for their business.
Training integrative data professionals could be a great opportunity for bootcamps. Universities will (understandably) focus on the academically interesting topic of models, while companies will increasingly realize they need people with skills across the data value chain. I know I would be interested in such profiles. :)