I'm a data engineer, it's a fairly new role so it's not well defined yet, but most data engineers write data pipelines to ingest data into a data warehouse and then transform it for the business to use.
I'm not sure why using a static language would make translating data types difficult, but I add as many typehints as possible to my Python so I rarely do anything with dynamic types. I guess they're saying for small tasks where you're working with lots of types, when using a static language most of your code will be type definitions, so a dynamic language will let you focus on writing the transformation code.
Thank you to reply. Your definition of data engineer makes sense. From my experience, I would not call it a new role. People were doing similar things 25 years ago when building the first generation of "data warehouses". (Remember that term from the late 1990s!?)
I am surprised that you are using Python for data transformation. Isn't it too slow for huge data sets? (If you are using C/C++ libraries like Pandas/NumPy, then ignore this question.) When I have huge amounts of data, I always want to use something like C/C++/Rust/C#/Java do the heavy lifting because it is so much faster than Python.
Yes, it's definitely a new word for an old concept, same as the term data scientist for data analyst or statistician.
I find Python is fast enough for small to medium datasets. I've normally worked with data that needs to be loaded each morning or sometimes hourly, so whether the transformation takes 1 minute or 10 minutes it doesn't matter. The better way is of course to dump the data into a data warehouse as soon as possible and then use SQL for everything, so I only use Python for things that SQL isn't suited for, like making HTTP requests.
Using a static language to manipulate complex types, particularly those sourced from a different type system (say complex nested Avro, SQL, or even complex JSON) is much more awkward when the types cannot be normalized into the language automatically as can be done with dynamic languages. Static languages require more a priori knowledge of data types, and are very awkward at handling collections with diverse type membership. Data has many forms in reality -- dynamic languages are much more effective at manipulating data on its own terms.
You realize every single thing that dynamically-typed languages can do with data types, statically-typed languages can do too? Except when it matters, they can also choose to do things dynamically-typed languages can't.
Lots of people assume static typing means creating domain types for the semantics of every single thing, and then complain that those types contain far more information than they need. Well, stop doing that. Create types that actually contain the information you need. Or use the existing ones. If you're deserializing JSON data, it turns out that the deserialization library already has a type for arbitrary JSON. Just use it, if all you're doing is translating that data to another format. Saying "this data is JSON I didn't bother to understand the internal content of" is a perfectly fine level to work at.
About monkeypatching, perhaps we have difference definitions. From time to time, I need to modify a Java class from a dependency that I do not own/control. I copy the decompiled class into my project with the same package name. I make changes, then run. To me, this is monkeypatching for Java. Do you agree? If not, how is it different? I would like to learn. Honestly, I discovered that Java technique years ago by accident.
Another technique: While the JVM is running with a debugger attached, it is possible to inject a new version of a class. IDEs usually make this seamless. It also works when remote debugging. Do you consider this monkeypatching also?
> You can’t do monkeypatching or dynamically modify the inheritance chain of an object in a statically typed language.
There's no theoretical reason you can't. No languages that I know of provide that combination features, because monkey-patching is a terrible idea for software engineering... But there's no theoretical reason you couldn't make it happen.
I think you've conflated static typing with a static language. They're not the same thing and can be analyzed separately.
So how would a statically typed language support conditionally adding methods at runtime? Lets say the code adds a method with name and parameters specified by user input at runtime. How could this possibly be checked at compile time?
You could add methods that nothing could call, sure. It would be like replacing the value with an instance of an anonymous subclass with additional methods. Not useful, but fully possible. Ok, it would be slightly useful if those methods were available to other things patched in at the same time. So yeah, exactly like introducing an anonymous subclass.
But monkey-patching is also often used to alter behaviors of existing things, and that could be done without needing new types.
You would need another feature in addition: the ability to change the runtime type tag of a value. Then monkey-patching would be changing the type of a value to a subclass that has overridden methods as you request. The subclasses could be named, but it wouldn't have much value. As you could repeatedly override methods on the same value, the names wouldn't be of much use, so you might as well make the subclass anonymous.
In another dimension, you could use that feature in combination with something rather like Ruby's metaclasses to change definitions globally in a statically-typed language.
I can't think of a language that works this way currently out there, but there's nothing impossible about the design. It's just that no one wants it.
In a dynamic language, everything is only defined at runtime.
Given that, a sketch of a statically-typed system would be something like... At the time a definition is added to the environment, you type check against known definitions. Future code can change implementations, as long as types remain compatible. (Probably invariantly, unless you want to include covariant/invariant annotations in your type system...)
This doesn't change that much about a correct program in a dynamic language, except that it may provide some additional ordering requirements in code execution - all the various method definitions must be loaded before code using them is loaded. That's a bit more strict than the current requirement they the methods must be loaded before code using them is run. But the difference would be pretty tractable to code around.
And in exchange, you'd get immediate feedback on typos. Or even more complex cases, like failing to generate some method you had expected to create dynamically.
Ok, I can actually see some appeal here, though it's got nothing to do with monkey-patching.
I love using "mixed" dynamic/static typed languages in these scenarios... you can do that data manipulation without types, but benefit from types everywhere else... my two favourite "mixed" languages are Groovy on the JVM, and Dart elsewhere... Dart now has a very good type system, but still supports `dynamic` which makes it as easy as Python to manipulate data.
A major problem with doing data transformation in statically typed languages is that its easy to introduce issues during serialization and deserialization. If you have an object
class myDTO{
string name;
string value;
}
var myObjs= DerserializeFromFile<myDTO>(filepath)
SerializeToFile(myObjs, filePath2)
filepath2 would end up with without the extraProperty field.
You can also write code like
function PrintFullname(person) {
WriteLine(person.FirstName + “ “ + person.LastName)
}
And it will just work so long as the object has those properties. In a statically typed language, you’d have to have a version for each object type or be sure to have a thoughtful common interface between them, which is hard.
All that bring said, I generally prefer type safe static languages because that type system has saved my bacon on numerous occasions (it’s great at telling me I just changed something I use elsewhere).
You can write code in a statically typed language that treats the data as strings. The domain modelling is optional, just choose the level of detail that you need:
1. String
2. JSON
3. MyDTO
If you do choose 3, then you can avoid serde errors using property based testing
"Most" (I mean "all", but meh - I'm sure there's some obscure exception somewhere) parsers will have the ability to swap between a strict DTO interpretation of some data, and the raw underlying data which is generally going to be something like a map of maps that resolves to strings at the leaf nodes. Both have their uses. The same can also be done easily enough by hand as well, if necessary.
I'm not sure why using a static language would make translating data types difficult, but I add as many typehints as possible to my Python so I rarely do anything with dynamic types. I guess they're saying for small tasks where you're working with lots of types, when using a static language most of your code will be type definitions, so a dynamic language will let you focus on writing the transformation code.