The Rise of the Data Engineer
The introduction of AI into products and services – across all sectors – is creating new capabilities at a scale that software developers could never have dreamed of just a decade ago. But this development is not just about the tech.
The combination of AI and Cloud infrastructure is triggering a fascinating change in personnel that you cannot ignore; it is changing the roles of certain engineers and creating an entirely newfound requirement that demands an entirely new engineering specialty.
Let’s review the evolution thus far. Prior to the age of the Cloud, things were simpler: Engineers were expected to manage production processes and worry about scale within the software itself. It made sense at the time because there were no frameworks that enabled the separation of software logic from the compute resources. The software was very much discretely tied to predefined discrete hardware compute resources.
But today, in the age of the Cloud and Elastic Computing resources, we break engineers into more specialized, distinct teams to build software solutions, products, and services that take advantage of these Elastic Computing platforms:
Back-end Engineers are typically responsible for building the logic behind the software. Sometimes, depending on the specific application, part of this team will include algorithm experts. This will occur in projects where building the logic – and especially building logic that can scale – requires more than just “engineering” or simple “if this then that” logic. The need for this specialized expertise is a natural evolution based on the ever-growing complexity and demands of the software, and the dramatic increase of computing horsepower available to support it.
Front-end Engineers build the top application layer and the user interface. Building an engaging, logical and adaptable Man-Machine Interface does require considerable skill and is an important aspect of the development process. That said, I believe that this area still awaits a huge disruption and paradigm change as the limitations of the browser interface has created significant hurdles to streamlined, efficient application development and production.
DevOps Engineers are responsible for scaling the software applet (the code container) onto the Elastic Cloud for deployment so that it can effortlessly cater to as many users as are expected, and elegantly handle as much load as needed. DevOps engineers don’t know or need to know much about the actual logic of the software they need to support.
So… What’s Changing?
AI challenges the organizational structure of roles we’ve just discussed, and that change is driven by one core factor: the role of data as a critical cog in the development engine.
Both Machine Learning and its more “cerebral” cousin, Deep Learning, are disciplines that leverage algorithms such as neural networks, which are, in turn, nourished by massive feeds of data to create and refine the logic for the core app. In Deep Learning, of course, the method goes further in its attempt to mimic how the human brain learns from the data it collects through experience and its senses. In effect, both these technologies eventually create their own logic pathways to complete a given task, and in this, replace the job of the Back-end Engineer as we knew it.
So who manages this new process? The simple answer is that we turn to the Data Scientist, whose job is to choose the right initial algorithms and then train, test and tune (and tune, and tune and tune…) them to optimize the algorithms to do their job of ultimately “spitting out” the software’s core application logic. His or her training jobs or experiments will couple a certain model (or neural network) with a particular dataset and a set of execution parameters.
If only it were that easy; indeed, it’s not the end of our story.
Organizations have realized that it is not enough to hire a few good Data Scientists. What we have discovered is that we are missing one more piece of the puzzle: Someone to focus exclusively on the selection, optimization and management of the raw materials that these Data Scientists’ algorithms need to chew through, and then on scaling the experimentation process to test out the potential configurations needed. This is called building the Data Pipeline, and this isn’t a task to be plugged in later in the process, into a system that is up and running in deployment. Relevant, usable, scalable data pipelines are required for development from day one.
And it’s not a simple task; unlike the human brain, machine/deep learning algorithms need a lot of help in tagging or categorizing the data before it’s used. There are also a lot of algorithm configuration parameters that need to be tuned. All cutting-edge development notwithstanding, these are still very simplistic models created to solve only specific problems — not to actually “think” for themselves or demonstrate genuine human-style judgment when confronting the unexpected. Someone needs to help the algorithm “solve” for edge cases and data biases. Without it, the software can’t adjust to outliers and unexpected situations as the human brain “automagically” does.
This is why it’s a hardware issue as well; to effectively create machine/deep learning solutions, the organization now needs to leverage a lot of specialized hardware optimized for the task (currently the vast majority of this is handled by GPUs) from the get-go. Some have begun to refer to this discipline as ML-Ops. From the get-go, leveraging the organization’s Cloud/Elastic compute resources already becomes an issue during development and not something to be addressed only during deployment.
All these are challenges in search of an owner, and smart organizations are looking at their org charts to discover that the owner of this task is not necessarily there. It’s certainly not a Back-end Engineer’s job, as it is not about developing application logic. And it is not, in truth, the responsibility of conventional DevOps engineers, who traditionally do not involve themselves in the underlying software logic or use-cases, or the underlying data sources or the required pipes to connect it all.
Moreover, unlike traditional DevOps (where the core task is to replicate the core software applet in as many instances as needed and maintain high availability) here the core task is one of both replicating big training jobs and also running multiple, ongoing, disparate training and experiment jobs in parallel so as to enable an efficient and timely development process.
The Curtain Rises…
So, at long last, we’ve arrived at the second phase of organizational change at the core of AI development following the introduction of Data Scientists: Enter the Data Engineer.
This newly emerging class of engineer, (often called a Data Engineer, and sometimes other names as the industry tries to settle on a term) is tasked with building the data pipelines and the scaling mechanisms to leverage elastic compute resources for AI workloads. Their job is to supply the Data Scientists with the Cloud-based or On-prem data and infrastructure so their algorithms can effectively access and run their experiments to build the ultimate model for deployment.
Data Engineers, then, need to deal, on one hand, with data management – what used to be called DBA – and on the other with DevOps-like tasks, that require specific hardware configured to scale with the software and the orchestration of many different (but related) tasks for each software application.
Organizations that have recognized this need are now moving quickly to restructure their AI teams by introducing Data Engineers into the process; this adjustment gives them a clear advantage over the competition still struggling ‒ and failing ‒ to force their Data Science team to effectively function within their existing IT or R&D organizational structure.