Everything you need to know about ETL Developer
Humans interfacing with machines generate information. Many humans interfacing with many machines over the internet produce a staggering amount of information called "Big Data". It is a source of tremendous power, the digital equivalent of dormant nuclear fuel laying dispersed across databases.
Extracting, compiling, and processing Big Data reveals valuable, actionable insights that take the business to the next level. For example, Big Data can be used for analytics or machine learning (ML) models. The infrastructure and the procedures for processing Big Data — the nuclear reactor, as it were — are called "data science".
The first step in turning raw data into actionable insights is called "Extract-Transform-Load" and the person that lays the groundwork for that is called "ETL developer". This article will explain:
- the ins and outs of data science
- the definition of business intelligence (BI)
- ETL developer’s role on the BI team
- skills needed for an ETL developer resume and job description
- if ETL developer is a stressful job
- whether ETL and SQL are the same
- data marts and how they compose a data warehouse
The ins and outs of data science
According to IBM, data science is a way to discover data and parse it into business insights by using multiple disciplines, which can involve math, AI, ML, and specialized programming. No matter what tools and methods are used, the goal is always the same — discover trends that are actionable on the business level.
The simplest way to store data is in a plain spreadsheet, but there usually arise numerous problems when the business tries to scale or pivot. For example, if the business decides to add another source of data, the spreadsheet might get corrupted unless the data is sanitized before inclusion. Adding an automated analysis or reporting tool may also turn out to be troublesome due to poor performance or compatibility.
Therefore, the data storage, formatting, analysis, and reporting must be planned in a way that lets the business experiment while enjoying the benefits of a sturdy infrastructure. Layman users must also be able to access the data, which is the task of the Business Intelligence team.
What is Business Intelligence?
As explained by Paul Turley, consultant, manager, and Microsoft Data Platform MVP, and Douglas McDowell, CEO of SolidQ, an achieved data science milestone can be a smashing success and an abject failure. The reason is that the internal team of experts can consider a metric, such as data mapping, a success but clients can’t access or use it, meaning it’s ultimately a failure.
Paul says that, in his experience, successful BI teams are small and focus on quick iteration without much managerial overhead. Such teams have a less-formal structure and excel in environments without ambitious managers or developers pushing them past the breaking point trying to impress the higher-ups. In larger BI teams, roles may be expanded to a specific part of data science.
ETL developer’s role on the BI team
In the grand scheme of BI, an ETL developer is a data infrastructure specialist who is:
- designing data storage systems
- defining data warehouse systems
- setting out physical data flow models
- identifying data storage requirements
- explaining to stakeholders what is going on in layman’s terms
Paul says an ETL developer should not be expected to achieve perfection. Rather, the BI team should balance out the quality and the time investment to reach a "good enough" state and move on. If there’s a need for improvement later on, the ETL developer can do cautious iteration with the existing design.
ETL, a technical solution specialist
ETL developers are tasked with finding technical solutions to three problems:
- how to find, catalog, and extract data from various databases and systems
- how to format or otherwise transform that data into the appropriate form
- how to load the data into the needed system
Any database or process can be used in all stages of Extract-Transform-Load, though scaling may prove to be a challenge if the choice is incorrect. Paul notes that ETL developers doing a fine job can make the job of others down the line easier by identifying and eliminating bottlenecks, though that isn’t as easy as it appears.
In other words, an ETL developer implements technical solutions for extract-transform-load problems. Cataloging and compiling scattered dynamic or static information requires knowledge of indexing tools and procedures. Extraction and formatting of compiled information to mesh with the existing data require intricate knowledge of data file types and their internal structure. Finally, loading the data into the data infrastructure in conformance with business needs asks for hands-on management of the corresponding hardware and software.
ETL testing
ETL testing involves closely cooperating with the QA team to ensure proper performance of the system prior to launch. For example, the ETL developer in charge would perform data flow validation checks and download/upload speed tests, alongside overall system stress tests.
Other roles on the data science team
Some other roles on a data science team are:
- data analyst (statistician that discovers trends in the data)
- data developer (software engineer that builds products that handle data)
- data scientist (statistician that formulates best practices for handling data)
- product owner (marketer who represents the customer’s point of view in the team)
- data architect (IT professional who develops databases)
- data engineer (an umbrella term for data architects and similar roles, including ETL developers)
- model designer (defines hierarchies and relationships between data entities)
Those only scratch the surface and each business will adopt different additional specialists but roles may be interchangeable to the point the title "ETL developer" is meaningless. For example, according to a 2017 presentation by Airbnb’s data scientist, Martin Daniel, his company focused on experimentation with the data, expanding its data science team with academics involved in social sciences, economics, physics, and even professional poker players; no mention of an ETL developer.
The case of Airbnb
Daniel explains that Airbnb created a data science team that was radically different from that employed by a social media company such as Facebook. He reveals that Airbnb created small and nimble data science sub-teams that were embedded in the company but fully autonomous in their area of work.
Every year, Airbnb had different goals and sub-teams worked with one another to achieve those goals. All their work was made open-source and published online. Daniel breaks down their work process into three stages:
- analysis
- experimentation
- data products
Example of analysis
Daniel provides an example of a question that was answered by Airbnb’s data science team, "I have $X to increase supply in Europe. In which markets should I invest first?" Answering it required understanding the dynamics of supply and demand to maximize return, which he represents with the following formula:
B = A * Sy * Da
(bookings = matching efficiency parameter * supply to the power of elasticity of supply * demand to the power of elasticity of demand)
Data scientists gathered market data and inserted it into the formula to reveal not just markets with the greatest return on investment, but also markets that will provide the greatest long-term return. Daniel gives Cannes, a city in France renowned for hosting an annual film festival visited by celebrities and globetrotters, as an example of a market worth investing in.
Experimenting with the data
Launching a new feature can lead to an increase in the desired metric, such as bookings, but without experimentation there is no way to know its actual impact. Daniel explains that experimenting allowed the data scientists at Airbnb to understand causality, with the company running 400 experiments at any given time. In fact, according to him, simply visiting Airbnb’s website enrolls the visitor in an experiment of some kind.
Data products
In London, Airbnb hosts were limited by the city officials to 90 days of hosting before they were considered professional hosts. To prevent hosts from removing and relisting their property to bypass that limit, Daniel created Jumeau, a tool that finds "twins" as properties are listed. The tool analyzed data but the decision for delisting a property was made by a human, minimizing the rate of false positives.
Data marts, warehouses, and lakes
Oracle defines a data mart as a unit of data centered on a single line, theme, or subject of business. For example, one data mart is for traffic metrics, the other for demographic information, and so on; many put together plus tools to parse the data are called a data warehouse. The former is for a sub-team in the company while the latter is for the entire company.
A data lake is a repository of unstructured, unprocessed data. Analytics and reporting tools can grab the data from the lake as needed to generate reports or create ML frameworks on data storage systems.
The ETL developer is the one who validates all data system designs and activities and does system user interviews to understand user and business needs. This person must understand data models, which are how the data is represented in the system, to choose the appropriate underlying technologies for marts, warehouses, and lakes.
Building a data warehouse
According to the DataSchool guide, building a data warehouse starts with combining two databases with related data into one. If one database contains names and the other contains roles, combining them gives more context. Filtering deprecated rows, dropping unused columns, and sanitizing all the data in both databases makes them join seamlessly. This can be done by ETL developers in a series of SQL statements.
The ScienceSoft data warehouse building guide projects the time needed at 3–12 months and the cost at $70,000+. The suggested team size is 7 people. There is no mention of an ETL developer but the diagram does display the ETL step. The finished data warehouse has three layers:
- data source
- staging area
- data storage
Data marts can be built prior to the data warehouse database or after it. The former approach is cheaper and faster but may result in redundant data while the latter is more consistent and provides company-wide support for data analysts.
In any case, ETL developers must balance out the data warehouse, which provides more secure and streamlined access to data, with data marts that provide more accessibility to the data warehousing environment.
How does an ETL developer fit in?
ETL developer has a major role in defining the data flow structure and processes while managing the development, tests, and implementation of the necessary tools. However, ETL developer jobs are not prominent or flashy. He or she will most often work as a part of the BI team, not solo.
The actual tasks put in front of the ETL developer depend on the system being built, meaning collaborative and team skills are a must on the ETL developer’s resume. One of the mandatory ETL developer skills is understanding data formats and how they will load into the system to avoid data issues. Once those tasks are complete, the ETL developer may be given an advisory or analytical role.
ETL developer resume
This VelvetJobs page allegedly shows 251 ETL developer resume samples, though the wording in them indicates the majority are actually ETL developer job postings. In any case, the built-in browser search showed the following word frequency related to ETL developers:
- support (169 times)
- Informatica (135 times)
- Oracle (131 times)
- develop (105 times)
- analysis (98 times)
- UNIX (84 times)
- maintain (54 times)
- Java (46 times)
- Talend (37 times)
- communicate (36 times)
- analyze (34 times)
- assist (34 times)
- mentor (33 times)
- Pentaho (28 times)
- manage (27 times)
- Linux (25 times)
- coordinate (17 times)
- troubleshoot (17 times)
- demonstrate (16 times)
- UNIX/Linux (11 times)
- Perl (11 times)
- research (11 times)
- monitor (8 times)
Based on the above breakdown of word frequency in ETL developer resumes and job postings, ETL developers serve as support, analysts, and maintenance.
Is ETL developer a stressful job?
Without automation, yes. ETL tools are the go-to solution for automated testing and reporting data issues in data science. There are many such tools for ETL developers, but they should all have the same core features:
- support data connectors, such as APIs, files, and databases
- a data validation engine that compares data across a range of formats
- a proper interface that minimizes the time spent micromanaging the tool
The ETL developer should also be ready to switch tasks on the fly and help out business intelligence developers at a moment’s notice.
Does ETL developer need coding?
Yes. At the very least, an ETL developer should be familiar with Java to understand the data system implementation and perform troubleshooting and/or optimization. Knowledge of scripting languages will help as well in terms of automating processes and reports on a large scale beyond what ETL tools can provide.
There is little actual programming involved — there are heavy-duty data science ETL tools and systems that can be deployed by ETL developers with relative ease. They can perform all advanced data science tasks, provided the ETL developer makes some tweaks to ensure they are integrated properly with the rest of the infrastructure.
Some of these heavy-duty tools and systems for ETL developers are:
- Pentaho, a software suite that provides a mix of BI and ETL functionalities, running on Java and using XML for automation, with add-ons to expand its feature list;
- Talend, database integration and data health solution that helps businesses absorb Big Data and find workarounds for critical problems;
- Informatica, a data management platform that integrates applications to provide automated, intelligent reports;
Are ETL and SQL same?
No. ETL is a niche role while SQL is a flexible, jack-of-all-trades role. However, the job market has blurred the lines between those two and other data science job roles. There is an enormous demand for data science personnel to the point businesses are willing to turn a blind eye to skill mismatch.
Two ETL developers in two companies will probably have different tasks and skill requirements. What their ETL processes will have in common are analysis, support, and presentation stages.
ETL developer job demand
According to an IBM survey titled "The Quant Crunch", there were 82,920 job postings for an ETL developer in the US in 2015, making it the 10th most needed role in data science teams; the first was SQL with 338,000. The survey predicted 2.7 million job openings in the US across all data science job categories by 2020.
By analyzing over 130 million job postings, IBM and its partners identified 300 skills loosely correlated with data science. These skills were placed in six categories, with SQL identified as fitting five. On the other hand, ETL fit only one category — data systems developers.
The same survey breaks down the ETL developer role by industry — 69% of all ETL developer jobs were in three industries:
- professional services (41%)
- finance & insurance (14%)
- manufacturing (14%)
The value of an ETL skillset
The IBM survey states the average salary for a data science job in 2015 was $80,265 and the average salary of an ETL developer was $83,000, with ETL job postings hard to fill. Nearly all data science job postings required 3+ years of experience.
In comparison, analytics managers earned $107–113,000 a year, though 11% of those job postings required a Master’s degree or higher compared to 3% for data systems developers. An ETL developer job description most often mentioned UNIX, Linux, and Java.
Conclusion
Big Data has forever changed businesses, prompting them to adopt data science to survive in the endlessly fickle global market. Those that can fine-tune the data science process to fit their needs and adapt it on the fly have the greatest chance to survive tectonic market shifts.
The ETL developer role will have to undergo the same fine-tuning and adaptation process to stay viable. What we have seen in this article indicates this means becoming less formal and more experimental to enable reaching novel insights. Even then, the title of ETL developer will briefly open the door to the data science industry but not much more than that.
Although needed by companies, ETL developers’ skills are used only until the underlying data infrastructure is good enough. After that, other data science roles, such as data analysts and business intelligence developers, take over and fine-tune the system to suit the client. This further indicates ETL developer jobs do not provide enough career advancement opportunities to warrant solely focusing on them.