ETL (Extract, Transform, Load) code is a set of computer instructions that handle the extraction of data from its source system, transformation of data to suit various business intelligence needs, and loading of data into some target systems. ETL Code enables companies to establish data pipeline that can process and transfer data at massive scales and speeds.
Historically, data consumers have relied heavily on hard-coded ETL operations and data pipelines to standardize the heterogeneous datasets they receive. Others have turned to spreadsheets and manual processes to make sense of the data. Either approach applies standardization logic directly where the data lives: the former does so after the data is on-boarded; the latter just before it’s analyzed.
These approaches have worked in the past, but with the explosion of business data today they are no longer viable. Hard-coding mapping rules introduce several problems listed below.
The more standardization code you have, the more expensive and time-consuming it becomes to maintain it. As your code base grows over time, data lineage becomes murkier, and your ability to reverse-engineer the mapping process and ensure that it’s still accurate for all cases diminishes.
Increased risk of errors
Related to the earlier point, as your standardization code base grows, it becomes more difficult to test the data and catch errors early before they reach business users. And when you do catch errors, it may take a while to identify the source of the problem and fix it. The problem increases markedly when it comes to capturing custom data elements from various endpoints.
Data availability limitations
Often, data providers offer only datasets that are easy and cheap for them to produce. Aggregators who require full business visibility are constrained in their analyses with limited data. Moreover, partner data may arrive in various forms (log files, CSVs, JSONs, database dumps, etc.), further complicating the aggregator’s ability to access and blend the datasets.
New technologies continuously appear on the scene, improving business productivity and flexibility. But when you hard-code mapping logic directly into your data infrastructure components, you lower your ability to rip and replace these components, given the cost of having to build new mapping logic. Moving from expensive databases to a scalable and cost-effective data lake, for example, becomes very cumbersome and expensive.
Lack of reusability
Hard-coded mapping logic is not easily reused in subsequent pipelines or ETL processes. Rather, it must be essentially re-implemented, which is costly and error-prone. Brands lose the opportunity to reuse meta-data and combine it in different ways to increase efficiency.
Increased cost of doing business
Perhaps the biggest impact of hard-coded data mapping is that it creates friction where the business attempts to advance its mission: A brand that sells primarily through a retail channel, for example, is in the business of selling more goods through more retailers. Hard-coded data standardization prolongs the onboarding process of new retailers, increasing time to value, and causing the brand to lose revenue and market-share to more nimble competitors.