The Rise of ELT for DW Data Integration

What will you learn in this blog?

Answers to the questions
  • The difference between ETL and ELT
  • The benefits of using an ELT over ELT or “hand-cranked” code
  • How the Cloud, with the next generation of tools can simplify the data integration landscape
The following Data Integration Terms
  • ETL – Extract, Transform and Load
  • ELT – Extract, Load and Transform
  • Streaming – A continuous delivery of data, being constantly transmitted via various means, such as HVR replication or a Kafka Topic 
  • RLT – Replicate, Load and Transform (a new term we use to refer to a blend of replication tools and ELT tools)
  • Data Pipeline – The end-to-end movement (may include Transformation) of data from a source to a target, involving one or more steps

ETL

Let’s go back to the beginning and the birth of ETL tools, back in the 90’s when Data Warehouse become mainstream. Data had to be extracted from source systems, transformed and then loaded to a database to allow Analysis and Reporting.


Before tools existed to help with this process, this was performed using hand-coded Scripts to extract the data, then Transform the Data and then finally Load the data for use by the business requiring a range of skills. 
 
This was very time-consuming. The lineage of the data was not possible and maintenance was difficult. Most DW projects failed to deliver the on promised value, often the business was oversold what could be achieved in given time-frames using magical ‘boxed technology solutions’ alone, and it wasn’t straightforward. 

The birth of ETL tools arrived with tools like OWB Oracle Warehouse Builder, IBM DataStage, Cognos DecisionStream (became IBM Data Manager), Informatica and Ab Initio.   

 
The common factor for all these tools was that the Transformation was performed on Application Severs not on the Source system or the Target database, this required the data to be extracted from source, transferred and modified by the application infrastructure (ETL server/s)  and pushed back to the Datawarehouse.
 
The Benefits this brought were
 
  • Lineage – The ability to trace data points from sources to a targets 
  • Logging – Recording various details about the processing performed, examples are record counts and rejected record info 
  • SCD – Simplified the process in creating Slowly Changing Dimensions
  • GUI – A Graphical User Interface to build the Jobs, improving readability and understanding
  • Metadata – Data about the Data, this is important to understand the context of the data you’re looking at 
  • Documentation -Some provided the capability to produce documentation from the Code / Metadata of the tool  
  • Business and IT – Working more closely. Using a tool allows less technical people to contribute and collaborate.

ELT

The benefit of ELT tools is that the Transform is performed within the database using the power of the underlying Datawarehouse while still providing the functionality of ETL tools.
 

I believe that the first ELT tool with heterogeneous capabilities was Sunopsis (Oracle Acquired that in 2006) and renamed ODI, and coincided with the rise in more powerful DW Appliance Databases like Oracle Exadata.

 
The immense power of elastic compute available in Data Warehouses such as Snowflake for your computations makes ELT processing a breeze, coupled with the fact that the data stays within the same ecosystem thus avoiding large volumes of data movement and latency.  
 
Many existing vendors followed Oracle and started to allow their ETL tools to begin to have the capability to push down the processing to the database to varying extents.

Hadoop - the beginning of a new "Big Data" era

With ever increasing volumes, variety and velocity of data, it made sense to move the compute to the data. Hadoop was a platform that promised expandable compute and capability to address large storage pools, so became the go-to place for “Big Data” projects.

 
Hadoop development was largely hand-coding MapReduce programs with Spark, Java, Python or C++. Some existing tools, like Talend and Pentaho, produced versions that providing adapters to interact with the Hadoop ecosystem. However predominately the Datawarehouse and Big Data systems worked in Silos. And very few companies managed to have a single Data Integration tool across the DW and Big data, nor a single place to get data for analytics landscapes.
 
What we have seen at many companies was competing departments with duplication of capabilities, data and processes across the Enterprise Data Warehouse and Big Data Eco System. 
 
This requirement for higher levels of programming skills, needed to navigate the ecosystems pushed the development further away from the business. With no easy lineage and multiple version of the same data, it was hard to evidence the accuracy of the data. 

How the Cloud with the next generation of tools, can simplify the data integration landscape

ELT is back with a vengeance. What’s Changed? The Next Generation of ELT tools like Matillion is built for Cloud-based data warehouses like Snowflake, Google Big Query and AWS Redshift. Removing the legacy needs to perform Data Crunching within the Application layer and fully utilsing the underlying cloud databases to pick up the load (computations) and not requiring the data to leave the database, eliminating latency issues. 

 
Also, the capability of platforms like Snowflake that can satisfy your Company’s Data Lake and Enterprise Data Warehouse requirements means you only need one ELT tool to perform your transformations. There is no better time to build a Modern Analytical Platform. Products like Matillion and Snowflake can reduce the TCO of your Datawarehouse with the ease of use of a GUI and re-usable shared jobs that can simplify your Data Pipeline for creation and scale, using the elasticity of Snowflake. Reducing development time and making maintenance and impact assessment of changes more manageable.

Streaming

Another complementary capability to your data warehouse is the ability to stream data to allow real-time analytics combing data sources to enable your business to make decisions quickly to help drive efficiencies or sales via real-time marketing. You can read more about real-time replication and how it can work with your ELT tool in our blog: Replicate your way to fast results, using HVR is a cost-effective way to replicate your data. Something we have coined as RLT – Replicate Load Transform, blending Replication with ELT.

Governance

This is a huge subject and worthy of its own Blog. But without lineage and rich metadata, audit and log information, it makes the Governance for things like GDPR an even more significant challenge. Using tools like Matillion ensures you get that information easily with the ability to navigate the metadata within the tool or using the HTML generated documentation of your project or by using the API to connect Data Governance tools like Collibra.

DataOps

With the Rise of DataOps, the need for the easy deployment of changes using tools like GIT is essential and is supported by ELT tools such as Matillion. Gone are the days of Quarterly / Monthly releases to the Data Warehouse. Modern days requirements for agile working requires building deployment, pipelines and regression testing capabilities allowing regular changes to accommodate source systems changes or new data sources or data-marts while ensuring the existing data’s integrity. Using Snowflake Cloning capability makes this an easy task, something we cover in a different blog: How is Snowflake so much better for development cycles than traditional databases used for Data Warehousing. We plan to cover more on this in a future blog. 

Migration

So how do you get to this new world? This is something that we have covered in our blog:  Transitioning from a legacy analytics platform to a modern insights architecture and something we, at Leading Edge IT will be happy to discuss with you.

Conclusion

With the era of Cloud-Based Data Platforms like Snowflake, it makes sense to utilise the scalable compute they provide. But if you go the hand-scripted route, you will be missing out on the benefits of using ELT like tool Matillion and the TCO of your platform will be more in the long run. You can now get all the benefits that the Original ETL tools provided at a fraction of the price with a Cloud-Based ELT tool like Matillion. The hand-coding vs ETL debate has been around since the early 2000’s, and lost some focus during the Hadoop “bubble”. But with the combination of HVR for Replication, Matillion for ELT and Snowflake for your Datawarehouse, providing the capabilities of Multi-user, Multi-Environment and scaling flexibility, you can automate and re-use templates with shared components, there should be no reason to hand-code a complete solution. 


 

Why is all of the above important? We love data, analytics, software, and enabling the business to do smart things. It’s frustrating hearing of failing solutions implemented, or programmes not capitalising on rich data sets when it is relatively simple to do – if you know what you’re doing. You don’t need the Budget of Facebook® to make your business Data-Driven, talk to us about our DAaaS Data Analytics as a Service or our Accelerator Templated Solutions.  

 
Contact Leading Edge IT to see how we can help you in your Data Journey no matter what stage you are. 
#LEIT | #Snowflake | #Matillion | #HVR | #ELT | #DataIntegration | #DataStrategy | #DataGovernance | #Replication | #RLT 
 
DILBERT © Scott Adams. Used By permission of ANDREWS MCMEEL SYNDICATION. All rights reserved.