Replicate Your Way to Fast Results

Replication as part of an overall Information Delivery Strategy

In this blog, we’ll cover how replication can fit very well within the overall architecture for information and analytics delivery.

 We’ll cover points such as:
  • Traditional Data Warehouse Population Processes
  • How Does Replication fit within a Modern Architecture
  • The savings in terms of time and resources migrating from on-premise to cloud
  • The renewed need for Replication products in a modern diverse Enterprise Architecture
  • The Marriage of Replication and ELT

The Pre-Cloud Datawarehouse Process

The pre-cloud generation of Transformation tools typically used ETL (Extract, Transform and Load) for applying business rules and transforming data from source systems into analytical data-sets, that then lived in the Data Warehouse. ETL worked very well in pre-cloud environments where multiple systems were all on the same Local Network, typically with fast connections in the same data-centre.

 
The typical server set-up for the ETL Server (or servers) required a pretty hefty server, with a large amount of disk and memory. The ETL server was typically used for only a few hours a day, with daily refreshes and so the resources allocated to it were under-used for a good deal of the day. They had to be highly specified to be able to cope with peak volumes of data, within the window between operational systems completing their overnight batch runs, and when the analytics Data Warehouse needed to be available for the users.
 
Whilst Replication / CDC solutions existed, they were often not deployed, due to very substantial costs, and being proprietary to particular databases. This along with the fact that often due to having batch windows of unused resource on the source databases, loading could be placed on them by the ETL tools without too much impact.

Modern Day Information Problems

Things have changed since the dawning of the Data Warehouse. Where once we were dealing with typically user driven content generation for Customer Relationship Management systems (CRM) or Enterprise Management Systems (EMS), there has been an explosion of additional data sources and data points, which didn’t exist when the Data warehouses started to appear on organisations architecture.

 
Today’s systems can churn out an awful lot of data points, and the volumes can be huge! Capture points on EMS systems have increased, to allow greater understanding of the journey of, for example the manufacturing  process, or a retail delivery chain. Digitisation is a key priority for many organisations now, and with this, even more information is being captured. All this extra data needs to be understood and analysed to gain greater insights and value for the organisation. Couple this with a growing need for access to information required to build a picture of what’s happening in a real-time, to influence what’s happening during an online day, and the overnight process starts to break down a little.
 
The problem with ETL, is that without replication technologies, the drain on the source systems to gather data more continuously than daily, is an operational risk that organisations need to mitigate. After all, the operational systems are the oil that greases the cogs that allow it to function. Risks to those systems are unacceptable.
 
Access to near real-time data is becoming increasingly powerful to growing communities in organisations, and centralisation of this data is critical to enable analysts to make use of the data in one place – the Data Warehouse, rather than relying on the many varied screen requests or reports out of each individual system.

How does Replication help, and what's the modern approach?

Firstly, what is Replication? It’s a process where data is taken from a source system or systems, and replicated (or copied) over to a landing area in a Data Warehouse. Replication needs to be:
 
  • As light-touch as possible, avoiding undue strain on the operational system that the data is being sourced from.
  • Refresh frequently via scheduling or work by constant polling against the source systems.
  • By preference, use CDC log capabilities if the source system has a database that supports it, for example SQL Server Enterprise Edition®. If no CDC, then database logs, and have the capability for other options if the source database doesn’t support that.
  • Replicate only the new, changed or deleted data since the last replication, so it needs to keep track of what’s going on!
  • Not only replicate data, but where structures change or are added to the source system, be able to capture those and replicate those to the Data Warehouse target as well, without user intervention. For example additional fields being added, or even tables!
  • Replicate data to multiple targets where required

 

Replication is especially useful against traditional database systems or ERP’s such as SAP, whether on premise or cloud to deliver data in effectively a streaming nature, to your target.
 
Migrations of structures and data from traditional databases can be performed with HVR for example, giving you a real quick-start to gather up historical data from these systems and get them loaded in to the Data Warehouse quickly. Then if these systems continue to be used (rather than retired), it can replicate data as it changes.
 
Many organisations now, don’t have a “typical” working day, any more, and can have demands across time-zones, or due to various operations working sometimes 24 hours a day, 7 days a week, there isn’t really an ideal time to get data. This means that there is no effective batch window, and users of centralised Data Warehouses need access to data on a more continuous basis that fits their geography.
 
This must be done without impacting the source systems, or impacting patches or upgrades on the source systems as well. This is where Replication Technology really helps!
The diagram below shows our replication architecture point between the source systems and the Data Warehouse. In this case HVR is being used which is capable of using CDC, logs or other mechanisms that reduce the load on source systems as much as possible, and HVR is capable of replicating this data to multiple targets, from both on-premise and cloud sources, the target could be combinations of Snowflake for example, and AWS S3 storage or event Kafka for Streaming!

Replication and ELT

Once HVR has delivered the replication part (using a very light-weight pull from source systems, where logs or CDC are used, meaning that the source system is barely impacted, and things like data-locking are avoided, as the data is not being pulled from the live data, but instead from the CDC or log capture). This makes it very feasible to get the continuously) in to Snowflake, where the data can then be transformed by ELT, in this case Matillion.

The diagram above shows a typical set-up where data is replicated and/or ingested in an ingest zone, from HVR. This is the first point where data is stored and now allows the compute and storage power of a modern datawarehouse platform, Snowflake being the best one! An ELT tool is then used, that effectively allows transformations to be made on the data from the ingest zone, using the power of the Data Warehouse. The ELT environment doesn’t need to be huge any more as most of the processing is pushed to the Data Warehouse, and in the case of Snowflake this can be scaled according to requirements and executed with the required processing power needed for the required amount of time. This means that batch windows either during the day or overnight, or a combination of the two can be executed as required and compute power turned on and off as is needed, without affecting the users, by using different warehouses.

Summary

What we’ve talked about here, is having the capability to replicate data from your On-Premise systems, and Cloud systems, with vastly reduced impact over using an ETL or ELT solution alone, in this case with HVR!
 
The Replication takes care of getting the data and structures from key source systems automatically, without the need for scripting or design of a large number of ELT Ingestion routines, allowing modern ELT architectures like Matillion to work in co-ordination with powerful Data Warehouse systems like Snowflake from ingested data zone onwards. This will enable you to move into cloud Data Warehousing, where you can pay for what you use, and also scale as required (reducing upfront costs, or costly up-scaled hardware to cope with peak demand) while maintaining your current source systems without necessitating change on their parts.
 
Contact Leading Edge IT to see how we can help you in your Data Journey to the Clouds, where we cover using HVR for replication and Matillion for ELT over Snowflake.