In typical on-premise datawarehousing projects there are phases that require scoping and sizing, and often times teams try to achieve this at the same time as the business requirements are being gathered. This approach is a difficult one to balance, as even if the business requirements are fully articulated and well understood ahead of time, things can – and often do change as things move on during the project. Additionally this approach leads to a more ‘Waterfall’ methodology for project management, which means the business community has to think of most scenarios up-front before they’ve even seen how anything works.
Even when the business requirements are understood, sizing can be a complex task, and building volumetrics can be tricky, particularly if the system/s being sourced from are in development. If the datawarehouse is being delivered as a part of a bigger programme of work, some of this detail won’t be apparent, and more over data availability is going to be scarce.
So… Lets say we’ve worked out the infrastructure requirement, whether that be new infrastructure, expansion of current infrastructure or somewhere in between. That target infrastructure caters for the eventual size that’s been scoped for a productionised system, and so doesn’t take in to account time-lines for development and resources needed for both development and testing! So, often sites will have multiple environments to build infrastructure for, for example, Dev, Test, Systems Integration, Unit Testing, Operational Acceptance and Production – so that’s more environments to keep running and potentially upgrade! Then comes the cost, for the use they get, the question arises – Do we need to size these environments the same as production? The answer on the majority of sites we’ve ever worked on is ‘No’ – cost is often too prohibitive or results in oversized environments for periods of the day or cycles in projects.
Another consideration is that the use of the environments can clash, both when multiple projects are going on in your datawarehouse simultaneously but have overlap in terms of structures in use. In addition, if you have a pre-production environment, often this is used for two different things, firstly Pre-Production testing/Operational Acceptance, and secondly Production Support, where support issues will sometimes necessitate pulling the environment from a project to put a fix live against operational grade data.
As a datawarehouse built specifically to utilize all the best capabilities that the cloud can offer, Snowflake offers us options we previously didn’t have.
Firstly in terms of the scoping phase, that’s just not necessary! The elastic nature of storage, and compute that are available in Snowflake, mean that upfront provisioning is not necessary, and can be adjusted during the project as required within minutes or even seconds.
What this means for your project methodology is that you’re free to adopt any style you wish rather than fitting a rigid Waterfall methodology, you can now go Agile using a DataOps process-oriented methodology if you wish; and if you want, you can have multiple iterations being run simultaneously, either with the same or entirely different engineering and test teams. This is made possible by taking clones of entire databases, and provisioning non-competing compute power for performance testing using warehouses , or if you want to share compute, creating compute power and sharing it amongst groups of users for certain tasks. This is all very easily performed via the Snowflake graphical interface or on a command line level.
Unlike typical cloud options, where the build-out of infrastructure that was required for on-premise is just shifted to the cloud instead, provisioning can be performed on-the-fly, and compute and storage requirements used only when needed rather than being ‘always on’.
Secondly, and we’ve already alluded to this just now, is that because of the ability to provision and ring-fence flexible compute and storage on demand, we can speed up the processes for development and testing. How ? Lets find out…
By using cloning of sources/targets the same set of data can be used with no additional storage overhead other than for changed data between the source of the clone and the target. Test cases can be run through with a “golden” set of data each time by cloning it, and then then the clones can be easily dropped. Multiple sets of entities can be cloned and used by different developers without effecting each other!
As cloning can be performed at database level and even down to table level, integration testing preparation and test durations can be vastly reduced. This can be done by taking a production database clone of the database or databases, and then cloning any changed objects from the new development path in from a dev/test environment and then running the transformations / Analytics against the newly cloned copy. Traditionally this type of activity would require backup/restores taking hours or even days and a lot of assembly. This can now be done in minutes or in complicated scenarios a few hours. Even better, because the cloning operations are standard Snowflake commands (DDL), the whole thing can be scripted and played through each time, running in minutes!
A good deal of the practice described in Integration Testing applies to what can be done in User Acceptance Testing. Things I’d add here as capabilities, include being able to provide full data volumes to users for testing with, which in typical agile datawarehouses isn’t, well, typical ! Another thing is that when we perform User Acceptance Testing, whilst there’s an appreciation that UAT is not performance testing, users expect to be able to walk through their test scenarios without too much hindrance, and would like to see at the very least representative performance. A big part of UAT from a Datawarehousing and Analytics standpoint is the user experience, and slow performance can be a real problem in getting out of UAT in the time-frames set out in the project.
In addition, UAT’ing can be further sped up, by creating additional compute performance (virtual warehouses) for groups of UAT testers, say for different departments, or even providing enhanced performance via large compute resources to avoid any delays playing through UAT scripts.
Remember as well, that all of the above can be done when people are actively using the system; and the compute turned off either automatically during the day when there are periods of inactivity, or outside of the test window. Remember – when the compute is off, there’s no cost!
Traditionally running cycles of ETL/ELT and Analytics is a cyclic affair, with the ELT being run prior to the Analytics cycle, with each waiting for the other before commencing the next phase. This means that timing is absolutely critical, and can lead to missed windows and delays in testing. By cloning the data, and using different compute resources between the ETL/ELT the Analytics pieces, we can change this cycle, and test performance in two paths, where the ETL/ELT stream can run ahead of the Analytics stream, whilst providing a mechanism for the next iteration to be available to the Analytics team as soon as they want it.
The figure above demonstrates the multiple paths that could be taken with test teams, testing different components of the overall Analytics architecture. You can see that it’s possible to start Iteration #2 and run the ELT straight after the first ELT has run once a clone has been taken, in advance of reviewing or walking through test patterns. The dotted line is where more traditionally you’d wait until satisfactory test results are yielded from reviewing the ELT transformations, however if under time pressures, or if the test patterns differ for a second run over a first run, this could give valuable insights in to potential future issues ahead of time.
You’ll then see that the Analytics stage of testing can be performed independently of the ELT if required, this means multiple runs can be made and kept as clones, for the data-sets ahead of the Analytics tests and even a queue can be formed of these.
Going live with a project is always an interesting phase and there’s no avoiding the planning that’s required. However Snowflake makes life so much easier from an implementation window perspective.
We’ve worked on implementations that have spanned multiple days before now, when complex rework is required, say for overhauls of prior designs being replaced with new ones. This usually requires taking the current production system out of action to make the changes, re-running the data through the system, and then running any batch cylces etc. This means that the data isn’t available downstream for your analytics users for that period of time. It also means you’re on the clock in a big way, and have to facilitate the process with potentially sophisticated rollback strategies which generally incorporate deadline date/times to complete by or Roll-back and then reschedule the whole thing again…
The good news is that with Snowflake you can implement on a point in time set of clones and perform the project implementation, at the same time as your users and previous ELT is running against the original set of databases. To boot this can be done using different compute power, so having no effect on the any data transformation processes or analytics during the implementation. Then, once the project has been implemented, the databases can be swapped, which means you’ve an easy fall-back position. Particularly with an ELT process this is much improved as the compute for transformations is processed by the warehouses associated with the implementation or the original.
To summarise, whilst Snowflake has a lot of clever technology and capability underneath the bonnet, the way it can be utilized is extremely flexible from a business realization viewpoint, and this power can be used with very little trouble. Accelerating your development to production times has never been easier!