Best Practices for Transforming Data in Snowflake
Some helpful hints to get your data where and how it needs to be.
Join the DZone community and get the full member experience.
Join For FreeThe death of the star-schema is not exaggerated. Gone are the days of all-encompassing data warehouse models and the 24-month projects to build them.
We live in a highly disruptive, event-driven world. New analytics are required almost daily to understand how our customers, business, and markets shift. A modern data stack using the speed, flexibility, and scalability of Snowflake needs to allow an organization to “model as you go” to answer critical business questions on the fly.
To this end, modern data modeling and transformation tools need to offer three things:
- Allow anyone in the organization, regardless of technical skills, to model and transform data in Snowflake,
- Support collaborative data modeling and transformation processes with Snowflake at the heart, and
- Let rich knowledge and information about data models be easily captured and shared.
At the same time, there does need to be some structure to the data warehouse and models. You can’t just dump data in using your EL tools and tell the analysts to “have at it.” You need to instill a set of best practices and collaborative processes that help you make the most of your Snowflake data. Let’s explore how you can achieve this in a modern data modeling and transformation platform such as Datameer that meets all the criteria we discussed in the last paragraph.
Step 1: Getting the Data Out of Raw Form
The initial data model your team will work from is a set of tables in Snowflake that look like objects from your data sources, are grouped similarly, and contain all the same fields. But because the data is still in a form similar to the SaaS application or cloud service objects, it could be very cryptic and not understandable by a data analyst.
First, if any of the data is private or sensitive, it needs to be anonymized or masked. This is critical to maintain the privacy of the data and ensure regulatory compliance. Second, after the raw data is loaded, the data engineering team may apply data transformation for data cleansing to (a) find and correct missing or invalid values, (b) transform incorrectly formatted fields, and (c) extract individual fields from complex, multi-faceted columns.
Step 2: Canonical Data Modeling
Once the data is in the CDW and has gone through the first pass of data transformation, the data engineering team can transform the raw data into canonical data models that represent specific subjects. Examples of these would be data models representing accounts, contacts, leads, opportunities, activities, and more. Canonical data models create shared, reusable components for multiple use cases.
Since the raw data that came from the data sources is often normalized (in some cases mildly normalized and others highly normalized), the canonical data models will typically blend (JOIN, UNION, etc.) data from multiple objects to create a rich and complete set of fields to represent the subject. In addition, the canonical data models may also have some data enrichment to calculate new fields for standardized use in different use cases.
Step 3: Use Case Data Modeling
The final step in data modeling is to create datasets that are specific to the analytics use case. For modern data modeling in Snowflake, this task is typically done by the data analyst because they have a more complete understanding of how the business would use the data and how it would be incorporated into analytics.
Data analysts may have varying technical skills but would prefer to spend more time on what they are good at – analysis - and less on coding data transformation. This is where a low-code or no-code data transformation UI becomes important, eliminating the need for analysts to write complex SQL code and Python-like scripts.
Use case data modeling and transformation will typically involve:
- Data cleansing that is specific for the use case, such as identifying and fixing outliers or de-duping records,
- Data shaping and reduction such as sorting and organizing the data, eliminating unneeded fields, or narrowing the scope of the data, and
- Data enrichment to add new calculated fields or uploading and blending local files specific to the use case.
The optimal, final form of the data model will be a single flattened data structure – a very large, wide table. This, along with materialization, eliminates the need for expensive JOINs to be performed each time a query is performed for the analysis.
Step 4: Optimize the Data Models for Snowflake
In the final step, data engineers will come back into the picture to tune the models specifically for Snowflake for fast and scalable execution. Here are some optimization techniques:
Use of Virtual Data Warehouses – Snowflake recommends you have a different virtual data warehouse for each use case and that users separate “loading” and “execution” into separate virtual data warehouses to best manage different workloads. Place use case-specific data models in the designated use case virtual warehouse raw data loading (for your EL) virtual warehouse.
Data Transformation Execution – Snowflake recommends you use materialized views to improve performance for queries that are frequent and repeated and data transformation models fit this perfectly. Also, place views that are unique to the use case in a use case database and data transformation views that are shared across different use cases in a separate one. A good data transformation tool, such as Datameer, can maintain a complete repository of data transformation models regardless of where they physically reside.
Data Transformation Techniques – There are four specific techniques for optimizing the performance of your data transformation views or queries:
- Reduce complexity and combine multiple queries into one. Simpler is always better.
- JOINs are very costly so if a data model JOINs multiple tables, materialize the view.
- Discard data you don’t need by performing projection in your data transformation.
- Pre-aggregate for common aggregation values and dimensions then materialize these.
In practice, data engineers can typically explore data transformation models once they have been created and optimize them across these parameters.
Wrap-up
Following the best practices outlined here can not only help deliver better structure and organization to your Snowflake cloud data warehouse, but brings much greater agility to your analytics. New analytics can be produced in hours, rather than days or even weeks. Entirely new use cases can be delivered in days rather than with many-month monolithic projects.
Opinions expressed by DZone contributors are their own.
Comments