SNOWFLAKE began as the best cloud storage tool. It set out to simply store data, while also providing the infinite scalability of the cloud. Right out of the gate it offered incredible flexibility with its decoupled storage and compute. You could model your data in a Kimball dimensional model, a Data Vault, or even a loosely structured data lake all in the same instance. It also handled table clustering, auto scaling, security, caching, and many other features most data engineers and analysts don’t want to worry about. However, it fell short when it came to data extraction and transformation. Other platforms like Redshift, BigQuery, and Synapse integrated so well with their respective cloud data stacks, making data engineering and processing as simple as it gets. Snowflake, on the other hand, has three options for loading and transforming data.
Snowflake has a proprietary continuous data loading solution called Snowpipe. It uses named Snowflake objects called ‘pipes’, which contain a COPY statement to take data from a stage and load it into a table. A stage is a location where data files are stored in the cloud. These can be internal (within Snowflake itself) or external (on a cloud platform like AWS S3 or Azure Data Lake Storage). Snowpipe can be called by a public REST API endpoint or by using a cloud event messaging service like AWS SNS.
For example:
There are numerous benefits to this approach:
However, Snowpipe may not be appropriate for the following reasons:
From the beginning, Snowflake created a robust partner network and integrations with some of the most popular ETL tools in the market. The ETL tools also create native integrations with Snowflake and have simplified the process of extracting, transforming, and loading data from various sources into Snowflake. While each ETL tool is different, they should all be able to offer the following:
There are many use cases where a third party ETL tool is the right choice, and when implemented successfully, can save time and money for data engineering teams. There are also reasons not to use third party tools:
There are other options for loading data into Snowflake:
The data loading wizard allows you to load data directly into tables with no actual code. This is intended for smaller datasets and in rare circumstances and should not be used for regular data loading. BULK loading via the COPY command allows you to manually load data by staging it and then copying it to a table. This requires a running virtual warehouse. This too should not be used for regularly scheduled loads or any kind of volatile or business-critical data.
While Snowflake started off as a cloud storage and querying platform, competition has forced it to create more data engineering capabilities. Platforms like Databricks are starting to encroach on data storage, and Snowflake has responded by adding new features:
Stored procedures allow you to execute procedural logic. Procedures can take parameters and execute tasks like DML actions. As of October 2021, stored procedures are only available in Javascript.
Tasks are named objects that execute a SQL statement at a scheduled time. The emphasis is on the scheduled aspect, as this is the first semblance of orchestration in Snowflake. A task will use a virtual warehouse to run a SQL statement, such as calling a stored procedure. In other words, you can combine tasks and stored procedures to create data pipelines!
Because tasks are named database objects, you can use them in third party ETL solutions as you would a table or view. Certain limitations may apply depending on the tool.
These are the last piece of the puzzle for creating a data engineering environment. Streams are named objects created off an existing table. A stream will have the same columns as the table, but with three additional columns: METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID.
With these three columns, you can implement Change-Data-Capture on tables. THIS ARTICLE shows how to build a relatively simple Type II slowly changing dimension and all the upsert logic associated with it.
Streams can provide an easier alternative to stored procedures for processing Change-Data-Capture. In a traditional Kimball architecture, this would be best used in the ODS layer and for Type II slowly changing dimensions in the presentation layer. Instead of having to program a stored procedure in Javascript, you can combine a stream to provide the table history and a view to handle the insert/update/delete logic. However, a stream only applies to a single table. Streams are not appropriate for ETL processes that require complex joins and aggregations like you would typically use in a staging layer for fact and dimension tables.
Streams and stored procedures still need to be used together for CDC, but the process is simplified. Instead of comparing the ODS table with the new raw data, the stream will determine the table changes for you. The stream and corresponding view that manages the upsert logic are used in a simple stored procedure to expired changed and deleted records, insert new and current changed records, and append metadata to all records.
With these native Snowflake objects, data processing no longer requires third party tools. Achieving a robust persistent staging layer with auditability is now possible with streams, and when chained with tasks and called by stored procedures, the entire process can be automated. Pipes can continuously bring new data into Snowflake, so your data will never be stale. The data orchestration and processing components that were previously in a third-party tool now resides entirely within Snowflake, and the code is entirely SQL.
Launch Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Launch's additional Snowflake thought leadership HERE.
Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.
SNOWFLAKE began as the best cloud storage tool. It set out to simply store data, while also providing the infinite scalability of the cloud. Right out of the gate it offered incredible flexibility with its decoupled storage and compute. You could model your data in a Kimball dimensional model, a Data Vault, or even a loosely structured data lake all in the same instance. It also handled table clustering, auto scaling, security, caching, and many other features most data engineers and analysts don’t want to worry about. However, it fell short when it came to data extraction and transformation. Other platforms like Redshift, BigQuery, and Synapse integrated so well with their respective cloud data stacks, making data engineering and processing as simple as it gets. Snowflake, on the other hand, has three options for loading and transforming data.
Snowflake has a proprietary continuous data loading solution called Snowpipe. It uses named Snowflake objects called ‘pipes’, which contain a COPY statement to take data from a stage and load it into a table. A stage is a location where data files are stored in the cloud. These can be internal (within Snowflake itself) or external (on a cloud platform like AWS S3 or Azure Data Lake Storage). Snowpipe can be called by a public REST API endpoint or by using a cloud event messaging service like AWS SNS.
For example:
There are numerous benefits to this approach:
However, Snowpipe may not be appropriate for the following reasons:
From the beginning, Snowflake created a robust partner network and integrations with some of the most popular ETL tools in the market. The ETL tools also create native integrations with Snowflake and have simplified the process of extracting, transforming, and loading data from various sources into Snowflake. While each ETL tool is different, they should all be able to offer the following:
There are many use cases where a third party ETL tool is the right choice, and when implemented successfully, can save time and money for data engineering teams. There are also reasons not to use third party tools:
There are other options for loading data into Snowflake:
The data loading wizard allows you to load data directly into tables with no actual code. This is intended for smaller datasets and in rare circumstances and should not be used for regular data loading. BULK loading via the COPY command allows you to manually load data by staging it and then copying it to a table. This requires a running virtual warehouse. This too should not be used for regularly scheduled loads or any kind of volatile or business-critical data.
While Snowflake started off as a cloud storage and querying platform, competition has forced it to create more data engineering capabilities. Platforms like Databricks are starting to encroach on data storage, and Snowflake has responded by adding new features:
Stored procedures allow you to execute procedural logic. Procedures can take parameters and execute tasks like DML actions. As of October 2021, stored procedures are only available in Javascript.
Tasks are named objects that execute a SQL statement at a scheduled time. The emphasis is on the scheduled aspect, as this is the first semblance of orchestration in Snowflake. A task will use a virtual warehouse to run a SQL statement, such as calling a stored procedure. In other words, you can combine tasks and stored procedures to create data pipelines!
Because tasks are named database objects, you can use them in third party ETL solutions as you would a table or view. Certain limitations may apply depending on the tool.
These are the last piece of the puzzle for creating a data engineering environment. Streams are named objects created off an existing table. A stream will have the same columns as the table, but with three additional columns: METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID.
With these three columns, you can implement Change-Data-Capture on tables. THIS ARTICLE shows how to build a relatively simple Type II slowly changing dimension and all the upsert logic associated with it.
Streams can provide an easier alternative to stored procedures for processing Change-Data-Capture. In a traditional Kimball architecture, this would be best used in the ODS layer and for Type II slowly changing dimensions in the presentation layer. Instead of having to program a stored procedure in Javascript, you can combine a stream to provide the table history and a view to handle the insert/update/delete logic. However, a stream only applies to a single table. Streams are not appropriate for ETL processes that require complex joins and aggregations like you would typically use in a staging layer for fact and dimension tables.
Streams and stored procedures still need to be used together for CDC, but the process is simplified. Instead of comparing the ODS table with the new raw data, the stream will determine the table changes for you. The stream and corresponding view that manages the upsert logic are used in a simple stored procedure to expired changed and deleted records, insert new and current changed records, and append metadata to all records.
With these native Snowflake objects, data processing no longer requires third party tools. Achieving a robust persistent staging layer with auditability is now possible with streams, and when chained with tasks and called by stored procedures, the entire process can be automated. Pipes can continuously bring new data into Snowflake, so your data will never be stale. The data orchestration and processing components that were previously in a third-party tool now resides entirely within Snowflake, and the code is entirely SQL.
Launch Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Launch's additional Snowflake thought leadership HERE.
Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.