What exactly is Snowflake?
Snowflake is a cloud-based data warehousing platform. Snowflake is built on top of the public cloud. Snowflake is a SaaS (Software as a Service) offering. Unlike traditional data warehouse options, Snowflake provides a unique data warehouse. It is not only easier to set up, faster, and more flexible. Snowflake has become a leader in data management solutions for analytics due to its many unique and innovative features. Snowflake has become a leader in data management solutions for analytics.
Why choose Snowflake?
Snowflake database is gaining popularity because it is easy to use. It has a simple user interface and abstracts technical complexities. Due to the SaaS model, there are no software upgrades. Ramping up the infrastructure as your databases and user base grow is much easier. Snowflake supports innovative new features like auto-scaling, cloning, data sharing, and auto suspends.
For high-concurrent loads Snowflake can automatically scale up.
Snowflake compresses (and encrypts) all data. This reduces the storage requirements. Snowflake supports Apache Parquet file format which enables in-place querying of data lakes.
What is a Snowflake data warehouse?
Snowflake is a cloud-native data warehouse. It uses subscription and usage-based pricing. You only pay for compute resources for the uptime. You can suspend compute resource if they are idle and increase your savings.
As of early 2022, Snowflake runs on Google Cloud, Azure and AWS. Snowflake has taken a first-principles approach to data warehousing. This includes
-
Zero-copy cloning
-
Data-sharing
-
Row-level security
-
Automatic refresh of materialized views
Snowflake Architecture
Snowflake utilizes S3 as the storage layer. S3 is a cloud-native object storage layer of the Internet. S3 is capable of processing large amount data in parallel. Snowflake builds on this massive parallel storage processing capability.
The data is stored in micro-partitions. Each micro-partition stores between 50MB-500MB of uncompressed data. The metadata about each micro-partition is stored in the cloud services layers (FoundationDB). The metadata consists of things like count, count(distinct), min, max values. The compute resources are allocated dynamically based on the load and the configuration. The users are charged for compute and storage based on the actual usage.
Snowflake combines the benefits of shared nothing and shared disk in a unique one-of-a-kind innovative design. Snowflake calls this design shared-data architecture. In summary, the architecture of Snowflake database consists of three layers.
-
Query Processing
-
Database Storage
-
Cloud Services
Query Processing in Snowflake
Query processing layer in Snowflake consists of the compute resources. In case of AWS, these are EC2 virtual machines. Depending on the cluster configuration one or more the virtual machines are allocated. These compute resources read data from the micro-partitions. Query processing may consist of tasks like sorting, aggregating, and filtering. The Computer/Query layer caches the queried micro-partitions. This minimizes communication overhead for future queries on the same dataset. Snowflake charges for the actual compute resources used using per-second billing. User can reduce the cost by using AUTO_SUSPEND and AUTO_RESUME options. In addition, multiple virtual warehouses can be created to meet workload requirements. For example, time-sensitive queries can run on a larger warehouse.
Database storage in Snowflake
Snowflake stores the data in databases. Database is a logical grouping of objects. Database consists of schemas. A schema is a grouping of tables and views (and procedures, sequences etc.)
Users can interact with a large number of objects in Snowflake using SQL
The storage layer in Snowflake consists of micro-partitions. These partitions are physically stored in cloud object storage such as S3. The micro-partitions are immutable. For updates, a new version of a micro-partition is created. All data in the partitions is compressed and encrypted.
Cloud services layer in Snowflake
Cloud Services layer in Snowflake works as the central nervous system. It handles encryption, SQL compilation, sessions and more. This layer does metadata management, security, authentication, query parsing, and optimization. The metadata that is required to optimize a query or filter data is stored in this layer. This also eliminates the manual data warehousing and tuning requirement needed.
The different layers are independently scaled and are redundant by design. To know how different layers work, we need to understand the lifecycle of a query.
When Snowflake client is connected, it starts a session. A query is submitted by the first virtual warehouse. The services layer verifies access and authorization to the data. It then optimizes the query and builds a query plan. It sends the query execution instructions to the virtual warehouse. Query Processing layer allocates the resources and executes the query. The results are returned to the user.
How Snowflake differs from other traditional architectures
‘Shared disk’ architecture uses multiple nodes to access shared data on a single storage system. ‘Shared nothing’ architectures store a part of the data in each node of the data warehouse. Snowflake combines the benefits of both platforms (‘Shared disk’ and ‘Shared nothing’) in a unique innovative design. Snowflake computes all the queries using MPP (massively parallel processing). Each node in compute and process cluster, stores a part of the entire dataset.
Advantages of Virtual Warehouses
The virtual warehouses can be started and stopped at any chosen time while they can also be scaled up or down without affecting queries. The virtual warehouses can be set to auto-suspend/auto-resume. This helps in cost reductio. Warehouse can auto-scale with a defined maximum and minimum cluster size. Charges for virtual warehouse and the underlying storage are separate.
Cloud Services usage within a certain percentage of compute services total usage is free.
Connecting to Snowflake
Snowflake can be connected with other services in the ways mentioned below:
-
Through ODBC and JDBC drivers
-
Help of Command line clients
-
Due to help of Native connectors
-
Through the help of Web based User Interface
-
Third-party connectors (e.g. ETL tools and BI tools)
Data Loading into Snowflake
As described, Snowflake supports following options for data loading:
-
WebUI for limited data
-
Snowpipe to automate bulk loading of data
-
SnowSQL for bulk loading
-
Third party tools to help bulk load the data from external sources
-
SnowSQL for Bulk Loading
The bulk loading of data is performed in two phases. Phase 1 involves staging files and phase 2 involves loading of data. Our emphasis will be on CSV files and loading data from them.
Stage the Data
Staging involves uploading the data files to a location where Snowflake can access them. In part 2 of the loading process, data is loaded into tables from staging. There are two types of stage locations:
-
Internal Stage
-
External Stage
A virtual warehouse in active state is required to load data into a database. The data can be located in one or more files.
In the following example, we will look at an example of data loading. An internal named stage is created. Files are then uploaded using PUT command. From this stage, the data is loaded into the destination using the COPY INTO command.
The steps are as below:
- Use the demo_db database.
`Last login: Sat Sep 19 14:20:05 on ttys011
Superuser-MacBook-Pro: Documents xyzdata$ snowsql -a bulk_data_load
User: peter
Password:
- SnowSQL * V1.1.65
Type SQL statements or !help
- SnowSQL * V1.1.65
Type SQL statements or !help
johndoe#(no warehouse)@(no database).(no schema)>USE DATABASE demo_db;
+----------------------------------------------------+
| status |
|----------------------------------------------------|
| Statement executed successfully. |
+----------------------------------------------------+
`
1 Row(s) produced. Time Elapsed: 0.219s
- The tables were created using the following SQL
peter#(no warehouse)@(DEMO_DB.PUBLIC)>CREATE OR REPLACE TABLE contacts
(
id NUMBER (38, 0)
first_name STRING,
last_name STRING,
company STRING,
email STRING,
workphone STRING,
cellphone STRING,
streetaddress STRING,
city STRING,
postalcode NUMBER (38, 0)
);
+----------------------------------------------------+
| status |
|----------------------------------------------------|
| Table CONTACTS successfully created. |
+----------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.335s
- Next, create an internal stage called csvfiles.
peter#(no warehouse)@(DEMO_DB.PUBLIC)>CREATE STAGE csvfiles;
+----------------------------------------------------+
| status |
|----------------------------------------------------|
| Stage area CSVFILES successfully created. |
+----------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.311s
- PUT command to stage the records in csvfiles. This command uses a wildcard contacts0*.csv to load multiple files, @ symbol defines where to stage the files – in this case, @csvfiles.
peter#(no warehouse)@(DEMO_DB.PUBLIC)>PUT file:///tmp/load/contacts0*.csv @csvfiles;
contacts01.csv_c.gz(0.00MB): [##########] 100.00% Done (0.417s, 0.00MB/s),
contacts02.csv_c.gz(0.00MB): [##########] 100.00% Done (0.377s, 0.00MB/s),
contacts03.csv_c.gz(0.00MB): [##########] 100.00% Done (0.391s, 0.00MB/s),
contacts04.csv_c.gz(0.00MB): [##########] 100.00% Done (0.396s, 0.00MB/s),
contacts05.csv_c.gz(0.00MB): [##########] 100.00% Done (0.399s, 0.00MB/s),
+----------------+-------------------+-------------+------------------------+
| source | target | source_size | target_size | status |
|---------------------------------------------------------------------------|
| contacts01.csv | contacts01.csv.gz | 554 | 412 | UPLOADED |
| contacts02.csv | contacts02.csv.gz | 524 | 400 | UPLOADED |
| contacts03.csv | contacts03.csv.gz | 491 | 399 | UPLOADED |
| contacts04.csv | contacts04.csv.gz | 481 | 388 | UPLOADED |
| contacts05.csv | contacts05.csv.gz | 489 | 376 | UPLOADED |
+------------------+-------------------+-------------+----------------------+
5 Row(s) produced. Time Elapsed: 2.111s
- To confirm that the CSV files have been staged, use the LIST command.
peter#(no warehouse)@(DEMO_DB.PUBLIC)>LIST @csvfiles;
- To load the files from the staged files into the CONTACTS table, specify a virtual warehouse to use.
peter#(no warehouse)@(DEMO_DB.PUBLIC)>USE WAREHOUSE dataload;
+----------------------------------------------------+
| status |
|----------------------------------------------------|
| Statement executed successfully. |
+----------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.203s
- Load the staged files into a Snowflake table
peter#(DATALOAD)@(DEMO_DB.PUBLIC)>COPY INTO contacts;
FROM @csvfiles
PATTERN = '.*contacts0[1-4].csv.gz'
ON_ERROR = 'skip_file';
INTO defines where the table data to be loaded, PATTERN specifies the data files to load, and ON_ERROR informs the command when it encounters the errors.
- If the load was successful, you can now query your table using SQL
peter#(DATALOAD)@(DEMO_DB.PUBLIC)>SELECT * FROM contacts LIMIT 10;
- Third party tools
Third party tools like ETL/ELT are used for the bulk loading of data. Snowflake supports a growing number of integrations for the loading of data from external sources.
- Snowpipe
Snowpipe is a feature which enables continuous loading of data from stage locations. Snowpipe uses the COPY command to copy data from stage to tables in database. Data transformation is possible with functions. Snowpipe uses external compute resources which are charged separately. This removes the need for a virtual warehouse.
- Web Interface
Another option for the loading of data is the web interface. The table which is to be used is selected and the load button is clicked. Using this method, you can load a limited amount of data into Snowflake. This method simplifies loading by combining staging and loading data into a single operation. This method deletes staged files after loading.
The Benefits of Snowflake Cloud Data Warehouse
1) Intuitive and simple Interface
Snowflake has an intuitive interface. Most operations can be performed using either the UI or SQL.
2) High efficiency and output
The elastic nature of the Cloud allows the user to load data faster or run concurrent queries. The virtual warehouse can be scaled up or down to take advantage of the extra compute resources. Users only pay for the time used which is a huge positive. The Snowflake platform ensures query procession is performed at an optimal rate.
3) BI Tools
Snowflake is compatible with a large number of BI tools (e.g. Tableau, Power BI). These tools empower user to analyze and visualize a large amount of data.
4 Seamless sharing of data
Snowflake’s architecture allows for seamless sharing of data for any customer. There are two options for sharing. Direct sharing requires the user to have a customer account. The other option for sharing is a reader account. The owner pays for the computer resources used by reader account.
5 Cost-effective
Using a combination of auto-suspend and auto-resume, the idle time of compute resources is minimized. Charges for compute and storage are separate. This enables cost optimization by using right-sized compute resources based on the workload.
6 Flexibility and elasticity
Due to the elastic nature of compute and separation of storage, Snowflake offers high degree of elasticity. Users can query data in the warehouse or directly query external data lakes in S3.
7 Support of multiple data formats
Snowflake supports various formats (XML, JSON, Parquet etc). Snowflake works with structured, semi-structured, and unstructured data. When combined with data transformation capability, you can work with almost all datasets.
8 Scalability
Snowflake also supports instant data warehouse scaling. You can handle high-concurrent workloads. Snowflake can automatically scale-up and scale-down depending on the load.
Final Conclusion
Snowflake is as an extremely powerful tool for data warehousing.
Snowflake differs from other data warehouse solutions due to its cloud-native & SaaS model. Snowflake brings innovative new features such as zero-copy cloning, data sharing, and time travel. When coupled with support for many enterprise features such as
Due to the bounding popularity of Cloud Data Warehousing, Snowflake type tools are extremely important.
On an added note, the integration of Snowflake into the user’s organizational structure can lead to exposure to better results. This also leads to better prediction of the future growth of the company as well.