Snowflake Concurrency and Parallel Processing

Snowflake Concurrency and Parallel Processing

This article explains the Snowflake concurrency and parallel processing capabilities, options and how the virtual warehouse configuration affects the performance and concurrency and efficiently processing data within the required performance target time. The other intention of this explanation is to provide a clarity on some of the confusions in the industry around the Snowflake concurrent processing limits.

In general, we think of Multi-Cluster warehouses, whenever concurrency comes into discussion. But in Snowflake there are two ways to handle concurrency.

1.      Concurrency or Parallel processing within a single cluster warehouse

2.      Concurrency or Parallel processing in a multi-cluster warehouse

Also, it needs to be noted that, essentially performance and concurrency are two sides of the same coin and if concurrency is not handled properly, it looks similar to performance issue to the end user.

Below text explains both type of concurrency in details along with some of the important parameters and how to tweak then to control the concurrency and performance of a warehouse. 

Snowflake Virtual Warehouse and its internal Architecture

Before going to the details of Concurrency, we need to understand what are virtual warehouses in Snowflake.

  1. The Virtual Warehouses in Snowflake are Compute machines which internally uses Amazon EC2 machines or Azure VM machines. The user need not know these internal details and can created warehouses with T-Shirt sizes like X-SMALL, SMALL, MEDIUM, LARGE, X-LARGE, 2X-LARGE, 3X-LARGE and 4X-LARGE.
  2. Each compute cluster consists of one or more servers with multi-core CPU(s) and hyper-threading enabled that can execute multiple tasks in parallel.
  3. The number of maximum possible parallel process or queries in a Snowflake compute cluster depends on the numbers of processing threads available in the compute machines used for the cluster.

For more details on Snowflake basic concepts and Virtual Warehouse details please see Snowflake documentation. https://docs.snowflake.com/en/user-guide/warehouses.html

To dive into the matter lets have a look into the Snowflake X-SMALL warehouse probable* architecture:

Snowflake typical X-Small Warehouse with assumed* compute capacity:

No alt text provided for this image

Note*: Number of physical Server: 1, Number of virtual CPU (processor): 2 Number of Cores in each processor (vCPU): 4, Threads per Core: 2.

*The above X-SMALL compute cluster has the ability to execute maximum 16 parallel processes.

Given the above, below are the Snowflake maximum concurrency or parallel processing limit of the various Snowflake single cluster warehouse sizes. 

Note: Remember that these are maximum number of parallel process possible but that does not mean this will be the actual number of parallel processes running. The actual number of processes running in parallel can depend on the various other process related factors like complexity, number of possible threads the queries can be divided into, etc.

Note*: Remember that these are maximum number of parallel process possible but that does not mean this will be the actual number of parallel processes running. The actual number of processes running in parallel can depend on the various other process related factors like complexity, number of possible threads the queries can be divided into, etc.

Concurrency or Parallel processing within a single cluster warehouse

Snowflake Warehouses uses the entire compute resources available within it to process the queries and processes submitted. As with multiple sized warehouses there are variable numbers of CPU(s) and finally multiple threads available for processing, any warehouse will allow a level of concurrency within itself. The number of process that can run on a warehouse depends on the size of the warehouse (based on which the number of CPU and Threads differ).

There is a misconception about the Snowflake parallel processing capability among the database user’s community specially working with Teradata, Oracle or Redshift databases. Probably the misconception comes from the warehouse level parameter named MAX_CONCURRENCY_LEVEL which has a default value of 8. But this is not a limit to number of allowed concurrent processes or queries. This parameter actually sets the maximum limit of compute resources each process or query can get. In reality there is no such hard limit of parallelism in Snowflake.

To understand this in details we need to first understand the role of the warehouse parameter MAX_CONCURRENCY_LEVEL deciding the concurrency of a single cluster warehouse. 

No alt text provided for this image

Note*: Reducing the value of the MAX_CONCURRENCY_LEVEL does not reduce the maximum possible concurrency level

The above table explains the effect of the parameter MAX_CONCURRENCY_LEVEL to the parallel processing capability of Snowflake with a single cluster of a virtual warehouse.

Please note that this parameter cannot and do not reduce the number of possible parallel process that can co-execute in the cluster, rather it sets the maximum possible amount of compute resources available to each process. Snowflake will always use the free resources in a server to execute process irrespective of the value set to MAX_CONCURRENCY_LEVEL.

When the number of MAX_CONCURRENCY_LEVEL is set low then the warehouse allows more compute resources to a single process. when this value is increased then it actually limits the resource availability to the individual processes and in turn makes room to allows more concurrent queries to run but with lesser resources. 

Concurrency or Parallel processing in a multi-cluster warehouse

In the above discussion we have seen how the concurrency is handled within a single cluster warehouse and the effect of the parameter MAX_CONCURRENCY_LEVEL in concurrency. Now we will see how Snowflake handles concurrency in a multi-clustered warehouse.

Multi-Cluster warehouses provides the ability to add multiple clusters in the same warehouse with the ability to scale up and down based on processing needs. Keep in mind that:

  1. The multiple clusters in a warehouse will be of same size.
  2. The value set for the parameter MAX_CONCURRENCY_LEVEL is applicable for all the clusters in that warehouse.
  3. Snowflake do not communicate between clusters to execute a single process. A single process will execute in a single cluster only, even if the warehouse is setup as multi-cluster.
  4. The value of the warehouse level parameter SCALING_POLICY can be used to control when the additional clustered will be started or shut down.  

Managing the Performance and Concurrency in Snowflake

  1. For a Multi-Cluster warehouse, the SCALING_POLICY decides when the additional clusters spawns up. When the value is set to ECONOMY then Snowflake starts the additional cluster in a delayed fashion, giving more importance to cost control over performance. When the value is set to STANDARD, Snowflake provides importance to performance and starts the additional clusters immediately when query starts getting queued up.
  2. In case the MAX_CONCURRENCY_LEVEL value is lower, then the additional cluster in a multi-cluster warehouse might starts quicker.
  3. The value of the parameter STATEMENT_QUEUED_TIMEOUT_IN_SECONDS has an impact on the timing when the additional cluster in a multi-cluster warehouse will spawn. The default is 0 which means no time out. Any non-zero value set for this parameter is the number of seconds the queued query will wait, and in a single cluster warehouse the query will be cancelled if did not get any compute resource to execute within that number of seconds. In case of a multi-clustered warehouse, an additional cluster will be spawned and the compute resources will be allocated to that query.
  4. It is very important to use multiple warehouses for different type and size of processing needs. Specially in case there is a process which is comparatively very complex and deals with huge volume data and takes a lot of time and compute resource, then use a separate bigger size warehouse to handle that process and do not use the same warehouse for any other needs.
  5. Consider tweaking with the MAX_CONCURRENCY_LEVEL parameter to provide more compute resources to the single process, so that it can execute faster. Keeping in mind the discussion about the “Concurrency within a single cluster warehouse”, below in an example of how a smaller sized warehouse can provide performance like a bigger warehouse and in turn reduces the cost. The below comparison provides an example of how this can be performed.

Below are the possible Virtual Warehouse configurations based on the complex query with target performance of 30 minutes.

No alt text provided for this image

Note: This warehouse with above configuration can be designed to spin up and shut down every time for the particular process only. 

Keep in mind that increasing a warehouse size may not always improve performance. In general, for a given process, the execution time is halved for each size increase of the virtual warehouse. But this may not be true for all processes. Please check the two examples below where in one of them the performance always improves with increase in virtual warehouse size, and in the other one it does not.

a.      Example where the performance consistently improves with warehouse size increase: 

No alt text provided for this image

b.      Example where the performance does not improve with increase in warehouse sizes. The LARGE warehouse is not going to improve performance in this case.

No alt text provided for this image

*Important Notes:

  • The number of CPU and threads are used for discussion purpose and not disclosed by Snowflake Computing. There is no such documentation from Snowflake Computing to know the exact type of servers they use in the Warehouses and the detailed architecture of those servers / instances. Even when the number of CPU(s) in a server or the number of Cores in a CPU might be different than the above numbers used for the discussion, the basic concepts behind the concurrency and how it is handled in Snowflake does not changes.
  • We may assume that the number of servers (nodes) per cluster will remain same as per Snowflake current standards, but over time the underlying architecture of the compute clusters will keep changing from the number of CPU(s) per node and amount of RAM and SSD available in the clusters with changes in available compute instances available in the underlying cloud platforms.
  • This article does not intend to override any topics, concepts or approach documented in the Snowflake official documentation. Snowflake database platform is always getting better and enriched with new features and updates and Snowflake documentation should be considered as the latest and greatest at all time and please reach out to me in case of any confusions or discrepancies found in this article for detailed understanding of the matter.
  • For more details on Snowflake please check https://docs.snowflake.com

Links to other articles by the author:

Subhrajit Bandyopadhyay is a Data Management professional with over 19 years of experience in Data Management and Data Governance areas like Data Architecture, Modeling, Metadata Management, Master Data Management, and Data Quality, working for multiple clients in the capacity of Lead Data Architect and experienced in leading and mentoring a team of Database Architects and Data Modelers for Data Warehousing, Data Analytics and Data Governance initiatives. Worked in major database systems like Oracle, Teradata, SQL Server and cloud specific databases Amazon Redshift and Snowflake. Working in Snowflake Database for last 3 years in the capacity of an Architect helping clients to migrate various on-premise data warehousing and analytics systems to cloud environment and setting up their Snowflake specific standards and best practices.

viswanatharao Marati

Snowflake Data Engineer | Specializing in ETL Processes & Performance Tuning | Data-Driven Decision Making|Informatica (Power Center/IICS/CDI/CAI/IDQ)

1y

when we say parallel processing number of tasks running at a time correct? If am processing 100 million records from a table can Snowflake process say 100k in batch all records parallel? means multi thread concept like each thread of 100k so processing quickly. Is there anyway we can specify to process 100 k in a batch from table?

Like
Reply
viswanatharao Marati

Snowflake Data Engineer | Specializing in ETL Processes & Performance Tuning | Data-Driven Decision Making|Informatica (Power Center/IICS/CDI/CAI/IDQ)

1y

Well explained

Shriram Mogal

Senior Software Engineer at PubMatic | Big Data | Java

2y

Insightful

Satheesh Donthy

Architecting for Business value & Customer Success

2y

Hi do you have links to the numbers and tables mentioned in the article? I am unable to find the source for the capacity measurements mentioned here. OR is it possible that Snowflake removed the numbers from the website?

Like
Reply

To view or add a comment, sign in

More articles by Subhrajit Bandyopadhyay

  • PII / PHI Protection - Advanced Requirements

    This article is a continuation from my previous one where we already discussed about the Snowflake PII / PHI data…

    1 Comment
  • PII / PHI Data Protection in Snowflake

    This article explains the PII / PHI data protection options in Snowflake with special emphasis on data encryption and…

    8 Comments
  • Snowflake Data Security Features

    Data Security was always an important topic and specially in today's cloud world it became much more important than…

    1 Comment

Explore content categories