Products

Use Cases

Customer Stories

Resources

Company

/

Product & Technology

From Storage-Computing Separation to Serverless, Key Issues Data Warehouses

From Storage-Computing Separation to Serverless, Key Issues Data Warehouses

From Storage-Computing Separation to Serverless, Key Issues Data Warehouses

wenjun

Jan 1, 2025

Challenges of Serverless Data Warehouses

Under the architecture of storage-computing separation, data warehouses can leverage the elasticity of the cloud platform's IaaS layer. By dynamically applying for machine resources to build a data warehouse cluster, and releasing machine resources dynamically when there is no load, a serverless experience can be provided to users. This can effectively reduce user costs and improve resource utilization. Users do not need to focus on the operation and maintenance of underlying infrastructure, but only on data analysis and business logic.

However, implementing a serverless data warehouse often faces several challenges:

● Elasticity Granularity: How to control the granularity of resource scaling more finely, and how to respond to load changes more quickly, are key issues that serverless data warehouses need to address. For example, facing sudden traffic peaks, how to quickly and efficiently expand computing resources to meet demands and avoid performance bottlenecks? At the same time, when the load decreases, how to quickly release resources to avoid waste?

● Cold Start Issue: A challenge for serverless data warehouses is the cold start issue. When a new query request arrives, if the required computing resources have not been started, it will take a certain amount of time to start and initialize, which increases query latency. How to reduce cold start time and improve response speed is an important issue that serverless data warehouses need to solve. For example, warm-up pools and quick start technologies can be used to optimize cold start performance.

● Query Performance Optimization: The performance of serverless data warehouses is affected by various factors, such as the allocation of computing resources, network latency, and the speed of data storage access. How to optimize these factors and improve query performance is an important challenge that serverless data warehouses need to address. For example, how to use caching technology to accelerate data access? How to optimize the query execution plan to reduce the amount of computation?

● Security: Under the serverless architecture, the security of data and access control also need special attention. For example, how to ensure that only authorized users can access data? How to prevent data leaks and malicious attacks?

Cold Start Issue

Taking the architecture based on the cloud platform IaaS + K8S as the resource base as an example, from applying for machine resources to finally launching the data warehouse service process, it needs to go through the following call stack: IaaS -> K8S -> pod (disk/network mounting) -> process -> thread -> service initialization logic. From initiating a request to the IaaS layer to the final completion of pod launching, it often takes several minutes. This several minutes of resource application delay often becomes the first stumbling block encountered by data warehouses in transitioning from storage-computing separation to serverless. This is also why many data warehouses initially claimed to have "serverless capabilities," but in essence, they only achieved an auto-suspend, auto-resume experience.

For serverless data warehouses, the acceptable delay in resource application needs to be in the order of hundreds of milliseconds or even milliseconds. This level of delay, in the foreseeable future, cannot be achieved by the IaaS layer + K8S. Moreover, from the perspective of architectural layering, it is not expected that the traffic of the data warehouse layer will proportionally penetrate to the IaaS layer + K8S, which is also a great challenge for the stability of resource application.

To meet the low-latency demands for resource allocation, it is often necessary to employ the technology of a warm-up pool, preemptively applying for machine resources from the IaaS layer. New resource requests are prioritized from the warm-up pool, thereby bypassing the resource application chain of the IaaS layer + K8S. Concurrently, the capacity of the warm-up pool is asynchronously managed in the background; when there are excess machine resources, they are released back to the IaaS layer, and when the warm-up pool's capacity is insufficient, additional machine resources are applied for from the IaaS layer to join the warm-up pool.

Of course, rapid start technology optimization at the IaaS layer and K8S layer is also necessary. When the warm-up pool capacity is insufficient, it can speed up the replenishment of the warm-up pool water level.

Elasticity Granularity

After solving the resource application delay, serverless data warehouses face another serious problem: how to more effectively control the granularity of elastic scaling to optimize resource utilization and performance.

The query performance of data warehouses does not necessarily increase linearly with resource expansion, which is determined by multiple factors such as data parallelism, query load characteristics, and cluster scale. Taking small data volume queries as an example, the amount of data processed by the query itself is not large. As the cluster scale expands, the performance consumption ratio of task distribution may increase, which can lead to a decrease in query performance.

It is precisely because of the uncertainty between the scale and performance of data warehouses that Snowflake's official website recommends using multi-cluster to improve query throughput.

Multi-cluster warehouses are best utilized for scaling resources to improve concurrency for users/queries. They are not as beneficial for improving the performance of slow-running queries or data loading. For these types of operations, resizing the warehouse provides more benefits.

Additionally, Snowflake suggests resizing a warehouse to improve the performance of slow queries.

Resizing a warehouse to a larger size is useful when the operations being performed by the warehouse will benefit from more compute resources, including:

● Improving the performance of large, complex queries against large data sets.

● Improving performance while loading and unloading significant amounts of data.

The Amazon Redshift serverless team attempts to introduce Mosaic AI to solve its auto-scaling issues, hoping to achieve a better balance between performance and cost.

Query Performance Optimization

Compared with traditional data warehouses, serverless data warehouses have an inherent disadvantage in performance. All data needs to be remotely accessed from object storage for loading. Therefore, the scan operator often becomes the biggest bottleneck in queries, with most of the time consumed on remote object storage access.

● Local Cache: Local cache seems to be a good choice to solve this problem. By caching files from object storage locally and using a consistency hash algorithm to avoid redundant caching of the same files between different nodes, the cache hit rate can be effectively improved.

● Remote Cache: Although local cache can provide the ultimate performance experience, if the serverless data warehouse is frequently released, local cache will face the risk of loss. Snowflake, for example, recommends setting different auto-suspension times for different scenarios. Therefore, an efficient remote cache service that decouples cache and data warehouse computing nodes is a good choice. Of course, the corresponding storage costs will also increase.

About the cache and auto-suspension

The auto-suspend setting of the warehouse can have a direct impact on query performance because the cache is dropped when the warehouse is suspended. If a warehouse is running frequent and similar queries, it might not make sense to suspend the warehouse in between queries because the cache might be dropped before the next query is executed.

You can use the following general guidelines when setting the auto-suspension time limit:

● For tasks, Snowflake recommends immediate suspension.

● For DevOps, DataOps, and Data Science use cases, Snowflake recommends setting auto-suspension to approximately 5 minutes because the cache is not as important for ad-hoc and unique queries.

● For query warehouses, for example, BI and SELECT use cases, Snowflake recommends setting auto-suspend to at least 10 minutes to maintain the cache for users.

Cost Optimization

While ensuring the performance of machine resource elasticity, whether the capacity management of the warm-up pool is excellent is related to the cost and profit space of the data warehouse manufacturer. If there are a large number of idle machine resources in the warm-up pool that are not utilized, this part of the cost is held by the data warehouse manufacturer. If the idle water level of the warm-up pool is low, then the resource holding cost is transferred to the cloud manufacturer. How to do a good job in the capacity management of the warm-up pool is both related to the serverless experience and the cost of the data warehouse manufacturer.

Currently, cloud manufacturers have all launched spot instances, which are often only 1/10 of the price of regular instances, but the resource usage time is relatively short. If spot instances are applied on a large scale in the warm-up pool, it is a good choice, which not only ensures the performance of machine resource elasticity but also does not increase the resource holding cost too much.

In terms of how to apply spot instances well, it has put forward a relatively high requirement for the kernel of the data warehouse: it is necessary to solve the impact of spot instances being reclaimed on query latency and success rate. In this regard, Databricks, thanks to the fault tolerance of Spark's stage-by-stage execution, has supported the creation of instance pools by choosing spot instances.