Products

Use Cases

Customer Stories

Resources

Company

/

Product & Technology

Decoupled Serverless Architecture with Four Layers: Technical Challenges and Key Designs for Metadata and Transaction Systems

Decoupled Serverless Architecture with Four Layers: Technical Challenges and Key Designs for Metadata and Transaction Systems

Decoupled Serverless Architecture with Four Layers: Technical Challenges and Key Designs for Metadata and Transaction Systems

yuanfei

Jan 26, 2025

Introduction

Since the inception of Snowflake's architecture, the company's peak market value has approached $100 billion. Snowflake's three-layer decoupled architecture for compute, storage, and metadata has inspired the architectural designs of many domestic vendors. We can observe the following examples:

Hashdata's 3-layer architecture:

GBase by Nanjing University:

Doris 2.x architecture, which has been open-sourced:

The core design principles of this architecture are as follows:

Stateful modules, such as metadata, transaction management, lock management, and even execution plan caching and query cache, are placed in the metadata layer. Compute and storage are made stateless to facilitate elastic scaling.

As stateful services are not well-suited for serverless implementation, the metadata layer is designed as a multi-tenant shared service to reduce costs in large-scale scenarios.

While many vendors have not yet clearly articulated the challenges and technical details faced after decoupling the metadata layer, Relyt AI-ready Data Cloud, as a cloud-native data warehouse with a four-layer decoupled architecture (metadata, compute, cache, and storage), has accumulated rich experience through years of technical exploration and extensive customer practice. In this issue, we invite Yuan Fei, the Evangelist of Relyt's AI-ready Data Cloud team and a senior expert in the cloud-native field, to share insights on the topic of "Challenges and Key Technologies of Metadata and Transaction Systems in a Three-Layer Decoupled Architecture."

Given this context, how should we address the technical challenges after decoupling the metadata layer?

Consistent Experience with PostgreSQL Ecosystem: We strive to make our cloud-native data warehouse as user-friendly as a single-node PostgreSQL. While open-source users may not demand high transaction capabilities, closed-source products typically offer standard ACID transaction capabilities, which have become a key competitive advantage for enhancing user experience.

Solutions for Performance Differences between Remote and Local Storage: We focus on bridging the performance gap between remote and local storage to ensure that users' DDL operations and read/write processes achieve the same latency as using a local database. In particular, we have optimized for high throughput and low latency in TPC-H scenarios.

Challenges in Metadata Management: Traditional big data metadata management is relatively simple because it mainly deals with immutable large files. We explore how the metadata layer can handle real-time small-batch writes in a real-time data warehouse, frequent update/delete operations, and high-QPS point-query serving scenarios.

Future Evolution: We discuss how to evolve towards a lakehouse architecture, achieve open data sharing, and meet the challenges of AI-Ready.

As a core component of the data warehouse, the metadata layer will be analyzed in depth from the practical experience of Relyt AI-ready Data Cloud, focusing on key designs and technical challenges.

Architecture Design

Metadata Service Layer: This includes two layers: the stateless service layer and the scalable distributed storage layer.

Stateless Service Layer: This layer is responsible for the abstraction of tables, columns, files, and statistics. It supports MVCC capabilities and provides a consistent metadata view. It encapsulates KV structures into table APIs, facilitating access by multiple DPS or application services. DPS refers to a set of compute resources in Relyt, similar to Snowflake's virtual warehouse.

Scalable Distributed Storage Layer: This layer uses FoundationDB (FDB) as the persistent storage, offering secure, reliable, and scalable foundational storage capabilities.

The benefits of this design are:

Global Consistency: Unified metadata across multiple DPS, with real-time visibility for update/delete operations across DPS.

Extreme Elasticity: Stateless service nodes can dynamically scale, with performance scaling linearly with the number of nodes.

Security and Reliability: FDB supports 3AZ deployment, real-time backup, and recovery.

Expanding further on the stateless service layer, there are several key services:

Meta Service: This service is responsible for the abstraction of tables, columns, files, and statistics. It supports MVCC capabilities and provides a consistent metadata view. It encapsulates KV structures into table APIs, facilitating access by multiple DPS or application services.

Coordinator: This service provides access based on PostgreSQL, generates execution plans, and distributes them.

Master: This service is responsible for global transaction management, transaction ID allocation, active transaction management, 2PC implementation, and metadata and lock management.

FDB: This is an ordered key-value store using range partitioning. To achieve resource isolation for multi-tenancy, we prefix user instance IDs to the keys. Different users are isolated by instance IDs, with their data stored under their respective instance IDs.

Table Meta: This contains table metadata information, including table descriptions and permissions, similar to PostgreSQL's pg_class.

Column Meta: This contains column information, similar to PostgreSQL's pg_attribute.

File Meta: This contains file information. We also encode file filtering and shard information into the keys to enable efficient filtering based on key prefixes.

Delete Bitmap: This marks deleted rows in a file using a bit to indicate deletion. In our columnar storage implementation, we use a similar approach to Greenplum's marking deletion for updates and deletes.

Statistic: This provides statistical information for the optimizer.

Query Cache: Following Snowflake's design, we store the query cache in FDB to enable efficient cross-instance and cross-user querying.

The benefits of this design are:

Transaction Compatibility: Supports MVCC and concurrent writes.

PostgreSQL Compatibility: The Coordinator is implemented as a PostgreSQL plugin, ensuring compatibility with PostgreSQL.

Master: Offloads transaction management from the Coordinator, enhancing its transaction processing capabilities to support high-TPS real-time writes and high-QPS serving scenarios.

Key Technologies

Key Read/Write Processes

1. The client sends a begin command to the Master to start a transaction, initiating the transaction state machine.

2. The Master sends the SQL and transaction information to the Coordinator.

3. The Coordinator reads metadata information from FDB.

4. The Coordinator parses the SQL to generate an execution plan.

5. The Coordinator sends the execution plan to the Worker.

6. The Worker reads files from object storage and visimaps from FDB based on file information and executes the plan.

7. The Coordinator returns the execution results to the Master.

8. The Master performs transaction commit, first checking for write-write conflicts before committing the FDB metadata. Here, we divide FDB metadata into temp space and work space. Writes in step 6 are made to the temp space within the transaction's own space. During commit, the data is "moved" to work space, ensuring atomicity through FDB transactions.

9. The execution results are returned to the client.

Key Design

Unlike most data warehouses, we do not roll back the entire transaction upon encountering write-write conflicts. This is because our system heavily relies on the background vacuum process to merge and reorder data to improve query performance. If the vacuum process conflicts with user updates, rolling back would degrade user experience. Additionally, both vacuum and update operations are resource-intensive, leading to unnecessary work. Instead, we use write compensation by replaying user delete operations on the merged files after vacuuming, achieving "write-write concurrency."

Insert Sequence Diagram

• Lock table t: This is the table-level lock in PostgreSQL, where insert/update/delete operations acquire a level-2 lock on the data table.

• Unlock table t: This corresponds to releasing the level-2 lock on the data table.

• Atomicity Guarantee: This is achieved through 2PC.

Delete Sequence Diagram

The core challenge here is resolving write-write conflicts.

Write-Write Conflict Detection

  • Lock file_visibility table: We apply a level-4 lock on the auxiliary table of the modified table. To prevent other transactions from performing conflict detection on this table during ww-conflict, we lock the auxiliary table (not the data table) at the start of ww-conflict. This lock allows only selects on the auxiliary table and blocks updates/deletes/inserts until the transaction ends and the lock is released. Since it is a level-4 lock on the auxiliary table, it does not block other transactions' update/delete/insert operations (which, unlike traditional PG methods, do not acquire level-2 locks but only level-1 locks, as they do not modify the workspace data). It only blocks other transactions' prepare/commit and conflict detection operations.

MVCC Implementation

We essentially adopt PostgreSQL's MVCC algorithm, with adjustments based on the aforementioned data structure. We implement a visibility judgment algorithm based on KV:

/* Visibility judgment for data in tmp space (own transaction) */

/* t_xmin status = ABORTED (for sub-transactions) */

Rule 1: IF t_xmin status is 'ABORTED' THEN

RETURN 'Invisible'

END IF

/* t_xmin status = IN_PROGRESS */

IF t_xmin status is 'IN_PROGRESS' THEN

IF t_xmin = current_txid THEN

Rule 2: IF t_xmax = INVALID THEN

RETURN 'Visible'

Rule 3: ELSE /* this tuple has been deleted or updated by the current transaction itself. */

RETURN 'Invisible'

END IF

Rule 4: ELSE /* t_xmin ≠ current_txid */

RETURN 'Invisible'

END IF

END IF

/* Visibility judgment for data in work space */

/* t_xmin status = COMMITTED */

IF t_xmin status is 'COMMITTED' THEN

Rule 5: IF t_xmin is active in the obtained transaction snapshot THEN

RETURN 'Invisible'

Rule 6: ELSE IF t_xmax = INVALID OR status of t_xmax is 'ABORTED' THEN

RETURN 'Visible'

ELSE IF t_xmax status is 'IN_PROGRESS' THEN

Rule 7: IF t_xmax = current_txid THEN

RETURN 'Invisible'

Rule 8: ELSE /* t_xmax ≠ current_txid */

RETURN 'Visible'

END IF

ELSE IF t_xmax status is 'COMMITTED' THEN

Rule 9: IF t_xmax is active in the obtained transaction snapshot THEN

RETURN 'Visible'

Rule 10: ELSE

RETURN 'Invisible'

END IF

END IF

END IF

Although the implementation is somewhat complex, the advantage is that it maintains consistent behavior with PostgreSQL, reducing the learning curve for Relyt users.

Time Travel Implementation

Based on the above MVCC algorithm, Time Travel can be easily implemented. Time Travel essentially involves reading a snapshot of metadata at a historical point in time. By simply changing the transaction snapshot for reads and using the ts_xmin and ts_xmax of the data, we can obtain the required version of metadata for Time Travel based on the MVCC algorithm.

Extreme Performance Optimization

GMeta Cache Implementation

One advantage of using FDB as the metadata storage is its large-capacity KV storage and automatic scaling capabilities. However, when facing scenarios with large-scale data scans, especially visimap data scans, the pressure is still significant. For example, querying 80,000 segfiles takes between 400-500ms, with a total data volume of around 23MB for these segfiles. Since a single FDB interaction can only retrieve about 200KB of data, it requires over 100 interactions to fetch all the data, resulting in high network overhead. If we could fetch 23MB of data in one go, the time would be around 20ms based on network bandwidth calculations. To address this issue, we have built a metadata caching service in front of FDB, called the gmeta service, and maintain a local copy of the data managed by the gmeta service in RocksDB. The process is as follows:

Write Process:

1. Maintain the existing logic to write segfile and visimap data into tmp space.

2. During the move stage, transfer the data from tmp space to work space (without cleaning up the data under tmp space).

3. At the end of the move stage, set the txn info of tmp space to Committed status and write a commit message for this gxid to the transaction commit queue. The transaction commit queue is a separate key range allocated in FDB to store the commit status of transactions.

Read Process:

1. During the initial startup phase (or out-of-sync phase):

a. The service retrieves the last record from the current transaction commit queue, noting its distributed transaction ID (gxid).

b. (Cache initialization) The service scans all keys in the workspace and writes them into the local RocksDB.

c. To enhance availability, the service can fall back to directly accessing FDB if the cache has not been built yet.

2. The client initiates a query request, which includes two pieces of information: the transaction information and the zdbrelnode+seqno.

a. If the client has no local cache, the ZdbRelNode+seqno information will be empty.

3. The service receives the request, calculates the delta information between the current seqno and the seqno from the transaction information, and returns this delta to the client.

4. The client caches the latest seqno and delta information in memory for use in subsequent queries.

a. Note that for sortkey queries, if local caching is used, there is no need to push the sortkey down to the gmeta service; instead, the full data can be pulled locally and then filtered.

GC Process:

1. In the existing GC logic, add cleanup logic for each relation's transaction commit queue and also clean up the corresponding tmp space for that transaction.

File Trimming:

Data trimming is an important optimization in OLAP-type databases. To further optimize performance, we encode the sortkey of the files into the filenames. During file retrieval, we can filter directly based on the filename, reducing the number of files that need to be returned and also eliminating the overhead of opening files and then trimming based on file meta information.

Reducing Visimap Requests:

In an OLAP system, data is rarely modified after being written. Therefore, accessing the visimap for each file from GMeta is unreasonable, especially in high-concurrency Ad-Hoc queries, where frequent access to GMeta can easily become a bottleneck. When returning files to the worker, we include a flag indicating whether the file has a visimap. If the file does not have a visimap, there is no need to request it.

After these optimizations, in a 30TB TPC-H test scenario:

1. The performance of querying fileinfo, relying on the gmeta service cache solution, is within 500ms.

2. The performance of querying visimap. The initial total time was 600-700ms, which has been reduced by the aforementioned gmeta client cache solution to control the network bandwidth for a single query to around 1GB.

Real-Time Processing

Traditional big data sources are mostly logs, and data from TP databases is still synchronized using a T+1 full-volume method, which cannot meet the needs of real-time data dashboards and real-time operational analysis scenarios. We have currently implemented a Flink-CDC plugin that supports real-time data writes with minute-level latency. For higher ms-level latency real-time data writes and providing primary key deduplication capabilities, we will introduce them in detail in the next article.

Lakehouse Integration

Compared to lake solutions, while our current metadata has certain optimizations for large-scale datasets, it lacks openness. In the future, we will implement a Spark connector to directly open our metadata and data to the Spark computing engine.

Summary

From the very beginning, traditional big data has been designed with separated storage and compute, but compared to databases, big data ecosystems' separation of storage and compute has several shortcomings in usage scenarios: it cannot well support concurrent CRUD operations; it provides a poor user experience with no transaction guarantees and requires users to handle failover themselves; and it necessitates the combination of multiple products and technical architectures, increasing complexity and operational costs. Compared to other data warehouse products, Relyt supports "write-write concurrency," allowing users' update/delete operations to run in parallel with the background vacuum process. It offers complete ACID transaction capabilities and extreme performance, providing customers with a better choice for integrated online and offline data warehouse scenarios.