yuanfei
Mar 18, 2025

During a visit to a top game company, we learned that their data warehouse (DW) platform is managed by fewer than five people, half of whom have been out of school for less than three years. Despite their limited skills and resources, they are responsible for handling the company's PB-level data. The client hopes to use the data warehouse as a "larger version of a database" and complained that there is no real-time data warehouse in the industry that can properly handle the following two typical analytical scenarios:
Scenario 1: Real-time Analysis of Order Tables.
Real-time ingestion and querying of a trillion-record order table, with 30% updates and deletes on the data from the past 7 days.
Scenario 2: Real-time Ingestion and Analysis of Data from Hundreds of RDS Instances at the Source.
The data needs to be synchronized in real-time to the target data warehouse.
The challenges these scenarios pose to traditional data warehouses are as follows:
Write Operations with Frequent Updates: The order table involves not only inserts but also status updates for each order record, such as from creation -> payment -> shipping -> delivery confirmation. This means each record may be updated five times.
High-Concurrency Writes: The source has hundreds or even thousands of RDS instances, resulting in hundreds or thousands of concurrent writes.
Primary Key Deduplication: The order table uses order ID as the primary key.
ACID Compliance for Writes: Data written by users should be immediately queryable, and data should not be lost even in the event of node failures; otherwise, it would lead to incorrect analytical results.
High-Performance Columnar Storage and Vectorized Execution Engine: Capable of delivering top-tier real-time analytical performance in the industry.
Cost-Effective Real-time Ingestion: Resources should be used on-demand without over-provisioning for peak business traffic.
These requirements are generally addressed by large companies through Lambda architecture or by reducing data freshness. However, for small and medium-sized vendors, how to improve the ease of use of data analysis, reduce user costs, and enhance the timeliness of analysis has prompted us to rethink data warehouses and how to create a data warehouse product that is as easy to use as a single-machine database. Before diving into the detailed design of Relyt AI-ready Data Cloud by Zhibian Technology, let's take a look at some existing solutions in the industry.
Industry Similar Solutions
Data Lake Solutions
Merge on Read:
When CDC (Change Data Capture) data is obtained, it is sorted and directly written into new files without checking for duplicate keys. During read operations, data is merged based on key values, combining multiple versions of data and returning only the latest version. This method is inspired by the widely used LSM (Log-Structured Merge) tree data structure in storage engines.
Hudi's MOR (Merge on Read) table and StarRocks' Unique Key table are implemented using this approach.
Copy on Write:
After obtaining CDC data, a full join is performed between the new data and the existing records to check for conflicts (i.e., records with the same key value). For conflicting files, a new file containing the updated data is written.
Reading data is straightforward, as the latest data file is directly read without any merging or additional operations, resulting in optimal query performance. However, the write cost is high, making it suitable for T+1 scenarios with infrequent updates, not real-time updates.
Delta Lake, Hudi's COW (Copy on Write) table, Iceberg, and commercial solutions like Snowflake use this method to achieve data updates.
There is a problem with frequent updates, which can lead to a large number of small files. Small files are not friendly to object storage, causing a significant increase in performance and cost. Additionally, for primary key deduplication, performing deduplication during read operations can degrade read performance.
The common industry solution is to build an aggregation and buffering layer between the data source and the data lake using Kafka. Data is aggregated and batched in Kafka before being written into the data lake. For modified or deleted data, deduplication is achieved using the aforementioned MoR or CoW methods, with upper-layer analysis performed using Hive/Spark/Presto. For data with very frequent updates/deletes, this approach is less feasible. Typically, after communicating with the business side, data freshness is reduced by adopting a T+1 approach.
For large companies with experienced and large teams, each component is maintained by a dedicated team. Although the system is complex, it has the advantage of openness and is widely used.
Real-time Data Warehouse Solutions with Doris/StarRocks
Doris/StarRocks are open-source OLAP data warehouse solutions in the industry that have done a lot of work in real-time analysis, becoming a key component in many vendors' data analysis product libraries to address the "real-time analysis" gap. [1]
System Overview

Opening the Implementation of Tablet

Rowset: All data of a Tablet is divided into multiple Rowsets, each stored in a columnar file format, which can be understood as a row group in a Parquet file. Multiple Rowsets can be stored in a single file.
Meta (Metadata): Stores the version history of the Tablet and information about each version, such as which Rowsets are included. It is serialized and stored in RocksDB, with quick access enabled by caching in memory.
DelVector: Records rows marked for deletion in each Rowset, also stored in RocksDB and cached in memory for fast access. It can be understood as a visibility map, with each Rowset in Doris corresponding to one or zero DelVectors.
Primary Index: Stores the mapping between primary keys and the location of the records. Currently, it is mainly maintained in memory, with ongoing efforts to persist the primary index to disk to save memory.
The issues with Doris/StarRocks' approach are:
Delta data is stored in an in-memory memtable. In typical RDS multi-database synchronization scenarios with thousands or tens of thousands of tables, the memory of Doris' BE (Backend) is quickly exhausted. For example, with 1,000 tables being written simultaneously, each memtable at 10MB, each BE would require 10GB of memory, causing node OOM (Out of Memory) and rendering the system unusable.
According to available public information, the primary key (PK) is also fully in-memory without being persisted to disk. Node failures would lead to time-consuming index rebuilding, especially in scenarios with trillion-record tables, which is clearly not suitable.
The community has clearly considered these issues, but persisting data to disk while maintaining high reliability and performance is a massive undertaking. This scenario is not their core focus, so while some capabilities (such as persisting primary keys to disk) are planned, they have not yet been implemented. [1]
Relyt AI-ready Data Cloud Technical Solution
Overall Solution

Relyt uses object storage as the primary storage, implementing a four-layer separation architecture for data, cache, computation, and metadata. The Hybrid DPS is used for real-time ingestion and deduplication. The Extreme DPS enables high-performance queries and ETL. The NDP implements high-speed data caching and computation on the cache, such as predicate pushdown and encoding/decoding, with object storage as the primary storage.
The core of Hybrid DPS is the row-based Delta, combined with the columnar Base data stored in object storage to build a real-time engine. Real-time ingested data first lands on the row-based storage and uses a B-tree index for primary key deduplication. Background tasks periodically flush the data into the columnar Base storage, merging, sorting, and deleting data during the flush process. Extreme DPS reads both Delta and Base data during queries.
The overall principle of the solution is straightforward, but in actual engineering practice, we face significant challenges:
Write-Write Conflict and Mutual Blocking Issues: In typical delta solutions, when users delete data, the background flush from row to column storage also needs to delete delta data. This means two transactions need to write simultaneously. Common solutions are either locking and waiting or aborting one transaction, both of which degrade user experience.
Bandwidth and Latency Issues with Extreme DPS Reading Delta Data: To ensure effective batching of delta data, Extreme DPS needs to maintain a certain volume of delta data. However, larger delta data volumes lead to high bandwidth usage when Extreme DPS reads data, potentially causing read delays.
Redistribution of Row-Based Data During Scaling: To ensure effective batching of delta data flushed to column storage, row-based storage holds a certain amount of data. Given the large number of tables in actual user scenarios, traditional solutions like Greenplum require re-distributing data, which takes a long time and significantly increases IO pressure, affecting user business during scaling.
Module Design
Below, we further break down the Delta+Base architecture. Before diving into this section, readers may first refer to how we decouple columnar data and metadata transactions. [1] We will not repeat that here and will only introduce how to achieve high-concurrency real-time updates in the database.
As mentioned above, the real-time ingestion part is divided into Delta+Base, where Delta is based on PostgreSQL's row-based storage engine, and Base is based on a self-developed columnar storage engine. Further breaking down the Delta part:
Delta Data: Real-time ingested data is written into Delta data in an append-only manner, with each Delta record assigned a unique TUID (Transaction Unique ID) for deletion purposes.
Tombstone: Deleted data is written into the Tombstone in an append-only manner, also using a unique TUID.
Delete Queue: If deleted records have been flushed to columnar storage files, they are also written into the Delete Queue to generate a visibility map for columnar storage.
Primary Key (PK): The primary key for data in both Delta and Base parts is stored in the PK, using a classic B-tree structure. This leverages PostgreSQL's capabilities.
Columnar Storage Part:
ColumnStorageFile: PAX (Partitioned Attributes eXtended) mixed row-column storage.
Delete Bitmap: Marks deleted data in ColumnStorageFile.
Write Process:
First, query the PK to check for duplicate data. If no duplicates are found, write the data into Delta data and add a new record to the PK. If there are duplicate records, for the "copy on conflict do update" scenario, the TUID of the existing record is queried from the PK, and the TUID is written into the Tombstone to mark it for deletion. A new TUID is then generated and the new data is written.
Query Process:
After deduplication using Delta data and Tombstone, the data is returned in columnar format. ColumnStorageFile reads the delete visibility map. If there are still records in the Delete Queue that have not been written to the delete visibility map, the visibility map is merged first.
Background Flush and Merge Process:
When Delta data and Tombstone data reach a set size or at scheduled times, they are written into columnar storage. Data in the Delete Queue is periodically written into the delete bitmap. ColumnStorageFile also merges when there are too many small files or too many deletions indicated by the delete bitmap. During the merge process, records in the Tombstone are cleaned up.
Solution Summary:
The benefits of this approach are:
The Delta table is an append-only structure. Deletions are achieved by inserting data into the Tombstone table rather than deleting data from Delta data directly. Tombstone data is only deleted during flush and merge operations, avoiding write-write conflicts between user deletions and background flushes. Flush and merge operations can be managed at the table level to avoid conflicts.
To address the bandwidth and latency issues caused by Extreme DPS reading Delta data, we configured a read-only replica for Extreme DPS. This replica consumes PostgreSQL's WAL (Write-Ahead Logging) to obtain the latest LSN (Log Sequence Number) and reads page data from the Page Server, caching it locally based on the LSN version. This approach resolves the bandwidth and latency issues when Extreme DPS reads data from Hybrid DPS.
To solve the problem of data redistribution during scaling, which requires moving data in traditional row-based storage, we changed the redistribution to a flush operation. Delta data is written into columnar storage, which includes shard information. By simply modifying the metadata to remap the relationship between compute nodes and shards, we avoid the need to move data, which would otherwise cause prolonged unavailability.
Of course, this solution is not perfect. The Tombstone-based deletion approach leads to significant write amplification in scenarios with a large number of deletions. Therefore, for bulk delete scenarios (e.g., full table deletes), we automatically switch to writing columnar storage and generating visibility maps. If an entire file is deleted, we further degrade to file deletion rather than writing visibility maps. Additionally, we have optimized the reading of visibility maps extensively, which can be referenced in [3]. We have also implemented compute-storage separation for the row-based storage by referencing the architecture of Neon. [4]
Other Work
In addition to the foundational work described above, to enable real-time ingestion of TP database source data and further optimize user experience and cost, we have also worked on ALTER TABLE
and serverless capabilities.
ALTER TABLE:
Most source TP databases are MySQL, while we provide PostgreSQL syntax. To facilitate integration for MySQL users, we have implemented syntax compatibility, such as ALTER TABLE ADD/MODIFY COLUMN FIRST/AFTER
, making it easier for MySQL users to integrate.
Spot Instance Support and Automatic Start/Stop:
Our compute and Page Server nodes are stateless, allowing us to deploy them on spot instances. Spot instances cost approximately 1/10 to 1/5 of the price of equivalent instances, significantly reducing costs. For businesses with distinct peak and off-peak periods, we offer automatic start/stop instances. When not in use, user data is preserved, compute nodes are released, and nodes are launched on-demand based on user requests, further reducing costs.
Performance Comparison
Write Capability:
The write test environment is equivalent to 32C. The 150GB lineitem table from TPC-H was divided into 20 parts and imported concurrently with 20 threads.
Test results:
RPS (Requests Per Second) results are shown in the figure below:
Write throughput:
The write capability is comparable to that of Apache Doris 2.0 with 3-replica local disks [2]. We have not optimized performance to the extreme. Achieving the above performance of 7.5MB/s per core for conflict-free writes is sufficient. Relyt's differentiation lies in its ability to stably synchronize data from high-concurrency, small-batch, and high-frequency update TP databases.
Read Capability:
Our test results in the TPC-H scenario are as follows:
1TB(64C) | 1TB(96C) | ||||
Test Environment | Test Item | Competitor S (16 C 64GB * 4) | Relyt (14C 52GB * 4) | Competitor S (16C 64GB * 6) | Relyt (14C 52GB * 6) |
Alibaba Cloud | Import Time | 6119s | 1245s | 6066s | 852s |
Query Time | 142.4s | 143.8s | 96.7s | 116s |
Our query capability ranks in the top tier of the TPC-H benchmark. Additionally, our read and write nodes follow a typical MPP architecture, with linear scalability for both read and write capabilities.
Conclusion
In summary, Relyt AI-ready Data Cloud by DataCloud Technology offers a new choice for data analysis. Our goal is not to compete with data lakes or existing OLAP engines in their areas of expertise. Instead, we aim to provide a new paradigm for data analysis, allowing users to use data analysis as easily as a single-machine database. This eliminates the need for users to build a large infrastructure team or sacrifice data freshness due to cost, turning T+0 analysis into a reality instead of T+1.
This solution has been refined over nearly two years through collaboration with seed customers. It has withstood data growth from 10 source databases to over 100, as well as extreme scenarios such as bulk table rewrites leading to massive deletions and frequent ALTER TABLE
operations. The solution is now mature and ready for broader adoption.