Products

Use Cases

Customer Stories

Resources

Company

/

Product & Technology

How to Perform Semi-Structured Data Analysis Effectively

How to Perform Semi-Structured Data Analysis Effectively

How to Perform Semi-Structured Data Analysis Effectively

wenjun

Jan 3, 2025

Introduction

Data warehouses excel at managing and analyzing structured data. For semi-structured data, many data warehouse products support struct, map, and array types to enhance support for semi-structured data types, but these types all require a predefined schema, which incurs significant costs during data import and schema changes.

JSON, on the other hand, is highly flexible and does not require a predefined schema, providing great flexibility for data collection, data analysis, and schema changes. In the rapidly changing internet industry, especially in the gaming sector, JSON is widely adopted. Its flexibility can quickly meet business requirements for adding or deleting fields. Additionally, as a cross-language protocol, JSON is well-supported by various big data platforms and middleware.

Challenges

The flexibility of JSON arises because it saves schema information on a per-line basis. Taking the GitHub events dataset (https://www.gharchive.org) as an example, each line of data repeats the storage of key fields (id, type, actor, payload, etc.). This characteristic of JSON presents the following three challenges for data warehouses:

1) Each line of JSON data repeatedly stores key field information, leading to significant storage space consumption.

2) When analyzing JSON data through a data warehouse, it is necessary to parse the JSON structure line by line, consuming a large amount of CPU resources.

3) As a column in a data warehouse table, JSON type, as the smallest logical unit of IO, cannot load JSON subfields on-demand during analysis to reduce IO, and JSON subfields do not have the capability for indexing or pruning.

Sample GitHub events data:

```json

{

"id": "2489651045",

"type": "CreateEvent",

"actor": {

"id": 665991,

"login": "petroav",

"gravatar_id": "",

"url": "https://api.github.com/users/petroav",

"avatar_url": "https://avatars.githubusercontent.com/u/665991?"

},

"repo": {

"id": 28688495,

"name": "petroav/6.828",

"url": "https://api.github.com/repos/petroav/6.828"

},

"payload": {

"ref": "master",

"ref_type": "branch",

"master_branch": "master",

"description": "Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.",

"pusher_type": "user"

},

"public": true,

"created_at": "2015-01-01T15:00:00Z"

}

```

Faced with such issues, simdjson (https://github.com/simdjson/simdjson) is widely used to accelerate JSON parsing. Compared to naive implementations, simdjson often brings a 10 to 20 times improvement in parsing performance. However, from an end-to-end perspective, the time consumed by JSON analysis queries is mainly composed of Scan + Projection operators. Simdjson has solved the performance issues of the Projection operator to some extent, but when the JSON content is large, the huge IO overhead makes the query latency unacceptable.

Relyt JsonB Type

To enhance the analysis performance and storage density of JSON types, Relyt AI-ready Data Cloud has added support for the JsonB type. Compared to the JSON type, Relyt parses the JSON string when writing JsonB type, automatically recognizes the schema, and structures the data for storage. As a result, users can achieve performance close to scalar types when analyzing JSON subfields. Users can write and analyze JsonB type data while being fully compatible with JSON type behavior.

Usage Introduction

-- Create a table with JsonB fields

```sql

CREATE TABLE json_test (f0 json)

WITH(json_expand=on);

```

-- Insert JSON data

```sql

INSERT INTO json_test

VALUES

('{"key1": "value1", "key2": 1234, "key3": true} ');

```

-- Query JSON subfields

-- ->> operator

```sql

select f0->>'key1' from json_test;

```

-- json_extract_text function

```sql

select json_extract_text(f0, '$.[key1]') from json_test;

```

Implementation Principle

JsonB Structured Storage

When handling JSON data, Relyt statistically analyzes the actual JSON data content written, counting the frequency of occurrence of JSON keys and the types of JSON values. Before data is persisted to disk, Relyt dynamically decides which JSON keys need structured storage and the storage types of JSON values based on statistical information. JSON keys are stored as schema information in columnar files and are not stored in the data area of the files. JSON values are mapped to Relyt types according to the following type conversion relationships:

| JSON Value Type | Relyt Type |

|----------------|-----------|

| JSON array | Array |

| JSON object | Struct (not supported yet) |

| JSON number (integer) | Bigint |

| JSON number (floating-point) | Json |

| JSON string | Varchar |

| JSON boolean | Boolean |

For multi-line JSON data, if the JSON value types corresponding to the same JSON key are inconsistent. The type of JSON value is uniformly classified as the JSON type. For all JSON keys that are dynamically decided not to be structured for storage, Relyt will merge them into a single byte segment, and the type of JSON value is classified as the JSON type.

Example 1:

Inserted JSON data

```json

{"key1": "value1", "key2": 1234, "key3": true}

{"key1": "value2", "key2": 4321, "key3": "true"}

```

Since the JSON value of key3 in the first line of JSON data is of JSON boolean type, and the JSON value of key3 in the second line of JSON data is of JSON string type. Therefore, when storing structured data, the type of key3 is mapped to the JSON type.

Example 2:

Inserted JSON data

```json

{"key1": "value1", "key2": 1234, "key3": true}

{"key1": "value2", "key2": 4321}

{"key1": "value3", "key2": 1000}

```

Since the presence rate of JSON key "key3" does not exceed 50%, key3 is not stored in a structured manner.

JsonB Query Optimization

Since the JsonB type has already been structured and stored, Relyt can load only the JSON key data involved in the query when querying JsonB data. At the same time, it can also save the calculation consumption of the projection operator when the storage type of Relyt is consistent with the return type of the json_extract function. Below, we introduce the optimization of Relyt during the query phase from the perspectives of IO pruning and projection.

IO Pruning

```sql

select f0->>'key1' from json_test where f0->>'key1' = 'value2';

```

Taking the above SQL as an example, Relyt will recognize the operands of the ->> operator when generating the execution plan. Since column f0 has already been structured and stored, the Scan operator only needs to scan the 'key1' subfield during the execution plan generation. The subfields 'key1', 'key2', 'key3' are stored in columnar storage by Relyt, so during scanning, it is possible to only scan the 'key1' subfield, achieving vertical IO pruning.

In addition, 'key1' is no different from an ordinary column for columnar storage files. The statistical information and indexing capabilities on this column can still be used by Relyt during queries, achieving horizontal IO pruning.

Projection Pushdown

As shown in Figure 3, for queries on JsonB, the json_extract_text function of the Projection operator is optimized away. Because the storage type of 'key1' is Varchar, and the return type of json_extract_text is also Varchar. No additional Projection calculation is needed. However, if the storage type of 'key1' is bigint, which is inconsistent with the return type of json_extract_text (Varchar), an additional cast operation is still required to convert the type. This part of the cast operation needs to be pushed down to the Scan operator because the schemas of JsonB fields between different columnar storage files may be inconsistent. Therefore, we need to decide at the file granularity whether to perform the cast operation calculation.

Query Performance Evaluation

The JsonB type has been well applied in Relyt's core customers, and the following performance data comes from the application situation in the business scenarios of these customers. Under the same business data conditions, the performance results of using large wide tables, JsonB, and Json solutions are as follows.

Industry Related Work

Doris and Clickhouse both support the variant type to solve the dynamic type problem of JSON data format, representing JSON data in a structured way in memory, avoiding the overhead of accessing JSON subfields. On disk, they both use high-density column storage to structure and store JSON data.

The Spark community currently supports the variant type and also uses the variant shredding feature to structure and store JSON subfields in a columnar way, thereby further enhancing IO pruning capabilities.

Future Work

Relyt currently only supports the structured storage of the first-level JSON subfields. Since multi-level nested subfields are still quite common in business scenarios, we are still working on supporting this part.

With the improvement of Spark and the data lake ecosystem's support for the Variant type, Relyt will also support the Variant type in the current lakehouse architecture, further strengthening the data lake's ability to process unstructured data.