Schema Design
Understanding effective schema design is key to optimizing ClickHouse performance and includes choices that often involve trade-offs, with the optimal approach depending on the queries being served as well as factors such as data update frequency, latency requirements, and data volume. This guide provides an overview of schema design best practices and data modeling techniques for optimizing ClickHouse performance.
Stack Overflow dataset
For the examples in this guide, we use a subset of the Stack Overflow dataset. This contains every post, vote, user, comment and badge that has occurred on Stack Overflow from 2008 to Apr 2024. This data is available in Parquet using the schemas below under the S3 bucket s3://datasets-documentation/stackoverflow/parquet/
:
The primary keys and relationships indicated are not enforced through constraints (Parquet is file not table format) and purely indicate how the data is related and the unique keys it possesses.
![Stack Overflow Schema](/docs/assets/images/stackoverflow-schema-e65f899fb048e8a4c530c6f9c30d07b0.png)
The Stack Overflow dataset contains a number of related tables. In any data modeling task, we recommend users focus on loading their primary table first. This may not necessarily be the largest table but rather the one on which you expect to receive most analytical queries. This will allow you to familiarize yourself with the main ClickHouse concepts and types, especially important if coming from a predominantly OLTP background. This table may require remodeling as additional tables are added to fully exploit ClickHouse features and obtain optimal performance.
The above schema is intentionally not optimal for the purposes of this guide.
Establish initial schema
Since the posts
table will be the target for most analytics queries, we focus on establishing a schema for this table. This data is available in the public S3 bucket s3://datasets-documentation/stackoverflow/parquet/posts/*.parquet
with a file per year.
Loading data from S3 in Parquet format represents the most common and preferred way to load data into ClickHouse. ClickHouse is optimized for processing Parquet and can potentially read and insert 10s of millions of rows from S3 per second.
ClickHouse provides a schema inference capability to automatically identify the types for a dataset. This is supported for all data formats, including Parquet. We can exploit this feature to identify the ClickHouse types for the data via s3 table function andDESCRIBE
command. Note below we use the glob pattern *.parquet
to read all files in the stackoverflow/parquet/posts
folder.
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/posts/*.parquet')
SETTINGS describe_compact_output = 1
┌─name──────────────────┬─type───────────────────────────┐
│ Id │ Nullable(Int64) │
│ PostTypeId │ Nullable(Int64) │
│ AcceptedAnswerId │ Nullable(Int64) │
│ CreationDate │ Nullable(DateTime64(3, 'UTC')) │
│ Score │ Nullable(Int64) │
│ ViewCount │ Nullable(Int64) │
│ Body │ Nullable(String) │
│ OwnerUserId │ Nullable(Int64) │
│ OwnerDisplayName │ Nullable(String) │
│ LastEditorUserId │ Nullable(Int64) │
│ LastEditorDisplayName │ Nullable(String) │
│ LastEditDate │ Nullable(DateTime64(3, 'UTC')) │
│ LastActivityDate │ Nullable(DateTime64(3, 'UTC')) │
│ Title │ Nullable(String) │
│ Tags │ Nullable(String) │