DEV Community 3h ago

Day 35 – ClickHouse® and S3 Integration: Querying Data Lakes

Introduction

Modern organizations generate massive amounts of data that need to be stored and analyzed efficiently. As data volumes continue to grow, storing everything inside a database can become expensive and difficult to manage. Amazon S3 has become one of the most popular storage solutions for building data lakes because it offers virtually unlimited, durable, and cost-effective object storage.

At the same time, ClickHouse® is known for delivering extremely fast analytical queries on large datasets. By integrating ClickHouse® with Amazon S3, organizations can query data directly from their data lake without first importing it into database tables. This reduces storage duplication, simplifies data pipelines, and enables fast analytics over massive datasets.

What Is Amazon S3?

Amazon Simple Storage Service (S3) is a cloud-based object storage service that allows organizations to store and retrieve virtually unlimited amounts of data. It is widely used for storing:

CSV files
JSON documents
Parquet datasets
ORC files
Application logs
Backups
Machine learning datasets
Historical archives

Because of its scalability, durability, and low storage cost, Amazon S3 serves as the foundation for many modern data lake architectures.

Key Benefits

Virtually unlimited storage capacity
High durability and availability
Cost-effective storage for large datasets
Seamless integration with analytics platforms
Ideal for long-term data retention

What Is a Data Lake?

A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its original format. Unlike traditional databases, data lakes do not require a predefined schema before storing data. Instead, data is stored as-is and processed only when needed, providing greater flexibility for analytics.

Common examples of data stored in data lakes include:

Application logs
Business transactions
IoT sensor readings
Clickstream data
Machine learning datasets
Historical business records

Why Integrate ClickHouse® with Amazon S3?

Traditionally, data stored in cloud storage is first imported into a database before it can be queried. This approach increases storage costs, duplicates data, and introduces additional ETL steps. ClickHouse® provides native support for querying files directly from Amazon S3 using the s3() table function.

This approach offers several advantages:

No data duplication
Faster access to large datasets
Lower infrastructure costs
Simplified ETL pipelines
Easy access to historical data

Querying Data from Amazon S3

ClickHouse® provides the s3() table function for reading files directly from Amazon S3.

Query a CSV File

SELECT * FROM s3(
  'https://my-bucket.s3.amazonaws.com/sales.csv',
  'CSVWithNames'
) LIMIT 10;

This query treats the CSV file as a virtual table and returns the first ten rows without importing the data into ClickHouse.

Query a Parquet File

SELECT customer_id, SUM(amount) AS total_sales
FROM s3(
  'https://my-bucket.s3.amazonaws.com/orders.parquet',
  'Parquet'
)
GROUP BY customer_id
ORDER BY total_sales DESC;

Parquet is particularly efficient because ClickHouse reads only the required columns, reducing both storage reads and query execution time.

Query Multiple Files

Large data lakes typically organize data across thousands of partitioned files. ClickHouse supports wildcard patterns for querying multiple files simultaneously.

SELECT count()
FROM s3(
  'https://my-bucket.s3.amazonaws.com/logs/2026/*.parquet',
  'Parquet'
);

This makes it easy to analyze large datasets without manually combining files.

Loading Data from Amazon S3 into ClickHouse

Although querying data directly from S3 is convenient, frequently accessed datasets can be imported into ClickHouse tables for even better performance.

Method 1: Create and Load in a Single Step

CREATE TABLE sales
ENGINE = MergeTree
ORDER BY customer_id
AS SELECT * FROM s3(
  'https://my-bucket.s3.amazonaws.com/sales.parquet',
  'Parquet'
);

This method creates the table and loads the data in a single query, making it useful for quick analysis and experimentation.

Method 2: Create the Table First

Create the table schema.

CREATE TABLE sales (
  customer_id UInt32,
  order_id UInt64,
  amount Float64,
  order_date Date
) ENGINE = MergeTree
ORDER BY customer_id;

Then insert the data.

INSERT INTO sales
SELECT * FROM s3(
  'https://my-bucket.s3.amazonaws.com/sales.parquet',
  'Parquet'
);

This approach offers greater control over schema design and is commonly used in production environments.

Benefits of Loading Data into ClickHouse

Importing frequently queried datasets provides several advantages:

Improved query performance
Better schema management
Reduced S3 access costs
Faster dashboard response times
Ideal for production workloads

Supported File Formats

ClickHouse® supports reading several popular file formats directly from Amazon S3.

Format	Typical Use Case
CSV	General-purpose data exchange
JSON	APIs and application data
Parquet	Analytics and data lakes
ORC	Big data processing
TSV	Tab-separated datasets

Among these formats, Parquet is generally the best choice for analytical workloads because of its columnar storage format and efficient compression.

Best Practices

To achieve the best performance when querying S3 data with ClickHouse®:

Store analytical datasets in Parquet format.
Partition data by date or business dimensions.
Query only the required columns.
Compress files to reduce storage costs.
Load frequently accessed datasets into local ClickHouse tables.
Organize S3 directories for efficient filtering.

Common Use Cases

Log Analytics - Analyze application logs and server logs stored in Amazon S3 without importing them into ClickHouse.
Historical Reporting - Generate reports from archived datasets directly within the data lake.
Data Warehousing - Use ClickHouse as a high-performance query engine on top of an S3-based data lake.
Business Intelligence - Power dashboards and analytics platforms using data stored directly in Amazon S3.

Conclusion

ClickHouse® and Amazon S3 together provide a powerful solution for modern data lake analytics. By allowing users to query data directly from object storage, ClickHouse eliminates unnecessary data movement while delivering exceptional analytical performance. Whether you're analyzing logs, exploring historical business data, or building a scalable data warehouse, integrating ClickHouse® with Amazon S3 simplifies data architectures, reduces infrastructure costs, and enables fast, efficient analytics at scale.

Read on DEV Community ↗ ← Back to News