Day 35 – ClickHouse® and S3 Integration: Querying Data Lakes
Introduction
Modern organizations generate massive amounts of data that need to be stored and analyzed efficiently. As data volumes continue to grow, storing everything inside a database can become expensive and difficult to manage. Amazon S3 has become one of the most popular storage solutions for building data lakes because it offers virtually unlimited, durable, and cost-effective object storage.
At the same time, ClickHouse® is known for delivering extremely fast analytical queries on large datasets. By integrating ClickHouse® with Amazon S3, organizations can query data directly from their data lake without first importing it into database tables. This reduces storage duplication, simplifies data pipelines, and enables fast analytics over massive datasets.
What Is Amazon S3?
Amazon Simple Storage Service (S3) is a cloud-based object storage service that allows organizations to store and retrieve virtually unlimited amounts of data. It is widely used for storing:
- CSV files
- JSON documents
- Parquet datasets
- ORC files
- Application logs
- Backups
- Machine learning datasets
- Historical archives
Because of its scalability, durability, and low storage cost, Amazon S3 serves as the foundation for many modern data lake architectures.
Key Benefits
- Virtually unlimited storage capacity
- High durability and availability
- Cost-effective storage for large datasets
- Seamless integration with analytics platforms
- Ideal for long-term data retention
What Is a Data Lake?
A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its original format. Unlike traditional databases, data lakes do not require a predefined schema before storing data. Instead, data is stored as-is and processed only when needed, providing greater flexibility for analytics.
Common examples of data stored in data lakes include:
- Application logs
- Business transactions
- IoT sensor readings
- Clickstream data
- Machine learning datasets
- Historical business records
Why Integrate ClickHouse® with Amazon S3?
Traditionally, data stored in cloud storage is first imported into a database before it can be queried. This approach increases storage costs, duplicates data, and introduces additional ETL steps. ClickHouse® provides native support for querying files directly from Amazon S3 using the s3() table function.
This approach offers several advantages:
- No data duplication
- Faster access to large datasets
- Lower infrastructure costs
- Simplified ETL pipelines
- Easy access to historical data
Querying Data from Amazon S3
ClickHouse® provides the s3() table function for reading files directly from Amazon S3.
Query a CSV File
SELECT * FROM s3(
'https://my-bucket.s3.amazonaws.com/sales.csv',
'CSVWithNames'
) LIMIT 10;
This query treats the CSV file as a virtual table and returns the first ten rows without importing the data into ClickHouse.
Query a Parquet File
SELECT customer_id, SUM(amount) AS total_sales
FROM s3(
'https://my-bucket.s3.amazonaws.com/orders.parquet',
'Parquet'
)
GROUP BY customer_id
ORDER BY total_sales DESC;
Parquet is particularly efficient because ClickHouse reads only the required columns, reducing both storage reads and query execution time.
Query Multiple Files
Large data lakes typically organize data across thousands of partitioned files. ClickHouse supports wildcard patterns for querying multiple files simultaneously.
SELECT count()
FROM s3(
'https://my-bucket.s3.amazonaws.com/logs/2026/*.parquet',
'Parquet'
);
This makes it easy to analyze large datasets without manually combining files.
Loading Data from Amazon S3 into ClickHouse
Although querying data directly from S3 is convenient, frequently accessed datasets can be imported into ClickHouse tables for even better performance.
Method 1: Create and Load in a Single Step
CREATE TABLE sales
ENGINE = MergeTree
ORDER BY customer_id
AS SELECT * FROM s3(
'https://my-bucket.s3.amazonaws.com/sales.parquet',
'Parquet'
);
This method creates the table and loads the data in a single query, making it useful for quick analysis and experimentation.
Method 2: Create the Table First
Create the table schema.
CREATE TABLE sales (
customer_id UInt32,
order_id UInt64,
amount Float64,
order_date Date
) ENGINE = MergeTree
ORDER BY customer_id;
Then insert the data.
INSERT INTO sales
SELECT * FROM s3(
'https://my-bucket.s3.amazonaws.com/sales.parquet',
'Parquet'
);
This approach offers greater control over schema design and is commonly used in production environments.
Benefits of Loading Data into ClickHouse
Importing frequently queried datasets provides several advantages:
- Improved query performance
- Better schema management
- Reduced S3 access costs
- Faster dashboard response times
- Ideal for production workloads
Supported File Formats
ClickHouse® supports reading several popular file formats directly from Amazon S3.
| Format | Typical Use Case |
|---|---|
| CSV | General-purpose data exchange |
| JSON | APIs and application data |
| Parquet | Analytics and data lakes |
| ORC | Big data processing |
| TSV | Tab-separated datasets |
Among these formats, Parquet is generally the best choice for analytical workloads because of its columnar storage format and efficient compression.
Best Practices
To achieve the best performance when querying S3 data with ClickHouse®:
- Store analytical datasets in Parquet format.
- Partition data by date or business dimensions.
- Query only the required columns.
- Compress files to reduce storage costs.
- Load frequently accessed datasets into local ClickHouse tables.
- Organize S3 directories for efficient filtering.
Common Use Cases
- Log Analytics - Analyze application logs and server logs stored in Amazon S3 without importing them into ClickHouse.
- Historical Reporting - Generate reports from archived datasets directly within the data lake.
- Data Warehousing - Use ClickHouse as a high-performance query engine on top of an S3-based data lake.
- Business Intelligence - Power dashboards and analytics platforms using data stored directly in Amazon S3.
Conclusion
ClickHouse® and Amazon S3 together provide a powerful solution for modern data lake analytics. By allowing users to query data directly from object storage, ClickHouse eliminates unnecessary data movement while delivering exceptional analytical performance. Whether you're analyzing logs, exploring historical business data, or building a scalable data warehouse, integrating ClickHouse® with Amazon S3 simplifies data architectures, reduces infrastructure costs, and enables fast, efficient analytics at scale.
Comments
No comments yet. Start the discussion.