Most ClickHouse mistakes don’t come from SQL syntax.
They come from using the wrong mental model.
ClickHouse looks like a familiar analytical SQL database, but internally it behaves very differently from traditional row-based systems. If you approach it with PostgreSQL or MySQL instincts, you’ll eventually hit confusing behavior: wrong results, strange duplicates, unexpected merges, or queries that don’t scale the way you expected.
Two internals in particular completely changed how I design tables and write queries:
-
AggregatingMergeTreestores aggregation states, not final values -
argMaxsolves group-wise max problems the way ClickHouse expects you to
These aren’t tricks. They’re core to how the engine works.
1. AggregatingMergeTree Is About States, Not Values
One of the most misunderstood engines in ClickHouse is AggregatingMergeTree.
At first glance, it looks like a way to store pre-aggregated results. So it’s tempting to do something like this:
CREATE TABLE daily_metrics
(
day Date,
avg_delay Float32
)
ENGINE = AggregatingMergeTree()
ORDER BY day;
Then insert data like:
SELECT
toDate(event_time) AS day,
avg(arrival_delay) AS avg_delay
FROM flights
GROUP BY day;
And later query it with:
SELECT day, avg_delay FROM daily_metrics;
This looks reasonable. But conceptually, it’s wrong.
Why This Fails
AggregatingMergeTree does not store final aggregated values. It stores aggregation states - intermediate, mergeable representations of computations.
ClickHouse merges parts in the background. As new data arrives:
- Rows are appended to new parts
- Background merges combine parts asynchronously
- Aggregation states are merged during these part merges
If you store plain values like avg() or sum(), ClickHouse has nothing mergeable to work with. You’ve effectively disabled the engine’s intended behavior.
The Correct Mental Model
Think in states, not values.
Instead of storing avg(), you store avgState().
Instead of storing sum(), you store sumState().
Instead of storing uniq(), you store uniqState().
You define your table like this:
CREATE TABLE daily_metrics
(
day Date,
avg_delay AggregateFunction(avg, Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY day;
And insert data like this:
SELECT
toDate(event_time) AS day,
avgState(arrival_delay) AS avg_delay
FROM flights
GROUP BY day;
Now you are storing mergeable states.
When querying, you finalize them:
SELECT
day,
avgMerge(avg_delay) AS avg_delay
FROM daily_metrics
GROUP BY day;
Why GROUP BY Is Still Required
Another common confusion is:
“If I already aggregated before inserting, why do I still need GROUP BY when querying?”
Because merges are asynchronous.
You may have multiple parts containing states for the same day. The final aggregation only happens when you explicitly merge those states using functions like avgMerge().
GROUP BY ensures that all states for the same key are combined correctly at query time.
Once you understand that AggregatingMergeTree is designed for incremental, mergeable aggregation, everything becomes predictable:
- Materialized views make sense
- Backfills behave correctly
- Background merges are no longer mysterious
The engine isn’t broken. The mental model was.
2. Solving Group-Wise Max Correctly with argMax
Another common analytical requirement looks simple on the surface:
“For each group, return the row with the maximum value.”
For example:
“For each airline, return the flight with the worst arrival delay.”
The naive approach might look like this:
SELECT
airline,
flight_number,
max(arrival_delay)
FROM flights
GROUP BY airline, flight_number;
This computes the maximum delay per (airline, flight_number) combination. That’s not the same as “the worst flight per airline.”
It also produces multiple rows per airline, which means extra filtering or subqueries later.
ClickHouse provides a much cleaner way.
The ClickHouse-Native Solution
SELECT
airline,
argMax(flight_number, arrival_delay) AS flight_number,
max(arrival_delay) AS max_delay
FROM flights
GROUP BY airline;
argMax(value, weight) means:
“Return the value associated with the maximum weight.”
In this case:
- Track the maximum
arrival_delayper airline - Return the
flight_numbercorresponding to that maximum - Produce exactly one row per airline
No subqueries. No joins. No row explosion.
Why This Matters
argMax aligns with how ClickHouse performs aggregation internally. It combines value selection and aggregation into a single pass.
Instead of thinking in terms of relational rewrites, you’re using an aggregate function that expresses the intent directly.
This is both faster and clearer.
Once you start using argMax, many complex “best per group” queries collapse into a single SELECT.
The Bigger Pattern
Both of these examples point to the same lesson:
ClickHouse is not just “Postgres, but faster.”
It is designed around:
- Immutable parts
- Background merges
- Mergeable aggregation states
- Columnar execution
- Specialized aggregate functions
If you treat it like a traditional row-based database, you’ll keep fighting it.
If you align your thinking with how it stores and merges data, entire classes of bugs and inefficiencies disappear.
Final Thoughts
Understanding AggregatingMergeTree as a state-based engine - and using functions like argMax for group-wise logic - changes how you model data in ClickHouse.
These aren’t edge cases. They show up in real production systems:
- Pre-aggregated materialized views
- Streaming ingestion pipelines
- Analytical dashboards
- Per-tenant metrics
- Best/worst reporting
ClickHouse rewards engineers who understand its internals.
Once you stop thinking in values and start thinking in states - and once you use the aggregate functions the engine was built around - query design becomes simpler and more predictable.
And performance follows naturally.
Top comments (0)