Reuben

Posted on Feb 14 • Edited on Feb 16

Monitoring ClickHouse Inserts with Datadog Agent

#clickhouse #datadog

Inserts in ClickHouse are highly sensitive and can easily trigger the notorious “too-many-parts” error. In this post, we discuss how to leverage the Datadog agent together with the Datadog-ClickHouse integration to improve monitoring of insert operations.

Why

The existing integration provides many metrics, but they lack table-level granularity. For example, you cannot filter metrics by table because there is no table tag.

clickhouse.query.insert.count{*} by {table} # This won't work

Fortunately, the integration supports custom SQL queries, which Datadog can ingest as additional metrics. Since these are emitted through the Datadog-ClickHouse integration, they are treated as integration metrics and won't be charged as custom metrics.

We can leverage this feature to create custom SQL queries, emit them as extra metrics, and use them for visualization and alerting in Datadog.

SQL Queries

We can use these queries to monitor insert performance.

Initial Part Size

Tracks the creation of new parts after inserts.

SELECT
  query_id,
  part_type,
  rows,
  size_in_bytes,
  bytes_uncompressed
FROM system.part_log
WHERE
  database = 'my-db-name'
  AND table = 'my-table-name'
  AND event_type = 'NewPart'
  AND event_time > (now() - toIntervalSecond(60))

Example output

| query_id        | part_type | rows      | size_in_bytes  | bytes_uncompressed |
|-----------------|-----------|-----------|----------------|--------------------|
| 20260214_001    | Wide      | 1000000   | 524288000      | 1048576000         |
| 20260214_002    | Wide      | 2500000   | 1310720000     | 2621440000         |
| 20260214_003    | Compact   | 500000    | 104857600      | 209715200          |
| 20260214_004    | Wide      | 3000000   | 1572864000     | 3145728000         |

rows (Rows per insert)

Measures how many rows were inserted in each part.

Helps evaluate batch efficiency: very small batches → more overhead, slower throughput.
size_in_bytes (Compressed part size)

ClickHouse splits data into parts.

Large parts improve compression, but too many parts can lead to the notorious "too-many-parts" error if inserts are frequent.
bytes_uncompressed (Uncompressed part size)

Used to calculate compression ratio (bytes_uncompressed / size_in_bytes).

Helps monitor storage efficiency and disk I/O impact, especially for large inserts.

Insert queries

Tracks insert execution performance for completed queries.

SELECT
  query_id,
  arrayJoin(tables) AS table,
  query_duration_ms,
  read_rows,
  read_bytes,
  written_rows,
  written_bytes,
  memory_usage
FROM
  system.query_log
WHERE
  type = 'QueryFinish'
  AND query_kind = 'Insert'
  AND has(tables, 'my-table-name')
  AND has(databases, 'my-db-name')
  AND event_time > (now() - toIntervalSecond(60))

Example output

| query_id          | table          | query_duration_ms | read_rows | read_bytes | written_rows | written_bytes | memory_usage |
|-------------------|----------------|-------------------|-----------|------------|--------------|---------------|--------------|
| 20260214_101523_1 | my-table-name  | 120               | 0         | 0          | 500000       | 157286400     | 10485760     |
| 20260214_101524_2 | my-table-name  | 115               | 0         | 0          | 600000       | 188743680     | 12582912     |
| 20260214_101525_3 | my-table-name  | 130               | 0         | 0          | 450000       | 141557760     | 9437184      |

query_duration_ms

Time taken for each insert query.

Essential to detect slow inserts, which can indicate bottlenecks.
written_rows / written_bytes

Actual data inserted per query.

Use this with query_duration_ms to calculate insert throughput (rows/sec or bytes/sec).
read_rows / read_bytes

Shows any reads caused by the insert (e.g., merges).

Important for understanding I/O impact on the system during large inserts.
memory_usage

Memory consumed by the insert query.

Helps detect inserts that could cause OOM or memory spikes, especially for huge batches.

Datadog Config

This configuration enables the Datadog Agent to monitor ClickHouse inserts by querying system tables and emitting metrics.

It tracks new parts (rows, size_in_bytes, bytes_uncompressed) to monitor batching and compression, and insert query performance (query_duration_ms, read_rows, written_rows, memory_usage) to monitor throughput and resource usage.

Metrics are collected at regular intervals, allowing real-time monitoring and alerts.

The custom_queries section is where you define the SQL queries that emit metrics for both initial parts and insert query performance. All the metrics discussed later are generated based on these custom queries.

Config File: /etc/datadog-agent/conf.d/clickhouse.d/conf.yaml


init_config:
instances:
  - server: localhost
    username: datadog
    password: password
    custom_queries:
        # Query to monitor initial parts
        - query: |
            SELECT
              query_id,
              part_type,
              rows,
              size_in_bytes,
              bytes_uncompressed
            FROM system.part_log
            WHERE
              database = 'my-db-name'
              AND table = 'my-table-name'
              AND event_type = 'NewPart'
              AND event_time > (now() - toIntervalSecond(60))
          columns:
            - name: query_id
              type: tag
            - name: part_type
              type: tag
            - name: rows
              type: gauge
            - name: size_in_bytes
              type: gauge
            - name: bytes_uncompressed
              type: gauge
          collection_interval: 60
          # Hardcoded way to inject tags
          tags:
            - "table:my-table-name"
          metric_prefix: clickhouse.my_custom_queries.insert.initial_parts
        # Query to monitor insert queries
        - query: |
            SELECT
              query_id,
              arrayJoin(tables) AS table, -- expand array to one row per table
              query_duration_ms,
              read_rows,
              read_bytes,
              written_rows,
              written_bytes,
              memory_usage
            FROM
              system.query_log
            WHERE
              type = 'QueryFinish'
              AND query_kind = 'Insert'
              AND has(tables, 'my-table-name')
              AND has(databases, 'my-db-name')
              AND event_time > (now() - toIntervalSecond(60))
          columns:
            - name: query_id
              type: tag
            # We inject the table name as a tag using the result from the SQL query
            - name: table
              type: tag
            - name: query_duration_ms
              type: gauge
            - name: read_rows
              type: gauge
            - name: read_bytes
              type: gauge
            - name: written_rows
              type: gauge
            - name: written_bytes
              type: gauge
            - name: memory_usage
              type: gauge
          collection_interval: 60
          metric_prefix: clickhouse.my_custom_queries.insert.queries

Metrics Generated

These are the metrics generated by the Datadog Agent based on the custom_queries configuration.

From the initial parts query

We get 3 metrics:

clickhouse.my_custom_queries.insert.initial_parts.rows
clickhouse.my_custom_queries.insert.initial_parts.size_in_bytes
clickhouse.my_custom_queries.insert.initial_parts. bytes_uncompressed

Each of those metric will have the tags, query_id, part_type & table.

From the insert queries query

We get 6 metrics:

clickhouse.my_custom_queries.insert.queries.query_duration_ms
clickhouse.my_custom_queries.insert.queries.read_rows
clickhouse.my_custom_queries.insert.queries.read_bytes
clickhouse.my_custom_queries.insert.queries.written_rows
clickhouse.my_custom_queries.insert.queries.written_bytes
clickhouse.my_custom_queries.insert.queries.memory_usage

Each of those metric will have the tags, query_id & table.

These metrics help monitor insert performance, including query duration, resource usage, and data throughput.

Using the metrics in Datadog

Once the metrics are emitted, they can be visualized and monitored for insert performance.

Number of insert queries

Track how many insert queries are executed per table over time using

count_nonzero(sum:clickhouse.my_custom_queries.insert.queries.written_rows{table:default.nyc_taxi, db:default} by {query_id})

Use the count_nonzero function to visualize how many queries

Number of rows in initial parts

Monitor how many rows in the initial parts

max:clickhouse.my_custom_queries.insert.initial_parts.rows{table:nyc_taxi} by {query_id}

Monitoring compression ratio

Compression ratio helps evaluate storage efficiency for inserts.

Metric A

max:clickhouse.my_custom_queries.insert.initial_parts.bytes_uncompressed{table:nyc_taxi} by {query_id}

Metric B

max:clickhouse.my_custom_queries.insert.initial_parts.size_in_bytes{table:nyc_taxi} by {query_id}

Formula

Compression Ratio: A / B

Conclusion

Using the Datadog Agent with custom SQL queries, you can gain table-level visibility into ClickHouse inserts, monitor batching efficiency, part sizes, compression ratios, and query performance in real time.

This approach is also easily extensible. You can define additional custom queries to monitor specific SQL statements.

For example, if your application executes an API like:

GET /userActionsCount?user-id=10

which translates to a ClickHouse query such as

SELECT user_id, COUNT(*) AS total_actions
FROM user_activity
WHERE user_id = 10
  AND event_date >= today() - 7
GROUP BY user_id

You can track its performance by querying the system.query_log table for that specific query pattern and sending metrics to Datadog

- query: |
    SELECT
      query_id,
      query_duration_ms,
      read_rows,
      read_bytes,
      memory_usage
    FROM system.query_log
    WHERE
      type = 'QueryFinish'
      AND query LIKE 'SELECT user_id, COUNT(*) AS total_actions FROM user_activity WHERE user_id = % AND event_date >= today() - 7 GROUP BY user_id'
      AND event_time > (now() - toIntervalSecond(60))
  columns:
    - name: query_id
      type: tag
    - name: query_duration_ms
      type: gauge
    - name: read_rows
      type: gauge
    - name: read_bytes
      type: gauge
    - name: memory_usage
      type: gauge
  collection_interval: 60
  tags:
    - api:user_actions_count  # all metrics will have this tag
  metric_prefix: clickhouse.my_custom_queries.application_queries

This lets you monitor the performance of API-generated queries, detect bottlenecks, and set up alerts, giving full observability over both inserts and application-level queries.

DEV Community