Mohamed Hussain S

Posted on Apr 1 • Edited on Jun 5

Full Text Search in ClickHouse: What Works in 2026

#clickhouse #database #backend #opensource

ClickHouse is the undisputed heavyweight champion of analytics famed for fast aggregations, massive columnar storage, and processing trillions of rows. Historically, however, if you wanted "real" full-text search, the engineering consensus was clear: Don't use ClickHouse. You had to pay the "architectural tax" of syncing your data to a dedicated engine like Elasticsearch or OpenSearch.

But as of 2026, that consensus has shifted. With the General Availability of Full-Text Text Indices and native tokenization functions, the question is no longer if ClickHouse can do search, but how much of your infrastructure you can now simplify by moving it all into one place.

What Do We Mean by "Full-Text Search"?

Full-text search is fundamentally different from simple string filtering. In a dedicated search ecosystem, it typically requires:

Tokenization: Breaking sentences into individual, searchable words.
Inverted Indexing: A specialized data structure that maps tokens to row IDs so the engine doesn't have to scan the entire table.
Relevance Scoring: Ranking results using algorithms like BM25 so the best matches appear first.

In the past, ClickHouse only handled basic filtering. Today, it natively handles fast tokenization and index acceleration.

What Actually Works: The 2026 Reality

ClickHouse now provides a tiered approach to text search. Depending on your performance needs, you have two primary tools:

1. The Heavyweight: Native Inverted Indices

This is the single biggest update to the ClickHouse ecosystem. You no longer need to rely on brute-force LIKE patterns that scan every byte of data. By defining an Inverted Index, ClickHouse creates a mapping that allows it to jump directly to the relevant data blocks.

-- Creating a high-performance inverted index
ALTER TABLE logs ADD INDEX text_idx(message) TYPE text(tokenizer = splitByNonAlpha) GRANULARITY 1;

The performance impact is massive. For datasets in the billions of rows, an indexed search can be 10x to 100x faster than a standard query because it narrows the search space to a few "granules" of data.

2. Precision Tokenization

Using functions like hasToken(), ClickHouse understands word boundaries. It knows that a search for the word "log" should not return results for "logger" or "biological." This brings a level of precision previously reserved for dedicated search engines.

Where ClickHouse Excels

In the current landscape, ClickHouse is the "sweet spot" for several specific high-growth use cases:

Log Analytics & Observability: This is the primary "Elasticsearch killer." You can search billions of logs for a specific error message and, in the same query, calculate the average latency or error rate.
Architectural Simplicity: Managing a ClickHouse cluster and a search cluster is an operational nightmare. Moving both workloads to ClickHouse reduces your infrastructure footprint, simplifies your ingestion pipelines, and slashes your cloud bill.
Hybrid Queries: ClickHouse allows you to join search results with structured metadata (like user IDs or pricing tables) instantly - something that is notoriously difficult in traditional search engines.

What Still "Doesn't Work"

Despite these massive strides, ClickHouse is not a magic bullet for every search problem. There are still areas where dedicated engines hold the lead:

Complex Linguistics: If you need deep morphological analysis (e.g., matching "mice" to "mouse" or handling complex compounding in German), dedicated engines still have more mature language plugins.
Fuzzy Matching & Auto-Correct: While ClickHouse can calculate levenshteinDistance(), it isn't yet optimized for high-concurrency "did you mean?" style suggestions found on major e-commerce sites.
Multi-tenant Search Products: If you are building a consumer-facing product where search is the entire product, the fine-grained tuning of a search-first engine is still superior.
Relevance Scoring & Ranking (BM25): ClickHouse is an acceleration engine, not a relevance engine. It does not natively support TF-IDF or BM25 scoring models, nor does it store word position data for advanced phrase ranking.

ClickHouse vs. Search Engines: The 2026 Comparison

Feature	ClickHouse (2026)	Elasticsearch / OpenSearch
Primary Strength	Analytics + Search	High-Relevance Search
Storage Cost	Very Low (Columnar)	High (Index Overhead)
Aggregation Speed	Best-in-class	Moderate
Relevance (BM25)	Not Supported	Industry Standard
Operational Effort	Low (Single System)	High (Multiple Systems)

Final Thoughts

The boundary between "Analytics" and "Search" has officially blurred.

If you are analyzing logs, building internal observability tools, or need to search across massive datasets where cost and aggregation speed matter most, ClickHouse is now a full-text search engine.

Choosing ClickHouse in 2026 means opting for a simpler architecture and better performance without sacrificing the core search capabilities your team needs.

Top comments (2)

Joseph Redfern • Jun 4 • Edited

This is not accurate. ClickHouse does not currently support BM25.

To quote from the blog post announcing GA of Full-Text Search:

It is important to be clear about what Full-text Search in ClickHouse is not. It is not a relevance engine and does not implement scoring models such as TF IDF or BM25, nor does it store positional information for advanced phrase ranking. It is designed to accelerate token based filtering, not to replace dedicated search engines built for rich NLP and relevance driven use cases. If you need sophisticated ranking and linguistic features, a traditional search engine may be a better fit. If you need extremely fast token and string matching over terabytes or petabytes of data, combined with real time aggregation and analytics, ClickHouse Full-text Search is purpose built for that workload.

Mohamed Hussain S • Jun 5

Thanks for the feedback! You’re totally right, I mischaracterized the GA release regarding BM25. I’ll update the post to clarify that it's an acceleration engine for token matching, not a relevance engine. Appreciate the save!