ushironoko

Posted on Feb 13

How octorus Renders 300K Lines of Diff at High Speed

#rust #tui #performance #treesitter

In a previous post, I introduced my TUI tool. This time, I'd like to talk about the performance optimizations behind octorus.

What Do We Mean by "Fast"?

"Fast" can mean many things. Even just for rendering, there's initial display speed, syntax highlighting speed, scroll smoothness (fps), and more.

Perceived speed and internal speed aren't always the same. No matter how much you optimize with zero-copy or caching, if the PR is massive, the API call becomes the bottleneck. And without rendering-level optimizations, the UI can freeze entirely.

In octorus, I push internal optimizations as far as possible while also applying web-app-style thinking (FCP / LCP / INP) to the TUI.

Core Concept

The fundamental approach is to asynchronously build and consume caches based on the current display state. By maintaining 5 layers of caching, the perceived initial display time approaches 0ms, while also improving fps and minimizing allocations.

Session Cache for PR Data

PR data is fetched via gh api when a PR is opened. The fetched diff and comment data are cached in memory. This cache remains valid for the entire octorus session — even when switching between PRs. Each PR is fetched only once.

The cache isn't unlimited; when the maximum entry count is reached, the oldest entries are evicted.

src/cache.rs#L98-L104

Background Processing for Diff Cache Construction

When a PR is opened and data arrives, diff parsing and syntax highlighting begin asynchronously in the background. By the time the user actually opens a diff, most of the work is already done — making the perceived latency effectively 0ms. This is the single biggest win for user experience.

src/app.rs#L792-L831

DiffCache Construction

The background processing aims to finish before the user opens a diff, but what happens when the diff is enormous — hundreds of thousands of lines — or when the language has complex syntax (like Haskell), making highlighting significantly heavier? Blocking the user until processing completes would be a terrible experience.

To solve this, octorus can display diffs in a plain (unhighlighted) state while highlighting is still in progress. Once highlighting completes, the view seamlessly transitions to the fully highlighted version. Here's an example with a 300K-line diff:

The cache exists in two locations: the active display cache (the file currently being viewed) and the prefetched standby store (pre-built in the background).

When a file is selected, the system first checks if the active cache can be reused (e.g., the user is just scrolling within the same file). If not, it pulls from the standby store (if prefetching finished in time). If neither is available, it builds the cache on the spot.

When switching from File A to File B, the diff_cache is replaced with File B's cache. File A's cache remains in the standby store, so switching back to File A hits Stage 2 and restores instantly.

This cache is scoped per PR. Unlike the session-level API cache, it's discarded when switching PRs. Since octorus also supports opening a PR directly by number, this design keeps the overall behavior consistent — diff_cache is bound to a single PR's lifetime.

Efficient Highlighting via CST + Semantic Boundary-Level Interning

So far I've covered caching of diff data itself. Now let's talk about optimizing the highlighting process.

Each line in DiffCache is stored as a sequence of styled Spans. If each Span naively held a String, every occurrence of the same token would trigger a separate allocation. To avoid this, I adopted lasso::Rodeo, a string interner.

An interner returns the same reference for identical strings. So even if let appears hundreds of times, only one copy exists in memory.

A typical String takes 24 bytes (pointer + length + capacity) plus ~8 bytes for the highlight style — about 32 bytes total. A lasso::Rodeo reference (Spur) is just 4 bytes.

This reduces not only per-Span size but also eliminates duplication. For a 1,000-line diff where let appears 200 times:

	`String`	`Rodeo`
Reference × 200	24 B × 200 = 4,800 B	4 B × 200 = 800 B
String body	3 B × 200 = 600 B	3 B × 1 = 3 B
Total	5,400 B	803 B

However, the effectiveness of interning depends heavily on granularity. Interning entire lines yields near-zero deduplication; interning individual characters makes the management overhead dominate.

The key insight is to reuse tree-sitter captures as the interning boundary. octorus parses source code extracted from diffs using tree-sitter (the same engine used in Zed, Helix, and other editors).

tree-sitter parses source code into a CST (Concrete Syntax Tree) and returns captures like @keyword, @function.call, etc. These correspond precisely to semantic units of programming languages (fn, let, {, ...) — making them an ideal granularity for interning.

In other words, tree-sitter provides both the style information for highlighting and the optimal split boundaries for interning.

During initial cache construction, a Rodeo for plain diff is initialized, and tokens like + and - are interned with their fixed colors. This is what enables the "display plain first" behavior mentioned earlier. Meanwhile, a highlighted Rodeo is built in the background.

src/ui/diff_view.rs#L32-L91

Since Rodeo internally uses an arena allocator, there's no need for individual drops — freeing the arena frees all interned strings at once. Furthermore, the Rodeo is moved into DiffCache, so it's bound to the DiffCache's lifetime. When the cache is dropped, all interning data is cleanly released. The fact that tree-sitter parse/query frequency equals Rodeo cache construction frequency is another nice alignment.

src/app.rs#L88-L99

Resolving Overlapping Captures

tree-sitter captures are returned per syntax tree node, so parent and child nodes can overlap in range. For example, in #[derive(Debug, Clone)]:

@attribute covers the entire range [0..23)
@constructor individually captures Debug [9..14) and Clone [16..21)

Naively processing from the start, @attribute's style would advance the cursor to position 23, and the inner @constructor captures would be missed.

[0..23)  @attribute            "#[derive(Debug, Clone)]"  ← style applied to entire range
[1..2)   @punctuation.bracket  "["                         ← nested
[8..9)   @punctuation.bracket  "("                         ← nested
[9..14)  @constructor          "Debug"                     ← nested
[16..21) @constructor          "Clone"                     ← nested
[21..22) @punctuation.bracket  ")"                         ← nested
[22..23) @punctuation.bracket  "]"                         ← nested

The solution: generate independent start/end events for each capture, sort them by position, and sweep left-to-right in a single pass. Active captures are managed on a stack, so the innermost (most specific) style always takes priority.

(0,  start, @attribute)
(9,  start, @constructor) ← takes priority
(14, end,   @constructor)
             ↓ falls back to @attribute
(16, start, @constructor) ← takes priority again
(21, end,   @constructor)
(23, end,   @attribute)

The time complexity is O(m log m) where m is the number of captures — independent of line length. For minified JS with extremely long lines, this scales only with capture count, not byte length. A naive byte-map approach would require O(n) memory and traversal for line length n, so the gap widens with longer lines.

Parser and Query Caching

Some languages don't map 1:1 from file extension to a single parser/highlight query. Vue and Svelte are prime examples — their Single File Components combine HTML, JS, and CSS in one file.

This means highlighting a single file requires initializing 3 parsers/queries. If a PR contains 50 .vue or .svelte files, that's 150 initializations.

To solve this, once a parser/query is created, it's stored in a ParserPool cache shared across all files. No matter how many files there are, only 3 initializations are needed. Given that some query compilations involve nearly 100KB of data, this is a non-trivial optimization.

Other Optimizations

Beyond the multi-layer cache, several smaller optimizations contribute to the overall experience.

Viewport-Restricted Rendering

octorus uses the ratatui crate for TUI rendering.

Rather than rendering all lines, only the visible range is sliced and passed to ratatui. Pre-rendering transformations (Span → Line conversion) and Rodeo string lookups are also limited to this range. Simple, but more directly impactful on perceived performance than something like ParserPool.

src/ui/diff_view.rs#L644-L655

Lazy Composition of Comment Markers

Comment data is intentionally excluded from the cache. In octorus, comments are fetched after the diff data, so they're composed at render time via iterator composition.

src/ui/diff_view.rs#L520-L529

As a result, comment markers appear slightly after the diff viewer opens (noticeable on very large diffs).

No Moves Between Cache Construction and Rendering

The Rodeo is moved into DiffCache during cache construction, but after that, everything through rendering is purely borrowed. As mentioned earlier, since the Rodeo is owned by DiffCache, dropping the cache drops all interning data — guaranteeing no lifetime leaks across the entire pipeline. This is less of an "optimization" and more of a strength of Rust's ownership system.

Closing Thoughts

By choosing Rust, octorus has been able to introduce optimizations incrementally. None of the techniques described here were introduced all at once — they were spread across dozens of PRs. The ability to start with a naive implementation for correctness and layer in zero-copy and multi-stage caching later is a testament to Rust's scalability.

Beyond raw speed, octorus also features AI-Rally, a powerful AI-assisted review capability. Give it a try!

👉 github.com/ushironoko/octorus