DEV Community

Software Jutsu
Software Jutsu

Posted on

Token Pruning and Prompt Compression in Modern AI

In the landscape of Large Language Models (LLMs), "context is king," but context is also expensive.
As we move toward models with million-token windows, the efficiency of how we fill that space has become a critical engineering hurdle.

Token pruning and prompt compression have emerged as the primary solutions to the "bloated prompt" problem.


What is Token Pruning?

Token pruning is the process of identifying and removing redundant, less informative, or "noisy" tokens from a prompt before or during the inference phase.

Unlike simple summarization, which rewrites text to be shorter, pruning often involves algorithmic selection—deciding which specific tokens the model's attention mechanism can afford to ignore without losing the semantic "gist" of the message.

How it works

  1. Importance Scoring: The system assigns a "budget" to the prompt. It uses a smaller, faster model (like a BERT-base or a lightweight GPT) to calculate the perplexity or attention weight of each token.
  2. Filtering: Tokens with low importance scores (e.g., "the," "actually," redundant adjectives, or repetitive background info) are discarded.
  3. Re-alignment: The remaining tokens are concatenated into a condensed string that is then passed to the primary, high-performance LLM (like GPT-4o or Gemini 1.5 Pro).

Usage Scenarios

1. The RAG "Needle in a Haystack"

In Retrieval-Augmented Generation (RAG), a system might pull 10 different PDFs to answer one question. This creates a massive prompt where 90% of the text is irrelevant.

Pruning allows the model to focus its "attention heads" specifically on the relevant paragraphs.

It mitigates the "Lost in the Middle" phenomenon, where models perform worse when key information is buried in the center of a long prompt.

2. Infinite-Turn Conversations

Customer service bots or roleplay AI often accumulate massive chat histories. If you don't prune, the cost per message increases linearly until the model hits its limit and starts "forgetting" the beginning of the interaction.

By pruning 30-50% of the middle-history tokens, you can maintain factual consistency of a conversation for thousands of turns at a fixed cost.


Scenario Example: Legal Document Analysis

The Setup:
A law firm needs to use an LLM to find conflicting clauses across five different 50-page contracts (approx. 60,000 tokens).

Without Token Pruning:
Cost: Processing 60,000 tokens per query costs roughly $0.15 - $0.60 (depending on the model).
Latency: The model takes 30-40 seconds to process the "prefill" (reading the prompt).
Risk: The model might hallucinate or miss a conflict because the relevant clauses are buried under 40 pages of standard "boilerplate" definitions.

With Token Pruning (using a tool like LLMLingua):
Preprocessing: A small model scans the 60,000 tokens and identifies that 45,000 tokens are "filler" that appears in every contract.
Compression: The prompt is compressed by 4x, leaving only 15,000 tokens containing the unique, substantive clauses.
Result: * Cost drops by 75%.
Latency improves significantly as the "prefill" time is slashed.
Accuracy increases because the LLM's attention is focused solely on the unique variables of the contracts.


Optimizing Gemini with Token Pruning

When using LLM in your projects, it does not prune tokens automatically, you should implement one of the following strategies in your application layer, lets say using Gemini

Strategy A: The "Semantic Filter"

Use a lightweight tool like LLMLingua or a smaller model (like Gemini 1.5 Flash-Lite) to "compress" your prompt before sending it to the more expensive Gemini 1.5 Pro.

Implementation Steps:

  1. Scoring: Calculate the "importance" of each sentence or paragraph relative to the user's query.
  2. Pruning: Remove segments with low importance scores.
  3. Inference: Send the compressed prompt to the Gemini API.

Strategy B: Hard Truncation & Summarization

For chat history, don't just send the last 100 messages.
Pruning Logic: Keep the last 5 messages in full. For messages 6-50, provide a one-sentence summary of each exchange rather than the full transcript.

Tip: Use the count_tokens method in the Gemini SDK before sending a request to monitor how much your pruning is actually saving you.

Pruning vs. Summarization

Pruning is different from summarizing.

Feature Token Pruning / Compression Standard Summarization
Logic Removes low-entropy tokens algorithmically. Rewrites text into a shorter version.
Speed Very fast (minimal compute). Slower (requires an LLM pass).
Integrity Preserves original phrasing of key info. May lose specific technical terminology.
Primary Goal Cost and Latency reduction. Human readability.

Example

1. The Original Prompt (Input)

"The system administrator reported that the primary database server, located in the Northern Virginia (us-east-1) region, experienced a critical hardware failure at exactly 14:02:45 UTC. The error log specifically identified a 'TimeoutException' on the NVMe storage controller (Serial: 99-XJ-22). We need to determine if this is a recurring issue across the cluster."


2. The Pruned Version (Extractive)

Logic: Removes "glue" words and low-information adjectives while keeping high-entropy identifiers.

"administrator reported primary database server us-east-1 experienced critical hardware failure 14:02:45 UTC. error log identified 'TimeoutException' NVMe storage controller Serial: 99-XJ-22. determine recurring issue cluster."

  • Tokens Saved: ~40% reduction.
  • Human Readability: Poor (looks like a telegram).
  • The Gemini model still sees the exact timestamp, the exact error code, the exact serial number, and the exact region. It has all the "signal" it needs to solve the problem.

3. The Summarized Version (Generative)

Logic: Rewrites the meaning into a new, shorter sentence.

"The admin noted a hardware crash in the Virginia region at 2:02 PM due to a storage error. We need to check if other servers are affected."

  • Tokens Saved: ~60% reduction.
  • Human Readability: Excellent.
  • The exact timestamp (14:02:45) is gone.
    • Identifier Loss: The specific Serial Number (99-XJ-22) is gone.
    • Technical Loss: "TimeoutException" was replaced with "storage error," which is too vague for troubleshooting.

If you need the model to have 100% of the information but want it to be cheaper/faster, you can try explore Context Caching.

Top comments (0)