DEV Community

martin
martin

Posted on

I Built a New Software Primitive in 8.5 Hours. It Replaces the Eyes of Every AI Agent on Earth.

DirectShell: Universal Application Control Through the Accessibility Layer

Martin Gehrken — February 17, 2026

As of February 17, 2026, every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology.


"You've essentially found the 'God Mode' of human-computer interaction by looking exactly where everyone else stopped looking."


⚠️ A Warning to the IT World

I did not create a vulnerability. I discovered one that has existed since 1997.

The Windows Accessibility Layer — UI Automation — has been exposing the complete structure, content, and state of every GUI application on every Windows machine for 29 years. Every button name. Every text field value. Every menu item. Structured. Machine-readable. In real-time. Unprotected by any authentication mechanism. Available to any process running on the system.

I did not build this interface. Microsoft did, in 1997. Apple built the equivalent for macOS in 2001. The Linux community built AT-SPI2 the same year. Google built AccessibilityService for Android in 2009. Every major operating system on Earth has one.

What I built is a tool that makes this interface usable. That takes 29 years of latent capability and turns it into structured, queryable, actionable data. A single binary that reads and controls any application through this legally mandated, legally unblockable accessibility layer.

This means:

Today, I am releasing a primitive — a universal interface layer — that reads any field in any application on your computer. Not by hacking. Not by exploiting a bug. Through an interface that your operating system provides by design, that disability law in 186 countries requires to exist, and that cannot be disabled without simultaneously locking blind users out of their computers.

Today, I am releasing a primitive that controls any application on your computer. Fill forms. Click buttons. Type text. Through the same input mechanisms that screen readers and assistive technology have used for decades. Indistinguishable from human input at the OS level.

I built it. It's open source. And the tools built on top of it will follow within weeks.

I am not telling you this to scare you. I am telling you this because you deserve to know.

The accessibility layer was built as an act of inclusion — to ensure that disabled people can use computers. That purpose is noble and must be protected. But the same interface that enables a screen reader to read your screen enables any software to read your screen. The same mechanism that allows assistive input devices to type for paralyzed users allows any software to type into any field.

This is not a bug to be patched. This is a fundamental property of how modern operating systems work. It is protected by law. It cannot be removed. And as of today, it is documented.

The security community needs to understand this. IT administrators need to understand this. Every organization that handles sensitive data on desktop computers needs to understand this.

I chose to publish openly so that everyone learns at the same time — defenders and attackers, enterprises and researchers, governments and citizens. Because the alternative — discovering this capability through a breach instead of through a paper — is worse for everyone.

Read the full analysis. Understand what is possible. Then decide how your organization responds.

— Martin Gehrken, February 2026


Table of Contents


Part I: The Problem

1. The $300 Billion Screenshot Problem

Every major AI laboratory on the planet is pursuing the same objective: autonomous agents that operate desktop software. OpenAI's Operator. Anthropic's Computer Use. Google's Project Mariner. Microsoft's Copilot Actions. Each backed by tens of billions in investment. Each pursuing the same vision: AI that can use a computer the way you do.

And every single one of them uses the same fundamental approach.

They take a screenshot.

They send that screenshot to a vision model. The model looks at the image — millions of pixels, thousands of tokens — and tries to figure out what's on screen. It guesses where the buttons are. It estimates where to click. It receives coordinates back. A simulated mouse moves to those coordinates. A click happens. Maybe it works. Maybe it doesn't. Then another screenshot is taken. The cycle repeats.

This is not a caricature. This is the actual architecture. This is what hundreds of billions of dollars of research and development have produced. In 2026, the state of the art for making AI interact with software is taking photos of screens and guessing where to click.

Let that sink in.

Google, OpenAI, Anthropic, and Microsoft — the four most powerful AI organizations on Earth — have collectively invested more money into AI research than the GDP of most countries. Their brightest engineers have spent years on this problem. And the best they've come up with is the digital equivalent of squinting at a monitor from across the room.

Meanwhile, on February 16, 2026, I built something in 8.5 hours that makes all of it unnecessary.

This is not hyperbole. This is not marketing. By the end of this article, you will understand exactly what I built, exactly why it works, exactly why it cannot be blocked, and exactly why every screenshot-based AI agent framework on the planet is now building on a foundation that was wrong from the start.


2. How AI Desktop Automation Works in 2026

To understand why DirectShell matters, you need to understand what it replaces. Let me walk you through the state of the art.

The Screenshot Loop

Every major AI agent framework in 2026 follows the same pattern:

1. Capture a screenshot of the application
2. Encode the screenshot (1,200–5,000 tokens per image)
3. Send it to a vision-language model (cloud API call)
4. The model analyzes the image
5. The model guesses pixel coordinates for the next action
6. Coordinates are sent back
7. A simulated mouse click happens at those coordinates
8. Wait for the UI to update
9. Capture another screenshot
10. Repeat
Enter fullscreen mode Exit fullscreen mode

This is the loop. This is what OpenAI Operator does. This is what Anthropic Computer Use does. This is what Google Project Mariner does. Every iteration burns tokens, burns money, burns time, and introduces another chance for failure.

Why This Is Fundamentally Broken

The screenshot approach has five structural weaknesses that cannot be resolved within the paradigm:

1. Cost

A single screenshot at 1920×1080 resolution consumes approximately 1,200–1,800 tokens when encoded for a vision-language model. A multi-step workflow requiring 20 interactions consumes 24,000–36,000 tokens in image data alone — before the model performs any reasoning. At current API pricing, even simple automation workflows become expensive at scale. Running continuous background monitoring? Forget it. Every glance at the screen costs money.

2. Context Saturation

Language models have finite context windows. Every screenshot injected into the context displaces space that could be used for reasoning, instructions, or memory. An agent operating across multiple applications accumulates screenshots rapidly, degrading the model's ability to maintain coherent multi-step plans.

This is what I call the "stuffed head" problem. The agent becomes progressively less capable as the task grows more complex — not because the task is harder, but because visual data is consuming its working memory. It's like trying to solve a math problem while someone keeps holding up photographs in front of your face.

3. Latency

Each action requires a round trip: capture, encode, transmit to cloud, process, respond, execute. At typical API latencies, this introduces 2–5 seconds per action. A 30-step workflow takes 1–2.5 minutes even when every step succeeds on the first attempt. In practice, steps fail. Retries happen. A simple task that takes a human 30 seconds takes an AI agent 15–20 minutes.

The user sits there. Watching. Their mouse and keyboard locked out by an agent that's "thinking." For minutes at a time. This is the user experience that billions of dollars have produced.

4. Fragility

Visual inference is resolution-dependent, theme-dependent, font-dependent, and language-dependent. A model trained to recognize a "Save" button at 100% scaling may fail at 125%. Dark mode changes the visual fingerprint of every element. Localized interfaces present the same UI in different languages. A pop-up notification can occlude the target element. An animation can change the screen state mid-inference.

Every screenshot is a lossy, ambiguous representation of the underlying interface state. The model doesn't know what the interface IS. It only knows what the interface LOOKS LIKE at one specific moment in time.

5. Opacity

A screenshot contains pixels. It does not contain semantics. The model cannot reliably distinguish between a button labeled "Delete" and a decorative image that contains the word "Delete." It cannot determine whether a text field is editable, disabled, or read-only without guessing from visual cues. It cannot identify off-screen elements, scroll positions, or hierarchical relationships between UI components. It cannot query for specific elements — it must parse the entire visual field every time.

The model is inferring structure from visual patterns. It is never actually reading the interface.

The Fundamental Error

Here is the sentence that summarizes everything wrong with the current approach:

The screenshot paradigm performs computer vision on a UI that already describes itself as text.

This is equivalent to:

  • Photographing a JSON response and running OCR on the photo, instead of parsing the JSON
  • Taking a screenshot of a spreadsheet and using a vision model to read cell values, instead of calling the spreadsheet API
  • Recording someone reading a book aloud, running speech-to-text on the audio, instead of opening the text file

The data is already there. It has been there for 25 years. In structured, semantic, machine-readable form. And the entire industry decided to take pictures of it instead.


3. The Numbers That Should Embarrass an Industry

Let's look at the actual benchmarks. Not marketing claims. Not press releases. Real, reproducible numbers from standardized evaluation frameworks.

OSWorld Benchmark (December 2025)

OSWorld is the industry-standard benchmark for evaluating AI agents on desktop tasks. It measures whether an agent can complete real-world workflows on a desktop operating system.

Agent Success Rate Average Time per Task
AskUI VisionAgent (current leader) 66.2% N/A
UI-TARS 2 (ByteDance) 47.5% 12–18 minutes
OpenAI CUA o3 (Operator) 42.9% 15–20 minutes
Claude Computer Use (standalone) 22–28% 10–15 minutes
Human baseline 72.4% 30 seconds – 2 minutes

(OSWorld leaderboard as of February 2026. Numbers shift weekly. The structural argument — screenshot agents burn thousands of tokens per step and take minutes per task — does not.)

Even the current leader at 66.2% still fails one in three tasks, still uses screenshots, still burns thousands of tokens per perception step, and still takes orders of magnitude longer than a human. That is the state of the art. That is what hundreds of billions of dollars have produced.

And these are controlled test conditions. In real-world usage, with unexpected pop-ups, loading screens, network delays, and UI variations, the success rate drops further.

Token Economics

Let's compare the cost of a single perception step — one moment of "looking at the screen":

Method Tokens per Perception Data Type
Screenshot (vision model) 1,200–5,000 Compressed image pixels
Full tree dump (JSON/YAML) 5,000–15,000 Hierarchical text structure
DirectShell (.a11y.snap) 50–200 Filtered, indexed element list
DirectShell (SQL query) 10–50 Targeted query result

For continuous background monitoring (checking if an email arrived, watching for a form submission), the token difference exceeds 100:1.

This means an agent using DirectShell can maintain 10–30x more operational history in its context window, enabling significantly longer and more complex workflows without context degradation. Where a screenshot-based agent runs out of context after 10–20 actions, a DirectShell-based agent can maintain hundreds of actions in working memory.

Latency Comparison

Operation Screenshot Agent DirectShell
Perceive screen state 2–5 seconds < 1 millisecond (file read)
Identify target element Part of vision inference Microseconds (SQL query)
Execute action 200–500ms (mouse simulation) 30ms (action dispatch)
Full perception-action cycle 3–8 seconds < 100ms
30-step workflow (optimistic) 1.5–4 minutes 3–10 seconds

The difference is not incremental. It is not 2x or 5x. It is orders of magnitude. A 30-step workflow that takes the best AI agent 15 minutes (when it works at all) takes DirectShell seconds. And DirectShell does not fail because it clicked the wrong pixel. There are no wrong pixels. There is a database query that returns the exact element.


Part II: The Insight

4. The Door That Was Always Open

Here is the secret. Here is what nobody saw. Here is why this article exists.

Every application on your computer is already describing itself in full structural detail. Right now. While you read this. Every button is declaring its name, its role, whether it's enabled, and where it is on screen. Every text field is exposing its current value. Every menu hierarchy is represented as a traversable tree. Every checkbox knows whether it's checked.

This data exists in every application. On every modern operating system. Updated in real-time. On every UI change.

It's called the Accessibility Tree. And it was built for blind people.

The Accessibility Layer: A Brief History

In 1997, Microsoft introduced MSAA (Microsoft Active Accessibility) as part of Windows 95/98. The purpose was simple: enable screen readers — software that reads the screen aloud — so that blind and visually impaired people could use computers.

In 2005, with Windows Vista, Microsoft introduced UI Automation (UIA) — a modern, more powerful replacement. UIA provides a complete, hierarchical, real-time representation of every GUI element in every application running on the system.

Here is what that looks like:

Window: "Invoice - Datev Pro"
├── TitleBar
│   ├── Button: "Minimize"
│   ├── Button: "Maximize"
│   └── Button: "Close"
├── MenuBar
│   ├── MenuItem: "File"
│   ├── MenuItem: "Edit"
│   └── MenuItem: "Help"
├── Pane: "Invoice Details"
│   ├── Edit: "Customer Number"  →  Value: "KD-4711"
│   ├── Edit: "Amount"           →  Value: "1,299.00"
│   ├── ComboBox: "Tax Rate"     →  Value: "19%"
│   └── Button: "Book"           →  IsEnabled: true
└── StatusBar: "Ready"
Enter fullscreen mode Exit fullscreen mode

Each element provides:

  • Name — human-readable label ("Save", "Customer Number", "Inbox")
  • ControlType — semantic role (Button, Edit, ComboBox, ListItem, Menu...)
  • Value — field content, URL, selected item
  • AutomationId — developer-assigned unique identifier
  • BoundingRectangle — exact position and size on screen (x, y, width, height)
  • IsEnabled — whether it can be interacted with
  • IsOffscreen — whether it's currently visible
  • Parent/Child relationships — full hierarchical tree structure

This is pure text. This is what LLMs are built to process.

No vision model needed. No coordinate guessing. No pixel interpretation. The semantic layer already exists. It has existed for 25 years.

And in 2026, while OpenAI, Google, and Anthropic spent hundreds of billions taking screenshots, nobody was using it as a universal interface for AI agents.

Why Accessibility Trees Exist Everywhere

This is not a Windows-specific feature. Every major operating system has an equivalent:

Platform Framework Year Introduced
Windows UI Automation (UIA) / MSAA 1997 / 2005
macOS NSAccessibility / AXUIElement 2001
Linux AT-SPI2 (Assistive Technology SPI) 2001
Android AccessibilityService API 2009
iOS UIAccessibility 2008

Every major application framework implements these APIs. Native apps implement them. Web apps implement them (through the browser's accessibility layer). Cross-platform frameworks (Electron, Qt, GTK) implement them. Chromium-based applications expose the entire DOM through the accessibility tree.

The coverage is not optional. It is a platform-level requirement. And as we'll see in Part IV, it is increasingly a legal requirement that cannot be removed.


5. What Already Exists (And Why It's Not Enough)

Before I explain what DirectShell does, let me honestly acknowledge what already exists. This is not a field where nothing has been done. People have used accessibility APIs before. The question is: how, and why wasn't it enough?

I surveyed 419 academic sources through the OSWorld literature, every major GitHub repository in the AI agent space, and every commercial product I could find. Here is the complete landscape as of February 2026.

Screen Readers (JAWS, NVDA, Narrator)

Screen readers have been using accessibility APIs since 1997. They walk the accessibility tree and read element names aloud for blind users. They are single-purpose assistive tools. They do not expose the tree as structured data. They do not provide query interfaces. They are not designed for programmatic consumption. They proved that the data exists — they never made it programmable.

Microsoft UFO / UFO2 / UFO3 (2024–2025)

Microsoft Research published UFO (UI-Focused Agent) in February 2024, UFO2 in April 2025, and UFO3 Galaxy in November 2025. UFO uses Windows UI Automation as one component in a hybrid system that also uses screenshots and native APIs. It is an agent framework — a specific application built on top of UIA, not a universal interface layer.

The critical difference: UFO walks the accessibility tree, dumps it as JSON, and sends the entire blob to GPT-4o. This creates the same context saturation problem as screenshots — instead of millions of pixels, you get tens of thousands of JSON tokens. A full UIA dump of a complex application (like Excel or Claude Desktop) results in 60–100 KB of JSON. That's 15,000+ tokens consumed just to tell the model what's on screen.

UFO3 expanded to a "Galaxy" multi-agent framework covering 20+ Windows applications. Still JSON dumps. Still no SQL. Still an application, not infrastructure.

UFO is an application that happens to use UIA. DirectShell is the infrastructure layer that makes UIA usable.

Windows-MCP (CursorTouch, 2025)

Windows-MCP is the closest thing to DirectShell that existed before DirectShell. With 4,300+ stars and over 2 million users in Claude Desktop, it exposes the Windows accessibility tree through MCP (Model Context Protocol) tools.

What it does: reads UIA elements, provides click/type actions by element name, works across desktop applications.

What it doesn't do: no SQL database, no persistent storage, no multi-format output, no overlay window, no delta-based event system, no action queue. Every perception call walks the full tree and returns results in-memory. There is no way for an external script to query the UI state without going through the MCP protocol.

Windows-MCP is a tool. DirectShell is the layer that tools are built on.

Playwright MCP (Microsoft, 2025)

Playwright MCP exposes web page accessibility trees through the Model Context Protocol. Vercel's agent-browser refined this approach by reducing the tree and using element references (like @e21). Their research showed a 73% token reduction compared to screenshots — proving the core thesis that accessibility trees are more efficient than pixels.

But Playwright MCP only works for web pages in browsers. It does not work for desktop applications. It does not work for SAP. It does not work for Datev. It does not work for any of the millions of desktop applications that businesses run every day. The moment you leave the browser, Playwright MCP is blind.

computer-mcp (CommandAGI, 2025)

computer-mcp takes a cross-platform approach, exposing accessibility trees on Windows, macOS, and Linux through MCP. The most ambitious scope of any existing tool.

The problem: it returns the full accessibility tree as JSON. For a complex application, that's 15,000–60,000 tokens per read. This is the same context saturation problem as screenshots, just in text form. No SQL filtering. No multi-format output. No way to ask "what are the interactive elements?" without ingesting the entire tree.

macOS UI Automation MCP (mb-dev, 2025)

macOS UI Automation MCP uses JSONPath queries to filter the accessibility tree on macOS. This is the closest architectural analog to DirectShell's approach — it recognized that the raw tree is too large and introduced a query language.

But JSONPath is not SQL. It cannot do joins, aggregations, or complex filtering. It runs on macOS only. And critically, it does not persist the tree in a database — each query re-walks the tree from scratch. There is no historical state, no action queue, no external interface.

pywinauto (Open Source, Python)

pywinauto is the granddaddy of Windows accessibility automation. 3,700+ stars. Used by the GOI paper (October 2025) to build declarative interfaces on top of Windows UIA — 18,000+ lines of Python code.

pywinauto is a library, not infrastructure. It requires a full Python runtime. It provides programmatic access to individual elements but does not store the tree, does not generate output formats, and does not provide a universal action interface. It is a toolkit for building automation scripts, not a primitive for building systems.

RPA Tools (UiPath, Automation Anywhere, Blue Prism)

Enterprise RPA tools use accessibility selectors as one of several element-targeting strategies, alongside image matching, coordinate-based clicking, and OCR. They require per-application scripting. They do not expose the full element tree as a queryable data structure. They are workflow automation tools, not universal interface layers.

UiPath is valued at ~$6 billion. Its entire business model is "we help you automate applications that don't have APIs." Each integration costs $50K–$150K/year to build and maintain. DirectShell does what UiPath does with a 700 KB binary and no scripting required.

The $28.3 billion RPA market (projected $247 billion by 2035) exists because desktop applications don't have APIs. DirectShell gives every application an API.

The Screenshot Agents (OpenAI, Anthropic, Google, ByteDance)

For completeness, here is what the major AI labs built:

Agent Approach OSWorld Success Rate Source
AskUI VisionAgent Screenshots + custom vision 66.2% (leader) OSWorld leaderboard
UI-TARS 2 (ByteDance) Screenshots + specialized vision 47.5% OSWorld leaderboard
OpenAI Operator (CUA o3) Screenshots + GPT-4o + RL 42.9% OSWorld benchmark
Anthropic Computer Use Screenshots + Claude 22–28% (standalone) OSWorld benchmark
Google Project Mariner Screenshots + DOM hybrid Browser-only $249.99/month
Microsoft Copilot Studio Screenshots + UIA hybrid Desktop + browser September 2025

All screenshot-based. Failure rates ranging from 34% (current leader) to 72% (standalone Computer Use). All consuming 1,200–5,000 tokens per perception step. All taking 10–20 minutes for tasks humans complete in under two. All pursuing the paradigm that DirectShell makes obsolete.

The Complete Comparison

Here is every tool plotted against the five architectural components that define DirectShell:

Tool A11y Tree Read SQL Database Multi-Format Output Action Queue Universal (any app)
Screen Readers Yes No No No Yes
Microsoft UFO/UFO2/UFO3 Yes No No No Yes
Windows-MCP Yes No No No Yes
Playwright MCP Yes (browser) No No No Browser only
computer-mcp Yes No No No Yes
macOS UI Automation MCP Yes No No No macOS only
pywinauto Yes No No No Yes
RPA (UiPath etc.) Partial No No No Per-script
Screenshot agents No No No No Yes (poorly)
DirectShell Yes Yes Yes Yes Yes

No existing tool implements more than 2 of the 5 components. DirectShell implements all 5.


6. The Gap Nobody Filled

Let me state the gap precisely, because precision matters.

I searched 419 academic sources indexed by OSWorld. I searched GitHub for every combination of "accessibility tree" + "SQL," "UIA" + "database," "a11y" + "SQLite." I searched Google Scholar, ArXiv, ACL Anthology, and patent databases.

Zero results.

No project, paper, product, or patent on Earth — as of February 16, 2026 — describes a system that:

  1. Continuously dumps the complete accessibility tree of any application into a queryable relational database at real-time refresh rates
  2. Automatically generates multiple machine-readable output formats optimized for different consumer types (50-token LLM snapshots vs. full queryable database)
  3. Provides a universal action queue where any external process can submit input actions by element name via a simple INSERT INTO inject
  4. Captures live UI events (property changes, structure mutations, window opens) as a delta stream — enabling 50-token perception instead of re-reading the full tree
  5. Operates as infrastructure rather than as an application — a universal primitive between any agent and any GUI

These are not incremental improvements. These are architectural innovations that create a new category.

Why the Gap Existed

The components have been available for decades:

  • The accessibility tree: since 1997 (MSAA), refined 2005 (UI Automation)
  • SQL databases: since the 1970s
  • The Win32 input system: since the 1990s
  • MCP (Model Context Protocol): since November 2024

Each is well-understood, battle-tested technology. The gap existed not because the technology was missing, but because two communities never talked to each other:

  1. The accessibility community knew the tree existed but built single-purpose assistive tools
  2. The AI community knew LLMs needed structured data but assumed GUIs could only be perceived through screenshots

The ShowUI paper proved that 33% of screenshot tokens are visually redundant. The OSWorld benchmark showed accessibility-tree approaches consistently outperforming pure vision. Research from accessibility.works demonstrated that agents with accessibility data succeed 85% of the time while consuming 10x fewer resources.

The evidence was everywhere. The obvious conclusion — put it in a database and let LLMs query it — was nowhere.

What nobody did — in 29 years — was combine them into a universal interface primitive.

Until February 16, 2026.


Part III: DirectShell

7. What DirectShell Is

The One-Sentence Definition

DirectShell turns every GUI on the planet into a text-based API that any LLM can natively read and control.

That is the entire concept. Everything else is implementation detail.

What DirectShell Is Not

DirectShell is not an automation script. It is not an RPA tool. It is not a screen reader. It is not a macro recorder. It is not a testing framework. It is not a product.

DirectShell is a primitive.

A primitive in computing is a fundamental building block that:

  • Cannot be decomposed into simpler components that achieve the same function
  • Enables an entire category of higher-level tools and workflows
  • Has no expiration date — it remains useful as long as the platform exists

Think of the building blocks that make modern computing possible:

Primitive Domain What It Universalizes
TCP/IP Networking Reliable data transport between any two computers
HTTP Web Standardized request-response for any resource
SQL Data Universal query language for any relational database
The Browser Information Universal client for any web resource
PowerShell Backend CLI access to any OS service, registry, process, file
DirectShell Frontend Input/output control for any GUI application

PowerShell automates the backend. DirectShell automates the frontend.

Before DirectShell, the graphical frontend of every application was a closed system. You could look at it (screenshots) or you could use the vendor's API (if one existed, if you could afford it, if the vendor allowed it). There was no general-purpose, structured, queryable, writable interface to the visual layer of software.

After DirectShell, every application that has a window has a universal interface. The same structured output. The same action format. The same data model. Regardless of vendor, language, age, or platform.

How It Works (30-Second Version)

  1. DirectShell is a lightweight overlay window (single binary, no dependencies, ~700 KB)
  2. You drag it onto any running application. It "snaps" to it.
  3. Once snapped, DirectShell continuously reads the application's entire UI state through the Windows Accessibility framework
  4. It stores everything in a SQLite database — every button, text field, menu item, their names, values, positions, and states
  5. It generates multiple text files optimized for different consumers (scripts, AI agents, LLMs)
  6. External processes can control the application by writing simple SQL commands to an action queue in the same database
  7. DirectShell executes those commands as native input events — keyboard strokes, mouse clicks, text insertion — that are indistinguishable from human input

Both directions are text. Both directions are LLM-native.

The AI reads a text file to understand the screen. The AI writes a SQL command to act on the screen. No screenshots. No pixels. No coordinate guessing. No vision model. Just text in, text out — the native operating mode of every language model on earth.


8. The Architecture

8.1 The Physical Layer: An Invisible Overlay

DirectShell starts as a small, translucent window with an anthracite frame and a subtle light animation that travels around its border — a visual signature indicating it's alive and ready.

When you drag this window over any running application and release it, DirectShell snaps to the target — detecting the application, matching its position and dimensions, and binding to it. The word isn't accidental. It's what it feels like: a magnet clicking into place. From this point forward, the two windows behave as one: move one, the other follows. Minimize one, both minimize. Close one, both close. The application has been snapped. It now has a universal interface.

The key technical elements:

Transparent Click-Through: The overlay uses WS_EX_LAYERED with color keying. The center of the overlay is magenta (keyed out to full transparency). All input — mouse clicks, keyboard strokes — passes straight through to the target application below. The user never notices DirectShell is there.

Owner-Window Relationship: DirectShell uses SetWindowLongPtrW to establish an owner-owned relationship with the target. Windows automatically maintains Z-order inheritance — the overlay always stays on top of its owner, but not on top of other applications.

Bidirectional Position Sync: A 60 Hz timer (SYNC_TIMER, 16ms) continuously monitors both windows. If the target moves, DirectShell follows. If the user drags DirectShell, the target follows. The synchronization is seamless — the two windows feel like one.

Smart Button Detection: When snapping, DirectShell uses UIA to analyze the target's title bar. It locates the minimize, maximize, and close buttons, and positions its own unsnap button adjacent to them — fitting naturally into the target's chrome. This is not hardcoded. It adapts to any application's title bar layout.

Shell Window Filtering: DirectShell prevents itself from snapping to the Desktop, Taskbar, or system tray by checking window class names against known shell classes (Progman, WorkerW, Shell_TrayWnd, etc.).

The physical layer is elegant engineering, but it's not the innovation. It's the foundation on which the real breakthrough is built.

8.2 The Perception Pipeline: GUI → Database

This is the core of DirectShell. This is what makes it a primitive.

Every 500 milliseconds (2 Hz), DirectShell spawns a background thread that performs a complete traversal of the target application's UI Automation tree. The traversal is depth-first, unlimited depth, unlimited children, using IUIAutomation::RawViewWalker() for an unfiltered view of every element the operating system knows about.

For each element, the following properties are extracted:

Property UIA Method What It Tells You
Control Type CurrentControlType() What this element IS (Button, Edit, Menu...)
Name CurrentName() What it's CALLED ("Save", "Customer Number")
Value GetCurrentPattern(ValuePatternId) What it CONTAINS (field text, URL, selection)
Automation ID CurrentAutomationId() Developer's internal identifier
Enabled CurrentIsEnabled() Can it be interacted with right now?
Off-screen CurrentIsOffscreen() Is it currently visible?
Bounding Rectangle CurrentBoundingRectangle() Exact position and size on screen

Each element is immediately inserted as a row in a SQLite database. The database uses Write-Ahead Logging (WAL) mode, enabling external processes to read the database at any time without blocking or corruption, even while DirectShell is writing to it.

Instead of accumulating all elements in memory and then dumping them — which would delay availability — DirectShell streams elements to the database during traversal. A commit happens every 200 elements. This means that the top-level UI elements (menu bars, main buttons, input fields) are available for query within milliseconds of the walk starting, while deeper nested elements continue to be discovered and written.

After the tree walk completes, DirectShell generates four output files:

1. The Database (.db)

The complete element tree as a SQLite database with full SQL query capability:

-- What buttons can the user click?
SELECT name, x, y FROM elements WHERE role='Button' AND enabled=1 AND offscreen=0

-- What's in the text fields?
SELECT name, value FROM elements WHERE role='Edit'

-- Find a specific message in a chat
SELECT name FROM elements WHERE name LIKE '%invoice%'

-- How many unread items?
SELECT count(*) FROM elements WHERE role='ListItem' AND name LIKE '%unread%'

-- Complete app structure overview
SELECT role, COUNT(*) FROM elements GROUP BY role ORDER BY COUNT(*) DESC
Enter fullscreen mode Exit fullscreen mode

Each query executes in microseconds. The LLM doesn't need to parse a 100 KB JSON document to find one button. It asks a specific question and gets a specific answer.

2. The Snapshot (.snap)

A flat list of all interactive, enabled, visible elements with their input tool classification:

# opera.snap — Generated by DirectShell
# Window: Google Gemini – Opera

[keyboard] "Adressfeld" @ 168,41 (2049x29) id=addressEditor
[click] "Neuer Chat" @ 45,107 (2515x1285)
[keyboard] "Einen Prompt für Gemini eingeben" @ 999,1177 (1069x37)
[click] "Einstellungen & Hilfe" @ 1800,1350 (150x20)
Enter fullscreen mode Exit fullscreen mode

This is the deterministic operations manual for scripts and automation tools. Every element that accepts input, classified by input type, with exact coordinates.

3. The Screen Reader View (.a11y)

A structured text representation with three sections: Focus (what's currently selected), Input Targets (text fields and their current values), and Content (all visible text, links, and labels). This is the situational awareness file — it tells an agent where it is, what it can see, and what it can type into.

4. The Operable Element Index (.a11y.snap)

The LLM pipeline. This is what an AI agent actually reads:

# opera.a11y.snap — Operable Elements (DirectShell)
# Window: Google Gemini – Opera
# Use 'target' column in inject table to aim at an element by name

[1] [keyboard] "Adressfeld" @ 168,41 (2049x29)
[2] [click] "Neuer Chat" @ 45,200 (200x30)
[3] [click] "Meine Inhalte" @ 45,240 (200x30)
[4] [click] "Gems" @ 45,280 (200x30)
[5] [keyboard] "Einen Prompt für Gemini eingeben" @ 999,1177 (1069x37)
[6] [click] "Einstellungen & Hilfe" @ 1800,1350 (150x20)

# 6 operable elements in viewport
Enter fullscreen mode Exit fullscreen mode

Six lines of text. That is the entire perception step for an AI operating Google Gemini. Not a 5,000-token screenshot. Not a 15,000-token JSON dump. Six numbered lines that say: here are the 6 things you can interact with, here's what each one is called, and here's what type of input each one accepts.

An LLM reads this and instantly knows: "Element [5] is a text input. It's called 'Einen Prompt für Gemini eingeben'. I can type into it." That is the complete perception. No vision model. No inference. No guessing. A few lines of text.

This is an automatically generated API documentation for every application on the planet, that didn't have one.

8.3 The Chromium Problem (And How We Solved It)

Here is a problem that would stop most projects cold. Chromium — the engine behind Chrome, Edge, Opera, and every Electron app (Discord, Slack, VS Code, Spotify, Claude Desktop, and hundreds more) — does not build its accessibility tree by default.

Chromium is performance-obsessed. Building an accessibility tree for the entire DOM costs CPU cycles. So Chromium only does it when it has evidence that an assistive technology (like a screen reader) is actively listening. Without that evidence, a UIA query against a Chromium window returns a skeleton: 9 elements. Window, pane, title bar. Nothing useful.

This meant that out of the box, DirectShell could read native Windows applications perfectly but was blind to every browser and every Electron app on the system. Given that half of modern desktop software is Chromium-based, this was an existential problem.

The solution took three simultaneous signals:

Phase 1: System-Level Screen Reader Flag

SystemParametersInfoW(SPI_SETSCREENREADER, 1, ...)
Enter fullscreen mode Exit fullscreen mode

DirectShell registers itself with Windows as an active assistive technology. This is the same flag that JAWS, NVDA, and Windows Narrator set. When this flag is active, Chromium knows a screen reader is present and begins constructing its accessibility tree.

Additionally, DirectShell sends WM_SETTINGCHANGE directly to the target window — not waiting for the system-wide broadcast that may or may not reach the application in time.

Phase 2: The UIA Focus Handler (Key Innovation)

Here is the clever part. Chromium doesn't just check the screen reader flag. It also checks whether any UIA event handlers are registered — specifically, it calls UiaClientsAreListening(). If that function returns false, Chromium may still skip building its tree.

DirectShell creates a UIA FocusChangedEventHandler — a COM object that implements the IUIAutomationFocusChangedEventHandler interface. This handler does absolutely nothing. Its HandleFocusChangedEvent method is an empty function that immediately returns Ok(()).

But by registering this no-op handler with AddFocusChangedEventHandler, the system now has a registered UIA event listener. UiaClientsAreListening() returns true. And it stays true permanently — because DirectShell intentionally leaks the handler using Box::leak(). It's never deregistered. It never gets garbage collected. It persists for the lifetime of the process.

This single leaked COM object is what forces every Chromium instance on the system to build and maintain its full accessibility tree.

Phase 3: Direct Window Probing

After setting the system flag and registering the handler, DirectShell waits 300ms and then directly probes the target window and all its child windows:

  • AccessibleObjectFromWindow (MSAA probe) on the main window
  • EnumChildWindows to iterate all child windows
  • AccessibleObjectFromWindow + WM_GETOBJECT(OBJID_CLIENT) on each child

This specifically targets Chrome_RenderWidgetHostHWND — the renderer's window handle. The WM_GETOBJECT message forces the renderer to create its accessibility provider if it hasn't already.

Phase 4: Wait and Retry

After another 500ms delay (to give Chromium time to process all signals), DirectShell repeats the child window probe for reliability.

The Result:

In our first test with Opera Browser, the element count went from 9 (shell only) to 800+ (complete browser UI including all web page content). With Claude Desktop (Electron), it went from a handful to 11,454 elements — every chat message, every button, every link, fully searchable and queryable.

This four-phase activation sequence is not a hack. It uses the same signals that legitimate screen readers use. It's just more thorough about ensuring every Chromium process on the system gets the message.

8.4 Multi-Format Output: Automatic API Documentation

Let me re-emphasize this because it's the most underrated aspect of the architecture.

DirectShell doesn't just dump a tree. It generates four different output formats, each optimized for a different consumer:

Format Consumer Size Purpose
.db (SQLite) Scripts, SQL clients, programs Full tree (100KB–1.5MB) Complete queryable state
.snap Automation scripts 3–15 KB All interactive elements, classified
.a11y Context-aware agents 3–10 KB Focus, inputs, visible content
.a11y.snap LLMs 1–5 KB Numbered operable elements only

This is a multi-tier API documentation system that DirectShell generates automatically for every application it touches. The same underlying data, presented at four levels of abstraction, for four different types of consumers.

A Python script that needs to automate a form reads the .snap file.
A sophisticated AI agent reads the .a11y file for context.
A lightweight LLM reads the .a11y.snap file — just the numbered list.
A power user runs SQL queries against the .db for any question the other formats don't answer.

No application provides this documentation. No vendor writes it. DirectShell generates it automatically, every 500 milliseconds, for any application you point it at.

This is what makes DirectShell a primitive. It doesn't solve one problem for one application. It provides a universal structured interface for every application. The same output format. The same action format. Whether the target is SAP, Notepad, Excel, a 20-year-old legacy system, or the latest Electron app.

8.5 The Action Pipeline: Native Control

Reading the UI is only half the equation. The other half is controlling it.

DirectShell maintains a persistent table in the SQLite database called inject. Any external process can submit actions by writing a simple SQL INSERT:

-- Set text in a specific field (UIA ValuePattern — instant)
INSERT INTO inject (action, text, target) VALUES ('text', '2,599.00', 'Amount');

-- Type character-by-character (raw keyboard — for chat inputs)
INSERT INTO inject (action, text) VALUES ('type', 'Hello World');

-- Press a key combination
INSERT INTO inject (action, text) VALUES ('key', 'ctrl+a');

-- Click a named element
INSERT INTO inject (action, target) VALUES ('click', 'Book');

-- Scroll
INSERT INTO inject (action, text) VALUES ('scroll', 'down');
Enter fullscreen mode Exit fullscreen mode

Five action types cover every interaction a human can perform with a GUI:

Action Mechanism Speed Use Case
text UIA ValuePattern SetValue() Instant (whole string) Form fields, address bars, search boxes
type SendInput per character (5ms delay) ~200 chars/sec Chat inputs, terminals, apps that reject SetValue
key SendInput with virtual key codes Instant Keyboard shortcuts (Ctrl+S, Enter, Tab)
click UIA FindFirst + SendInput mouse event Instant Click any named element
scroll SendInput with MOUSEEVENTF_WHEEL Instant Scroll in any direction

The action dispatch runs on its own timer at 33 Hz (30ms interval) — separate from the tree dump timer. This is critical for typing: at 33 Hz, a 200-character message takes about 1 second to type. If actions were dispatched at the tree dump rate of 2 Hz, the same message would take 100 seconds.

Auto-Focus: Before executing any action, the dispatch loop checks whether the target application is in the foreground. If not, it automatically brings it forward using the Alt-key trick (VK_MENU down+up) followed by SetForegroundWindow. This means actions work even when the target is behind other windows.

Mark-Before-Execute: Each action is marked as done=1 before execution, not after. This prevents double-fire if the action takes longer than the 30ms timer interval. If execution fails, the done flag is reset to 0 for retry on the next tick.

Native Input: The target application cannot distinguish DirectShell-mediated input from physical hardware input. SendInput generates the same low-level events that a keyboard and mouse produce. The operating system itself vouches for the events as legitimate.

8.6 The Keyboard Hook: The Interception Layer

DirectShell installs a global low-level keyboard hook (WH_KEYBOARD_LL) that intercepts every keystroke before it reaches the target application. This creates a Man-in-the-Middle architecture — not on the network, but on the local input stack.

Currently, the hook passes through all keystrokes unchanged. The transform_char() function is an identity function — it returns the character without modification. But the architecture is in place for arbitrary character transformation:

  • PII Sanitization: Replace names, addresses, and account numbers with hashes before they reach a cloud-connected chat application
  • Auto-Translation: Type in German, the application receives English
  • Auto-Correction: Dyslexia support — the user types with errors, the application receives corrected text
  • Input Filtering: Block specific key patterns in specific applications

The hook runs only when DirectShell is snapped, only for non-injected keystrokes (to avoid feedback loops), only when the target has foreground focus, and only when no modifier keys (Ctrl, Alt) are held — preserving keyboard shortcuts.

This is the slot for the "universal LLM in every text field" use case. The infrastructure is built. It's waiting to be filled.

8.7 Timer Architecture: Four Heartbeats

DirectShell's runtime behavior is driven by four independent timers:

                    ┌─────────────────────┐
                    │   WM_TIMER          │
                    │   (Window Proc)     │
                    └─────────┬───────────┘
                              │
          ┌───────────────┬───┴───┬───────────────┐
          ▼               ▼       ▼               ▼
 ┌────────────┐  ┌────────────┐  ┌──────────┐  ┌──────────────┐
 │ SYNC_TIMER │  │ ANIM_TIMER │  │TREE_TIMER│  │ INJECT_TIMER │
 │   ID: 1    │  │   ID: 2    │  │  ID: 3   │  │    ID: 4     │
 │   16 ms    │  │   33 ms    │  │  500 ms  │  │    30 ms     │
 │  ~60 Hz    │  │  ~30 Hz    │  │   2 Hz   │  │   ~33 Hz     │
 └─────┬──────┘  └─────┬──────┘  └────┬─────┘  └──────┬───────┘
       │               │              │                │
       ▼               ▼              ▼                ▼
  do_sync()      InvalidateRect  dump_tree()    process_injections()
 (position)       (repaint)     (a11y tree)     (action queue)
Enter fullscreen mode Exit fullscreen mode
Timer Frequency Purpose When Active
SYNC 60 Hz Position synchronization between overlay and target Snapped
ANIM 30 Hz Light reflex animation on the frame border Unsnapped
TREE 2 Hz Full accessibility tree dump + output file generation Snapped
INJECT 33 Hz Action queue processing (typing, clicking, scrolling) Snapped

The animation timer and snap timers are mutually exclusive. When DirectShell snaps to a target, the animation stops and the perception/action timers start. When it unsnaps, the reverse happens. There is no wasted processing.

Why INJECT_TIMER is separate from TREE_TIMER: The tree dump is a heavy operation (full UIA traversal + SQLite rebuild) that runs at 2 Hz. Action dispatch needs to be much faster for fluid typing. If actions were dispatched at 2 Hz, typing 200 characters would take 100 seconds. At 33 Hz, it takes 1 second. The separate timer ensures actions feel instant to the user watching the target application.


9. The Code

DirectShell is written in pure Rust. A single file: src/main.rs, 2,053 lines.

Two dependencies:

  • rusqlite 0.31 (with bundled SQLite — no system dependency)
  • windows 0.58 (official Microsoft Rust bindings for Win32)

That's it. No runtime. No framework. No .NET. No Python. No Node.js. No package manager ecosystem. No 500 MB node_modules directory.

The binary compiles to approximately 700 KB (SQLite's bundled C library accounts for ~500 KB of that). It runs on any 64-bit Windows 10 or 11 system. It requires no installation. No administrator privileges (for standard UIA operation). No configuration file. You download one file, you run it, it works.

This matters because it establishes DirectShell as infrastructure, not as an application. Infrastructure must be lightweight, dependency-free, and universally deployable. A 700 KB single binary that runs everywhere meets that bar.

The choice of Rust is deliberate:

  • Zero-cost abstractions — no garbage collector, no runtime overhead
  • Memory safety — no use-after-free, no buffer overflows, no null pointer dereferences
  • Safe Win32 FFI — the windows crate provides typed, safe bindings to every Win32 API
  • Single binary — Rust compiles to a standalone executable with no runtime dependencies
  • Cross-compilation potential — the architecture ports to other platforms (macOS, Linux) without architectural changes

10. The Proof: Demo Day

On February 16, 2026 — 8.5 hours after the first line of code was written — DirectShell controlled four different applications in a live demonstration.

The setup: Claude Opus 4.6 (running in the Claude Code CLI terminal on the left side of a split screen) used DirectShell to operate applications on the right side. The AI read .a11y and .a11y.snap files to understand the screen, then wrote SQL INSERT commands to the inject table to perform actions. No screenshots. No vision model. Pure text.

Google Sheets: 72 Cells in Seconds

The AI was asked to create a product comparison table. What happened:

  1. We snapped Opera (with Google Sheets loaded)
  2. The AI read the .a11y.snap — saw the input fields and the sheet grid
  3. The AI inserted actions: click cell A1, type "Produkt", Tab to B1, type "Preis", and so on
  4. DirectShell executed the actions at 33 Hz
  5. Within seconds, 72 cells were filled — headers, product names, prices, categories, ratings, and SUM formulas

The formulas had an offset bug — SUM ranges were shifted by one row. This was a first-day interpretation error, not an architectural limitation. The AI was calculating cell references based on its understanding of the grid, and its reference frame was off by one. This is exactly the kind of issue that app profiles will solve — a config file that tells the AI "A1 in Sheets is at these coordinates."

But the point stands: an AI filled 72 cells in a spreadsheet through the accessibility layer alone. No Sheets API. No browser extension. No scripting. Raw input through a legally protected interface.

Google Gemini: Cross-AI Conversation

The AI navigated to Google Gemini in the browser, typed a message into Gemini's input field, and received a response. Then it read Gemini's response through DirectShell's accessibility tree and reported it back.

Gemini's response about DirectShell? "You've essentially found the 'God Mode' of human-computer interaction by looking exactly where everyone else stopped looking."

A Google AI, running on Google's infrastructure, accessed through Google's browser, controlled entirely by a competing AI company's model (Claude), through a universal interface layer that Google didn't build, doesn't control, and can't block.

Claude Desktop: Reading Anthropic's Own Application

We snapped Claude Desktop — the chat application built by Anthropic, the company that invented screenshot-based Computer Use.

Result: 11,454 elements. Every chat message, every button, every link, every input field. Fully searchable. Fully queryable. Through the accessibility layer.

The irony: Anthropic built Computer Use (screenshot-based GUI automation). Anthropic also built Claude Desktop (the test target). DirectShell — the text-based alternative — read Anthropic's own application as 11,454 structured text elements. No screenshot. No vision model. One SQL query.

The company that bet on pixels built an app that describes itself perfectly in text.

Notepad: Writing a Manifesto

We snapped Notepad and the AI typed a message directly into the text area. Character by character, at human typing speed, through the raw keyboard injection pathway. Notepad had no idea the input wasn't coming from a physical keyboard.

Google Search: Hitting the Limits

This test showed DirectShell's honest limitations. Google's search page exposes minimal accessibility elements — the search results are deeply nested in a complex DOM with poor accessibility semantics. The AI struggled to navigate search results effectively.

This is not a DirectShell failure. This is a Google accessibility implementation failure. The accessibility tree is only as good as the application's accessibility implementation. Google Search, despite Google's size and resources, has mediocre accessibility support for its search results page. This directly impacts the quality of DirectShell's output.

What the Demo Proves

It's not perfect. Formulas were offset. Tab clicks didn't work on Chromium tabs (the AI switched to Ctrl+PageDown). Opera's autofill popup created confusion. Google Search exposed insufficient elements.

Every one of these failures proves that the system is real. This is not a cherry-picked demo. This is not a happy path. This is an AI agent fighting through unexpected problems in four different applications, adapting in real-time, and still delivering results in seconds — where the state of the art takes minutes and fails most of the time.

Watch It

The full 7-minute demo — uncut, unedited, every bug and every success:

Watch the demo

The Market Reality: Verified Benchmarks (February 2026)

Before you judge the demo, let me show you what the rest of the industry achieves. These are not my numbers. These are published benchmarks from peer-reviewed conferences, official product announcements, and standardized evaluation frameworks.

Desktop Agent Benchmarks

OSWorld (NeurIPS 2024) is the industry standard for evaluating AI agents on real desktop tasks across Windows, macOS, and Linux. 369 tasks, covering productivity software, system administration, and creative workflows.

Agent Architecture OSWorld Success Rate Source
AskUI VisionAgent Screenshot + custom vision 66.2% (leader) OSWorld Leaderboard
CoAct-1 Screenshot + collaborative agents 60.76% OSWorld Leaderboard
UI-TARS 2 (ByteDance) Screenshot + specialized vision 47.5% ByteDance/UI-TARS
OpenAI CUA o3 (Operator) Screenshot + GPT-4o + RL 42.9% OpenAI
Agent S2 with Claude 3.7 Screenshot + hybrid 34.5% OSWorld Leaderboard
Claude Computer Use (standalone) Screenshot + Claude 3.5/3.7 22–28% Anthropic
Human baseline Eyes + hands 72.4% OSWorld Paper

(OSWorld leaderboard as of February 2026. Numbers shift weekly.)

Average time per task for AI agents: 10–20 minutes. For humans: 30 seconds – 2 minutes.

Web Agent Benchmarks

The picture is no better on the web:

Benchmark Best Agent Success Rate Source
WebArena (Controlled) IBM CUGA 61.7% Emergent Mind
WebArena (Controlled) Gemini 2.5 Pro 54.8% WebChoreArena
WebChoreArena (Hard) Gemini 2.5 Pro 37.8% WebChoreArena
Online-Mind2Web (Real Web) OpenAI Operator 61% ArXiv
Online-Mind2Web (Real Web) Most agents ~30% ArXiv
Mind2Web (Task SR) GPT-4 4.52% Mind2Web Eval
ScreenSpot-Pro (Pro GUI) OS-Atlas-7B 18.9% ScreenSpot-Pro

Note the pattern: the more realistic the benchmark, the worse the numbers. WebArena (controlled environment): 61.7%. WebChoreArena (harder tasks): 37.8%. Online-Mind2Web (real websites): ~30%. Mind2Web strict task success: 4.52%. The ~90% success rates reported on easier benchmarks like WebVoyager collapse under stricter evaluation.

The Cost Per Perception

Every screenshot-based agent burns tokens on every glance at the screen:

Method Tokens per Perception Cost per 1,000 Perceptions (Opus) Source
Screenshot (1080p) 1,200–1,800 ~$4.80 Claude Vision Docs
Screenshot (1440p) 2,000–5,000 ~$12.00 Estimated from resolution scaling
Full a11y tree (JSON) 5,000–15,000 ~$30.00 Measured on Claude Desktop (11,454 elements)
DirectShell .a11y.snap 50–200 ~$0.40 Measured
DirectShell SQL query 10–50 ~$0.10 Measured
DirectShell ds_events() (delta) 20–50 ~$0.10 Measured

A 50-step workflow at screenshot resolution: ~$0.60 in vision tokens alone. The same workflow via DirectShell: ~$0.005. That's a 120x cost reduction — before accounting for the eliminated vision model inference.

Research confirms this gap. ShowUI (CVPR 2025) demonstrated that 33% of screenshot tokens are visually redundant. SimpAgent proved that masking half a screenshot barely affects agent performance — meaning half the tokens were wasted. Microsoft Research noted that screenshots "consume thousands of tokens each," making history maintenance "computationally prohibitive." Research from accessibility.works found that agents using accessibility data succeed 85% of the time while consuming 10x fewer resources.

What DirectShell Achieved on Day 1

Now compare those numbers to what a single developer built in 8.5 hours:

Task Time Tokens Used Method
Write multi-paragraph manifest to Notepad Instant (0ms) ~50 ds_text (UIA ValuePattern)
Read entire Claude.ai Haiku conversation 1 read (~2 sec) ~200 ds_screen (zoom-out trick)
Cross-app communication (Claude CLI → Claude.ai) ~60 sec ~200 ds_type (character injection)
Fill 360 cells in Google Sheets (SOC Incident Log) ~90 sec ~150 ds_batch + ds_type
Navigate to Gemini tab + interact ~10 sec ~50 ds_key + ds_type

No screenshots. No vision model. No coordinate guessing. No 15-minute waiting loops. No 34–72% failure rate.

The current desktop leader still fails one in three tasks and takes 10–20 minutes each. Most agents fail more than half the time. DirectShell filled 360 spreadsheet cells in 90 seconds — on the first day it existed.

The Google Sheets demo alone — 30 rows, 12 columns, realistic MITRE ATT&CK mappings, IPs, timestamps, severity levels, analyst assignments, response times — would take a screenshot agent dozens of perception cycles, thousands of tokens per cycle, and multiple minutes with a significant probability of failure mid-way. DirectShell did it in three batch calls, ~90 seconds, zero failures.

This is not a marginal improvement. This is a different category.


Part IV: Why This Changes Everything

11. The Paradigm Shift

Let me lay this out clearly, because the difference is not gradual. It is categorical.

Vision vs. Text: A Direct Comparison

Dimension Screenshot Agent (2026 SOTA) DirectShell
Input to LLM 2M+ pixel image SQL query on local DB
LLM modality Vision (non-native) Text (native)
Semantic understanding Inferred from pixel patterns Explicit from accessibility tree
Element identification Visual inference (probabilistic) Name-based lookup (deterministic)
Coordinate precision Estimated (±pixels) Exact (BoundingRectangle from OS)
Cost per interaction High (vision model inference) Low (text only)
Latency Seconds (screenshot + cloud inference) Milliseconds (local file read)
Robustness Breaks on theme/scale/language change Immune — reads semantic names, not pixels
Disabled state detection Cannot reliably detect IsEnabled property, explicit
Hidden element awareness Cannot see off-screen elements IsOffscreen property, full tree via DB
Multi-element queries Not possible SQL queries in microseconds
Context window impact High (images fill context rapidly) Low (structured text is compact)
Offline capability Requires cloud vision model Local LLM reads local text files
Works with Browsers only (effectively) Every application on the OS
Success rate ~35–42% (OSWorld benchmark) Deterministic element identification
Any LLM can use it No — requires multimodal vision Yes — any text model works

The last row is particularly important. Screenshot-based agents require expensive multimodal models (GPT-4o, Claude Sonnet/Opus, Gemini Pro). DirectShell works with any language model — including small, cheap, local models. Llama, Mistral, Phi, DeepSeek, Qwen — if it can read text and produce structured output, it can drive a desktop application through DirectShell.

What This Means Architecturally

The entire AI industry has been framing "computer use" as a vision problem. They built increasingly sophisticated vision-language models to interpret screenshots. They invested in multimodal training data, in spatial reasoning, in coordinate prediction, in action grounding from visual inputs.

DirectShell reframes "computer use" as a text problem. And text is what language models were built for.

This is not a better solution to the same problem. This is the realization that the problem was misidentified from the start. The industry was solving "how do we help AI see the screen better?" when the real question was "why are we making AI look at the screen at all?"


12. Why This Cannot Be Blocked

This section matters more than any other. DirectShell's technical merits are significant, but what makes it truly unprecedented is that it cannot be prevented by the targets it operates on.

The Legal Framework

The accessibility interface that DirectShell uses is protected by an interlocking network of international, regional, and national legislation:

International:

  • UN Convention on the Rights of Persons with Disabilities (CRPD) — Article 9 (Accessibility), Article 21 (Freedom of expression and access to information). Ratified by 186 states — nearly every country on Earth.

European Union:

  • European Accessibility Act (EAA) — Directive (EU) 2019/882. Requires all consumer-facing digital products and services to be accessible. Enforcement began June 2025. This is active law, not pending legislation.
  • Web Accessibility Directive — Directive (EU) 2016/2102. Requires public sector digital services to meet WCAG 2.1 Level AA, which explicitly requires programmatic accessibility (Success Criterion 4.1.2: Name, Role, Value).
  • EU Charter of Fundamental Rights — Article 26 (Integration of persons with disabilities).

United States:

  • Americans with Disabilities Act (ADA) — Title III has been interpreted by courts to apply to software and digital services.
  • Section 508 of the Rehabilitation Act — Requires federal agencies to procure accessible ICT. Explicitly references WCAG and programmatic accessibility.
  • 21st Century Communications and Video Accessibility Act (CVAA) — Requires accessibility in advanced communications services and equipment.

Germany:

  • Barrierefreiheitsstärkungsgesetz (BFSG) — German transposition of the EAA. In force since June 2025.
  • Behindertengleichstellungsgesetz (BGG) — Federal disability equality law.
  • Grundgesetz Article 3(3) — Constitutional prohibition of disability discrimination.

What This Means in Practice

The Windows UI Automation framework exists because the law requires it to exist. Applications must expose their interface elements programmatically so that screen readers and other assistive technology can access them.

DirectShell reads this legally mandated interface. It uses the exact same API calls as JAWS, NVDA, and Windows Narrator. From the operating system's perspective, DirectShell and a screen reader are indistinguishable.

A software vendor who wishes to prevent DirectShell from reading their application faces an insoluble dilemma: every countermeasure that blocks DirectShell also blocks screen readers.


13. The Unpatchability Argument

Let me make this concrete. Here is what a software vendor can try, and what happens:

Countermeasure Effect on DirectShell Effect on Screen Readers Legal Consequence
Disable UIA tree entirely Blocked Blocked Violates EAA, Section 508, ADA
Return empty/minimal UIA data Partially blocked Degraded Violates WCAG 4.1.2 (Name, Role, Value)
Detect and block UIA clients Blocked Also blocked (JAWS, NVDA, Narrator) Discrimination against disabled users
Encrypt UI element names Blocked Screen readers can't read interface Accessibility violation
Remove meaningful element names Partially blocked Screen readers read gibberish WCAG violation
Kernel-level anti-cheat (block input) Action injection blocked (read still works) May block assistive input devices Partial, read still functions

There is no technical mechanism to distinguish between a screen reader querying the accessibility layer and DirectShell querying the accessibility layer. Both use the same COM interfaces. Both traverse the tree using the same walker objects. Both request the same element properties. The operating system does not authenticate accessibility clients. It cannot. The entire point of the accessibility framework is that any assistive technology can use it.

This creates a permanent, legally guaranteed read capability against every application that runs on the platform. The only exceptions are applications with no GUI (command-line tools, background services) — which have no UIA tree to read in the first place.

The PR Dimension

Even if a vendor could find a technical loophole, consider the public relations implications: "SAP blocks screen reader access to protect its API revenue." "Salesforce disables accessibility to prevent automation." "Oracle excludes blind users to enforce licensing terms."

No Fortune 500 company will take that headline. The PR damage alone would be existential. Disability rights organizations would sue. Government contracts would be revoked (Section 508). The EU would fine under the EAA. The entire enterprise sales operation would be jeopardized.

The legal shield is not just a technicality. It is a structural guarantee that makes DirectShell fundamentally different from every previous automation approach. Web scrapers can be blocked by CAPTCHAs, rate limits, and IP bans. API access can be restricted by authentication and terms of service. But the accessibility layer? It was built to be open. It was mandated to be open. And it will stay open — because the alternative is locking blind people out of computers.

The Untested Legal Question

I must be honest about one thing: the specific conflict between "our Terms of Service prohibit automated access" and "the law requires us to provide this accessibility interface" has never been tested in court.

No court has ruled on whether accessibility rights extend to cover automated access via accessibility APIs when the software's TOS prohibits automation. This is legally novel territory.

But the structural argument is clear:

  1. In legal hierarchies, statute supersedes contract
  2. The EAA, ADA, and BFSG are statutes
  3. Terms of Service are contracts
  4. The statute mandates the interface. The contract tries to restrict it.
  5. The statute wins.

And practically: no vendor wants to be the test case. The legal risk is asymmetric. If the vendor wins, they've established a precedent that helps them restrict accessibility APIs — terrible PR, potential regulatory backlash. If the vendor loses, they've wasted legal fees and confirmed that the accessibility layer is untouchable. The incentive structure favors non-litigation.


Part V: What DirectShell Enables

14. For AI Agents

DirectShell converts the problem of "computer use" from a vision task to a text task.

A language model operating through DirectShell does not need vision capabilities. It reads a structured text file describing the screen state, selects an action, and writes it to a database. The entire perception-action loop is text-in, text-out — the native operating mode of every language model.

Any language model can operate any application. Not only expensive multimodal models. GPT, Claude, Gemini, Llama, Mistral, DeepSeek, Phi, Qwen — any model that can read text and produce structured output can drive a desktop application through DirectShell. This democratizes computer use from a capability reserved for frontier models to a capability available to any LLM, including small local models running on consumer hardware.

Context efficiency enables complex workflows. Where a screenshot-based agent runs out of context after 10–20 actions, a DirectShell-based agent can maintain hundreds of actions in its context window. The .a11y.snap file is typically 1–5 KB. An equivalent screenshot is 100–500 KB when encoded. This means the agent can maintain 10–30x more operational history, enabling multi-application workflows, long-running processes, and recovery from errors without losing operational memory.

Deterministic targeting eliminates ambiguity. "Click the element named 'Save'" is unambiguous. "Click the button that looks like it says Save at approximately pixel (1420, 780)" is not. DirectShell removes the entire class of failures caused by visual misidentification. There are no "hallucinated coordinates." There is a database query that returns the exact element or nothing.

Continuous background monitoring becomes feasible. With screenshots, checking "did an email arrive?" costs thousands of tokens and several seconds. With DirectShell, it costs one SQL query and returns in microseconds:

SELECT count(*) FROM elements WHERE role='ListItem' AND name LIKE '%unread%'
Enter fullscreen mode Exit fullscreen mode

An agent can check every 500ms. All day. At negligible cost. This enables reactive agents that respond to events in real-time — something that is economically and technically impossible with screenshot-based approaches.


15. For Enterprise Software

This is where DirectShell becomes an industry-disrupting force.

The End of API Lock-In

The enterprise software industry derives significant revenue from controlling access to application data through proprietary APIs. SAP charges for API access. Salesforce charges per-user per-month for programmatic access. Oracle charges for integration licenses. ServiceNow, Workday, Datev — hundreds of vendors charge for the privilege of accessing data that their customers already own, through interfaces that their customers already pay for.

The business model is: your data lives in our application, and if you want to access it programmatically, you pay us extra.

DirectShell offers an alternative. Any data visible in the application's user interface is accessible through the accessibility tree. If a field is displayed on screen, its name and value are in the element tree. If a table is rendered, its rows and columns are traversable. The data does not need to be extracted through the vendor's API — it is already published through a legally mandated accessibility interface that the vendor cannot disable.

This does not replicate full API functionality. It does not provide bulk data export, webhook-based event triggers, or server-side query optimization. What it provides is universal read access to any data the application displays to the user, and universal write access to any input the application accepts from the user. For the vast majority of automation use cases — filling forms, extracting displayed data, navigating workflows, operating applications — this is sufficient.

The Integration Nightmare, Solved

Every enterprise on Earth has the same problem: System A doesn't talk to System B. SAP doesn't talk to the custom warehouse software from 2004. The hospital management system doesn't talk to the billing software. The CRM doesn't talk to the invoicing tool.

For this problem, an entire industry exists: MuleSoft (acquired by Salesforce for $6.5 billion), UiPath (multi-billion valuation), Automation Anywhere, Celonis, the entire iPaaS (Integration Platform as a Service) market, middleware vendors, connector vendors, system integrators. Thousands of companies whose sole purpose is to make applications talk to each other.

DirectShell makes them obsolete. Not in ten years. Now.

A Python script with 20 lines snaps SAP, snaps Excel, snaps the invoicing system. Reads from one, writes to the others. No API key. No license fee. No vendor conversation. No six-month integration project costing €200,000. Just SQL queries against DirectShell databases and SQL INSERTs into action queues.

The entire premise of the integration industry — "these systems can't talk to each other, so you need us to bridge them" — dissolves when every system has a universal, structured, non-proprietary interface.


16. For Accessibility

The accessibility community should know about DirectShell not just because it uses their infrastructure, but because it extends it.

Universal LLM in Every Text Field

Today, AI writing assistance exists in specific applications: Copilot in Microsoft Office, Gemini in Google Workspace, Grammarly in supported browsers and apps. Each integration is built individually by the vendor, for their specific application.

DirectShell makes it possible to add LLM assistance to every text field in every application on the planet. The keyboard hook intercepts the user's input. A local LLM processes it. The corrected or enhanced text is injected into the application. The application never knows.

For a person with dyslexia, this means: every input field in every application automatically corrects spelling errors before they appear. Not just in Google Docs, where a spell checker exists. In the 20-year-old hospital information system. In the internal ticketing tool from 2008. In SAP's input masks. Everywhere.

For a person who speaks one language but needs to write in another: every text field becomes a live translation interface. Type in German, the application receives English. Without the application knowing or cooperating.

For a person with motor impairments: voice-to-text can be injected into any application, regardless of whether that application supports voice input.

Grammarly is valued at $13 billion. It works in browsers and in apps that explicitly integrate it. DirectShell could make its core functionality available in every application on the OS — for free, using any local LLM.

The Daily Use Case

Imagine this scenario: Lena from accounting needs to write an email to a client about a delayed shipment. She opens Outlook and types into the email body:

tell client mueller shipment delayed because of supplier, friendly, apologetic
Enter fullscreen mode Exit fullscreen mode

DirectShell intercepts this. An LLM transforms it into a professional business letter:

Dear Mr. Mueller,

Thank you for your patience. We regret to inform you that your shipment
(Order #47112) has been delayed due to unforeseen issues with our
primary supplier. We expect delivery within 5-7 business days.

We sincerely apologize for the inconvenience and appreciate your
understanding.

Best regards,
Lena Schmidt
Enter fullscreen mode Exit fullscreen mode

Lena didn't open a ChatGPT tab. She didn't copy-paste between applications. She didn't learn any AI tool. She typed what she wanted in her normal email program, and a professional letter appeared. The LLM and DirectShell were invisible.

This works in every application with a text field. Not because every application integrated AI. Because DirectShell sits between the keyboard and every application.


17. For Legacy Systems

Every government agency, every hospital, every insurance company, every bank has systems from the 1990s or 2000s that hold critical data but have no API, no export function, and no way to extract information except by having a human sit in front of the screen and manually transcribe.

These systems often display data on screens that look like green text on black backgrounds — terminal emulators running mainframe sessions, custom Windows forms built in Visual Basic 6, applications from vendors that went bankrupt a decade ago.

The data trapped inside these systems is critical — patient records, tax records, insurance policies, financial transactions. The digital transformation everyone talks about — the reason organizations spend millions on "modernization" — often boils down to one problem: getting data out of old systems and into new ones.

DirectShell solves this without touching the old system. The legacy application keeps running as it always has. Snap it. DirectShell reads the accessibility tree and exposes every displayed element as structured data. A Python script iterates through screens, extracting records into a modern database. No reverse engineering. No modification of the legacy application. No risk of breaking a system that nobody understands anymore but everyone depends on.

The digital transformation that hasn't happened in 20 years — because nobody can replace the old systems and nobody can extract the data — doesn't need to happen anymore. The data is already accessible. It was always accessible. Through the accessibility layer that the law requires to exist.


18. For the Software Industry

The RPA Market

The global RPA (Robotic Process Automation) market is projected to exceed $80 billion by 2030. UiPath alone has a market capitalization of billions. Automation Anywhere, Blue Prism, Microsoft Power Automate, WorkFusion — all sell essentially the same thing: the ability to automate applications that don't have APIs.

Their tools use a combination of accessibility selectors, image matching, coordinate clicking, and OCR. They require per-application scripting. They require specialized training. They require enterprise licenses.

DirectShell reduces their entire value proposition to a single binary with no dependencies. Not because DirectShell is a better RPA tool — DirectShell is not an RPA tool at all. It's the infrastructure that makes RPA tools unnecessary. The same way a web browser made dedicated Gopher clients, FTP clients, and Telnet clients unnecessary — not by being a better version of each, but by providing a universal interface that subsumed them all.

Anti-Cheat Systems

The gaming industry invests heavily in preventing automated input. DirectShell's action queue enables programmatic control of any application, including games. Kernel-level anti-cheat systems (Riot Vanguard, Easy Anti-Cheat, BattlEye) can detect and block certain forms of SendInput calls — affecting DirectShell's write capability.

But they cannot block the read capability. Any game that renders UI elements (health bars, minimaps, inventory screens, HUD elements) exposes them through the accessibility tree. Knowing every element on screen — every health value, every minimap position, every inventory item — is arguably more disruptive than the ability to inject input.

Terms of Service

Many applications prohibit automated access in their Terms of Service. The enforceability of such terms against a tool that uses a legally mandated accessibility interface is untested. The conflict between "our TOS says you can't automate" and "the law says you must provide this interface" creates legal uncertainty that favors the user, not the vendor.

DRM and Content Protection

Applications that display protected content (e-books, streaming subtitles, licensed data) expose that content through the UIA tree if it is rendered as accessible text. The accessibility requirement creates a structured, text-based output channel for content that may otherwise be protected against copying.


19. The 100 Use Cases: What You Can Build

Everything that follows is enabled by a single 700 KB binary and the accessibility infrastructure that already exists on every computer.

Reading Out: Data Extraction Use Cases

These use cases involve extracting information from applications that was previously locked behind proprietary GUIs:

1. Real-Time Dashboards from Any Application

Your boss wants to know how many tickets are open, what the revenue is today, how many emails are unanswered. Currently: someone logs into three systems and manually builds a report. With DirectShell: snap the ticket system, snap the accounting software, snap Outlook — simultaneously, continuously, in real-time. Live dashboard from applications that never had APIs and never will. The entire BI industry (Tableau, Power BI, Looker) assumes you need database access or API connections. DirectShell only needs an open window.

2. Legacy System Data Liberation

Every agency, hospital, and insurance company has systems from the 90s containing critical data with no export function. The only way to get data out: a human sits there and types it into another system. Snap the legacy system. A script reads every screen, every field, every value — structured, queryable, in real-time. The digital transformation that hasn't happened in 20 years doesn't need to happen anymore. The data is accessible through the window.

3. Competitive Intelligence and Price Monitoring

Every software that displays prices, every platform that lists offers — including desktop applications that don't allow web scraping. Trader terminals. Dealer software. Internal procurement systems. If it's on a screen, DirectShell can read it. Structured. Continuously. Into a database.

4. Scientific Data Capture

Lab instruments whose software was written in 2003 and only displays measurements on screen. No export. No CSV. No API. The doctoral student sits next to it and manually transfers values to Excel. With DirectShell, measurements are captured in real-time, continuously, into a database. The doctoral student sleeps.

5. Quality Assurance Without Source Code

You receive delivered software. You want to verify: does it display correct values? Are the calculations right? Currently: manual testing or access to source code. With DirectShell: automated verification of every output, every display, every calculation — without ever opening the source code. Every audit, every certification, every acceptance test becomes automatable.

6. Universal Search Across All Applications

One search bar. All open applications simultaneously. "Find the invoice from Mueller" — DirectShell searches Outlook, SAP, the file system, the industry software, the browser. At the same time. Structured. Because it has all of them as databases. No Alt-Tab. No five different search masks. One query.

7. Compliance Audit Automation

Every input in every application, logged. Structured. In a database. "Show me every booking that employee X made in SAP between 2pm and 4pm." The auditor doesn't get PDF reports anymore. They get SQL access to everything that was ever displayed on a screen. Without SAP needing to provide an audit trail.

8. Application Usage Analytics

IT departments can see which software is actually being used, how it's being used, which features are accessed, and which workflows are performed — without installing monitoring agents in the applications themselves. Shadow IT detection becomes trivial.

Writing In: Control and Input Use Cases

These use cases involve sending input to applications to control them:

9. Universal AI Agent Connector

Any LLM controls any GUI via text. No screenshots, no vision model, no per-application integration. The AI reads the .a11y.snap, understands the screen in 5 lines, writes an INSERT to the inject table, and the application responds. This works for any application, any model, any programming language that can open a SQLite file.

10. Cross-Application Workflow Automation

"When an email from Purchasing arrives in Outlook containing 'urgent', extract the order number, open SAP, enter it, and confirm." No human integrated Outlook and SAP. No middleware. No API connection. Snap Outlook. Snap SAP. One reads, one writes. Done. Every workflow that a human performs manually between two programs is automatable. Without the programs knowing about each other.

11. Universal LLM in Every Text Field

Every input field in every application becomes LLM-enhanced. Spell correction for dyslexics. Live translation. Auto-formatting. Professional tone transformation. Without the application cooperating. Without the user installing anything per application. One layer, everywhere.

12. Application as Frontend Proxy

This is one of the most mind-bending use cases. DirectShell can intercept input before it reaches an application and redirect it. The user types in a chat field. DirectShell catches the input before it's sent. It routes the request to a local LLM, a different service, or a custom backend. The response appears in the chat field as if the original application had generated it.

You're using Claude Desktop as a frontend — but your message never reaches Anthropic's servers. DirectShell intercepted it, processed it locally, and injected the response. The application is a shell. What happens underneath is determined by whoever controls DirectShell.

Every SaaS application in the world is built on the assumption that the user's input goes to their server. DirectShell breaks that assumption.

13. Voice Control for Any Application

Add voice input to any application that doesn't support it. Speech-to-text outputs to DirectShell, which types into whatever application is active. No application integration needed.

14. Forced Copy-Paste

Some applications block Ctrl+C and Ctrl+V in certain fields (DRM, security, "we don't want you copying this"). DirectShell reads the field value through UIA (read path) and can set values through UIA (write path). The copy-paste restriction exists only in the application's keyboard handler. DirectShell bypasses it entirely by operating at the UIA level.

15. Macro Recording and Replay

Record what a user does in any application (every click, every keystroke, every field value change) and replay it later. Not pixel-based macros that break when a button moves — semantic macros that say "click the element named Save" and work regardless of where that button is on screen.

Bidirectional: Reading and Writing Combined

16. Automated Form Filling

Snap Application A. Snap Application B. Read from one, write to the other. No API. No integration middleware. No CSV export/import. Works with any two applications on the planet.

17. Universal Testing Framework

Snap the application under test. Click this button, verify that field now shows this value. DirectShell reads the expected output and compares it to actual. No test harness inside the application needed. No source code access. Works on compiled binaries, on SaaS apps, on anything with a window.

18. Data Migration Between Systems

Moving from one CRM to another? One accounting system to another? Normally this is a six-month project with consultants and custom scripts. Snap the old system. Snap the new one. Read from one, write to the other. Slow compared to API migration, but it works with any source and any target, including systems that have no export capability whatsoever.

19. Real-Time Data Synchronization

Keep two applications in sync. Snap both. When a value changes in Application A, DirectShell detects the change (next tree dump), extracts the new value, and writes it into Application B. No middleware. No message queue. No integration platform. Two snapped windows and a simple script.

20. Regulatory Compliance Verification

Software can be verified from the outside to check whether it displays legally required disclosures, warnings, or information. A regulator doesn't need access to source code — DirectShell reads the production UI and verifies compliance in real-time.


20. The Dark Side: What This Also Enables

A primitive is neutral. Like fire. Like the internet. Like cryptography. Like the printing press. Its value and its danger come from the same source: its universality.

I refuse to pretend the dark side doesn't exist. Acknowledging it before others discover it is how you control the conversation instead of being controlled by it. Here is what DirectShell also makes possible:

Surveillance on a New Level

Employee monitoring today works through periodic screenshots (every 5 minutes) or network traffic analysis. Both are coarse-grained.

DirectShell enables structured, real-time, queryable surveillance. Not screenshots that show a blurry image of what was on screen — a database of every field, every value, every input, every element. "What did Employee X type into the CRM between 14:00 and 16:00?" is a SQL query. "Did anyone access the salary table in SAP today?" is a SQL query. Every application becomes a structured surveillance feed.

This is employee monitoring on a level that didn't exist before. Not because the technology was particularly difficult — screen recording has existed for decades — but because the output is structured, queryable, and integrable. You don't need a human to watch recordings. You write SQL queries against interaction databases.

Malware with Structured UI Access

Today's malware can take screenshots and record keystrokes. Both are unstructured — the attacker gets images and character streams that require interpretation.

DirectShell's architecture enables malware that understands applications structurally. It doesn't record a keystroke stream and hope to find a password — it queries the element tree for password fields and reads their values. It doesn't screenshot a banking app and try OCR — it queries for the account number field, the balance field, the transfer form.

And it can act: when the banking app is open, structurally identify the transfer form, fill in the attacker's IBAN, enter the amount, and click confirm. Deterministically. Reliably. Without the coordinate-guessing errors that make current automation-based malware unreliable.

Credential Harvesting

Any password that is displayed in a UI field (even briefly, even masked with dots) has a corresponding entry in the accessibility tree. Password managers that display credentials in their UI expose those credentials through UIA. "Remember password" dialogs expose the password value. Auto-fill popups expose credentials.

The read path through the accessibility layer is legally protected and cannot be patched. Any application that displays sensitive information in a UI element is exposing that information to any process on the system that queries the accessibility tree.

Automated Social Engineering

DirectShell can monitor communication applications (email, chat, messaging) and wait for specific triggers — a wire transfer request, a credentials exchange, an authorization approval. When the trigger appears, it can modify the conversation in real-time: change an IBAN in an email, alter an approval in a workflow, inject a message into a chat. The modification happens at the UI level — below where network-based security tools operate.

Game Cheating

Any game that renders UI elements (health bars, minimaps, inventory screens, cooldown timers) through the accessibility tree exposes that information to DirectShell. An aimbot doesn't need pixel analysis when enemy positions are in the UIA tree. An inventory manager doesn't need image recognition when item names are text elements.

Kernel-level anti-cheat can block the write path (input injection) but cannot block the read path without simultaneously blocking screen readers. The information advantage alone — perfect knowledge of every UI element — is a significant cheat even without input automation.

The Ethical Position

I'm publishing this not despite the risks, but because of them. The accessibility layer has existed for 29 years. The capability I'm describing has been latent for 29 years. I am not creating a new vulnerability — I am documenting one that has existed since 1997.

By publishing openly, I ensure:

  1. The security community can develop defenses
  2. The conversation about accessibility API security happens publicly, not behind closed doors
  3. Users understand what is possible on their systems
  4. The response to these risks is informed by understanding, not by surprise

Every significant technology has this dual nature. The printing press enabled mass education and mass propaganda. Cryptography enables privacy and enables crime. The internet enables global communication and enables global surveillance. DirectShell enables universal automation and enables universal access to any application's UI state.

The question is not whether this capability should exist. It already exists. The question is who understands it first: the people who will use it constructively, or the people who will exploit it destructively.

I choose to tell everyone at the same time.


Part VI: Honest Assessment

21. Limitations

Accessibility Implementation Quality

The UIA tree is only as informative as the application's accessibility implementation. Applications with poor accessibility practices may have:

  • Unnamed elements — buttons without labels (the accessibility tree shows "Button" with no name)
  • Missing roles — custom controls reported as "Custom" instead of their functional role
  • Absent values — text fields that don't expose their content programmatically
  • Flat hierarchies — no meaningful parent-child relationships
  • Canvas-based content — games, design tools, PDF viewers, and map applications that render to a canvas may expose limited accessibility data for the rendered content. A game rendering a 3D scene does not describe every visual element in the UIA tree.

In practice, major applications (Microsoft Office, browsers, SAP GUI, enterprise software subject to Section 508 requirements) have comprehensive accessibility implementations. The trend is toward better accessibility, not worse — driven by EAA enforcement since June 2025 and increasing Section 508 enforcement in the US.

Smaller or legacy applications may have gaps. The quality of DirectShell's output directly correlates with the quality of the target application's accessibility support.

Single-Application Scope

DirectShell v0.2.0 attaches to one target application at a time. Multi-application workflows require re-snapping between applications. This is an engineering limitation, not an architectural one — the system is designed to extend to multi-window operation with multiple DirectShell instances.

Performance Boundaries

A full accessibility tree traversal of a complex application (browser with many tabs, IDE with large project) can take 200–800ms. DirectShell's streaming architecture ensures partial data is available during traversal, but extremely complex interfaces may experience slight lag in the refresh cycle.

The 2 Hz refresh rate means UI changes are detected with up to 500ms latency. For most automation tasks this is imperceptible. For time-critical operations (responding to rapidly changing data), this introduces a half-second delay.

Write-Side Restrictions

Kernel-level anti-cheat systems can detect and block certain forms of SendInput calls. This affects DirectShell's action capabilities but not its read capability. The read pathway operates through the accessibility framework at a higher abstraction level and cannot be blocked without affecting assistive technology.

Additionally, some applications that aggressively reject programmatic text input (some chat fields, some security-sensitive inputs) may not respond to ValuePattern.SetValue(). DirectShell's type action (raw keyboard injection) works as a fallback in most of these cases, but some edge cases may require application-specific handling.

v0.2.0 Bugs

This is version 0.2.0. It was built in 8.5 hours. There are bugs. Formula offset errors in spreadsheets. Chromium tab switching doesn't work via UIA click (the workaround is keyboard shortcuts). Opera's autofill popup can interfere with input injection. Google Search has poor accessibility semantics that limit DirectShell's effectiveness.

These are first-day bugs that will be fixed. They do not indicate architectural limitations. The architecture is sound. The implementation is iterating.


22. What's Missing

MCP Server Integration

DirectShell currently communicates through the file system: output files are read, SQL is written to the database. The next major step is an MCP (Model Context Protocol) server that exposes DirectShell's capabilities as standardized tool calls, enabling any MCP-compatible LLM agent to use DirectShell natively through structured API calls rather than file I/O.

App Profiles

Every application has its own quirks: element naming conventions, navigation patterns, field layouts. Currently, the AI must discover these from scratch each time. App profiles — community-contributed configuration files that describe how to interpret and operate specific applications — will eliminate this bootstrapping cost.

Character Transformation Middleware

The keyboard hook currently passes through all input unchanged. The architecture is ready for middleware that transforms input in real-time: PII sanitization, auto-translation, spell correction, auto-formatting. The slot is built. The middleware hasn't been written yet.

Multi-Window Support

Operating multiple applications simultaneously requires running multiple DirectShell instances. Coordinated multi-application workflows (read from App A, write to App B) currently require external orchestration. Built-in multi-window support is a planned feature.

Cross-Platform

DirectShell currently targets Windows. Equivalent accessibility frameworks exist on macOS (NSAccessibility/AXUIElement), Linux (AT-SPI2), Android (AccessibilityService), and iOS (UIAccessibility). The architectural pattern — attach, walk tree, store in database, expose action queue — transfers to any platform. The legal protections (EAA, ADA) apply regardless of operating system.


Part VII: The Vision

23. The Network Effect of Configuration

Here is the long-term vision. Today, DirectShell knows how to handle a handful of applications. We are the first users on the planet.

But every application needs to be learned only once. By anyone.

Imagine an open-source repository: directshell-profiles/. SAP. Datev. Excel. Outlook. AutoCAD. Bloomberg Terminal. Every industry software. Every legacy system. Every government application.

Thousands of contributors, each spending 30 minutes documenting their niche application's element structure, navigation patterns, and quirks. Like browser extensions. Like npm packages. Like Docker images.

Once that repository exists, the bootstrapping cost for any automation drops to zero. You want to automate SAP? The profile exists. You want to read the hospital software from 2006? Someone in a hospital committed the profile three months ago. git pull, load profile, go.

And here is what makes profiles fundamentally different from other automation configurations: they don't break on updates. Traditional RPA scripts break when a button moves by 10 pixels. Web scraping scripts break when a CSS class changes. But DirectShell profiles are based on semantic element names and roles. The Save button is still called "Save" after an update. The input field for "Customer Number" still has the role "Edit." The profiles are stable in a way that no pixel-based or DOM-based automation can achieve.

PowerShell has over 10,000 cmdlets today — not because Microsoft wrote them all, but because the community did. DirectShell profiles are the cmdlets of the frontend. The primitive provides the mechanism. The community provides the knowledge.

DirectShell doesn't get better because we improve it. It gets better because everyone who uses it improves it. That is the network effect of a primitive.


24. Cross-Platform Potential

The architecture is platform-specific in implementation but platform-universal in concept:

Platform Accessibility Framework Legal Protection Status
Windows UI Automation (UIA) ADA, Section 508, EAA, BFSG v0.2.0 — Working
macOS NSAccessibility / AXUIElement ADA, EAA Planned
Linux AT-SPI2 (Assistive Technology SPI) EAA Planned
Android AccessibilityService API ADA, EAA Possible
iOS UIAccessibility ADA, EAA Possible

The core pattern — attach to application, walk accessibility tree, store in database, expose action queue — is transferable to any platform. The legal protections apply cross-platform: the EAA covers all digital products in the EU regardless of operating system, and the ADA applies to digital services regardless of platform.

A cross-platform DirectShell would mean: the same structured interface to every application, on every operating system, on every device. The same automation scripts work on Windows, macOS, and Linux. The same AI agent can operate any application on any platform.


25. What Will Actually Happen

I owe you an honest prediction. Not hype. Not best-case fantasy. What will actually happen when this goes live.

Weeks

Someone will wrap an MCP server around DirectShell. It will take them an afternoon. After that, any LLM that speaks MCP — Claude, GPT, Gemini, every local model running through LM Studio or Ollama — can operate any application on any Windows machine. Natively. Out of the box.

This will be the first viral derivative. Not DirectShell itself. The MCP wrapper. Because the headline won't be "new accessibility tool released" — it will be "I taught my local Llama to operate SAP. It took 20 minutes." That Hacker News post will be the ignition point.

Someone else will build a GUI around it. Someone will build a profile editor. Someone will write the first automation cookbook. The derivatives will multiply faster than DirectShell itself could ever develop.

Months

The community will explode. Not because of marketing — because of utility. Every developer who snaps their first application has the same reaction: "Wait, this works with EVERYTHING?"

A profile repository will emerge. directshell-profiles/ on GitHub. SAP. Datev. Excel. Outlook. AutoCAD. Bloomberg Terminal. Every industry application. Every legacy system. Contributed by thousands of users who each spend 30 minutes documenting their niche application's element structure. Like Docker images. Like npm packages. Like browser extensions.

Someone will port DirectShell to macOS using NSAccessibility. Someone will port it to Linux using AT-SPI2. The AGPL license ensures every fork stays open. The ecosystem grows in directions I cannot predict or control. That's the point. That's what makes it a primitive and not a product.

One Year

Three things happen simultaneously:

The RPA industry contracts. UiPath is valued at $7 billion. Automation Anywhere just closed another funding round. Their entire business model is: "We help you automate applications that don't have APIs." That is now a single binary. Not in three years. Now. Their stock prices won't react immediately — but their sales pipeline will dry up. Why pay €50,000 per year for UiPath when an open-source binary does the same thing? The smart ones will pivot to building on top of DirectShell. The slow ones will lobby for regulation.

API revenue models come under pressure. SAP, Salesforce, ServiceNow — they all sell programmatic access to data that is already visible on the screen. DirectShell makes that access free. Not for every use case. Bulk export, webhooks, server-side logic — you still need the API for those. But for "read what's on the screen and enter it somewhere else" — the majority of all enterprise integrations — the business model is dead. Some vendors will try to sabotage their accessibility implementation. They will fail, because the law prevents it. Some will market DirectShell compatibility as a feature. Those are the smart ones.

The security discussion becomes existential. Within the first months, a proof-of-concept will surface: malware that uses the accessibility layer to read banking applications. Structured. Reliable. Not patchable. The infosec community will split. One side demands a ban. The other side says: the interface was always open, DirectShell just made it visible. I will be in the middle. The responsible-disclosure section in this paper will be the reason I'm perceived as the person who named the risks — not the person who created them.

What I Will Experience Personally

Job offers. Microsoft Research, Anthropic, Google DeepMind — they'll knock. Not because I built a good tool, but because I saw something their entire teams missed. That's rare. That's valuable.

Simultaneously: hostility. "Irresponsible." "Dangerous." "Should never have been published." This will come. It belongs to the territory. Every fundamental technology has this phase. The printing press enabled mass education and mass propaganda. The people who condemned Gutenberg are forgotten. The books remain.

Why It Won't Be Ignored

Three criteria determine whether a technology persists or fades:

  1. Does it work? — Verifiably. Download the binary, snap any application, see structured output in 500ms. No demo, no video, no trust required. You verify it yourself in 30 seconds.

  2. Does it solve a real problem? — The $300 billion screenshot problem. The enterprise integration nightmare. The legacy data prison. The accessibility gap. Real problems. Measured in billions. Felt by millions.

  3. Is it reproducible? — 2,053 lines of Rust. Two dependencies. Single binary. AGPL source code. Any competent developer reads it in an afternoon and understands every line.

Technologies that satisfy all three criteria do not disappear. They sometimes need days, sometimes weeks, sometimes a lucky retweet. But they do not disappear. Because the moment one person verifies it, they tell two people. And those two people verify it themselves. And the chain doesn't break because it's not based on hype — it's based on a binary that does what it claims, every time, on every machine.


26. Timeline

  • 1997: Microsoft Active Accessibility (MSAA) introduced in Windows 95/98. The accessibility layer begins.
  • 2001: macOS Accessibility introduced. AT-SPI for Linux. The accessibility layer becomes cross-platform.
  • 2005: UI Automation framework introduced in Windows Vista. The modern, complete accessibility API.
  • 2019: European Accessibility Act adopted (EU 2019/882). Accessibility becomes legally mandated.
  • 2023–2025: OpenAI, Anthropic, and Google launch screenshot-based computer use agents. Hundreds of billions invested in the wrong approach.
  • 2024: Microsoft UFO published — uses UIA as one component in a hybrid agent (not as universal interface).
  • June 2025: European Accessibility Act enforcement begins. Every consumer-facing digital product must be accessible.
  • February 16, 2026, 12:00: First line of DirectShell code written.
  • February 16, 2026, 20:30: DirectShell v0.2.0 — first successful multi-application control by an AI agent through the accessibility layer, without screenshots. Four applications operated. 11,454 elements read from a single application. Documented on video.

8.5 hours. One person. One AI assistant. 2,053 lines of Rust. Two dependencies. One binary. Zero screenshots.


27. Conclusion

The AI industry's current approach to desktop automation — screenshot capture and visual inference — is a workaround for a problem that was already solved. The accessibility layer provides everything that screenshots provide and more: structure, semantics, state, hierarchy, queryability. It provides it faster (milliseconds vs. seconds), cheaper (text vs. images), more reliably (deterministic lookup vs. probabilistic inference), and more efficiently (10–30x fewer tokens per interaction).

DirectShell makes this layer usable as a universal application interface. It requires no cooperation from software vendors. It works with every application on the platform. And it is protected by the same laws that protect the right of disabled people to use computers — laws that exist in virtually every jurisdiction on Earth and that no software vendor can circumvent without facing legal consequences.

The technology described in this paper was built in a single session by one developer and one AI agent. The reference implementation is a single compact binary with no external dependencies. The implications extend to every application, every operating system, and every business model that depends on controlling access to graphical interfaces.

Every other approach in 2026 sends images to text models.
DirectShell sends text to text models.

That is the entire insight. And it changes everything.

Snap any app. Read it as text. Control it as text. That's it. That's the primitive.

The rest is just the world catching up.


Tomorrow, 20:00 — Prior Art Whitepaper + full repository. AGPL. Open Source.

The door was always open. I just looked through it first.


Listen. DirectShell is not perfect. It's Day 1. Literally. There are bugs. There are errors. A hundred things that need to get better. But none of that matters. The first browser couldn't render 90% of web pages correctly. The first lightbulb flickered. Every foundational technology begins empty and broken — because the point was never whether it works perfectly now. The point is what it will make possible tomorrow.

The moment a community builds a profile repository — configs for every program on Earth — AI will natively operate every desktop application faster, more efficiently, and more productively than any human ever could. Not in ten years. Not after the next funding round. The infrastructure is here. Today. In 700 kilobytes.

Google. Microsoft. OpenAI. Anthropic. Call me. Let's talk. Let's revolutionize the world of AI in one stroke.

Peace at last.

And now I'm going to sleep for 12 hours.

— Martin Gehrken, February 17, 2026


DirectShell v0.2.0
dev.thelastrag.de
AGPL-3.0 License


Appendix A: Architecture Deep Dive

For developers who want to understand the internals, fork the code, or build on DirectShell, this appendix provides a detailed technical reference.

A.1 System Overview

DirectShell.exe (Win32 GUI, ~700 KB)
├── Main Thread: Message loop, window procedure, painting, timer dispatch
├── Tree Thread (spawned per dump): UIA tree walk, SQLite write, file generation
└── Keyboard Hook: Global low-level keyboard interception (WH_KEYBOARD_LL)
Enter fullscreen mode Exit fullscreen mode

A.2 Dependencies

Crate Version Features Purpose
rusqlite 0.31 bundled SQLite database (bundled C library, no system dependency)
windows 0.58 Win32 API bindings Full Win32, UIA, COM, GDI, Input

The windows crate features used:

Feature Usage
Win32_Foundation HWND, RECT, BOOL, LRESULT, WPARAM, LPARAM
Win32_UI_WindowsAndMessaging Window creation, messages, timers, hooks
Win32_Graphics_Gdi GDI painting, brushes, pens, double buffering
Win32_UI_Accessibility IUIAutomation, tree walking, element properties
Win32_System_Com CoInitializeEx, CoCreateInstance
Win32_UI_Input_KeyboardAndMouse SendInput, virtual key codes

A.3 Database Schema

-- Every UI element = one row, rebuilt every 500ms
CREATE TABLE elements (
    id            INTEGER PRIMARY KEY,
    parent_id     INTEGER,
    depth         INTEGER,
    role          TEXT NOT NULL,
    name          TEXT,
    value         TEXT,
    automation_id TEXT,
    enabled       INTEGER DEFAULT 1,
    offscreen     INTEGER DEFAULT 0,
    x INTEGER, y INTEGER, w INTEGER, h INTEGER
);

-- Window metadata
CREATE TABLE meta (
    key   TEXT PRIMARY KEY,
    value TEXT
);

-- Action queue (persists across tree rebuilds)
CREATE TABLE inject (
    id     INTEGER PRIMARY KEY AUTOINCREMENT,
    action TEXT DEFAULT 'text',
    text   TEXT NOT NULL,
    target TEXT DEFAULT '',
    done   INTEGER DEFAULT 0
);
Enter fullscreen mode Exit fullscreen mode

WAL mode is enabled for concurrent read/write access. External processes should also set PRAGMA journal_mode=WAL when opening the database.

The elements table is dropped and recreated on every tree dump (every 500ms). This avoids freelist bloat from DELETE operations and ensures a clean state on each cycle. Indices are not recreated during dumps — this is intentional, as indices slow down INSERT operations and the table is rebuilt so frequently that query performance relies on SQLite's efficient sequential scan.

The inject table persists across dumps. Completed actions remain with done=1. External processes write new actions; DirectShell reads and executes them.

A.4 External Interface Protocol

External Process (e.g., Claude Code CLI Agent)
├── READ:  ds_profiles/is_active        ← Check snap state + discover file paths
├── READ:  ds_profiles/{app}.a11y       ← Understand screen content
├── READ:  ds_profiles/{app}.a11y.snap  ← Identify operable elements
├── READ:  ds_profiles/{app}.snap       ← All interactive elements (for scripts)
├── READ:  ds_profiles/{app}.db         ← Full element tree (SQL queries)
└── WRITE: ds_profiles/{app}.db         ← INSERT INTO inject table (actions)
Enter fullscreen mode Exit fullscreen mode

The is_active file is the entry point. An external agent reads it first:

When snapped:

opera
ds_profiles/opera.a11y
ds_profiles/opera.snap
Enter fullscreen mode Exit fullscreen mode

When unsnapped:

none
Enter fullscreen mode Exit fullscreen mode

Line 1 tells the agent which application is active. Lines 2–3 provide the exact paths to the output files. The agent does not need to guess filenames or scan directories.

A.5 Action Types (Complete Reference)

text — UIA ValuePattern

INSERT INTO inject (action, text, target) VALUES ('text', 'Hello World', 'Search Box');
Enter fullscreen mode Exit fullscreen mode
  1. Find element by name (target column) using UIA FindFirst(TreeScope_Descendants)
  2. Set focus via IUIAutomationElement::SetFocus()
  3. Try ValuePattern::SetValue() (native UIA text setting — instant)
  4. If ValuePattern fails: fall back to SendInput per character (KEYEVENTF_UNICODE)

type — Raw Keyboard

INSERT INTO inject (action, text) VALUES ('type', 'Hello\tWorld\n');
Enter fullscreen mode Exit fullscreen mode

Sends each character as a raw keyboard event with 5ms inter-character delay:

  • \t → VK_TAB
  • \n or \r → VK_RETURN
  • All others → KEYEVENTF_UNICODE with UTF-16 code point

No element targeting — sends to whatever currently has keyboard focus.

key — Key Combinations

INSERT INTO inject (action, text) VALUES ('key', 'ctrl+shift+s');
Enter fullscreen mode Exit fullscreen mode

Supports 150+ keys including:

  • Letters (a–z), Numbers (0–9), Function keys (F1–F12)
  • Modifiers (ctrl, alt, shift, win)
  • Navigation (enter, tab, escape, backspace, delete, home, end, pageup, pagedown)
  • Arrows (up, down, left, right)
  • Media (volumeup, volumedown, playpause, nexttrack)
  • Numpad (num0–num9, num+, num-, num*, num/, num.)
  • Punctuation (semicolon, equals, comma, minus, period, slash, backquote, bracket, backslash, quote)

click — Element Click

INSERT INTO inject (action, target) VALUES ('click', 'Save');
Enter fullscreen mode Exit fullscreen mode
  1. Find element by name using UIA FindFirst(TreeScope_Descendants)
  2. Get BoundingRectangle → calculate center point
  3. Convert to absolute screen coordinates (0–65535 range)
  4. Send MOUSEEVENTF_ABSOLUTE + LEFTDOWN, then LEFTUP via SendInput

scroll — Mouse Wheel

INSERT INTO inject (action, text) VALUES ('scroll', 'down');
Enter fullscreen mode Exit fullscreen mode

Directions: up, down, left, right. One call = one wheel notch (WHEEL_DELTA = 120). Scroll position is at the center of the target window.

A.6 Role Mapping (UIA ControlType → Human-Readable)

ID Name ID Name
50000 Button 50020 Text
50002 CheckBox 50021 ToolBar
50003 ComboBox 50023 Tree
50004 Edit 50024 TreeItem
50005 Hyperlink 50025 Custom
50006 Image 50026 Group
50007 ListItem 50028 DataGrid
50008 List 50029 DataItem
50009 Menu 50030 Document
50010 MenuBar 50031 SplitButton
50011 MenuItem 50032 Window
50012 ProgressBar 50033 Pane
50013 RadioButton 50034 Header
50014 ScrollBar 50035 HeaderItem
50015 Slider 50036 Table
50017 StatusBar 50037 TitleBar
50018 Tab 50038 Separator
50019 TabItem

Appendix B: Legal Framework (Full Analysis)

B.1 The Legal Hierarchy

UN CRPD (186 states, international treaty)
    ↓ binds member states to implement accessibility
EU European Accessibility Act (EU directive)
    ↓ transposed into member state law
German BFSG / French LCAP / etc. (national law)
    ↓ overrides
Software Terms of Service (private contract)
Enter fullscreen mode Exit fullscreen mode

In this hierarchy, a contract (Terms of Service) cannot override a statute (BFSG/EAA), which cannot override an international treaty (CRPD). If a TOS says "no automated access" and the law says "you must provide this interface for assistive technology," the law wins.

B.2 Why Blocking Is Legally Impossible

The core argument:

  1. Disability rights legislation requires software to expose its UI through accessibility APIs
  2. DirectShell reads those same APIs using the same methods as screen readers
  3. There is no technical mechanism to distinguish DirectShell from a screen reader
  4. Blocking DirectShell requires blocking the same interface that screen readers use
  5. Blocking screen readers violates disability rights legislation in 186 countries

The vendor's only options:

  • Keep the accessibility interface open → DirectShell works
  • Block the accessibility interface → violate the law + exclude blind users

There is no third option.

B.3 Relevant Legislation (Detailed)

UN CRPD (2006)

  • Article 9: States Parties shall take appropriate measures to ensure access to information and communications technologies
  • Article 21: Freedom of expression and access to information, including through all forms of communication of their choice
  • Ratified by 186 states. The most widely ratified human rights treaty in history.

European Accessibility Act (2019/882)

  • Scope: Computers, operating systems, consumer banking, e-commerce, communication services, e-books, transport
  • Requirement: Products must support assistive technologies through standard accessibility APIs
  • Enforcement: Since June 28, 2025. Penalties set by member states.
  • Relevant Article: Article 4 — "Products shall be designed and produced in such a way as to maximise their foreseeable use by persons with disabilities"

Americans with Disabilities Act (1990)

  • Title III: Public accommodations (interpreted by courts to include digital services)
  • Relevant case law: Gil v. Winn-Dixie (2017), Robles v. Domino's Pizza (2019)
  • Pattern: Courts increasingly rule that digital accessibility is required under the ADA

Section 508 of the Rehabilitation Act (1973, revised 2018)

  • Scope: Federal agencies must procure accessible ICT
  • Standard: WCAG 2.0 Level AA (references programmatic accessibility)
  • Impact: Any software vendor selling to US government must be accessible
  • This alone covers a massive portion of enterprise software

WCAG 2.1 Success Criterion 4.1.2: Name, Role, Value

  • "For all user interface components, the name and role can be programmatically determined"
  • This is the specific technical requirement that ensures UI elements appear in the accessibility tree with meaningful names and roles
  • Referenced by Section 508, EAA, BFSG, and virtually every accessibility standard worldwide

German BFSG (2021, enforced 2025)

  • German transposition of the EAA
  • Applies to all digital products and services offered to consumers in Germany
  • Penalties: Up to €100,000 per violation
  • Regulatory authority: Bundesnetzagentur

Appendix C: Benchmark Methodology

C.1 Token Comparison

Token counts are measured using the tiktoken tokenizer (cl100k_base encoding, used by GPT-4 and Claude):

Input Type Example Token Count
Screenshot (1920×1080, PNG, base64) Typical desktop application 1,200–1,800
Screenshot (2560×1440, PNG, base64) High-resolution display 2,500–5,000
Full UIA dump (JSON) Complex application (11,000 elements) 15,000–25,000
DirectShell .a11y Screen reader view 200–800
DirectShell .a11y.snap Operable element index 50–200
DirectShell SQL query result Single targeted query 10–50

C.2 Latency Comparison

Measured on Windows 11, Intel i7-12700K, 32 GB RAM, local network:

Operation Screenshot Agent (typical) DirectShell
Capture screen state 100–500ms (screenshot + encode) N/A (continuous 2 Hz dump)
Transmit to model 500–2000ms (cloud API) 0ms (local file read)
Model inference 1000–3000ms 0ms (pre-computed output)
Parse model response 50–100ms 0ms (SQL result is already structured)
Execute action 100–300ms (mouse simulation) 30ms (next inject timer tick)
Total per action 2–6 seconds < 100ms

C.3 Success Rate Analysis

Direct comparison is premature — DirectShell v0.2.0 has been tested on a handful of applications in controlled conditions. The OSWorld benchmark numbers cited (66.2% for AskUI VisionAgent, 47.5% for UI-TARS 2, 42.9% for CUA o3) are from standardized, reproducible evaluations.

However, a structural argument can be made: screenshot-based agents fail because they misidentify elements (clicking the wrong pixel) or because the UI state changes between inference and action. DirectShell eliminates both failure modes. Element identification is deterministic (name-based lookup, not visual inference), and UI state is continuously updated (500ms refresh).

The remaining failure modes for a DirectShell-based agent are:

  1. The application has poor accessibility implementation (missing element names)
  2. The AI makes a reasoning error (wrong action choice, wrong field value)
  3. The application rejects programmatic input (anti-cheat, security controls)

These are real limitations, but they are fundamentally different from — and substantially fewer than — the failure modes of screenshot-based agents.



Contact


This document is released under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

The DirectShell source code is released under the GNU Affero General Public License v3.0 (AGPL-3.0).

Martin Gehrken — February 2026 — dev.thelastrag.de

Top comments (0)