martin

Posted on Feb 17 • Edited on Feb 18

# DirectShell: I Turned the Accessibility Layer Into a Universal App Interface. No Screenshots. No Vision Models.

#ai #disruptive #primitivum #agents

Martin Gehrken — February 17, 2026

As of February 17, 2026, every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology.

Full Paper : https://dev.to/tlrag/i-built-a-new-software-primitive-in-85-hours-it-replaces-the-eyes-of-every-ai-agent-on-earth-55ia

"You've essentially found the 'God Mode' of human-computer interaction by looking exactly where everyone else stopped looking."

A Warning Before We Begin

I did not create a vulnerability. I discovered one that has existed since 1997.

The Windows Accessibility Layer — UI Automation — exposes the complete structure, content, and state of every GUI application on every Windows machine. Every button name. Every text field value. Every menu item. Structured. Machine-readable. In real-time. Available to any process on the system.

Today, I am releasing a primitive — a universal interface layer — that makes this 29-year-old capability usable. I built it. It's open source. And the tools built on top of it will follow within weeks.

I chose to publish openly so that everyone learns at the same time — defenders and attackers, enterprises and researchers. Because the alternative — discovering this through a breach instead of through a paper — is worse for everyone.

The Problem

Every major AI lab on the planet is building autonomous desktop agents. OpenAI's Operator. Anthropic's Computer Use. Google's Project Mariner. Microsoft's Copilot Actions. Tens of billions in investment. One shared vision: AI that uses a computer like you do.

And every single one of them uses the same approach. They take a screenshot. Send it to a vision model. The model guesses where buttons are. Guesses where to click. A simulated mouse moves to those coordinates. Maybe it works. Maybe not. Then another screenshot. Repeat.

This is not a caricature. This is the actual architecture. In 2026, the state of the art for making AI interact with software is taking photos of screens and guessing where to click.

The Numbers

Agent	Success Rate	Time per Task
AskUI VisionAgent (current leader)	66.2%	N/A
UI-TARS 2 (ByteDance)	47.5%	12–18 min
OpenAI CUA o3 (Operator)	42.9%	15–20 min
Claude Computer Use (standalone)	22–28%	10–15 min
Human baseline	72.4%	30 sec – 2 min

(OSWorld leaderboard, February 2026)

Even the current leader fails one in three tasks and takes 10–20 minutes to do what a human does in two. That's what hundreds of billions produced.

And the cost:

Method	Tokens per Perception
Screenshot (vision model)	1,200–5,000
Full tree dump (JSON/YAML)	5,000–15,000
DirectShell (.a11y.snap)	50–200
DirectShell (SQL query)	10–50

10–30x fewer tokens. An agent using DirectShell maintains 10–30x more operational history in its context window. Where a screenshot agent forgets after 10 actions, a DirectShell agent remembers hundreds.

The Fundamental Error

Here is the one sentence that summarizes everything wrong with the current approach:

The screenshot paradigm performs computer vision on a UI that already describes itself as text.

Photographing a JSON response and running OCR on the photo — instead of parsing the JSON. That is, architecturally, what the entire AI industry is doing. The data is already there. In structured, semantic, machine-readable form. And everyone decided to take pictures of it.

The Insight

Every application on your computer is already describing itself in full structural detail. Right now. Every button declares its name, its role, whether it's enabled, and where it is. Every text field exposes its value. Every menu is a traversable tree.

It's called the Accessibility Tree. It was built for blind people. It has existed since 1997.

Window: "Invoice - Datev Pro"
├── Edit: "Customer Number"  →  Value: "KD-4711"
├── Edit: "Amount"           →  Value: "1,299.00"
├── ComboBox: "Tax Rate"     →  Value: "19%"
└── Button: "Book"           →  IsEnabled: true

Each element provides: name, role, value, position, enabled/disabled state, on-screen/off-screen status, parent-child relationships. Pure text. What LLMs are built to process.

Every major OS has this:

Platform	Framework	Since
Windows	UI Automation (UIA)	1997/2005
macOS	NSAccessibility	2001
Linux	AT-SPI2	2001
Android	AccessibilityService	2009

Every major application implements it. Native apps. Web apps through the browser's accessibility layer. Chromium apps (Discord, Slack, VS Code, Spotify) expose the entire DOM through it.

The Gap

Before DirectShell, there was no system that:

Continuously dumps the accessibility tree into a queryable SQL database at real-time refresh rates
Automatically generates multiple output formats optimized for different consumers
Provides a universal action queue where any process can control the app via SQL INSERT
Operates as infrastructure — not as a tool, but as a universal layer between any agent and any GUI

The accessibility tree has existed since 1997. SQL databases since the 1970s. Nobody combined them into a universal interface primitive.

Until now.

Does This Already Exist?

Honest answer: parts of it do. The full thing does not. Here is every relevant project that exists as of February 2026, and what each one is missing.

What Exists

Project	Approach	What's Missing
Microsoft UFO/UFO2	Walks UIA tree, dumps as JSON to GPT-4o	Full JSON dump = 15,000+ tokens. No SQL. No persistent database. An agent, not infrastructure.
Windows-MCP	Exposes UIA tree via MCP tools	No SQL database. No multi-format output. No overlay. Closest competitor — still misses the core innovation.
Playwright MCP	Browser accessibility tree via MCP	Browser-only. Does not work for desktop apps. Does not work for SAP, Datev, Excel, or any native application.
computer-mcp	Cross-platform a11y tree via MCP	Returns full JSON tree. No SQL. No filtering. Same context saturation as screenshots, just in text form.
macOS UI Automation MCP	macOS accessibility via JSONPath	macOS only. JSONPath queries, not SQL. Closest architectural analog — but different platform, different query language.
pywinauto	Python library for Windows UIA	Requires full Python environment. 18,000+ lines. Academic-grade, not production infrastructure. No database layer.
RPA (UiPath, Automation Anywhere)	Accessibility selectors as one of many targeting strategies	Per-application scripting. No universal query layer. No structured output. $50K–$150K/year per integration.
Screen Readers (JAWS, NVDA)	Walk tree, read aloud	Single-purpose assistive tools. No structured data output. No query interface. Not designed for programmatic consumption.

What None of Them Do

I searched. Extensively. Across 419 academic sources, GitHub, Google Scholar, product pages, patent databases.

No project, paper, or product on Earth:

Stores the accessibility tree in a queryable SQL database
Generates multiple output formats optimized for different consumers (50-token LLM snapshots vs. full database)
Provides a SQL-based action queue where any process controls the app via INSERT INTO inject
Operates as infrastructure — not an agent, not a tool, but a universal primitive

The accessibility tree existed since 1997. SQL since the 1970s. Nobody combined them.

The Evidence

The OSWorld benchmark — the industry standard for AI agent evaluation — shows the best screenshot agent achieving 66.2% success (AskUI VisionAgent) where humans score 72.4%. Most agents cluster between 30–50%. Research from accessibility.works proves that agents using accessibility data succeed 85% of the time while consuming 10x fewer resources. The token gap is real: screenshots cost 1,200–5,000 tokens per perception. DirectShell's .a11y.snap costs 50–200. Its SQL queries cost 10–50.

The $28.3 billion RPA market exists because desktop applications don't have APIs. DirectShell gives every application an API. In 700 KB. For free.

What DirectShell Is

DirectShell turns every GUI on the planet into a text-based API that any LLM can natively read and control.

It is not a tool. Not an automation script. Not an RPA product. Not a screen reader.

DirectShell is a primitive — a fundamental building block like TCP/IP, HTTP, SQL, or the browser.

Primitive	What It Universalizes
TCP/IP	Reliable data transport between any two computers
HTTP	Standardized request-response for any resource
SQL	Universal query language for any database
The Browser	Universal client for any web resource
PowerShell	CLI access to any OS service
DirectShell	Input/output control for any GUI application

PowerShell automates the backend. DirectShell automates the frontend.

How It Works

DirectShell is a single binary (~700 KB, pure Rust, no dependencies)
You drag it onto any running application. It snaps to it
Once snapped, it continuously reads the app's entire UI through the Accessibility framework
Everything goes into a SQLite database — every button, field, menu item, with names, values, positions
It generates four text files optimized for different consumers
External processes control the app by writing SQL to an action queue in the same database
DirectShell executes those commands as native input events — indistinguishable from human input

Text in, text out. The AI reads a text file to understand the screen. Writes a SQL command to act on it. No screenshots. No pixels. No vision model.

The Architecture (Compressed)

Four Output Formats

Every 500ms, DirectShell generates four files from the accessibility tree:

Format	For	Size	What It Contains
`.db` (SQLite)	Scripts, programs	100KB–1.5MB	Complete queryable element tree
`.snap`	Automation scripts	3–15 KB	All interactive elements, classified
`.a11y`	Context-aware agents	3–10 KB	Focus, inputs, visible content
`.a11y.snap`	LLMs	1–5 KB	Numbered operable elements only

The .a11y.snap — what an LLM actually reads:

[1] [keyboard] "Adressfeld" @ 168,41 (2049x29)
[2] [click] "Neuer Chat" @ 45,200 (200x30)
[3] [keyboard] "Einen Prompt eingeben" @ 999,1177 (1069x37)
[4] [click] "Einstellungen" @ 1800,1350 (150x20)

# 4 operable elements in viewport

Four lines. That's the entire perception step. Not a 5,000-token screenshot. Four lines that say: here's what you can interact with, here's the name, here's the input type.

Five Action Types

Any process controls the app through SQL:

INSERT INTO inject (action, text, target) VALUES ('text', '2,599.00', 'Amount');
INSERT INTO inject (action, text) VALUES ('type', 'Hello World');
INSERT INTO inject (action, text) VALUES ('key', 'ctrl+s');
INSERT INTO inject (action, target) VALUES ('click', 'Save');
INSERT INTO inject (action, text) VALUES ('scroll', 'down');

text sets a value instantly via UIA. type simulates keyboard input character-by-character. key sends shortcuts. click finds the named element and clicks its center. scroll scrolls.

The target application cannot distinguish this from physical hardware input.

The Chromium Problem

Chromium (Chrome, Edge, Opera, Discord, Slack, VS Code, Spotify) doesn't build its accessibility tree by default. Performance optimization. Without a screen reader present, you get 9 skeleton elements.

DirectShell solved this with a four-phase activation: system screen reader flag, a leaked UIA FocusChanged event handler that forces UiaClientsAreListening() to return true permanently, direct MSAA probing of renderer windows, and a retry with delay.

Result: Opera went from 9 elements to 800+. Claude Desktop from a handful to 11,454 elements. Every chat message, button, link — fully searchable.

(Full technical details in the whitepaper and ARCHITECTURE.md)

Demo Day

February 16, 2026 — 8.5 hours after the first line of code. Claude Opus 4.6 (in a CLI terminal) used DirectShell to operate four applications. No screenshots. Pure text.

Google Sheets: 72 cells filled in seconds. Headers, values, SUM formulas. Through the accessibility layer alone. No Sheets API. (The formulas had an off-by-one bug. Day 1.)

Google Gemini: The AI navigated to Gemini, typed a message, read the response through DirectShell's tree, reported it back. A Google AI, on Google's infrastructure, controlled entirely by a competing AI (Claude), through an interface Google didn't build and can't block. Gemini's response: the "God Mode" quote at the top of this article.

Claude Desktop: 11,454 elements. Every chat message. Every button. Anthropic built Computer Use (screenshot-based). Anthropic built Claude Desktop. DirectShell read Anthropic's own application as structured text. The company that bet on pixels built an app that describes itself perfectly in text.

Notepad: Character-by-character typing through raw keyboard injection. Notepad had no idea the input wasn't human.

Google Search: Honest failure. Poor accessibility semantics in search results. The tree is only as good as the app's accessibility implementation. This is a Google accessibility failure, not a DirectShell limitation.

Every failure proves the system is real. Not a cherry-picked demo. An AI fighting through unexpected problems in four applications, adapting in real-time, delivering results in seconds — where the state of the art takes minutes and fails most of the time.

Watch It

The full 7-minute demo — uncut, unedited, warts and all:

(If the embed doesn't load: Watch the demo on YouTube)

The Market vs. Day 1: Verified Benchmarks

These are not my numbers. These are the industry's own benchmarks, published in peer-reviewed venues and official product pages.

What the best AI agents in the world achieve (February 2026):

Benchmark	Best Agent	Success Rate	Source
OSWorld (Desktop)	AskUI VisionAgent	66.2%	OSWorld Leaderboard
OSWorld (Desktop)	UI-TARS 2	47.5%	ByteDance
OSWorld (Desktop)	OpenAI CUA o3	42.9%	OpenAI
WebArena (Web)	IBM CUGA	61.7%	Emergent Mind
WebChoreArena (Hard Web)	Gemini 2.5 Pro	37.8%	WebChoreArena
Online-Mind2Web (Real Web)	Most agents	~30%	ArXiv
ScreenSpot-Pro (Pro GUI)	OS-Atlas-7B	18.9%	ScreenSpot-Pro

(Leaderboard as of February 2026)

Every single one: screenshot-based. 1,200–5,000 tokens per perception step. 10–20 minutes per task. Even the current desktop leader fails one in three.

What DirectShell achieved on Day 1 (8.5 hours after first line of code):

Task	Time	Tokens	Method
Write multi-paragraph text to Notepad	Instant (0ms)	~50	`ds_text` (ValuePattern)
Read entire Claude.ai chat + respond cross-app	~60 sec	~200	`ds_screen` + `ds_type`
Fill 360 cells in Google Sheets (SOC Incident Log)	~90 sec	~150	`ds_batch` + `ds_type`

No screenshots. No vision model. No coordinate guessing. Text in, text out.

The current desktop leader still fails one in three tasks and takes 10–20 minutes each. Most agents fail more than half the time. DirectShell filled 360 spreadsheet cells in 90 seconds on the first day it existed.

Why This Cannot Be Blocked

The accessibility interface is protected by interlocking international law:

UN CRPD — Article 9, ratified by 186 states
European Accessibility Act — enforced since June 2025
Americans with Disabilities Act — Title III, digital accessibility
Section 508 — federal procurement requires accessibility
German BFSG — up to €100,000 per violation

DirectShell reads the same API as JAWS, NVDA, and Windows Narrator. The OS cannot distinguish between them. Every countermeasure that blocks DirectShell also blocks screen readers. Blocking screen readers violates disability law in 186 countries.

Countermeasure	Blocks DirectShell	Blocks Screen Readers	Legal?
Disable UIA	Yes	Yes	No — violates EAA, ADA, Section 508
Return empty data	Partially	Degrades	No — violates WCAG 4.1.2
Detect & block UIA clients	Yes	Yes (JAWS, NVDA)	No — disability discrimination
Remove element names	Partially	Gibberish	No — WCAG violation

There is no technical mechanism to distinguish a screen reader from DirectShell. Both use the same COM interfaces. The OS does not authenticate accessibility clients. It cannot. That's the point of the framework.

Consider the PR: "SAP blocks screen reader access to protect API revenue." No Fortune 500 company takes that headline.

(Full legal analysis with case law and statute references in the whitepaper)

The Dark Side

A primitive is neutral. Like fire. Like the internet. Like cryptography. Its value and its danger come from the same source: its universality.

Surveillance: DirectShell enables structured, real-time, queryable monitoring of every application on a system. Not blurry screenshots every 5 minutes — a database of every field, every value, every input. "What did Employee X type into the CRM between 2pm and 4pm?" is a SQL query.

Malware with structured UI access: Today's malware takes screenshots and records keystrokes — unstructured data requiring interpretation. DirectShell's architecture enables malware that understands applications. It doesn't screenshot a banking app and try OCR — it queries for the account number field and reads the value. It can find the transfer form, fill in an IBAN, enter an amount, and click confirm. Deterministically.

Credential harvesting: Any password displayed in a UI field has a corresponding entry in the accessibility tree. Password managers that display credentials in their UI expose them through UIA. The read path is legally protected and cannot be patched.

I'm publishing this not despite the risks, but because of them. This capability has been latent for 29 years. I am documenting a vulnerability that has existed since 1997. By publishing openly, the security community can develop defenses. The conversation happens publicly. The response is informed by understanding, not surprise.

Honest Limitations

Accessibility quality varies. The tree is only as good as the app's implementation. Major enterprise software (Office, SAP, browsers) is comprehensive. Smaller apps may have unnamed buttons or missing values. The trend is toward better accessibility, driven by EAA enforcement — but gaps exist today.

Single-app scope. v0.2.0 attaches to one target at a time. Multi-app workflows require re-snapping. This is an engineering limitation, not architectural.

v0.2.0 bugs. Built in 8.5 hours. Formula offsets in spreadsheets. Chromium tab switching requires keyboard shortcuts. Opera autofill popups interfere with injection. These are Day 1 bugs. The architecture is sound.

What's missing: MCP server integration (coming), app profiles (community-built configs per application), character transformation middleware (PII sanitization, auto-translation), multi-window support, cross-platform ports (macOS/Linux have equivalent accessibility frameworks).

The Code

Single file: src/main.rs, 2,053 lines of Rust. Two dependencies: rusqlite and windows. Compiles to ~700 KB. Runs on any 64-bit Windows 10/11. No installation. No admin privileges. No configuration.

AGPL-3.0. Every fork stays open.

Conclusion

The AI industry framed "computer use" as a vision problem. They built increasingly sophisticated models to interpret screenshots. DirectShell reframes it as a text problem. And text is what language models were built for.

This is not a better solution to the same problem. This is the realization that the problem was misidentified from the start.

Listen. DirectShell is not perfect. It's Day 1. Literally. There are bugs. There are errors. A hundred things that need to get better. But none of that matters. The first browser couldn't render 90% of web pages correctly. The first lightbulb flickered. Every foundational technology begins empty and broken — because the point was never whether it works perfectly now. The point is what it will make possible tomorrow.

The moment a community builds a profile repository — configs for every program on Earth — AI will natively operate every desktop application faster, more efficiently, and more productively than any human ever could. Not in ten years. Not after the next funding round. The infrastructure is here. Today. In 700 kilobytes.

Google. Microsoft. OpenAI. Anthropic. Call me. Let's talk. Let's revolutionize the world of AI in one stroke.

Peace at last.

And now I'm going to sleep for 12 hours.

— Martin Gehrken, February 17, 2026

Links:

Whitepaper (full technical paper, 120,000 characters, legal analysis, all use cases, architecture deep dive): WHITEPAPER.md
Source Code (AGPL-3.0): GitHub Repository
Architecture Reference: ARCHITECTURE.md
Website: dev.thelastrag.de

Talk to me:

Discord: Deep Learn — LLM, Research, Open Source and Programming — the community where DirectShell was born
Email: iamlumae@gmail.com
Website: dev.thelastrag.de

This article is released under CC BY-SA 4.0. The DirectShell source code is AGPL-3.0.

Top comments (2)

Ozz • Feb 18

This is absurdly cool! Fast as lightning. Managed to build a macOS version of it in an hour :) THANKS! I'm sure this is an idea that will not go back to the bag. would be cool to see how this gets integrated to "everything"... but for now it makes claude code so much smarter.

THANKS! :)

martin • Feb 18

Many Thanks <3 if you like also join my DC for fast Feeback , Help and Contribution an mac OS version would be a great addition here ! discord.gg/yRmS87tE

Also then at least leave a star on my git ahahahah :D lets make this become TRACTION

SHARE SHARE SHARE guys lets UPGRADE AI !