Aabhas Sao

Posted on May 24

Google I/O 2026: AI Built an OS in 12 Hours. I Spent Mine Sorting Screenshots. 🤦

#devchallenge #googleiochallenge #ai #webdev

Google I/O Writing Challenge Submission

This is a submission for the Google I/O Writing Challenge

I haven't watched a tech keynote in a really long time. They usually feel like 2 hours of "the future is here!" slides with product demos that never work on stage.

But Google I/O 2026 actually got me. Watched the whole thing and came away genuinely excited and slightly stressed about how fast things are moving.

They Built a Full OS in 12 Hours

Google's team built a full blown OS using Gemini agents. 93 sub-agents, 15K model requests, 2.6 billion tokens, all under $1,000 in credits. Twelve hours total.

Less than a thousand dollars. For 2.6 billion tokens. That's wild efficiency.

Gemini 2.5 Flash clocks 200 tokens/second output speed. Claude Sonnet sits around 40-60/sec. That gap is huge when you're running 93 agents in parallel and explains why this was even possible in that timeframe.

Gemini Spark Felt Like a MoltBot Alternative

Gemini Spark runs agents in a secure private cloud fully managed by Google. You don't worry about infrastructure, just give it tasks. The closest thing I could think of was MoltBot but living entirely inside Google's ecosystem.

Agentic AI you don't have to host, secure, or manage yourself sounds genuinely appealing. Haven't gotten deep access yet but it's on my list.

Search Is Getting Agentic

For years Google Search has been a box you type into. Now it's a box that talks back, remembers, and proactively reaches out.

They showed concert updates for your city. You can ask Search to notify you when shows for an artist you like get announced. I live in Mumbai and would love to test how well this actually works here because currently I find out about shows after tickets are sold out.

AI autocomplete that doesn't just finish your query but suggests better ones based on intent. Proactive updates that follow up on your interests without you asking again. Search is slowly becoming something closer to a smart assistant.

Yes, Google, take my data, please.

Agents Are Buying Coffee Now

This was the part that felt most science-fiction-turned-real.

Universal Commerce Protocol lets AI agents interact with e-commerce systems. Agent Payments Protocol lets them actually complete payments on your behalf.

The live demo had an agent order coffee through DoorDash, selecting the item, going through checkout, completing payment. No human in the loop at any point.

Picture this: you tell your AI "book me a flight to Bangalore next Friday under ₹8,000, window seat" and it just... does it. Hotels too. Payments too.

Is this a privacy nightmare waiting to happen? Probably worth thinking about. But as a demo, genuinely impressive.

Gemini Live Replied in Haryanvi and I Was Not Ready

Small moment but a meaningful one. During the Gemini Live demo, the model replied in Haryanvi, a regional dialect spoken mostly in Haryana, India. Not just Hindi. Haryanvi. That was cool to see.

Work Onward (Inspiring real Antigravity and Stitch work usage case study)

Holly Jooyoung Diamond built Work Onward using Antigravity and Stitch. The problem she was solving was real: how do restaurant owners post jobs easily?

Job postings via SMS so owners don't need to sit at a laptop filling out forms. Multilingual job descriptions automatically so the listings open up to more candidates immediately.

Really inspiring to see that the access to build ideas is with everyone now.

Small thing but I actually used Gemini myself while writing this article to fetch the specific timestamp of the Work Onward demo from the keynote video because I had forgotten the details after watching. That worked surprisingly well.

WeatherNext Predicted a Category 5 Hurricane 3 Days Out

WeatherNext predicted a Category 5 hurricane in Jamaica three full days before it happened.

3-day advance warnings at category-level accuracy are the kind of thing that saves lives. Not a dev tool but a reminder that the same underlying models powering our autocomplete are doing genuinely important work elsewhere.

Fine-Tuning Gemma 4 with Antigravity

This is the one I keep coming back to.

Google showed fine-tuning Gemma 4 directly from the Antigravity CLI for custom use cases. I haven't worked much with local models yet but the cost of calling a large cloud LLM for every single query adds up fast, especially for repetitive domain-specific tasks.

If you're building a product that does the same type of classification or extraction thousands of times a day, running that against a fine-tuned Gemma locally is far cheaper than hitting a frontier API each time. That's the promise here.

I want to try this. Will write about it when I do.

Playing With the Antigravity CLI

Enough keynote recap. Here's what I actually tried.

Installing it

curl -fsSL https://antigravity.google/cli/install.sh | bash

One thing I hit right away: after installation, typing agy or antigravity opened the Antigravity IDE instead of the CLI. Turned out I had an older IDE version installed and its PATH entry was winning.

Had to manually remove the old PATH entry from ~/.zshrc and re-source it. After that the CLI came up fine. Not sure if it's a bug or just my setup, but if you hit the same thing, check your .zshrc for conflicting PATH entries before assuming something is broken.

Organizing 200+ Screenshots

I had a Desktop full of screenshots going back two years. Totally unorganized, no naming convention, nothing. I thought, let's see what Antigravity does with this.

My prompt was: organize my screenshots, categorize them into folders, and give them meaningful names.

Gemini came back with a solid plan: a Swift OCR tool using macOS's native Vision Framework to extract text from each screenshot, paired with a Python script that classifies them into folders (Coding, Meetings, AI-Assistants, Communication, Finance, Design, Media, General) and renames them with date and keyword info like 2024-12-15_Brave_GitHub_GoogleMeet.png.

Using macOS's native Vision Framework instead of a third-party OCR library was a smart call, zero extra dependencies.

I gave the green flag and it started running. Midway through the dry run I got impatient and used the /btw command to check progress without interrupting the session. That's a genuinely useful feature, like tapping someone on the shoulder to ask "hey how far along are you" without stopping their work.

Files got organized in the end but honestly OCR alone isn't enough context to make great categorization decisions. A screenshot of a GitHub PR and a screenshot of VS Code might have similar text but very different purposes. Some files ended up in slightly wrong folders.

Not the model's fault, it's a genuinely hard problem. But it got me thinking: if the agent could actually see the screenshot using vision instead of just reading extracted text, the categories would be way more accurate.

Modern Web Guidance Plugin

agy plugin install https://github.com/GoogleChrome/modern-web-guidance

This gives Antigravity context about modern web best practices, similar to what the Chrome Modern Web Guidance docs cover for Claude Code.

Using the CLI feels noticeably faster than AI-powered IDEs. No Electron overhead, no waiting for a UI to re-render. Just your terminal, the model, and results.

Every time I use a native CLI tool I wonder why Teams and VS Code are built on Electron. I get the history. Still.

I tried redesigning a section of my portfolio to add a tech stack display with icons. Gave it a screenshot reference and it couldn't nail the visual. Gave it a full URL to a reference site and the result was still not great honestly. Not sure if my prompts were bad or if this is just a CLI vs IDE thing, because the IDE felt like it gave better results. I don't know, need to experiment more.

Gemini Omni and SynthID

Almost forgot: Gemini Omni now has an evolved sense of physics for generating stylized videos. Generated video that actually behaves like it understands how things move and interact.

Also SynthID and C2PA credentials for detecting AI-generated content. As generated media gets better, tooling to authenticate what's real becomes critical infrastructure. Good to see it being built in at the platform level.

Final Thoughts

Google I/O 2026 didn't feel like a hype keynote. It felt like a company that had spent a year quietly building and was now ready to show it.

Fine-tuning Gemma 4 is the thing I most want to play with.

The agents-everywhere story is clearly where things are heading. The question is whether the underlying protocols stay open enough that indie devs can build on top of them. Hoping they do.

What was your favourite part of Google I/O 2026? Let me know in the comments, especially if you've played with Antigravity or Gemini Spark!

Top comments (10)

UnitBuilds • May 25

That AI is buying coffee now is very true. In a week, I'm open sourcing a stripped down version of my MCP for browsing. Using the AOM instead of DOM, it's 8x more efficient. But that's the reason it's stripped down... If I released the full one, WAF is dead. So for context, I built it to browse, it does that beautifully, it even maps sites based on their AOM. So I ran it on bot.sannysoft.com for interest sake, because it was browsing G2.com, which is secured by datadome, and it came up all clear. So I played around a bit more, creating a swarm running Gemma 4 e2b and to my surprise it didnt throw off any alarms anywhere... Not DataDome, not CloudFlare, not Google, not X, it could log in my google account, send an email, it created it's own X account and sent Elon a taunt. It swarmed 50 airliner sites (albeit at 5 in parallel at a time, due to 8gb vram restraint), to find the cheapest flight and book it, it did it. I had it go to jspaint and draw that rectangle with an X that phones use to test the screen (I used it to test element isolation and coordinate translation), it did it flawlessly. At 8x less tokens than puppeteering. So naturally, I thought, lemme try it via a datacenter proxy, finally, the captchas hit! And as fast as they hit, it solved them... stripping down the required context for any task to the bare minimum, eg. the nav AOM, or the individual challenge.hcaptcha iframe for a cropped screenshot, or sequential screenshots that get stitched, it bypassed everything I could throw at it. This is me, running on a single 8gb card, yet it can run from a datacenter and do the same, someone could use a larger model that takes less time solving the captchas, while keeping the browser agent lean... Unfortunately, that marked the death of internet security. Because if I can build it as a quick cost-saving system for doing automated UI testing for my foundry, someone else built it for hacking and scraping. And the funny part is, the AOM is protected by law, because any blocking of it, would block impaired users from using the internet. So it's undetectable, unpatchable and 8x faster than whatever people are using now. Did I mention it allows scripting? It maps a task sequence once, select a variable and it can do it a million times, without even needing a LLM. While really cool for anyone wanting to use it for workflow automation, etc. think about it, it can log into a secured bank portal and execute a transfer...

That's why stripped down, detectable, predictable, so it's just a performance booster, not a malicious tool that strips every internet user of any safety they think they have.

Aabhas Sao • May 25

wow, this is so cool, you have already built agents to do so much stuff, that too using Gemma 4, so i guess lot of savings. would love to see if you have any public repo, or article for this, to see how you did it.

I have not done anything with gemma, this can be very useful.

UnitBuilds • May 25

Preview of it going through the paces at the moment, jspaint is a little different, because of canvas use, which LLMs are usually blind to, but the AOM makes it natively query-able which lets the LLM map the canvas space to screenspace coordinates and execute tasks like click and drag to 'paint' lines via the tool, using xy start, xy end coordinates in canvas space, so even if you resize the canvas, or change window size, or even if the canvas goes past screenspace, it can interact cleanly with it. Take note, these are straight lines drawn with the simple pencil tool, not the line-draw tool, the impressive part is that it can be pixel perfect at the corners, which puppeteer would struggle with.

UnitBuilds • May 25

At the moment I'm putting the mcp-lite through a full gauntlet to ensure it's web-safe and doesnt bypass anti-bot by default. That way it's a nifty tool for developers to automate workflows like UI testing (especially cool when swarming, max I had in headless swarm was 20 concurrent on a laptop), site-mapping, or simply not burning a baby brain model up while doing a simple web search.

Varsha Ojha • May 25

This is such a relatable contrast. Every keynote makes AI look like it’s building the future in minutes, but most real workflows are still stuck in tiny messy tasks like sorting screenshots, naming files, cleaning folders, or finding that one image from last week. Honestly, those boring use cases are where AI could be most useful. Not always replacing big creative work, but removing the small digital clutter we waste time on every day.

UnitBuilds • May 25

It's what I spent most of my career doing. Finding production workflows that suck to do, then automate them. Eg. scanning in documents, write a scanner app that continuously scans. Sorting documents, write an OCR detection system that categorizes them by type and files them. Finding that 1 file, you cant name, but know a single line of, db back the OCR results and have a lookup table. Small things that take up small amounts of time in their own right, but cascade into days/weeks/months when you crank the scale up and 1 thing is always true... It's boring.

Aabhas Sao • May 25

Yes true, handover the boring and repetitive tasks to AI 😂. Maybe I can add a hourly trigger to automate it further

Varsha Ojha • May 26

Haha yes, exactly. The small boring tasks are where AI feels most useful day to day. An hourly trigger could actually make sense here. Sort, rename, tag, and clean up before the mess piles up again.

Andy Stewart • May 27

Keeping high-frequency, domain-specific tasks local is the only true way to master data sovereignty. As for the classification errors you hit while organizing screenshots via the CLI, that's completely normal. Pure OCR fails to capture the visual context. Once on-device multimodal capabilities are fully deployed, local agents will finally have a real set of eyes.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.