Xavier Plantaz for Google AI

Posted on Feb 16

Building a Gemini-Powered Robotics Simulator in the Browser with MuJoCo WASM

#robotics #gemini #mujoco #learngoogleaistudio

Education Track: Build Apps with Google AI Studio

Robotics simulations historically required expensive hardware, complex local installations, or heavy computation. MuJoCo WASM and the AI Studio "Build" feature are changing that!

In this post, we break down how to build a "Pick and Place" robotics demo that runs entirely in the browser. We combine MuJoCo (via WebAssembly) for physics, Three.js for rendering, and Gemini 1.5 ER as the reasoning engine to control a Franka Emika Panda robot which will be loaded from MuJoCo Menagerie as seen in the following post:

X Post: https://x.com/googleaidevs/status/2019116278196138349
Direct Link to the Demo: https://aistudio.google.com/apps/bundled/robotics_franka_pick_and_place

1. Building in AI Studio Build

A working strategy for developing in AI Studio is to use Chrome tabs groups. They allow to keep track visually of everything that is being worked on, experiment with different prompting strategies and do some minor version control:

To start a new project, start at the left side of a new window. Clearly stating what the final goal is. For a Gemini-powered MuJoCo simulation, our initial prompt in https://aistudio.google.com/apps was the following:

A simulated Franka Panda executing Pick and Place 
using Gemini Robotics ER.

To optimize development speed without sacrificing reasoning depth, a workflow to try is the following: a parallelized "four-tab" system designed to balance rapid exploration with complex problem-solving. Positioning the current stable build as an anchor on the far right ("Working Version") while simultaneously running three active experimental tabs to generate new solutions. Two of these utilized Gemini 3 Flash for high-frequency iteration, allowing to quickly explore the problem space and refine our prompting strategy (think about it as an ongoing A/B testing mechanism). The third tab runs Gemini 3 Pro, leveraging its higher reasoning capabilities to tackle more difficult logic, albeit at a slower pace. This creates a continuous evolutionary cycle: the moment any of the three active tabs produced a breakthrough, that result is immediately promoted to the "Working Version," duplicated to populate three fresh tabs, and the process repeated, ensuring every new experiment started from the latest, most robust foundation.

2. Getting MuJoCo WASM to work in AI Studio

The first and most critical technical hurdle is establishing a robust physics engine running entirely in AI Studio.

MuJoCo is an advanced, open-source physics engine developed at Google DeepMind for high-fidelity robotics simulation , while its WebAssembly (WASM) port empowers developers to run these complex, contact-rich physics calculations natively and efficiently directly inside a web browser.

Working in tandem with the MuJoCo WASM team, we leveraged the Alpha release of the engine. Following the very clear instructions on GitHub, we were able to compile and get the bindings. The dist/ folder provided allowed us to get both mujoco_wasm.js and mujoco_wasm.wasm, that we can easily upload in our AI Studio directory.

To get them to work, the core challenge was fetching and initializing the WebAssembly binary correctly within the AI Studio's file structure.

To solve this, we specifically focused on file handling: we leveraged the public/ folder and implemented a precise loading mechanism. The solution required fetching the mujoco_wasm.js file, converting it into a Blob, and creating an object URL to bypass standard loading restrictions.

In practice, all that is required is to upload the files from mujoco/wasm/dist directly to the public folder:

And use the following prompt:

To load and initialize MuJoCo physics engine (WebAssembly version) in the browser you need to use the following Blob URL workaround to bypass strict bundler restrictions. Instead of importing files directly, you will:

1. Manually fetch the raw `.wasm` and `.js` files.
const wasmRes = await fetch('/mujoco_wasm.wasm');
if (!wasmRes.ok) throw new Error("Failed to fetch wasm");

const jsRes = await fetch('/mujoco_wasm.js');
if (!jsRes.ok) throw new Error("Failed to fetch wasm.js");


2. Convert them into Blob Object URLs (memory-based URLs).
const wasmBlob = new Blob([await wasmRes.arrayBuffer()], { type: 'application/wasm' });
const wasmUrl = URL.createObjectURL(wasmBlob);

const jsBlob = new Blob([await jsRes.arrayBuffer()], { type: 'application/javascript' });
const jsUrl = URL.createObjectURL(jsBlob);

3. Dynamically import the JavaScript module from the Blob URL and instructs it to locate the WASM file using the created Blob URL.
// @ts-ignore - Dynamic import of blob URL
const module = await import(/* @vite-ignore */ jsUrl);
const loadMujoco = module.default;

return await loadMujoco({
  locateFile: (path: string) => path.endsWith('.wasm') ? wasmUrl : path
});

Then simply copying the code from the MuJoCo demo_app and appended it to the following prompt recreated the initial demo and got us started to build more advance simulations:

Recreate the following demo: [demo_app code]

Here is what the result looked like in AI Studio:

PS: The current MuJoCo WASM files are being moved to NPM and the current demo was modified to import them directly as a library.

3. Importing a Robot in MuJoCo WASM

Once the physics engine was operational, the next step was introducing the robot itself. We used the MuJoCo Menagerie, a repository of high-quality robot models, specifically selecting the Franka Emika Panda. However, simply pointing to a URL is insufficient in a WASM environment. The simulation requires a Virtual File System (VFS) to parse and fetch all relevant assets, including meshes and XML definitions, into a directory structure that the compiled MuJoCo engine can read.

This process required building first a "working directory" to mount the necessary assets dynamically, allowing the simulation to load the robot's geometry and kinematic chain without local dependencies.

Here is a perfect prompt capturing all those learnings that will get a robot files directly into AI Studio:

Load Franka Emika Panda (id: `franka_emika_panda`) robot, implementing the following VFS: 

## 1. How the VFS Works: The VFS is a layer provided by Emscripten that MuJoCo uses to "read" model files. We manage this through the `mujoco.FS` object.

### Initialization
Before loading a model, we create a workspace directory:

// From RobotLoader.ts
try { this.mujoco.FS.mkdir('/working'); } catch (e) { }

### Writing Files
To make a file available to MuJoCo, we fetch it from the web and write it to the VFS:

// Binary files (STL meshes, PNG textures)
const buffer = new Uint8Array(await res.arrayBuffer());
this.mujoco.FS.writeFile('/working/mesh.stl', buffer);

// Text files (XML/MJCF)
const text = await res.text();
this.mujoco.FS.writeFile('/working/model.xml', text);

### Loading the Model
Once all files are in the VFS, we tell MuJoCo to load the entry point:

// From MujocoSim.ts
this.mjModel = this.mujoco.MjModel.loadFromXML('/working/scene.xml');


## 2. Importing a Robot from MuJoCo Menagerie

Here is the step-by-step process to import them:

### Step 1: Define the Base URL
Every robot in Menagerie follows a standard path structure:
`https://raw.githubusercontent.com/google-deepmind/mujoco_menagerie/main/{robot_id}/`

### Step 2: Recursive Dependency Scanning
Models are rarely a single file. An XML might reference other XMLs (`<include>`), meshes (`<mesh>`), or textures (`<texture>`).

Use `DOMParser` to find these:

// Simplified scanning logic
const xmlDoc = parser.parseFromString(xmlString, 'text/xml');
xmlDoc.querySelectorAll('[file]').forEach(el => {
    let fileName = el.getAttribute('file');
    // Add to a download queue to fetch and write to VFS
});

Note: assets/ files are sometimes in the assets/ folder, where a `meshdir="assets"` will be declared (example: `<compiler angle="radian" meshdir="assets" autolimits="true"/>`).

### Step 3: XML Patching
Sometimes the base Menagerie models need adjustments for specific tasks (like adding Inverse Kinematics targets or workspace objects). 

Dynamically inject XML elements before writing to the VFS:

// Adding a TCP (Tool Center Point) site for IK tracking
text = text.replace(
    /(<body[^>]*name=["']hand["'][^>]*>)/, 
    '$1<site name="tcp" pos="0 0 0.1" size="0.01" rgba="1 0 0 0.5" group="1"/>'
);

---

## 3. Implementation Summary: 

The `load()` method acts as the orchestrator:
1. **Cleans** the `/working` directory.
2. **Fetches** the root `scene.xml`.
3. **Parses** the XML to find assets (STL, PNG) and nested XMLs.
4. **Downloads** assets recursively.
5. **Patches** the XML strings to include demo-specific bodies (trays, cubes).
6. **Writes** everything to `this.mujoco.FS`.

By the time `load()` resolves, the WASM engine sees a complete, local directory structure containing all meshes and configurations required to boot the simulation.

This should get us directly to an almost working version of importing the files of the robot:

4. Iterating on Robot import

A common issue faced during the development of this demo, was the robot that would look exploded:

The robot parts move correctly in relation to each other, but the visual meshes are offset or rotated incorrectly relative to their physics bodies. In MuJoCo, a Body (physics frame) can have multiple Geoms (visual/collision shapes). The Geoms often have their own local offsets (pos and quat) defined in the XML relative to the Body. To replicate this in Three.js, do not apply physics updates directly to the loaded Mesh. Instead, using a Parent-Child hierarchy solved the problem here.

Specifically, the following prompt unlocked the situation:

1. Parent: A `THREE.Group` that represents the MuJoCo Body. *This* is what receives the `xpos` and `xquat` updates in the sync loop.  
2. Child: The loaded STL/OBJ. This sits inside the Parent. You apply the XML `<geom>` offsets to this child *once* during initialization.

After many back and forth, the following implementation worked:

// 1. Create the Physics Anchor 
const bodyGroup = new THREE.Group();
scene.add(bodyGroup);

// 2. Load the Visual Mesh
loader.load('link1.stl', (geometry) => {
    const mesh = new THREE.Mesh(geometry, material);

    // 3. Apply the local transform defined in the XML <geom> tag
    // (These values come from parsing the MJCF/URDF)
    const geomPos = { x: 0, y: 0, z: -0.333 }; // Example offset
    const geomQuat = { w: 1, x: 0, y: 0, z: 0 }; 

    mesh.position.set(geomPos.x, geomPos.y, geomPos.z);

    // Remember to swap quaternion order for Three.js!
    mesh.quaternion.set(geomQuat.x, geomQuat.y, geomQuat.z, geomQuat.w);

    // 4. Add mesh to the anchor
    bodyGroup.add(mesh);
});

// 5. In your Update Loop, only move 'bodyGroup'
// bodies[i] = bodyGroup;

We also identified mismatches between axes between ThreeJS and MuJoCo, the following prompt solved alignment issues:

Alignment: MuJoCo aligns along the Z-axis by default. Three.js aligns along the Y-axis. Rotate the geometry once during creation, for example: 

const geo = new THREE.CylinderGeometry(radius, radius, height, segments);
geo.rotateX(Math.PI / 2); // Rotate to align with Z-axis

For a smooth simulation, perform the update in your `requestAnimationFrame` loop after the physics step:

function update() {
    // 1. Step Physics
    while (mjData.time - startSimTime < 1.0 / 60.0) {
        mujoco.mj_step(mjModel, mjData);
    }

    // 2. Sync Visuals
    for (let i = 0; i < bodies.length; i++) {
        const mesh = bodies[i];
        // Position
        mesh.position.fromArray(mjData.xpos, i * 3);

        // Quaternion (Manual swap)
        mesh.quaternion.set(
            mjData.xquat[i * 4 + 1],
            mjData.xquat[i * 4 + 2],
            mjData.xquat[i * 4 + 3],
            mjData.xquat[i * 4 + 0]
        );

        // Ensure matrices are updated for the renderer
        mesh.updateMatrixWorld();
    }

    renderer.render(scene, camera);
    requestAnimationFrame(update);
}

5. Implementing Inverse Kinematics

Perhaps the most mathematically intensive portion of the project was implementing Inverse Kinematics (IK)—the logic required to calculate the necessary joint angles to move the robot's gripper to a specific point in 3D space.

To achieve smooth, continuous motion, we referenced the "Franka Analytical Inverse Kinematics" research paper and tasked the model with porting the mathematical solution into JavaScript. This required the translation of complex geometric formulas into performant code and the result was a custom IK solver that allows for precise control of the robotic arm, complete with a visual gizmo for debugging, ensuring the robot moves naturally without singularity errors.

With this Inverse Kinematics Solver in place, building a pick-up sequence for a specific target in place is just just a few prompts away! Here is the final animated sequence, showcasing the IK Solver distance, the joint positions and the Sequencer:

5. Integrating Gemini Spatial Understanding

The final layer was adding intelligence. We used the Gemini Spatial Understanding capabilities demo on AI Studio to give the robot its capabilities. By feeding the rendered scene into the model, Gemini can identify objects based on visual prompts such as "red cubes." The model analyzes the spatial relationships in the scene and returns the coordinates of the target objects. These coordinates are then fed into the Inverse Kinematics system, closing the loop between perception and action. This integration transforms the demo from a passive simulation into an active, agentic application capable of executing complex "pick and place" commands based on visual inputs.

Implement Gemini ER based on the following file. The camera should move to a top view position, take a screenshot and send that data alongside a prompt that the user would have added. 
[Insert Prompt.tsx from https://aistudio.google.com/apps/bundled/robotics-spatial-understanding]

6. Next Steps and Future Development

Try the demo directly at https://aistudio.google.com/apps/bundled/robotics_franka_pick_and_place
Remix and share your results!

For example, here is what a few prompts after importing Aloha, solving its IK and using MediaPipe for control looks like:

This demo is available at https://ai.studio/apps/drive/13B34fAndVesfHue7GSZ_pqdTYT2GqeOo

Top comments (2)

Ofri Peretz • Feb 18

The four-tab parallel prompting workflow is a clever pattern — essentially treating Gemini tabs like worker threads with different speed/quality tradeoffs and promoting the best result. I've been doing something similar when iterating on complex prompts, though never formalized it this cleanly. Curious whether you hit any divergence issues where Flash and Pro produced fundamentally incompatible approaches that were hard to merge back into a single "working version."

Tahir yamin • Feb 17

Awesome tutorial. Thanks for sharing such a detailed work.