Ankit Khandelwal

Posted on Feb 14

From Perception to Embodied Intelligence: Evolution, Architectures, and the Humanoid Gap

#robotics #vla #ai #computervision

Vision-Language-Action (VLA) models represent a paradigm shift from passive multimodal understanding to active embodied control. This brief maps the lineage from foundational Vision-Language Models (VLMs) like CLIP and BLIP to current state-of-the-art VLA systems, revealing critical architectural transitions, data strategies, and failure modes that define the frontier of humanoid manipulation.

The analysis identifies three core evolutionary phases:

(1) VLM pre-training for semantic understanding
(2) action tokenization enabling end-to-end control
(3) hybrid architectures balancing reasoning with real-time execution

For humanoid robotics, fundamental gaps remain in proprioceptive reasoning, long-horizon planning, and physics-aware action generation, challenges that current open-source models address only partially.

1. The Evolutionary Timeline: From VLMs to VLAs

Phase 1: Foundation (2021–2022) – VLMs as Semantic Engines

CLIP (2021) and BLIP (2022) established contrastive learning as the dominant paradigm for aligning vision and language modalities. These models excelled at matching images to text descriptions but lacked any mechanism for action generation. Their legacy persists in modern VLAs: OpenVLA inherits SigLIP's vision encoder, while Pi0 leverages PaliGemma's VLM backbone. hankyukim

Key Limitation: VLMs were fundamentally passive, optimized for retrieval and classification, not sequential decision-making. Early attempts like CLIPort (2022) demonstrated that grafting CLIP representations onto robotic policies via imitation learning could achieve task-specific success but failed to generalize across embodiments or semantic concepts beyond the training distribution. arxiv

Phase 2: Tokenization Breakthrough (2023) – RT-2 and the Birth of VLAs

Google DeepMind's RT-2 (July 2023) catalyzed the field by reconceptualizing robot actions as text tokens. The architecture quantized continuous actions into discrete bins (typically 256 per dimension) and appended them to the vocabulary of a PaLM-E or PaLI-X VLM. This enabled training with standard next-token prediction objectives, unifying web-scale vision-language pre-training with robotic demonstrations. madison-proceedings

Performance Leap: RT-2 achieved 3× improvement in generalization over RT-1, demonstrating emergent capabilities like reasoning about object categories and improvising tools. The model could interpret novel commands ("place the apple on the 3") despite never observing such combinations in robot data. deepmind

Phase 3: Scaling and Open-Source (2024–2025) – OpenVLA, SmolVLA, and Pi0

OpenVLA (2024) democratized access with a 7B-parameter model trained on 970k demonstrations from the Open X-Embodiment dataset. Built on Llama 2 + DINOv2 + SigLIP, it outperformed closed models like RT-2-X (55B parameters) with 7× fewer parameters by leveraging more diverse training data and 27 training epochs (vs. typical 1-2 epochs for VLMs). arxiv

SmolVLA (2025) pioneered efficiency, achieving OpenVLA-level performance with <0.5B parameters by employing a compact VLM backbone, flow matching action expert, and asynchronous inference stack. Its key insight: action generation quality depends more on architectural efficiency than raw parameter count. youtube

Pi0 Series (Physical Intelligence, 2024–2025) introduced hybrid architectures combining autoregressive action tokens with continuous flow matching. Pi0.5 added temporal awareness through timestep conditioning, while Pi0.6 scaled to 5B parameters and incorporated knowledge insulation, training the VLM backbone on FAST tokens while isolating the action expert's gradients. arxiv

2. Thematic Deep Dives: What Worked vs. What Failed

2.1 Key Ideas That Worked

Action Tokenization as Sequence Prediction

Treating actions as discrete tokens enabled direct transfer of LLM training infrastructure to robotics. RT-2's 256-bin quantization scheme remains the default in OpenVLA, providing a simple bridge between continuous control and autoregressive generation. This approach inherits powerful properties from language modeling: in-context learning, few-shot adaptation, and chain-of-thought reasoning. arxiv

Evidence: OpenVLA achieves 95% action token accuracy after 27 training epochs, with performance correlating strongly to robot success rates. The discrete representation also simplifies multi-task training across heterogeneous robot embodiments. arxiv

Flow Matching for Continuous Control

Diffusion-based action heads address the continuity problem inherent in tokenization. Pi0 and SmolVLA use flow matching to predict action chunks as continuous trajectories, avoiding quantization errors. This enables smoother, more precise control, critical for contact-rich manipulation. youtube

Performance Impact: Pi0 outperforms tokenized baselines on action chunking tasks (e.g., folding laundry) where precise force modulation matters. Flow matching also supports variable horizon predictions, unlike fixed-length token sequences. arxiv

Knowledge Insulation and Modularity

VLA-Adapter and Pi0.6 demonstrate that decoupling VLM reasoning from action generation improves training efficiency. By freezing the VLM backbone and training only a lightweight action expert, these models avoid catastrophic forgetting of web-scale knowledge while specializing for robot control. arxiv

Efficiency Gains: VLA-Adapter trains a powerful VLA in 8 hours on a single consumer GPU, while Pi0.6's insulated gradients prevent performance degradation on vision-language benchmarks. website.pi-asset

2.2 Key Ideas That Failed

Naive Proprioception Integration

Feeding raw robot state (joint angles, end-effector poses) directly as additional tokens creates shortcut learning. Policies overfit to state-action memorization rather than visual reasoning, degrading spatial generalization. In testing, models trained with proprioception fail when object positions deviate slightly from training trajectories. arxiv

Failure Mode: A study on visuomotor policies found that proprioceptive states cause "shortcuts where the policy directly associates absolute configurations with actions," leading to 40-60% success rate drops under spatial perturbations. arxiv

Monolithic Scaling Without Architectural Innovation

Simply increasing VLM backbone size (e.g., RT-2-X's 55B parameters) yields diminishing returns for robot control. The computational overhead, 15GB GPU memory for inference at 6Hz, makes real-time deployment impractical. Larger models also struggle with action token accuracy, as the vast parameter space prioritizes language modeling over control precision. arxiv

Empirical Evidence: OpenVLA's 7B model matches RT-2-X's performance despite 7× fewer parameters, suggesting data diversity and training recipe matter more than scale. arxiv

Single-Modality Action Generation

Pure autoregressive or pure diffusion approaches each have blind spots. Autoregressive models struggle with continuous precision (quantization error), while diffusion models lack the reasoning depth of VLMs for long-horizon planning. HybridVLA attempted to combine both but introduced training interference between the two generation paradigms, requiring complex collaborative ensemble mechanisms that increased inference latency. arxiv

3. Open Source Model Comparison: OpenVLA vs. SmolVLA vs. Pi0

Feature	OpenVLA (7B)	SmolVLA (<0.5B)	Pi0.6 (5B)
Backbone	Llama 2 + DINOv2 + SigLIP	Qwen 2.5 0.5B + custom ViT	Gemma3 4B
Action Head	Autoregressive tokens (256 bins)	Flow matching (continuous)	Hybrid: FAST tokens + flow matching website.pi-asset
Training Data	970k demos (OpenX dataset)	Public community datasets	Proprietary large-scale corpus
Inference Speed	6 Hz on RTX 4090 arxiv	12.5 Hz on L40s (2.5× faster than OpenVLA) ai.stanford	5-10 Hz (denoising steps dependent)
Key Innovation	Cross-embodiment generalization	Asynchronous inference stack	Knowledge insulation + RL fine-tuning pi
Simulation Performance	62% on LIBERO-90 ai.stanford	77% on LIBERO-90 (w/ action chunks) ai.stanford	State-of-the-art on LIBERO-5 (96.5%) arxiv
Real-World Strength	Generalization across robots	Deployment on consumer GPUs	Long-horizon tasks (coffee making, laundry) pi
Critical Weakness	Slow inference, quantization error	Limited long-horizon reasoning	Proprietary, computationally intensive

Architectural Deep Dive

OpenVLA follows the RT-2 blueprint faithfully: discretize actions, append to vocabulary, train with cross-entropy loss. Its strength lies in the curated OpenX dataset diversity, enabling zero-shot control of unseen robots. However, the autoregressive generation bottleneck limits real-time performance, 15GB GPU memory and 6Hz inference constrain deployment to high-end hardware. arxiv

SmolVLA challenges the "bigger is better" orthodoxy. By using a compact VLM and flow matching action expert, it achieves comparable performance with 14× fewer parameters. The asynchronous inference stack decouples perception from action generation, allowing new chunks to be predicted while the robot executes previous commands. This is particularly impactful for dynamic environments where reaction time matters. huggingface

Pi0.6 represents the hybrid extreme: it trains the VLM backbone on FAST discrete tokens while the action expert predicts continuous flows. Knowledge insulation prevents gradient interference, and offline RL pre-training (Recap) doubles throughput on complex tasks. The model's hierarchical design supports heterogeneous prompts, enabling high-level task conditioning. The trade-off is accessibility, Pi0.6's training requires proprietary data and substantial compute, limiting reproducibility. pi

4. The Humanoid Gap Report: Missing Capabilities for Hand Manipulation

4.1 Proprioception and Tactile Integration

Current VLAs treat proprioception as auxiliary inputs, leading to shortcut learning and poor spatial generalization. Humanoid hands require fine-grained force feedback and slip detection, capabilities absent in standard VLA pipelines. arxiv

Gap: No open-source VLA integrates tactile sensing end-to-end. ForceVLA and AnyTouch explore Mixture-of-Experts for contact-rich tasks, but these remain research prototypes. The lack of large-scale tactile datasets mirrors the early scarcity of robot demonstrations. themoonlight

Opportunity: Develop a "Tactile VLA" that fuses vision, language, and distributed pressure sensor arrays. The architecture should use tactile tokens analogous to image patches, enabling the VLM backbone to reason about contact forces and friction constraints.

4.2 Long-Horizon Planning and Memory

Humanoid manipulation tasks (e.g., assembling furniture) span 5–20 minutes and require remembering partial progress. Standard VLAs operate with Markovian assumptions and fixed context windows, causing failure when intermediate steps are ambiguous. arxiv

Gap: MemoryVLA demonstrates perceptual-cognitive memory banks for manipulation, but its evaluation is limited to tabletop tasks. Humanoid whole-body control introduces additional complexity: locomotion plans must be retained while hands execute fine manipulations. arxiv

Opportunity: Implement a hierarchical memory system with (1) working memory for immediate action chunks and (2) episodic memory for task-level progress. The hippocampal-inspired consolidation mechanism from MemoryVLA could scale to humanoid tasks by encoding proprioceptive trajectories alongside visual observations. arxiv

4.3 Physics-Aware Action Generation

VLAs hallucinate physically implausible actions, predicting grasps that violate kinematic constraints or object trajectories that ignore gravity. This stems from the VLM backbone's pixel-space reasoning lacking 3D physical grounding.

Gap: GeoVLA and 3D-VLA integrate point clouds and depth maps, but these are add-ons rather than core architectural features. The models still prioritize semantic alignment over physical feasibility. arxiv

Opportunity: Embed a differentiable physics simulator within the VLA training loop. Actions could be penalized for violating Newtonian mechanics, similar to how RL uses physics-based rewards. The "visual foresight" approach in F1-VLA shows promise: predicting next visual states correlates with action reliability, suggesting that generative world models could enforce physical consistency. arxiv

4.4 Sim-to-Real for Humanoid Morphology

Humanoid robots exhibit high-dimensional action spaces (30+ DOF) and complex contact dynamics. Current sim-to-real methods rely on domain randomization, which fails to capture the nuance of bipedal balance and bimanual coordination. pmc.ncbi.nlm.nih

Gap: HumanVLA demonstrates vision-language directed object rearrangement but requires privileged state information and hand-crafted finite state machines. The sim-to-real gap remains 17% failure rate in real-world experiments, primarily due to depth sensing errors and contact estimation delays. arxiv

Opportunity: Leverage human video data as an intermediate domain. EgoVLA extracts wrist and hand actions from egocentric videos, using inverse kinematics to retarget to robot hands. This "human-to-robot" transfer could bootstrap humanoid VLA training without expensive real robot data collection. rchalyang.github

5. Critical Disagreements and Uncertainties

Disagreement 1: Proprioception's Role

Proponents: Proprioception provides compact, accurate state information essential for precise servo control. arxiv
Critics: End-to-end visuomotor policies without explicit state inputs achieve better spatial generalization, as they cannot memorize trajectories. arxiv
Resolution: The consensus is shifting toward conditioned proprioception, using state inputs only for low-level control while keeping high-level reasoning vision-driven, as seen in Helix's dual-system architecture. iotworldtoday

Disagreement 2: Action Representation

Tokenization Camp: Discrete tokens enable direct VLM transfer and chain-of-thought reasoning (OpenVLA, RT-2). arxiv
Diffusion Camp: Continuous flow matching captures action continuity and supports variable horizons (Pi0, SmolVLA). youtube
Resolution: Hybrid approaches (Pi0.6, HybridVLA) are emerging as the synthesis, but training interference remains an open problem. arxiv

Uncertainty: The optimal data mixture ratio for humanoid VLAs is unknown. RT-2 used 10% robotics data, while OpenVLA uses 100%. For humanoids, the scarcer data may require more aggressive web-scale pre-training, but this risks physics misalignment.

6. Conclusion

VLA models have evolved from passive VLMs to active embodied agents, but the leap to reliable humanoid manipulation remains incomplete. The open-source ecosystem (OpenVLA, SmolVLA) has democratized access, yet critical gaps persist in proprioceptive reasoning, long-horizon memory, and physics-aware generation.

DEV Community