Teleoperation Data Quality for Imitation Learning: What Actually Breaks the Model

#lerobot #vla #ai

Practical rubric design and failure modes from auditing robot teleop datasets (e.g. LeRobot).

Why this post

We audited teleoperation episodes for an imitation-learning pipeline. Removing poor-quality episodes (about 20–40% in our case) led to clearly better learning; the literature often reports ~10–15% policy improvement from similar filtering. This post covers rubric mistakes that cause inconsistent scores and failure modes we kept seeing.

1. Rubric mistakes and how to fix them

Mistake 1: Metrics that sound clear but aren’t.

Example: “Mistake-to-Recovery-Ratio.” People disagree: Is it (total mistakes)/(total recoveries) or (total mistakes)/(total recovery attempts)? If a pick fails, then fails again, then succeeds, is that one recovery or two attempts?

How it should be: Define one ratio per episode. Count each distinct mistake once (each new failure event). Count a recovery only when the operator successfully got back on track; failed attempts in between don’t add extra recoveries. Write this in the rubric: “Count a recovery only when intended behavior has resumed; don’t count failed attempts as new mistakes unless it’s a new failure (e.g. new drop).” If you also want to penalize messy recoveries, add a separate “recovery attempts per mistake” number.

Mistake 2: No rule for overall quality.

Scorers give High when most dimensions are High but one is Low. Then “high quality” is not strict.

How it should be: Overall = Low if any dimension is Low; High only if all dimensions are High. One bad dimension pulls the episode down.

2. Failure modes we kept seeing

Short name (formal term) with plain-language meaning. One line each; add a screenshot or GIF per item when you publish.

Post-task idle / run-on footage (extra 10–15 s of video after the task is done). Dilutes the signal; policy can learn to linger.
Temporal misalignment (sync issues between cameras or sensors). Bad for multi-view or fusion; causes inconsistent state.
Self-collision / kinematic clash (arm hits itself or the body). Unsafe; don’t let the policy imitate it.
Low contrast / poor observability (white background, same-color object, or bad lighting). Object hard to see; weak visual signal.
Rubric incompleteness (scorers disagree or don’t know how to score). Add explicit rules and examples; flag “undefined” cases and fix the rubric before locking scores.
Repeated failures before success (e.g. 3–5 pick attempts before one works). Noisy trajectory; can teach hesitation.
Over-ideal / low-complexity conditions (too easy, no obstacles). Can bias the dataset; score complexity separately or down-weight.

3. Impact

After fixing the rubric and removing Low-quality episodes (20–40%), retraining gave noticeably better results. Studies on filtering teleop data often report ~10–15% (or more) policy gain. Define metrics and overall quality clearly, then audit before scaling data.

Summary

Rubric: Define “mistake” and “recovery” in writing; one ratio per episode. Overall quality = Low if any dimension is Low.
Failure modes: Post-task idle, sensor sync, arm clashes, poor visibility, rubric gaps, repeated failed attempts, over-ideal setup. Name them, add examples (screenshots/GIFs), score consistently.
Filtering a chunk of bad episodes is high leverage; do it before collecting more