Problems to solve to bring embodied AI

Prediction

With Unitree and Engine AI aggressively reducing the price of robots in each iteration.
Hardware is going to be become standard and modular, faster than most people realize.
Software and AI brain (general or embodiment specific) will be the only moat in robotics.

1

People are betting on scaling data to improve the accuracy of current systems and that is not a wrong bet, but the amount of data required can be reduced significantly by figuring out better system/architectural designs

A parallel case can be seen in large language models, where continued data scaling has shown diminishing returns. The real progress is now coming from architectural innovations such as Retrieval-Augmented Generation (RAG), which introduces external memory and contextual reasoning instead of relying solely on larger pretraining datasets. Robotics is likely to follow a similar trajectory --> shifting from collecting ever-larger demonstration datasets to building systems that can recall prior experiences, reason within context, and adapt dynamically to new environments.

Interesting Bottlenecks

Below are a few core bottlenecks that can't be solved by simply collecting more data

  • Non Markovian Tasks 2
  • Long Horizon Tasks 3
  • Data flywheel/In-context learning 4
  • Loss of base model knowledge (VLM) during finetuning for action head 5
  • The VLA needs to be optimized for edge devices. Less compute = Longer Battery
  • The data collection hardware needs to be cheaper - people can't be shipping or producing high fidelity copies of their embodiment for data collection.

Opportunity

The problems above are largely company and use-case agnostic. Every robotics team working with AI foundational models would want these improvements, whether as better base models or modular plugins. This is especially true for humanoid and general robotics startups that have raised solid seed or Series A rounds (>$5 M). That kind of capital isn’t enough to train a VLA from scratch on a brand-new architecture...but it’s perfect for fine-tuning existing models and making them work for focused, initial deployments. Robot Integrators are also prime customers here. Ironically, most of the money still gets funneled into hardware R&D. which, honestly, I don’t agree with.

6

Our Approach

We are solving this by adding memory and a context system which we are calling as Spatial-Temporal context layer. Hopefully some of experiments will succeed and we'll hear more about this in the comming weeks :)

Behaviour1K

Behaviour1k is a benchmark competition organised by Stanford University and is led by Dr Li Fei Fei. We are using it as our test bench for checking the performance of our VLAs experiments.
Dataset: https://huggingface.co/datasets/behavior-1k/2025-challenge-demos

Data collection:

The Behaviour1K already provides training data but its all success cases and we need failure and partial failure data too to tackle - edge cases, increase the accuracy of the model and add recovery mechanisms to the VLA's behavior.

There are four major ways we could have collected data for VLA finetuning when going the teleoperation route.

  1. Teleoperation using VR
  2. Teleoperation using keyboard/joystick
  3. Teleoperation using a leader follower setup
  4. Exoskeleton based Teleoperation

We used the VR method because we had a friend's VR and it's quite fun :)

Results:

In the current iteration we have not implemented every thing we planned, we are still in the process. But we have added system 2 thinking and trained on recovery dataset. This has given us a total accuracy of 9.5% on all the 50 tasks evaluated continuously in random order. I think this is a good point to iterate from. Also we are at the top of the leader board with this lol.

Video: https://youtu.be/-HpCKoEfRAs
Will share a more detailed github and videos soon.

The robot in the vidoe is fully controlled both (navigation + manipulation) by the VLA and the model is Galaxea's R1 Pro robot

Footnotes

  1. Non-Markovian tasks: tasks where the next action depends on more than just the current state; the system must recall previous steps.
    Example: When a robot is assembling furniture, it must remember where it placed each screw earlier to correctly attach the next panel.

  2. Long-horizon tasks: tasks made up of many interdependent actions over time. Small errors can accumulate and derail the final goal.
    Example: A robot asked to fetch a pair of keys from where you kept it yesterday in your home

  3. Data flywheel: a feedback loop where improved models generate better data, and that data, in turn, makes models even better. When real-world data is scarce, automation keeps the cycle running.
    Example: An autonomous driving system that uses simulated edge cases to generate new data when real accident footage is limited.

  4. Action head finetuning: updating a model’s control layer to specialize for a new task can accidentally overwrite earlier knowledge (catastrophic forgetting).
    Example: A vision-language model trained on general tasks may forget how to describe objects accurately if finetuned too aggressively for robotic grasping.

  5. Gif 1 Credit: https://x.com/jloganolson/status/1981102506228011361

  6. Gif 3 Credit: https://x.com/stash_pomichter/status/1984755495606108241