Announcing Spatial & Physical Intelligence at Protege

Announcing Spatial & Physical Intelligence at Protege

Why we’re launching our data vertical aimed at robotics, world models, and more — and an invitation to build with us.

Robots have worked alongside humans on factory floors for decades. What this typically looks like: a single mechanical arm, bolted to the floor, repeating the same motion thousands of times in a space designed entirely around its limitations. Every object is in the same place, every surface a fixed distance, shape, and size, everything pre-determined.

Today, leading AI companies are trying to build something much different: the equivalent of a robot that works in a car mechanic’s shop. This environment is messy by nature. Tools scattered across the bench. A manual that doesn’t exist. Human colleagues moving through the same space.

But the robot of tomorrow needs to know which tool to pick up, that the wrench is heavier than the screwdriver, or that the person next to it just stepped in its path. It needs to understand the world around it well enough to act in it safely, without anyone choreographing the scene in advance.

That is a very different problem, and it requires a very different kind of data than what’s been used before.

Billions of dollars are flowing into robotics and world models right now. The ambition is real. But the training data these systems need, the dense and diverse recordings of how humans actually move through and interact with the physical world, barely exists. The models are advancing. The data has not kept pace — at least, not the full breadth of data required for the next generation of robotics and world models.

That’s why we’ve been building something new.

Today, we’re announcing Spatial & Physical Intelligence, a new vertical at Protege focused on curating and structuring high-quality training data for robotics and world models.

Where this leads: we are building the foundational multi-modal training data assets for bringing AI into the physical world.

Having worked closely with builders and data creators in the space, we now clearly see that these two fields depend on the same underlying data. Despite serving different builders with different model objectives, both need to understand how humans, objects, and environments interact in physical space.

Two fields, one data gap

A world model is a predictive model of how the environment evolves. It learns the dynamics of the world (e.g., “if I push this cup, it will slide”) and can simulate future states. World models are studied broadly in AI and reinforcement learning, not just robotics, and can be trained purely from 3D, video or sensor data without ever controlling a physical robot (Ha & Schmidhuber, 2018; LeCun, 2022).

A robotics model, by contrast, must translate perception into action commands for real hardware under physical constraints. It may use a world model internally for “planning”, but it also needs control policies, safety constraints, embodiment awareness, and real-time execution.

Technically, these are separate problems. But here’s what we keep seeing: the data that teaches one teaches the other. We have seen world model companies benefit from robotics data, and vice-versa.

When you think about what cutting-edge builders want to build, this makes sense.

How does a robot know what tool to pick up? How does it know the hammer is heavier than the apple? A world model is learning to simulate exactly these dynamics. A robotics model is learning to act on them. The raw material is the same: recordings of humans interacting with objects and environments in real physical space.

Autonomous vehicle companies figured this out years ago. Simulation and on-road deployment draw from overlapping sensor data. The rest of physical AI is arriving at the same conclusion. We’ve seen it firsthand: both a robotics research team and a world model research team end up needing the same data. That convergence will only accelerate.

Four data types, one thesis

We are anchoring our initial data sources to four key categories:

  • Ego & Exo-centric video. First-person cameras worn by real workers in real settings like factories, logistics centers, workshops, kitchens. Our focus is on purpose-built captures with depth data, multi-sensor synchronization, and task diversity across geographies. Vehicles and robots also capture egocentric footage simultaneously with LiDar, spatial positioning data, and photography.
  • Motion capture. Many existing motion capture datasets represent a fair share of locomotion, human-object interaction, and everyday motions like sitting and standing, or climbing stairs. We are building on that foundation with activities like working with tools of different weights, handling fragile objects, and moving through spaces that aren’t stages. This bridges the real-world activities gap between what’s existed in mocap before and what researchers need.
  • Video game data. Gaming studios have built 3D environments with realistic physics and player interaction data over the course of decades. None of it was built for AI training, but it carries genuine value for understanding how agents interact with complex physical spaces.
  • 3D assets. 3D scans of objects and scenes across scanning platforms & high quality scanning rigs. Metadata like friction coefficient, mass, or reflectivity add depth and understanding to environment and object interaction.

Where this data can be used

Spatially grounded data helps world models learn physics, human behavior, and environmental dynamics. These multimodal datasets are also critical for training robotics-focused vision-language models (VLMs), enabling them to better understand actions, predict outcomes, and operate effectively in real-world 3D environments. We aim to be the data layer for training physical intelligence and to fulfill all needs across all stages of the training process.

Collaborate with us!

Robotics and world models are in their early “GPT-2 era.” The demos are promising and the funding is piling up. But there is no established data supply chain for physical AI, no dominant provider, and no consensus on what “good” versus “great” spatial and physical data looks like.

Here are examples of questions we’re thinking about:

  • How do you curate millions of 3D scans without losing the diversity that makes the archive valuable?
  • How do you combine egocentric video with motion capture to create richer multimodal training sets (and does it actually help)?
  • What does a “gold-standard” spatial and physical dataset look like?

If you’re building robotics or world models and are actively exploring training data across any of these categories, we’d love to talk. If you hold data in any of these areas and want to explore licensing it for AI training, we want to hear from you too.

AI models have improved rapidly at understanding text and speech. Understanding the physical world is the next frontier. We’re here to build the datasets to drive the industry forward.

If you’re a model builder and want learn more, visit https://withprotege.ai/model-builders/spatial.

For data providers looking for the right partner for your data, visit: https://withprotege.ai/data-provider/spatial.