The Physical Turn: Humanoid Robots, Dexterity, and the 2025–26 Inflection Point
Robotics & AI

The Physical Turn

Humanoid robots have begun to move with genuine dexterity — not because the hardware suddenly improved, but because the learning algorithms finally caught up to what sixty years of robotics promised. Here's what actually changed.

47
distinct hand joints in Figure 02's dexterous manipulator
1,000+
hours of robot-learning data generated per week at Figure
700ms
full-body reaction latency for Optimus Gen 2 object manipulation
2030
projected year humanoid robots enter mass manufacturing facilities
1

The Dexterity Problem

For sixty years, robotics solved the wrong problem. Beginning in the 1960s with the Stanford Arm and continuing through the era of industrial automation, roboticists built machines of extraordinary precision and speed — but only for tasks that had been entirely pre-programmed. A robot on an automotive assembly line could weld a chassis seam with submillimetre accuracy, repeating the same motion ten thousand times. A factory picker could identify semiconductor components and place them on a board with inhuman consistency. These feats were genuine engineering triumphs. But they were also profoundly brittle.

Step an unpredicted object in front of that welder, and it would crash. Ask the picker to retrieve something it had never seen before, and it would fail. The machines had no understanding of what they were manipulating — no sense of texture, weight, friction, or intent. They were executing carefully choreographed sequences, nothing more. This limitation was never a mystery to roboticists. It was simply accepted as the boundary of what was possible.

The deeper problem became apparent in the early 2000s with the DARPA Robotics Challenge. Researchers were asked to design humanoid robots that could handle unstructured disaster scenarios — navigating debris, turning valves, opening doors, climbing ladders, all in environments that had not been pre-mapped or engineered for the robot. The machines that emerged were impressively limbed and mobile, yet fundamentally helpless when asked to manipulate an object for the first time. Pick up a bottle cap? Fold a towel? Pour a glass of water? The robots could not do any of it without spilling, dropping, or crushing the object. The hardware existed. The problem was intelligence — the ability to sense, reason, and adapt in real time.

For decades, the robotics community laboured under what might be called the "pick-and-place delusion." If you wanted a warehouse robot to move objects, you simply needed to: (1) engineer a gripper strong enough to grasp the object, (2) program the robot to detect the object in the 3D space, (3) instruct it to move the gripper to the object's centroid, and (4) close the gripper. This worked perfectly for boxes of identical size and shape. It failed catastrophically when tasked with picking up anything novel, anything fragile, anything that required even minimal understanding of how to approach it without crushing, dropping, or destabilising it.

True dexterous manipulation requires four things the pre-programmed approach simply lacks. First, it requires genuine tactile sensing — fingertips that can feel pressure, texture, slip, and temperature in real time. Second, it requires force feedback — the ability to modulate grip strength and hand position based on what the fingers are reporting. Third, it requires something closer to a motor imagination: the robot must be able to predict how an object will respond to different movements before it makes them. And fourth, it must be able to re-plan constantly — not in minutes, but in milliseconds — as the real world deviates from expectation. A human toddler can acquire all four of these capabilities by watching and touching things for a few months. Roboticists had spent sixty years trying to code them explicitly. In 2025 and 2026, they finally found a different way.

A human toddler can generalise manipulation to any new object within milliseconds of seeing it. Replicating that in silicon has taken sixty years — and required a revolution in how machines learn, not how they move.
2

What Actually Changed: The Algorithmic Breakthrough

The robotics world has a running joke: for every five years of progress, the hardware stays roughly the same. In 2015, a well-equipped robotics lab could build a robot arm with 7 degrees of freedom in the shoulder, elbow, and wrist. In 2025, the mechanical specifications were only marginally better. The motors were slightly faster. The actuators were slightly smoother. The sensors were slightly more precise. But nothing had fundamentally changed the raw capabilities of the machine itself. What changed was what you could do with a machine of that specification, and that change was algorithmic.

The inflection point arrived with Diffusion Policy, a paper published in 2023 by researchers at Columbia University and MIT led by Cheng Chi. At its core, Diffusion Policy addresses a problem that had bedevilled robot learning for years: how do you train a robot to imitate a human performing a complex, multi-step manipulation task? Traditional approaches used something called behavioural cloning: you show the robot many examples of a human performing the task, and the robot learns to predict the "best" action at each step. But this assumes there is only one best action. In reality, humans perform manipulation with tremendous flexibility. You can pick up a coffee mug by the handle, or grip the body, or cradle it in your palm. You can pour from the right side or the left. The task succeeds across a range of behaviours.

Diffusion Policy borrowed an insight from the generative AI revolution. Just as diffusion models in image generation learn to model the full distribution of possible images — not just one "best" image — Diffusion Policy trains robots to model the full distribution of possible actions. The robot learns not what to do, but what kinds of things it's reasonable to do in a given situation. This distribution-modelling approach allows graceful failure. If an object slips during manipulation, the robot's learned distribution reflects that unexpected outcomes sometimes happen. The next action is sampled from a distribution that has learned to recover from slips. If a surface is slightly rougher than expected, the distribution has learned hands react to friction variations. The robot doesn't crash and try again from the beginning. It adapts.

By late 2023, every leading robotics company had incorporated some version of Diffusion Policy into their manipulation stacks. Figure AI used it to train their dexterous hand. Tesla integrated it into Optimus. The algorithm became the de facto standard because it worked. More importantly, it worked with small amounts of data. A robot didn't need a million examples of a task. It could learn from hundreds of demonstrations and generalise far beyond what it had been shown. This wasn't magic. It was a more intelligent way to let machines learn from human example.

But Diffusion Policy was only part of the revolution. Running in parallel was the adaptation of Transformer architectures — the same neural network design that powers GPT — to visuomotor control. Where earlier robot controllers had been relatively shallow networks with modest memory, Transformer-based policies could attend to entire motion sequences, learning long-horizon reasoning about multi-step tasks. A Transformer policy could understand that you're picking up a bottle to pour it, not to set it down. This kind of semantic understanding, baked into the learned policy itself, was new.

A third breakthrough came from integrating locomotion and manipulation seamlessly. Boston Dynamics, CMU's humanoid group, and others published research on "whole-body loco-manipulation" — the problem of controlling a humanoid robot that is simultaneously walking, balancing, and moving its arms with precise dexterity. For most of the history of robotics, locomotion and manipulation were handled as separate problems by separate controllers. But a human doesn't walk and manipulate independently. You lean, shift your weight, turn your torso to position your arms. These actions are coupled. When robots learned to couple them, suddenly they could perform tasks in the real world that required both — picking something off a high shelf, opening a door while walking through it, assembling something while standing at a workbench. The coupling wasn't a minor detail. It changed what was fundamentally possible.

Then there was the question of simulation. For decades, roboticists had trained policies in simulation and tried to transfer them to real robots. The gap was always enormous. A policy trained in perfect physics simulation would fail on a real robot because real friction, real motors, and real sensor noise don't match the simulation. But by 2024, simulators had become accurate enough, and transfer learning techniques had become sophisticated enough, that you could train a policy in simulation, add a small amount of real-world refinement, and have it work. This cut training time from years to months. Some leading companies began training in massive, high-fidelity simulations and deploying on real hardware with only minimal additional tuning.

Finally, and perhaps most audaciously, 1X Technologies pursued a different path entirely. Rather than collecting data from their own robots, they built a world model trained on internet video — YouTube videos of humans performing manual tasks. Their robot, called NEO, learned what human hands do from millions of hours of unstructured video, without ever being teleoperated to generate training data. This is the ultimate expression of a new principle in robotics: you don't need to hand-engineer the robot's experience. You can learn from the open world.

Diffusion Policy — The Key Insight

Traditional robot controllers output a single best action. Diffusion Policy instead models the full distribution of possible actions — like a generative AI for motion. The robot learns not just "what to do" but "what kinds of things to do," allowing graceful adaptation when objects slip, surfaces shift, or tasks are partially novel. First demonstrated at scale in 2023; now the backbone of every leading humanoid's manipulation stack.

3

The 2025–26 Inflection: Figure, Optimus, 1X

Figure AI entered 2024 with a reputation for ambitious claims and limited public demonstrations. That changed in February 2024, when the company released video of Figure 02 working in a BMW manufacturing facility in Germany. The robot was not being teleoperated. It was not executing a pre-programmed sequence. It was autonomously identifying automotive components — interior trim pieces, brackets, fasteners — and moving them from a conveyor to an assembly station. The task itself was not revolutionary. But the autonomy was. The robot's vision system, trained on Figure's own teleoperated demonstrations, identified objects it had never seen before in configurations it hadn't been explicitly shown. It adapted when a component was rotated or partially occluded. It made grip-strength decisions based on surface texture. For thirty minutes, it worked without human intervention.

What made this possible was a partnership with OpenAI. Figure had given OpenAI access to their robot's camera feeds and teleoperation logs, and OpenAI had trained a vision-language model — a system that understands both images and natural language — to plan the robot's actions. This model could interpret a command like "move this bracket to the assembly line" and break it into low-level motor control. The robot's arm controller handled the manipulation using the now-standard Diffusion Policy algorithm. But the high-level reasoning came from the same architecture that powers ChatGPT. This wasn't anthropomorphism. It was engineering pragmatism. Why reinvent the wheel when large language models had already learned to reason about physical tasks?

Tesla's Optimus Gen 2 embodied a different but equally significant achievement. In demonstrations throughout 2024, the robot — which stood 173 centimetres tall and weighed 57 kilograms — showed eye-watering hand speed and precision. Most striking was a video of Optimus threading a needle. Threading requires seeing a tiny hole, navigating a filament through three-dimensional space with sub-millimetre accuracy, all while maintaining tension. It's a task that defeats most humans without magnification. Optimus did it smoothly, on the first try. The robot achieved a 30x improvement in hand speed over its Gen 1 predecessor, and that speed didn't come at the cost of dexterity. Every finger had tactile sensors. The control loop was running at 700 milliseconds end-to-end — sensor input to motor output — fast enough to catch a falling object or adjust grip mid-manipulation.

Tesla's advantage was data. The company had been collecting footage from eight million vehicles on the road, using camera systems that had been recording their surroundings for years. While Tesla couldn't directly extract robot training data from a vehicle camera, they had access to millions of examples of how hands and objects interact in the real world. They had videos of humans picking things up, putting things down, grasping, releasing, adjusting. This ambient data, fed into their learning pipeline, let them train more robust policies faster than competitors relying on teleoperation data from their own machines. By 2025, Tesla had shifted from relying on human teleoperation to relying on their data advantage, and the difference showed in Optimus's capabilities.

1X Technologies took the most radical approach. Their robot NEO — Neural Embodied Operation — was trained almost exclusively on internet video. The company's researchers had scraped YouTube and other open video platforms for footage of human hands performing manipulation tasks: opening drawers, picking up objects, gesturing, cooking, building. They trained a world model — a neural network that predicts the future — on this video. Then they fine-tuned the model on a small amount of actual robot data. The result was a system that could perform tasks it had never been explicitly trained on, because it had learned the physics of the world from watching billions of human actions. This approach didn't require massive teleoperation teams. It scaled by leveraging what humans had already recorded.

Beyond these three, 2024 and 2025 saw an explosion of capability across the field. Boston Dynamics released an all-electric version of Atlas, its humanoid robot, which could move with balletic grace and strength. Agility Robotics' Digit robot began work at Amazon fulfilment centres, handling bin-sorting tasks at a rate that matched human workers. Unitree and Sanctuary AI released updates to their humanoid systems. But the most significant marker that something fundamental had changed came when major robotics companies signed their first deployment contracts for commercial manufacturing. These weren't pilots or proofs-of-concept. These were signed agreements to deploy humanoid robots into actual production facilities, with delivery timelines stretching into 2025 and 2026. The inflection point had been crossed. Dexterity had stopped being a research goal and become a shipping feature.

What 2025 and early 2026 have clarified is that this transition wasn't driven by a single breakthrough. It was the convergence of multiple advances: Diffusion Policy for learning from human demonstration, Transformers for reasoning over sequences, large-scale simulation for training efficiency, vision-language models for planning, and finally, the abundance of internet data from which to learn. No single company had all these pieces independently. What distinguished the leaders — Figure, Tesla, 1X, Boston Dynamics — was the willingness to combine insights from across the machine learning field and apply them to embodied AI. They didn't invent a new kind of robot. They invented a new way to teach robots to use the bodies they'd already built.

4

The Open Questions

Yet despite these advances, profound challenges remain. The first is what might be called the "last mile" problem. All of the demonstrations of dexterous humanoid robots have occurred in controlled environments: manufacturing facilities with known objects and predictable lighting, laboratory settings where the scene can be carefully arranged. Deploying a humanoid into a fully unstructured human home — with variations in floor texture, clutter, unknown objects, and the thousand small surprises of domestic life — remains unsolved. A robot can learn to manipulate objects it has seen before. Learning to generalise to the full diversity of human environments is orders of magnitude harder. This is not a hardware problem. This is a learning problem. And it's far from solved.

The second is safety and corrigibility. A 70-kilogram robot moving at full speed can do genuine damage. An error in a factory setting — the robot misidentifies an object, misestimates a weight, miscalculates trajectory — could injure a human worker or destroy equipment. Unlike a software bug that crashes silently, a robot that makes a mistake makes it in physical space, with physical consequences. How do you build confidence that a learned policy won't fail in ways that matter? Traditional approaches — extensive testing, simulation verification, conservative speed limits — help but don't eliminate risk. This is not yet solved. Every humanoid deployment so far has occurred in human-free zones or with extensive safety infrastructure precisely because we don't yet have robust answers.

The third is the question that underlies all automation: labour economics. Humanoid robots will displace workers. This is not speculation. Digit at Amazon is already performing jobs that humans previously did. Optimus and Figure 02 are being deployed in manufacturing environments where humans worked. The standard economic narrative is that technological disruption creates new jobs even as it destroys old ones — the loom destroyed hand-weaving but created factory employment. This may be true at the macro level over decades. But at the level of an individual worker whose job is being automated, the disruption is real and immediate. The challenge facing society is not whether humanoid robots will be economically useful — they clearly will be — but whether we have policies in place to handle the transition. This is a policy problem, not a technology problem. But it's a problem that matters.

There's also a flip side. In ageing societies with shrinking workforces, humanoid robots may be essential for economic continuity. Japan, Germany, and parts of China face demographic challenges where the population is declining and the labour pool is shrinking. Humanoid robots won't replace the ethical or emotional dimensions of care work, but they could perform the physical labour associated with manufacturing, logistics, construction, and other sectors facing labour shortages. The same technology that threatens displacement in wealthy countries might prevent economic collapse in others. The question of how to govern this transition — how to ensure that the benefits are shared, and the harms are mitigated — remains unresolved.

Finally, there's the question of what embodied AI might mean for AI alignment and safety more broadly. The systems currently deployed are relatively narrow: they're trained to perform specific tasks in specific environments. They're not general-purpose agents with arbitrary access to physical systems. But as robots become more capable and more autonomous, the alignment problem becomes more urgent. A language model that hallucinates is wrong but rarely dangerous. A robot that hallucinates about where it is in 3D space, or misunderstands an instruction, can cause injury. The field of embodied AI safety is nascent. How do you ensure that as robots become more autonomous, they remain corrigible and aligned with human intent? How do you design robots that are robust to adversarial inputs and physical perturbations? These are hard open problems.

Looking toward 2030, the optimist says: humanoid robots enter mass production. They're in factories and warehouses globally. They handle hazardous work that humans shouldn't do. Labour displacement is significant but manageable through retraining programs. Safety has improved through continuous iteration and rigorous testing. The pessimist says: mass production stalls due to unresolved safety and corrigibility questions. Labour displacement outpaces job creation. Robots remain expensive and specialized enough that only large corporations can afford them, deepening economic inequality. Alignment failures occur, leading to regulatory backlash. Both scenarios are plausible. What seems certain is that the trajectory set in 2025 and 2026 will define the next decade. The dexterity problem has a solution. What we solve next is the wisdom problem.

Sources & Further Reading

Ko-fi Buy me a coffee
Scroll to Top