
Machine Learning in Robots: How Modern Teams Train Skills Without ML PhDs
The Robotics Stall Nobody Talks About: When Your Team Has Everything Except an ML Researcher
Picture the bench: a mechanical engineer, a systems integrator, and a ROS2 developer standing around a $35,000 Franka manipulator. A signed pilot contract sits on the laptop. The Jetson Orin Nano arrives Thursday. The deadline is ten weeks out. The workspace is mapped, the gripper is selected, and the customer demo script is written. What this team does not have is anyone who has trained a reinforcement learning policy from scratch — and the prevailing assumption inside the industry says that single gap disqualifies them.
It doesn't. The bottleneck in machine learning in robots today is not hardware and not compute. Per the International Federation of Robotics' World Robotics 2023, 553,000 new industrial robots were installed worldwide in 2022, bringing the global operational stock to roughly 3.9 million units. The compute substrate — Jetson Orin Nano, T4000/T5000, AGX Orin — is a catalog item. What's stalled is the legacy belief that deploying robot intelligence requires a dedicated research team. The traditional model said: build the ML stack end-to-end in-house. The emerging model says: consume pretrained policies, capture environments, deploy to edge hardware. Those are two completely different operating assumptions, and they imply two completely different team compositions.
The market data underscores how unsustainable the legacy model has become. The AI robots market is projected to grow from $6.11B in 2025 to $33.39B by 2030 at a 40.4% CAGR according to MarketsandMarkets — a growth curve that simply cannot be sustained if every deployment requires a PhD pipeline behind it. The teams shipping fastest in 2025 aren't the ones with the most ML expertise. They're the ones who've correctly identified which layers of the stack they need to own versus consume.

Table of Contents
- Why Training Robot Skills Used to Require a Specialist Pipeline
- The Four Layers of Machine Learning in Robots
- How Real-to-Sim LiDAR Scanning Eliminates the Biggest Bottleneck
- Pretrained RL Policies vs. Training From Scratch
- What One-Click Deployment Actually Demands From Your Team
- Three Robotics Team Archetypes and Their ML Approaches
- Three Questions Every Team Asks After Deploying ML-Powered Skills
- The Pre-Flight Readiness Checklist
Why Training Robot Skills Used to Require a Specialist Pipeline
The traditional ML-in-robotics pipeline required four distinct specialist functions to operate concurrently, and the cost of running all four simultaneously is the entire reason this work historically lived inside FAANG-tier labs and well-funded research startups.
Environment modeling came first. 3D artists or simulation engineers built URDF/USD scene files by hand, modeling fixtures, lighting, materials, and contact surfaces with enough fidelity that a trained policy might survive transfer to physical hardware. Two to six weeks per environment was typical. Change the cell layout, restart the modeling work.
Reward function engineering ran in parallel. ML researchers iterated on reward shaping — the process of defining numerical signals that encourage the policy toward useful behavior — and this phase often ran longest because RL exploration is sample-inefficient by nature. The agent spends thousands of episodes producing nothing useful before stumbling into rewarded behavior. Reward shaping is the discipline of making that stumble happen faster.
Sim-to-real gap closure came third. Engineers tuned domain randomization, dynamics randomization, and system identification — three of the four canonical transfer strategies categorized in the IEEE survey on sim-to-real transfer in deep reinforcement learning for robotics. The fourth strategy, real-world fine-tuning, runs after deployment. Each strategy demands different expertise: randomization is statistical, system identification is control-theoretic, fine-tuning is empirical.
Embedded deployment was the fourth function. Firmware and middleware engineers ported policies to edge hardware, handled ROS2 integration, managed inference latency, and built safety wrappers around learned behaviors that — by design — have no formal guarantees of correctness.
Each layer required different specialists. A team needed ML researchers, simulation engineers, embedded systems developers, and safety engineers in the same room at the same time. That hiring profile priced out every robotics team except the ones with venture capital deep enough to absorb 18 months of pre-revenue payroll.
What collapsed that requirement is a stack of independent advances that converged around 2022–2024.
Pretrained RL policies now encode broad motion primitives — grasping, locomotion, navigation, articulated manipulation — that teams fine-tune rather than train from scratch. The conceptual antecedent traces to work by Sergey Levine and collaborators at UC Berkeley, demonstrating that reinforcement learning can learn complex motor skills directly when combined with expressive function approximators and off-policy algorithms. The practical consequence is that motion primitives behave like infrastructure now, not like research artifacts.
Real-to-sim scanning pipelines automate environment capture using LiDAR, replacing manual URDF construction with a sequential workflow a robotics engineer can run without a simulation specialist on staff.
Cloud simulation infrastructure parallelizes training across hundreds of environment instances, compressing what used to be week-long training runs into hours.
Standardized edge runtimes — Jetson Orin Nano, T4000/T5000, AGX Orin — abstract hardware-specific deployment. ONNX and TensorRT compile inference artifacts that drop onto any of these targets with predictable performance.
The distinction that matters here is between doing ML and using ML. Doing ML means building algorithms, architectures, and training infrastructure. Using ML means selecting a pretrained policy, fine-tuning it to your environment, and deploying it to your hardware. Most robotics teams need the second category. Hiring boards still demand the first — which is why robotics pilots stall in the hiring phase before any code ships.
The IEEE sim-to-real survey's framing reinforces the point: transfer strategies decompose cleanly into four named approaches, which means teams can choose how much of the stack they own based on the strategy they adopt, rather than owning all of it by default. The modern question is not "can our team do ML?" It is "which ML layers does our team need to own?"
The Four Layers of Machine Learning in Robots
The question is not whether to use ML in your robot. The question is at which abstraction level you engage with it. Decompose the stack into four functional layers and the ownership decisions become tractable.
| ML Stack Layer | Traditional Expertise Required | Time Cost (Traditional) | Platform Abstraction | Risk if Skipped |
|---|---|---|---|---|
| Environment capture | Sim engineer + 3D modeler | 2–6 weeks per env | High (LiDAR → mesh) | Sim-to-real gap widens |
| Policy training | ML researcher + RL specialist | 4–12 weeks per skill | High (pretrained + fine-tune) | Reinventing motion primitives |
| Policy packaging | ML + embedded engineer | 1–3 weeks | Full (one-click export) | Runtime incompatibility |
| Edge deployment | Embedded + ROS2 + DevOps | 2–4 weeks | Partial (team retains safety) | Latency, safety, integration failures |
The pattern that emerges from this decomposition is uncomfortable for teams that have already over-staffed. Layers 2 and 3 are where most robotics organizations catastrophically over-invest. Hiring an ML researcher to retrain motion primitives that already exist as pretrained policies is the robotics equivalent of training your own foundation model from scratch to power a customer support chatbot. The economics don't survive contact with a pilot deadline.
The IEEE sim-to-real survey reinforces this implicitly: the most successful transfer outcomes come from teams who combined high-fidelity simulation with targeted fine-tuning, not from teams who built training infrastructure from scratch. The framework rewards layering, not rebuilding.
Layers 1 and 4 — environment capture and edge deployment — are where teams should retain ownership. These layers encode the team's specific physical context (cell layout, fixtures, payload geometry) and specific safety requirements (risk assessment, force limits, e-stop integration). No platform can know these in advance. The peer-reviewed ScienceDirect review of ISO 15066 makes this point explicitly: a collaborative robot system "comprises the robot arm, controller, end-effector, workpiece, and associated equipment" — and that integration responsibility cannot be outsourced to a vendor, no matter how slick the deployment pipeline.
Conflating all four layers into a single "we need to do ML" project is the most common reason robotics pilots overrun their timelines by quarters. The teams that ship treat the four layers as independent ownership decisions, not as a monolithic build.
How Real-to-Sim LiDAR Scanning Eliminates the Biggest Bottleneck
State the sim-to-real problem in operational terms. Per the IEEE survey on sim-to-real transfer and the ScienceDirect article "Overcoming the Sim-to-Real Gap in Autonomous Robots," policy performance on real hardware correlates strongly with how accurately core physical interactions are captured in simulation. The corollary is brutal in production: if your simulated environment doesn't match your real one, your trained policy will degrade on hardware — sometimes catastrophically, sometimes silently, usually in ways that surface during customer demos rather than during validation.
Historically, closing this gap meant manual 3D modeling (URDF/USD construction) or photogrammetry. Both required simulation specialists. Both consumed weeks per environment. Both produced output that aged poorly the moment a fixture moved.
Real-to-sim LiDAR scanning collapses this work into a sequential pipeline that a robotics engineer — not a simulation specialist — can operate end to end.
Step 1: LiDAR scan of physical environment. A handheld or mounted scanner captures spatial geometry of the actual workspace — the cell, the lab, the warehouse aisle — as a dense point cloud. Scan resolution should match the spatial precision the downstream task demands. Sub-centimeter precision is achievable with industrial scanners; coarser scans work for navigation tasks where contact dynamics aren't critical.
Step 2: Automated mesh generation. The pipeline converts the point cloud into a simulation-ready mesh with collision geometry. This is the step that historically required a 3D artist and weeks of cleanup. Automation here is the central unlock.
Step 3: Sim validation. The team confirms the generated environment matches physical reality. Walk around the mesh, check critical dimensions against tape-measure reality, verify that fixtures are positioned where they actually sit. This is not a passive step — but it's a verification task, not a modeling task, and the time delta is enormous.
Step 4: RL training against validated environment. A pretrained policy fine-tunes within a simulation that mirrors the deployment target. Because the environment is already accurate, the policy is learning the task, not learning to compensate for sim/real mismatch.
Step 5: Edge deployment. The trained policy ships to Jetson hardware running in the scanned environment. The hardware sees what the policy was trained against.
Why this compresses the downstream pipeline matters as much as the time savings. With a high-fidelity environment baseline, teams need less domain randomization (because the sim already matches reality), less reward shaping (because spatial constraints are correctly encoded), and they hit fewer sim-to-real failures during validation. The IEEE survey identifies four transfer strategies — domain randomization, dynamics randomization, system identification, and real-world fine-tuning. Real-to-sim scanning primarily addresses system identification by encoding the actual environment, which reduces the burden the other three strategies have to carry.
The sim-to-real gap isn't a physics problem — it's a data problem. When your simulation is built from your actual environment, the gap closes before training even starts.
The teams that misunderstand this step treat scanning as an optional preprocessing nicety. The teams that ship treat it as the first irreducible engineering deliverable in the entire training workflow.
Pretrained RL Policies vs. Training From Scratch
Pretrained RL policies are policies trained on broad motion primitives — grasping, locomotion, navigation, articulated manipulation — that can be fine-tuned to specific environments, payloads, or tools. The analogy that lands for technical readers: fine-tuning a foundation model on domain-specific data versus pretraining one from raw tokens. Same architectural pattern, applied to motor control rather than language.
The decision of when to use pretrained bases versus when to train from scratch is not philosophical. It's operational, and it follows a clean matrix.
| Use Case | Recommended Starting Point | Fine-Tuning Time | Risk | Platform Support |
|---|---|---|---|---|
| New manipulator task, known cell | Pretrained manipulation policy | Days to 2 weeks | Low | Strong (Franka, UR) |
| Locomotion on novel terrain | Pretrained locomotion policy | 1–3 weeks | Medium | Strong (Unitree, custom) |
| Existing skill, new payload/tool | Pretrained + targeted fine-tune | Days | Low | Strong |
| Entirely novel motion primitive | Scratch training or hybrid | 8–16+ weeks | High | Limited |
The matrix surfaces a non-obvious insight: pretrained policies don't eliminate training. They eliminate the worst part of training. RL's notorious sample inefficiency comes mostly from the early random exploration phase — the phase where the agent produces no useful behavior for thousands of episodes while it learns that random flailing isn't rewarded. Pretrained policies skip that phase entirely. Fine-tuning starts from a policy that already knows how to move. You're teaching it what to move toward in your specific context, not teaching it the concept of moving.
This is consistent with Precedence Research's finding that the machine learning and deep learning segment of AI-driven industrial robotics represents roughly 32% of technology adoption — ahead of traditional computer vision alone. The ecosystem has consolidated around the assumption that learned policies are deployable infrastructure, not research artifacts.
Pretrained RL policies don't skip the training process. They skip the 80 percent of it where nothing useful happens.
The honest counter-evidence sits in the same body of literature. Per the IEEE sim-to-real survey and the ScienceDirect sim-to-real paper, even with pretrained bases, transfer is not automatic. Teams should expect to invest engineering effort in dynamics matching, sensor noise modeling, and real-world fine-tuning. The right framing is not "pretrained equals free." The right framing is "pretrained means the floor is dramatically higher." You still climb. You just don't start from the basement.
For most robotics teams, the only scenario where scratch training is justified is the bottom row of the matrix: an entirely novel motion primitive with no analog in any pretrained base. That scenario exists. It is rare. Teams who assume they're in it almost always aren't — they're in a fine-tuning scenario they've misclassified as a research scenario, often because someone on the team has incentive to scope it as research.
What One-Click Deployment Actually Demands From Your Team
Deployment automation abstracts infrastructure complexity. It does not abstract operational responsibility. Teams who interpret "one-click" as "zero work" are the same teams who fail safety validation on first attempt and lose two weeks they thought they'd already saved.
Five ownership areas survive the abstraction. Each follows the same pattern: what the platform handles versus what the team still owns.

- Hardware target configuration. The platform handles runtime compilation for Jetson Orin Nano, T4000/T5000, or AGX Orin variants; ONNX/TensorRT conversion; container packaging. The team owns selection of the correct edge hardware — Kinematics Mini at $1,499 for development and light production, or Kinematics Max for industrial enclosures — plus network topology, power budget, and thermal envelope. The platform cannot know whether your cell has a 24V rail or whether your enclosure rejects heat.
- Policy performance validation. The platform handles the deployment pipeline, version tagging, and artifact distribution. The team owns latency benchmarking on actual hardware (measure inference time per control cycle under realistic load), accuracy validation against task success criteria, and edge case probing. A policy that runs at 50Hz in sim but 12Hz on the Jetson under thermal throttling is not a deployed skill — it's a pending failure.
- Safety envelope definition. The platform handles runtime integration with ROS2 safety nodes. The team owns ISO 10218-1/2 and ISO/TS 15066 compliance — including risk assessment, safety-rated monitored stops, force/pressure limits for collaborative operation, and emergency stop integration. Per ISO/TS 15066, this means defining quasi-static and transient contact force limits across human body regions when cobots are in play. That work is not delegable. No vendor can do your risk assessment for you because no vendor knows your workspace.
- ROS2 integration checkpoints. The platform handles standard message types, topic publication for policy outputs, and node lifecycle. The team owns mapping topics to the existing control stack, confirming message frequency matches downstream consumers, and handling out-of-distribution fallback behaviors. If your policy publishes joint commands at 100Hz and your controller consumes at 50Hz, the symptom presents as policy underperformance — and the root cause is integration, not ML.
- Fleet versioning and rollback. The platform handles policy artifact storage, version metadata, and over-the-air distribution. The team owns rollback triggers, A/B deployment ratios, and the operational runbook for when a deployed policy underperforms. "Deploy" is one verb. "Revert in under five minutes when production stops" is a different verb. Both need owners before the first deployment goes live.
The ScienceDirect review of ISO 15066 makes the broader point pointedly: collaborative robots are not intrinsically safe. Safety depends on detailed risk assessment, correct application of force and speed limits, and validation testing under worst-case conditions. The same logic applies to ML-powered skill deployment. The platform makes shipping fast. It does not make shipping safe by default. That responsibility stays with the integrator, where it always belonged.
Three Robotics Team Archetypes and Their ML Approaches
Three operational archetypes dominate the teams currently deploying ML-powered skills. Each has different constraints, different hardware ecosystems, and a different correct answer to the "which layers do we own?" question. Read them and calibrate.
The Velocity-Constrained Startup
A seed-stage robotics company building on a Unitree Go2 quadruped or a custom platform. One robotics engineer wearing four hats. Their constraint isn't budget — it's velocity. They need a working skill demo to close a Series A round or land a pilot contract. The deadline is six weeks, not six quarters.
Their ML approach must be entirely platform-delegated. Pretrained locomotion policies, LiDAR scan of the demo environment, fine-tuning measured in days, deployment to existing hardware without rewriting the driver stack. They cannot afford to own any of the four layers from the stack decomposition except environment capture and deployment configuration — and even those need to be lightweight. Anyone trying to sell them custom RL development as a six-month engagement is selling them a missed deadline and a dead company. The right partner for this archetype shows up with infrastructure, not with a research proposal.
The Industrial Integrator
An established systems integrator deploying Franka Emika or Universal Robots manipulators for a tier-2 automotive or electronics manufacturer. They have robotics expertise on staff. They have ROS2 production experience. What they don't have is a six-month R&D budget — their client contract has a six-month deployment SLA, and the difference matters.
They need pretrained manipulation policies fine-tuned to specific gripping tasks, often with custom end-effectors. They need reliable sim-to-real transfer on known cell geometry. They need enterprise-grade deployment with documented rollback. They must comply with ANSI/A3 R15.06-2025 and ISO 10218-1/2 for industrial robot safety — the Machine Safety Specialists training reference lays out the validation framework integrators use in North America. For this archetype, the platform's value isn't raw speed. It's audit trail and reproducibility across cells. They're deploying skill 47 of 200, not skill 1 of 1.
The ROS2 Academic or Maker Team
A university lab, an applied research group, or a maker collective building on open-source hardware and ROS2. They want to contribute to policy development, not just consume pretrained models. Their priority is openness. They want an MIT-licensed inference stack, an open framework underneath it, and compatibility with community hardware they can buy, modify, and document.
They want to inspect and modify the policy layer. For them, the platform is a starting substrate, not a black box. They're the archetype that pushes the ecosystem forward — the one that finds the bug in the locomotion policy, files the GitHub issue, and submits the PR. Open licensing isn't a procurement preference; it's a structural requirement for the kind of work they do.
The robotics teams moving fastest right now aren't the ones with the most ML expertise — they're the ones who've been most honest about which parts of the stack they actually need to own.
The synthesis across all three archetypes: machine learning in robots looks different depending on which layers your team chooses to own. The velocity-constrained startup owns almost nothing in the ML stack and is correct to do so. The integrator owns the safety and validation layers and is correct to do so. The academic team owns potentially all of it and is correct to do so. A serious platform supports all three. The first diagnostic move for any robotics team is to recognize which archetype they're operating as — often without realizing it — and stop pretending they're a different one because it sounds more impressive on a pitch deck.
Three Questions Every Team Asks After Deploying ML-Powered Skills
- Does using a pretrained RL policy lock my team into one vendor's training data? Distinguish between policy weights (which encode learned behavior) and the architecture and runtime around them (which may be open-source). Stacks built on MIT-licensed inference components allow inspection, modification, and portability of the surrounding infrastructure even when specific weights are platform-distributed. Lock-in risk is real but addressable through architectural choice. Evaluate three things: the license of the inference runtime, the portability of the policy artifact format (ONNX is widely portable; proprietary binary formats are not), and whether the platform allows export of fine-tuned weights for redeployment elsewhere. If those three answers are open, license, and yes, you have an exit path.
- How does machine learning in robots differ for mobile robots versus manipulators? Locomotion policies operate in continuous state spaces with balance, gait, and terrain adaptation constraints. They're trained on broad motion patterns and benefit heavily from domain randomization across surface types — friction, compliance, slope. Manipulation policies are tighter and more task-specific. Precision requirements run higher (sub-centimeter for assembly tasks), and the sim-to-real gap manifests in contact dynamics rather than terrain variation. The IEEE sim-to-real survey categorizes both under the same transfer framework, but the engineering emphasis diverges: locomotion teams invest in dynamics randomization; manipulation teams invest in system identification and contact modeling.
- What happens when a deployed policy encounters a scenario it wasn't trained on? The honest answer is that out-of-distribution behavior in RL policies is the open frontier. Policies may degrade gracefully (slower or less precise execution), degrade unsafely (unpredictable motion), or fail to act at all. This is precisely why the safety envelope from the deployment section exists. Workspace boundaries, velocity limits, force and pressure constraints per ISO/TS 15066, and e-stop integration are not redundant safeguards. They are the operational answer to OOD behavior. Define fallback behaviors at the ROS2 layer as well: when policy confidence drops below a threshold, the system should revert to a deterministic safe state — a controlled stop, a return-to-home, a handoff to a teleoperator — rather than continuing inference into territory the policy doesn't understand.
The Pre-Flight Readiness Checklist
A three-tier operational checklist. Use it Monday morning. Each item maps to an actual deliverable, not an aspiration.
Tier 1 — Environment and Hardware Readiness (before scanning begins)
- Physical environment is stable enough to scan. If the workspace will change in the next 30 days — furniture, fixtures, lighting, machine layout — scan after the change. A scan of a non-final environment trains a policy for an environment that won't exist when the policy ships.
- LiDAR scanning hardware identified. Either owned (a handheld scanner in the Leica BLK2GO class or equivalent) or accessible via service provider. Scan resolution should match the spatial precision your task demands. Don't pay for sub-millimeter resolution on a navigation task; don't accept centimeter resolution on a precision assembly task.
- Target robot platform confirmed. Manipulator (Franka, UR), quadruped (Unitree or analog), or mobile base. The platform determines which pretrained policy family applies. Confirming this before scanning prevents wasted work if a hardware swap is on the table.
- Edge hardware spec finalized. Kinematics Mini at $1,499 (Jetson Orin Nano class) for development and light production, or Kinematics Max (Jetson T4000/T5000 or AGX Orin) for industrial enclosures. Compute budget should account for inference latency targets under thermal load — not benchtop best-case.
Tier 2 — Policy and Training Readiness (before training starts)
- Task defined in measurable terms. Not "pick objects." Try "pick 50mm cylindrical objects from a 600mm × 400mm bin at 95% first-attempt success across 100 trials, with cycle time under 4 seconds." Measurable tasks have measurable success criteria. Vague tasks have vague outcomes.
- Pretrained policy base selected, or scratch-training rationale documented. Default to pretrained unless the decision matrix's bottom-row criteria genuinely apply. If someone on the team is pushing for scratch training, ask them to write the rationale down. Most rationales don't survive being written.
- Simulation environment validated against physical scan. Visually inspect the generated mesh. Verify dimensions of critical features against tape-measure reality. Walk the sim, walk the cell, compare.
- Reward function or fine-tuning parameters scoped. For pretrained fine-tuning, this is usually narrow — task-specific reward signals layered on existing primitives. For scratch training, this is the longest-running task in the project. Scope honestly. Budget honestly.
Tier 3 — Deployment Readiness (before going live)
- ROS2 node compatibility confirmed. Topic names match. Message types match. Publication frequency matches what downstream consumers expect. Integration testing happens on real ROS2 graphs, not on isolated test rigs.
- Safety envelope documented and tested. ISO 10218-1/2 risk assessment complete. If collaborative operation is in scope, ISO/TS 15066 force and pressure limits validated under worst-case conditions, not nominal. Per the ScienceDirect review of ISO 15066, treat the cobot system as the arm plus controller plus end-effector plus workpiece plus surrounding equipment — not just the arm alone.
- Policy performance benchmarked on target hardware. Inference latency measured under load. Success rate measured on actual hardware in the actual environment, not in simulation. Sim numbers and hardware numbers should be reported separately; a team that conflates them loses credibility the first time the customer asks which is which.
- Rollback procedure defined and tested. If a deployed policy underperforms, you should be able to revert to the previous version in under five minutes without a redeploy from source. Test the rollback before you need it.
- Deployment owner named. Not "the team." One person whose name is on the runbook, whose phone rings when the robot stops, and who has the authority to trigger rollback without convening a meeting.
The items on this checklist define the work that remains even when the ML pipeline is fully platform-delegated. Whichever path you take — building this stack in-house, partnering with a platform like OpenKinematics, or running a hybrid — these are the items that determine whether your first deployed skill ships in weeks or stalls in months. The teams that treat the checklist seriously are the ones whose pilots become production. The teams that treat it as paperwork are the ones whose pilots become postmortems.