Vision-language reasoning over a simulated robotics scene.
A vision-language model observes the Isaac Sim scene via screenshots and reasons over what to do next ('clear the spilled bottle from aisle 3'). Bridges high-level reasoning with low-level skill execution.