Definition

Curriculum learning is a training strategy that presents tasks or data samples in a structured order of increasing difficulty, rather than sampling uniformly at random. Inspired by how human education progresses from simple exercises to complex problems, curriculum learning helps reinforcement learning agents and learned policies discover initial successes on easy tasks, then build on those skills to solve harder variants.

In robotics, the motivation is practical: many real-world tasks have sparse rewards — the agent receives no positive signal until it completes the entire task. A grasping policy trained from scratch on small, slippery objects may never succeed during early training and therefore never learn. Starting with large, easy-to-grasp objects provides early reward signals that bootstrap the learning process. As the policy improves, the curriculum introduces progressively more challenging objects, poses, or environmental conditions.

Curriculum learning is not a replacement for reward shaping or domain randomization — it is complementary. The curriculum controls what the agent trains on at each stage, while reward shaping controls how success is measured, and domain randomization controls how varied the training conditions are. Used together, these techniques dramatically improve sample efficiency and final policy performance.

How It Works

A curriculum is defined by two components: a difficulty measure (how hard is each task variant?) and a progression schedule (when does the agent advance to harder tasks?). The difficulty measure can be hand-designed (object size, trajectory length, terrain slope) or learned (based on the agent's current success rate). The progression schedule can be fixed (advance after N epochs), performance-based (advance when success rate exceeds a threshold), or fully automatic.

During training, the environment or dataset is parameterized by a difficulty level. At each training iteration, the curriculum selects the current difficulty, the agent collects experience (RL) or trains on a batch (supervised), and the curriculum updates the difficulty based on performance metrics. This creates a feedback loop where the training distribution adapts to the agent's competence.

The key insight is that learning on the right distribution at the right time is more efficient than learning on the full distribution from the start. Early training on hard tasks wastes compute on episodes where the agent never receives useful gradient signal.

Types of Curriculum Learning

  • Self-paced learning — The agent implicitly controls the curriculum by weighting training samples based on its current loss. Easy samples (low loss) receive less weight; samples at the frontier of the agent's ability receive more. No external teacher is needed.
  • Teacher-student (adversarial) — A separate "teacher" network generates task configurations designed to be at the boundary of the "student" agent's ability. PAIRED (Dennis et al., 2020) and PLR (Jiang et al., 2021) use this approach. The teacher is trained to maximize learning progress, not just difficulty.
  • Task curriculum — A fixed sequence of distinct tasks, each building on skills from the previous one. For example: reach → push → pick up → place → stack. Common in multi-stage manipulation training.
  • Environment curriculum — The task remains the same, but environment parameters change: terrain roughness for locomotion, object size and friction for grasping, lighting conditions for visual policies. This is closely related to domain randomization but with structured progression instead of uniform sampling.

Robotics Applications

Locomotion: Legged robot locomotion policies (Unitree Go2, ANYmal) are typically trained with terrain curricula in simulation. Training begins on flat ground, progresses to gentle slopes, then stairs, rubble, and gaps. NVIDIA Isaac Lab and Legged Gym implement terrain curriculum as a standard feature. Without a curriculum, policies for extreme terrains often fail to converge.

Manipulation: Dexterous hand manipulation (Rubik's cube, pen spinning) uses curricula that start with the object close to the goal state and progressively increase the initial randomization. This ensures early episodes are short enough for the agent to accidentally succeed and receive reward. OpenAI's famous Rubik's cube result relied heavily on automatic domain randomization with curriculum-like progression.

Imitation learning: Curricula can also improve imitation learning by presenting demonstrations in order of trajectory length or complexity. Starting with short, simple demonstrations helps the policy learn basic motions before tackling longer, multi-stage tasks.

Comparison with Alternatives

Curriculum learning vs. reward shaping: Reward shaping modifies the reward function to provide intermediate signals (e.g., distance to goal). Curriculum learning keeps the reward function unchanged but controls task difficulty. Reward shaping risks introducing reward hacking; curriculum learning avoids this but requires a meaningful difficulty parameterization.

Curriculum learning vs. hindsight experience replay (HER): HER relabels failed episodes with achieved goals, providing reward signal even when the task was not completed. This is an alternative to curriculum for sparse-reward settings, but it only works with goal-conditioned policies and does not scale well to high-dimensional goal spaces.

Curriculum learning vs. uniform domain randomization: Domain randomization samples all conditions with equal probability from the start. This is simpler but wastes training time on conditions that are far beyond the agent's ability. Curriculum learning is domain randomization with a schedule.

Practical Requirements

Simulation: Curriculum learning is almost always implemented in simulation, where environment parameters can be adjusted programmatically. NVIDIA Isaac Lab, MuJoCo, and PyBullet all support parameterized environments. Real-world curricula are harder to implement but possible (e.g., physically swapping objects between training sessions).

Compute: Curriculum adds negligible overhead to training — the cost is in the RL training itself. A typical locomotion curriculum with PPO trains in 4–24 hours on a single RTX 4090 using Isaac Lab. Manipulation curricula may take 1–3 days depending on task complexity.

Design effort: The main engineering cost is defining the difficulty parameterization and progression schedule. For well-studied domains (locomotion, grasping), standard curricula exist in open-source codebases. For novel tasks, expect to iterate on the curriculum design as part of the training pipeline development.

Automatic Domain Randomization (ADR)

Automatic Domain Randomization, introduced by OpenAI in their Rubik's Cube work (2019), is a curriculum over domain randomization parameters. Instead of manually setting randomization ranges, ADR starts with narrow ranges and automatically widens them as the policy demonstrates competence. The expansion criterion is simple: if the policy's success rate exceeds a threshold (e.g., 80%) under the current randomization range, the range is expanded by a fixed increment.

ADR unifies curriculum learning and domain randomization into a single framework. The "difficulty" is the width of the randomization distribution. Narrow ranges are easy (conditions are predictable); wide ranges are hard (conditions vary enormously). By progressively widening, ADR ensures the policy is always training at the frontier of its ability — the optimal curriculum regime.

Implementing ADR requires tracking per-parameter success rates, which means running separate evaluation episodes for different randomization dimensions. This adds computational overhead but produces policies that are robust to a precisely characterized set of conditions, with clear documentation of what variation the policy can handle.

Designing Curricula for Your Task

A practical guide to curriculum design for robotics practitioners:

  • Identify the difficulty axis: What makes your task hard? For grasping, it might be object size or surface friction. For locomotion, terrain roughness. For insertion, tolerance clearance. The curriculum varies this specific axis.
  • Define the easy endpoint: Configure the environment so the agent can succeed within the first 100 episodes. If the agent never succeeds, the curriculum cannot bootstrap learning. For grasping: start with large, textured objects in known positions. For insertion: start with loose tolerances (5 mm clearance).
  • Define the target endpoint: The conditions under which the deployed policy must succeed. For grasping: small, slippery objects in cluttered scenes. For insertion: tight tolerances (0.1 mm clearance). The curriculum must reach this difficulty level.
  • Choose a progression schedule: Performance-based thresholds (advance when success rate > 80%) are more robust than fixed epoch schedules. Track success rate with exponential moving average over the last 100–1000 episodes.
  • Monitor for regression: When difficulty increases, success rate should drop temporarily, then recover. If it drops permanently, the difficulty step was too large. Add intermediate levels or reduce the step size.

See Also

  • RL Environment Service — Pre-built curricula for locomotion and manipulation in Isaac Lab
  • Data Services — Expert guidance on curriculum design for your task
  • Robot Leasing — Access Unitree G1 for sim-to-real validation of curriculum-trained policies

Key Papers

  • Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). "Curriculum Learning." ICML 2009. The foundational paper that formalized curriculum learning for neural networks and demonstrated its benefits on vision and language tasks.
  • Dennis, M. et al. (2020). "Emergent Complexity and Zero-Shot Transfer via Unsupervised Environment Design." NeurIPS 2020. Introduced PAIRED, an adversarial teacher-student framework for automatic curriculum generation.
  • Rudin, N. et al. (2022). "Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning." CoRL 2022. Demonstrated terrain curriculum learning for legged locomotion in Isaac Gym, training policies in under 20 minutes.

Related Terms

  • Reinforcement Learning — The primary learning paradigm where curriculum learning is applied
  • Reward Shaping — Complementary technique for providing intermediate reward signals
  • Domain Randomization — Training across varied conditions; curriculum adds structured progression
  • Sim-to-Real Transfer — Curriculum-trained policies are typically transferred from simulation to real hardware
  • Policy Learning — The broader paradigm of training robot controllers

Design Your Curriculum at SVRC

Silicon Valley Robotics Center provides GPU-accelerated simulation environments with Isaac Lab and MuJoCo, pre-built terrain and manipulation curricula, and expert guidance on designing custom training schedules for your specific robot and task. Our RL environment service includes curriculum design as part of the policy training pipeline.

Explore Data Services   Contact Us