Troubleshooting Guide#

This comprehensive guide helps you diagnose and resolve common issues when working with LeggedGym-Ex. Each section covers specific categories of problems with actionable solutions.

Installation Issues#

CUDA Version Mismatch#

Problem: Training fails with CUDA-related errors or torch cannot detect GPU.

Error Messages:

RuntimeError: CUDA out of memory. Tried to allocate X.XX MiB
AssertionError: Torch not compiled with CUDA enabled
NVIDIA driver version is incompatible with CUDA version

Root Cause: The PyTorch CUDA version must match your system’s NVIDIA driver and CUDA toolkit.

Solution:

  1. Check your NVIDIA driver version:

nvidia-smi
  1. Verify PyTorch CUDA version:

import torch
print(torch.version.cuda)
print(torch.cuda.is_available())
  1. Install compatible PyTorch version:

For IsaacGym (Python 3.8, CUDA 12.1):

pip install torch==2.4.1 torchvision==0.19.1 --index-url https://download.pytorch.org/whl/cu121

For Genesis (Python 3.10, CUDA 12.6):

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu126

For IsaacLab (Python 3.11, CUDA 12.6):

# IsaacSim includes PyTorch, verify after installation
python -c "import torch; print(torch.cuda.is_available())"

Prevention: Always check driver compatibility before creating conda environments. Minimum driver version: 570.

IsaacGym Installation Errors#

Problem: IsaacGym Preview 4 fails to install or import.

Error Messages:

ModuleNotFoundError: No module named 'isaacgym'
ImportError: libpython3.8.so.1.0: cannot open shared object file
OSError: Cannot load IsaacGym library

Root Cause: IsaacGym requires specific Python version and proper library paths.

Solution:

  1. Download IsaacGym Preview 4 from NVIDIA Developer website.

  2. Create Python 3.8 environment:

conda create -n lr_gym python=3.8
conda activate lr_gym
  1. Install IsaacGym:

cd isaacgym/python
pip install -e .
  1. Verify installation:

python -c "from isaacgym import gymapi; print('IsaacGym installed successfully')"
  1. If library errors persist, add to ~/.bashrc:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib

Alternative: Use conda to manage library paths:

conda install -c conda-forge gcc_linux-64 gxx_linux-64

Conda Environment Conflicts#

Problem: Package conflicts or wrong Python version in conda environment.

Error Messages:

UnsatisfiableError: The following specifications were found to be incompatible
ERROR: pip's dependency resolver does not currently take into account all the packages
AssertionError: Python version mismatch

Root Cause: Mixing pip and conda installations, or creating environments with wrong Python versions.

Solution:

  1. Clean start - remove conflicting environment:

conda deactivate
conda env remove -n problematic_env
  1. Create fresh environment with correct Python version:

# IsaacGym requires Python 3.8
conda create -n lr_gym python=3.8 -y

# Genesis requires Python 3.10
conda create -n lr_gen python=3.10 -y

# IsaacLab requires Python 3.11
conda create -n lr_lab python=3.11 -y
  1. Install packages in correct order:

# Install PyTorch first
pip install torch torchvision --index-url <appropriate_url>

# Then install LeggedGym-Ex
pip install -e ".[isaacgym]"  # or [genesis], [isaaclab]

Best Practice: Never mix conda install and pip install for core packages. Use pip exclusively for PyTorch and project dependencies.

Training Issues#

NaN/Inf Loss Values#

Problem: Training produces NaN or Inf loss values, causing training to fail.

Error Messages:

RuntimeError: Function 'MseLossBackward0' returned nan values in its 0th output
ValueError: Detected inf or nan values in loss

Root Cause: Multiple possible causes including learning rate too high, reward scaling issues, or numerical instability.

Diagnostic Commands:

# Check for NaN in observations
import torch
if torch.isnan(env.obs_buf).any():
    print("NaN detected in observations")
    
# Check for NaN in actions
if torch.isnan(actions).any():
    print("NaN detected in actions")

# Check reward values
print(f"Rewards - min: {env.rew_buf.min()}, max: {env.rew_buf.max()}, mean: {env.rew_buf.mean()}")

Solutions:

  1. Reduce learning rate in config:

class LeggedRobotCfgPPO:
    class algorithm:
        learning_rate = 1e-4  # Reduce from 1e-3
  1. Check reward scales - ensure no extreme values:

class rewards:
    class scales:
        tracking_lin_vel = 1.0  # Typical values are 0.1 to 2.0
        # Avoid extremely large scales like 100.0
  1. Enable gradient clipping:

class algorithm:
    max_grad_norm = 1.0  # Add gradient clipping
  1. Check action bounds in config:

class control:
    action_scale = 0.25  # Reduce if actions are too large
    clip_actions = 100.0  # Clip extreme actions
  1. Validate observation normalization:

# Add normalization debugging
python -m legged_gym.scripts.train --task go2 --debug

Out of Memory (OOM) Errors#

Problem: Training crashes with GPU or CPU out of memory errors.

Error Messages:

RuntimeError: CUDA out of memory. Tried to allocate X.XX MiB
torch.OutOfMemoryError: CUDA out of memory
MemoryError: Unable to allocate array

Diagnostic Commands:

# Check GPU memory usage
nvidia-smi

# Monitor during training
watch -n 1 nvidia-smi

Solutions:

  1. Reduce number of parallel environments:

python -m legged_gym.scripts.train --task go2 --num_envs 2048  # Default is 4096
  1. Reduce batch size in config:

class LeggedRobotCfgPPO:
    class algorithm:
        num_mini_batches = 8  # Increase to reduce batch size
  1. Enable gradient checkpointing (if available):

class algorithm:
    use_gradient_checkpointing = True
  1. Clear cache between iterations:

import torch
torch.cuda.empty_cache()  # Add in training loop
  1. Use mixed precision training:

class algorithm:
    use_amp = True  # Automatic Mixed Precision

Memory Estimation:

  • Each environment: ~10-50 MB depending on robot complexity

  • For 10GB GPU: max ~2000 environments with safety margin

  • For 24GB GPU: max ~4000 environments

Training Not Progressing#

Problem: Training runs but rewards/losses don’t improve over iterations.

Error Messages: No explicit error, but learning curves are flat.

Diagnostic Commands:

# Check tensorboard logs
tensorboard --logdir logs/

# Monitor reward components
python -m legged_gym.scripts.train --task go2 --debug

Root Causes and Solutions:

  1. Reward function issues: Check that rewards are being computed correctly.

# In robot config, enable reward logging
class rewards:
    class scales:
        tracking_lin_vel = 1.0  # Ensure weight is non-zero
  1. Observation issues: Verify observations contain useful information.

# Add observation debugging
class env:
    debug_observations = True
  1. Learning rate too low: Increase learning rate.

class algorithm:
    learning_rate = 3e-4  # Try higher if stuck
  1. Insufficient exploration: Increase action noise or entropy coefficient.

class algorithm:
    entropy_coef = 0.01  # Increase for more exploration
  1. Check termination conditions: Overly strict terminations prevent learning.

class terminations:
    termination_if_close_to_ground = 0.3  # Adjust threshold

Slow Training Speed#

Problem: Training is significantly slower than expected.

Expected Performance:

  • IsaacGym: 20,000-50,000 FPS with 4096 environments

  • Genesis: 10,000-30,000 FPS with 4096 environments

  • IsaacLab: 5,000-15,000 FPS with 4096 environments

Diagnostic Commands:

# Check FPS during training
python -m legged_gym.scripts.train --task go2 --headless

# Monitor GPU utilization
nvidia-smi dmon -s u

Solutions:

  1. Enable headless mode:

python -m legged_gym.scripts.train --task go2 --headless
  1. Optimize terrain generation:

class terrain:
    mesh_type = 'plane'  # Fastest for initial testing
    # mesh_type = 'trimesh'  # Slower but more realistic
  1. Reduce observation size:

class env:
    num_observations = 48  # Minimize for speed
    num_privileged_obs = None  # Disable if not using TS
  1. Adjust simulation frequency:

class sim:
    dt = 0.02  # Control frequency
    substeps = 1  # Reduce for speed (default: 4)
    # Warning: may affect stability
  1. Use simpler robot models:

class asset:
    self_collisions = 0  # Disable self-collision checking
    fix_base_link = False  # Keep False for locomotion

Inference Issues#

Model Not Found#

Problem: Cannot locate trained model checkpoint for inference.

Error Messages:

FileNotFoundError: [Errno 2] No such file or directory: 'logs/...'
RuntimeError: Cannot load model from path

Root Cause: Wrong path or model not saved.

Solution:

  1. List available models:

ls -R logs/
  1. Check experiment directory structure:

logs/
└── <experiment_name>/
    └── <datetime>/
        ├── config.json
        ├── model_<iteration>.pt
        └── model_latest.pt
  1. Specify correct path:

# Using experiment name (finds latest)
python -m legged_gym.scripts.play --task go2 --resume

# Using specific run
python -m legged_gym.scripts.play --task go2 --load_run <datetime>

# Using exact path
python -m legged_gym.scripts.play --task go2 --load_run logs/go2/20250403_123456
  1. Verify model exists:

import torch
checkpoint = torch.load('logs/go2/.../model_1000.pt')
print(checkpoint.keys())  # Should contain 'model_state_dict', 'optimizer_state_dict'

JIT Export Errors#

Problem: Cannot export model to TorchScript format for deployment.

Error Messages:

RuntimeError: Exporting the operator 'aten::grid_sampler_2d' is not supported
RuntimeError: Cannot extract guaranteed root tensor from output

Root Cause: Model contains operations not supported by TorchScript.

Solutions:

  1. Use correct export command:

python -m legged_gym.scripts.play --task go2 --export
  1. Check exported model:

import torch
model = torch.jit.load('logs/go2/.../exported/policy.pt')
print(model.code)
  1. Handle observation normalization:

# Export includes normalization if configured
# Check config:
class normalizations:
    class observations:
        clip_observations = 100.0
  1. For Teacher-Student models, export student network:

python -m legged_gym.scripts.play --task go2_ts --export
# Exported model only uses student observations (no privileged info)

Visualization Issues#

Problem: Cannot visualize training or inference, or viewer crashes.

Error Messages:

RuntimeError: Failed to create window
GLFW Error: X11: Failed to open display
Segmentation fault (core dumped)

Root Cause: Display or graphics driver issues.

Solutions:

  1. For headless servers, disable visualization:

python -m legged_gym.scripts.train --task go2 --headless
  1. Check display settings:

echo $DISPLAY
# Should output something like :0 or :1

# If empty, set it:
export DISPLAY=:0
  1. For remote servers with X forwarding:

# On local machine
xhost +

# SSH with X forwarding
ssh -X user@server

# Then run training
python -m legged_gym.scripts.train --task go2
  1. Use VirtualGL for remote rendering:

vglrun python -m legged_gym.scripts.train --task go2
  1. Record video instead of live viewer:

python -m legged_gym.scripts.play --task go2 --record

Multi-Simulator Issues#

IsaacGym Reset Bug#

Problem: After calling reset(), rigid body states are incorrect, causing abnormal terminations.

Error Messages:

# No explicit error, but unexpected behavior:
# - Robot appears in wrong position after reset
# - Reference motion tracking fails after reset
# - Termination triggers incorrectly

Root Cause: IsaacGym requires one simulation step after reset() to update rigid body states properly. This is a known bug in IsaacGym Preview 4.

Solution:

Add simulator.forward() call after reset:

# In reset_idx or post_physics_step method:
def reset_idx(self, env_ids):
    # ... reset logic ...
    
    # BUG FIX: Call forward() to update rigid body states
    if self.cfg.simulator == 'isaacgym':
        self.simulator.forward()
    
    # Now rigid body states are correct

Affected Methods:

  • _reset_dofs()

  • _reset_root_states()

  • _reset_dofs_from_reference_motion()

  • _reset_root_states_from_reference_motion()

Example from codebase (see g1_deepmimic.py:73):

# BUG: IsaacGym requires 1 step after resetting to get the correct rigid body states
# When enabling reference motion termination, the rigid body state does not update
# after this reset, which causes the termination abnormally.
# The dof state and root state is reset correctly, but the rigid body state is not updated

# Solution is already implemented in DeepMimic tasks
# Apply same pattern if you encounter this issue

IsaacLab CPU Tensor Requirement#

Problem: Domain randomization functions fail with device errors in IsaacLab.

Error Messages:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
AssertionError: Domain randomization tensors must be on CPU for IsaacLab

Root Cause: IsaacLab’s backend requires domain randomization tensors to be on CPU, unlike IsaacGym which uses GPU tensors.

Solution:

Move tensors to CPU before calling randomization functions:

# Wrong (IsaacGym style):
self.simulator.set_material_properties(
    env_ids, 
    friction_tensors.cuda()  # Fails in IsaacLab
)

# Correct (IsaacLab compatible):
self.simulator.set_material_properties(
    env_ids,
    friction_tensors.cpu()  # Must be CPU tensors
)

Affected Functions:

  • set_material_properties()

  • set_masses()

  • set_coms() (center of mass)

  • set_friction()

Cross-simulator Code:

def _randomize_friction(self, env_ids):
    friction = torch.rand(len(env_ids), device='cpu')  # Create on CPU
    friction = friction * (self.cfg.domain_rand.friction_range[1] - 
                          self.cfg.domain_rand.friction_range[0]) + \
               self.cfg.domain_rand.friction_range[0]
    
    # Works for all simulators
    self.simulator.set_material_properties(
        env_ids.cpu() if hasattr(env_ids, 'cpu') else env_ids,
        friction
    )

Genesis XML Requirement#

Problem: Genesis simulator fails to load robot or environment.

Error Messages:

ValueError: XML file path must be provided for Genesis simulator
FileNotFoundError: Cannot find robot XML file
genesis.sim: Failed to load URDF from XML

Root Cause: Genesis requires XML configuration files for robot and scene setup, unlike IsaacGym which uses programmatic API.

Solution:

  1. Ensure XML file exists:

ls resources/robots/go2/urdf/go2.xml
  1. Check XML configuration:

class asset:
    file = '{LEGGED_GYM_ROOT_DIR}/resources/robots/go2/urdf/go2.xml'
  1. Verify path resolution:

# In robot config
import os
xml_path = os.path.join(
    os.path.dirname(__file__),
    '../../../resources/robots/go2/urdf/go2.xml'
)
assert os.path.exists(xml_path), f"XML not found: {xml_path}"
  1. Set SIMULATOR environment variable:

export SIMULATOR=genesis
python -m legged_gym.scripts.train --task go2

Terrain Configuration Conflicts#

Problem: Terrain generation fails with conflicting options.

Error Messages:

ValueError: Curriculum and selected terrain cannot be both True.

Root Cause: Cannot use curriculum terrain and selected terrain simultaneously.

Solution:

Choose one terrain mode:

# Option 1: Curriculum terrain (difficulty increases over training)
class terrain:
    curriculum = True
    selected = False
    terrain_curriculum_difficulty = 0.5

# Option 2: Selected terrain (fixed terrain types)
class terrain:
    curriculum = False
    selected = True
    terrain_proportions = [0.2, 0.3, 0.5]  # Proportions for each terrain type

Code Reference (see terrain.py:63):

if cfg.curriculum and cfg.selected:
    raise ValueError("Curriculum and selected terrain cannot be both True.")

Heightfield Terrain Limitation#

Problem: Heightfield terrain not working in IsaacLab.

Error Messages:

NotImplementedError: Heightfield terrain not implemented for IsaacLabSimulator
RuntimeError: Cannot create heightfield in IsaacLab

Root Cause: Heightfield terrain generation is not implemented for IsaacLab backend.

Solution:

Use trimesh terrain instead:

class terrain:
    mesh_type = 'trimesh'  # Works for IsaacLab
    # mesh_type = 'heightfield'  # NOT supported in IsaacLab

Simulator Compatibility Matrix:

Terrain Type

IsaacGym

Genesis

IsaacLab

plane

âś“

âś“

âś“

heightfield

âś“

âś“

âś—

trimesh

âś“

âś“

âś“

Debugging Commands Quick Reference#

Check Environment#

# Verify CUDA
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Version: {torch.version.cuda}')"

# Check simulator
echo $SIMULATOR

# List available tasks
python tests/test_all_tasks.py --list

# Check conda environment
conda list | grep -E "torch|genesis|isaac"

Debug Training#

# Run with debug output
python -m legged_gym.scripts.train --task go2 --debug

# Monitor GPU
watch -n 0.5 nvidia-smi

# Check logs
tail -f logs/go2/<experiment>/log.txt

# TensorBoard
tensorboard --logdir logs/

Validate Installation#

# Test specific task
python tests/test_all_tasks.py --tasks go2 --iterations 1

# Test all tasks
python tests/test_all_tasks.py

# Verify simulator backend
python -c "from legged_gym.simulator import get_simulator; print(get_simulator.__name__)"

Getting Help#

If you cannot resolve an issue using this guide:

  1. Check Documentation: Full documentation at https://leggedgym-ex-doc.readthedocs.io/

  2. Search Issues: Check existing GitHub issues for similar problems

  3. Debug Information: When asking for help, provide:

    • Simulator type (echo $SIMULATOR)

    • Python version (python --version)

    • PyTorch version (python -c "import torch; print(torch.__version__)")

    • CUDA version (nvcc --version or nvidia-smi)

    • Complete error message and stack trace

    • Minimal reproduction script

  4. Community: Join the Feishu group (see README) for discussions

  5. Bug Reports: Open a GitHub issue with:

    • Clear description

    • Steps to reproduce

    • Expected vs actual behavior

    • System information