Troubleshooting Guide#
This comprehensive guide helps you diagnose and resolve common issues when working with LeggedGym-Ex. Each section covers specific categories of problems with actionable solutions.
Installation Issues#
CUDA Version Mismatch#
Problem: Training fails with CUDA-related errors or torch cannot detect GPU.
Error Messages:
RuntimeError: CUDA out of memory. Tried to allocate X.XX MiB
AssertionError: Torch not compiled with CUDA enabled
NVIDIA driver version is incompatible with CUDA version
Root Cause: The PyTorch CUDA version must match your system’s NVIDIA driver and CUDA toolkit.
Solution:
Check your NVIDIA driver version:
nvidia-smi
Verify PyTorch CUDA version:
import torch
print(torch.version.cuda)
print(torch.cuda.is_available())
Install compatible PyTorch version:
For IsaacGym (Python 3.8, CUDA 12.1):
pip install torch==2.4.1 torchvision==0.19.1 --index-url https://download.pytorch.org/whl/cu121
For Genesis (Python 3.10, CUDA 12.6):
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu126
For IsaacLab (Python 3.11, CUDA 12.6):
# IsaacSim includes PyTorch, verify after installation
python -c "import torch; print(torch.cuda.is_available())"
Prevention: Always check driver compatibility before creating conda environments. Minimum driver version: 570.
IsaacGym Installation Errors#
Problem: IsaacGym Preview 4 fails to install or import.
Error Messages:
ModuleNotFoundError: No module named 'isaacgym'
ImportError: libpython3.8.so.1.0: cannot open shared object file
OSError: Cannot load IsaacGym library
Root Cause: IsaacGym requires specific Python version and proper library paths.
Solution:
Download IsaacGym Preview 4 from NVIDIA Developer website.
Create Python 3.8 environment:
conda create -n lr_gym python=3.8
conda activate lr_gym
Install IsaacGym:
cd isaacgym/python
pip install -e .
Verify installation:
python -c "from isaacgym import gymapi; print('IsaacGym installed successfully')"
If library errors persist, add to
~/.bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
Alternative: Use conda to manage library paths:
conda install -c conda-forge gcc_linux-64 gxx_linux-64
Conda Environment Conflicts#
Problem: Package conflicts or wrong Python version in conda environment.
Error Messages:
UnsatisfiableError: The following specifications were found to be incompatible
ERROR: pip's dependency resolver does not currently take into account all the packages
AssertionError: Python version mismatch
Root Cause: Mixing pip and conda installations, or creating environments with wrong Python versions.
Solution:
Clean start - remove conflicting environment:
conda deactivate
conda env remove -n problematic_env
Create fresh environment with correct Python version:
# IsaacGym requires Python 3.8
conda create -n lr_gym python=3.8 -y
# Genesis requires Python 3.10
conda create -n lr_gen python=3.10 -y
# IsaacLab requires Python 3.11
conda create -n lr_lab python=3.11 -y
Install packages in correct order:
# Install PyTorch first
pip install torch torchvision --index-url <appropriate_url>
# Then install LeggedGym-Ex
pip install -e ".[isaacgym]" # or [genesis], [isaaclab]
Best Practice: Never mix conda install and pip install for core packages. Use pip exclusively for PyTorch and project dependencies.
Training Issues#
NaN/Inf Loss Values#
Problem: Training produces NaN or Inf loss values, causing training to fail.
Error Messages:
RuntimeError: Function 'MseLossBackward0' returned nan values in its 0th output
ValueError: Detected inf or nan values in loss
Root Cause: Multiple possible causes including learning rate too high, reward scaling issues, or numerical instability.
Diagnostic Commands:
# Check for NaN in observations
import torch
if torch.isnan(env.obs_buf).any():
print("NaN detected in observations")
# Check for NaN in actions
if torch.isnan(actions).any():
print("NaN detected in actions")
# Check reward values
print(f"Rewards - min: {env.rew_buf.min()}, max: {env.rew_buf.max()}, mean: {env.rew_buf.mean()}")
Solutions:
Reduce learning rate in config:
class LeggedRobotCfgPPO:
class algorithm:
learning_rate = 1e-4 # Reduce from 1e-3
Check reward scales - ensure no extreme values:
class rewards:
class scales:
tracking_lin_vel = 1.0 # Typical values are 0.1 to 2.0
# Avoid extremely large scales like 100.0
Enable gradient clipping:
class algorithm:
max_grad_norm = 1.0 # Add gradient clipping
Check action bounds in config:
class control:
action_scale = 0.25 # Reduce if actions are too large
clip_actions = 100.0 # Clip extreme actions
Validate observation normalization:
# Add normalization debugging
python -m legged_gym.scripts.train --task go2 --debug
Out of Memory (OOM) Errors#
Problem: Training crashes with GPU or CPU out of memory errors.
Error Messages:
RuntimeError: CUDA out of memory. Tried to allocate X.XX MiB
torch.OutOfMemoryError: CUDA out of memory
MemoryError: Unable to allocate array
Diagnostic Commands:
# Check GPU memory usage
nvidia-smi
# Monitor during training
watch -n 1 nvidia-smi
Solutions:
Reduce number of parallel environments:
python -m legged_gym.scripts.train --task go2 --num_envs 2048 # Default is 4096
Reduce batch size in config:
class LeggedRobotCfgPPO:
class algorithm:
num_mini_batches = 8 # Increase to reduce batch size
Enable gradient checkpointing (if available):
class algorithm:
use_gradient_checkpointing = True
Clear cache between iterations:
import torch
torch.cuda.empty_cache() # Add in training loop
Use mixed precision training:
class algorithm:
use_amp = True # Automatic Mixed Precision
Memory Estimation:
Each environment: ~10-50 MB depending on robot complexity
For 10GB GPU: max ~2000 environments with safety margin
For 24GB GPU: max ~4000 environments
Training Not Progressing#
Problem: Training runs but rewards/losses don’t improve over iterations.
Error Messages: No explicit error, but learning curves are flat.
Diagnostic Commands:
# Check tensorboard logs
tensorboard --logdir logs/
# Monitor reward components
python -m legged_gym.scripts.train --task go2 --debug
Root Causes and Solutions:
Reward function issues: Check that rewards are being computed correctly.
# In robot config, enable reward logging
class rewards:
class scales:
tracking_lin_vel = 1.0 # Ensure weight is non-zero
Observation issues: Verify observations contain useful information.
# Add observation debugging
class env:
debug_observations = True
Learning rate too low: Increase learning rate.
class algorithm:
learning_rate = 3e-4 # Try higher if stuck
Insufficient exploration: Increase action noise or entropy coefficient.
class algorithm:
entropy_coef = 0.01 # Increase for more exploration
Check termination conditions: Overly strict terminations prevent learning.
class terminations:
termination_if_close_to_ground = 0.3 # Adjust threshold
Slow Training Speed#
Problem: Training is significantly slower than expected.
Expected Performance:
IsaacGym: 20,000-50,000 FPS with 4096 environments
Genesis: 10,000-30,000 FPS with 4096 environments
IsaacLab: 5,000-15,000 FPS with 4096 environments
Diagnostic Commands:
# Check FPS during training
python -m legged_gym.scripts.train --task go2 --headless
# Monitor GPU utilization
nvidia-smi dmon -s u
Solutions:
Enable headless mode:
python -m legged_gym.scripts.train --task go2 --headless
Optimize terrain generation:
class terrain:
mesh_type = 'plane' # Fastest for initial testing
# mesh_type = 'trimesh' # Slower but more realistic
Reduce observation size:
class env:
num_observations = 48 # Minimize for speed
num_privileged_obs = None # Disable if not using TS
Adjust simulation frequency:
class sim:
dt = 0.02 # Control frequency
substeps = 1 # Reduce for speed (default: 4)
# Warning: may affect stability
Use simpler robot models:
class asset:
self_collisions = 0 # Disable self-collision checking
fix_base_link = False # Keep False for locomotion
Inference Issues#
Model Not Found#
Problem: Cannot locate trained model checkpoint for inference.
Error Messages:
FileNotFoundError: [Errno 2] No such file or directory: 'logs/...'
RuntimeError: Cannot load model from path
Root Cause: Wrong path or model not saved.
Solution:
List available models:
ls -R logs/
Check experiment directory structure:
logs/
└── <experiment_name>/
└── <datetime>/
├── config.json
├── model_<iteration>.pt
└── model_latest.pt
Specify correct path:
# Using experiment name (finds latest)
python -m legged_gym.scripts.play --task go2 --resume
# Using specific run
python -m legged_gym.scripts.play --task go2 --load_run <datetime>
# Using exact path
python -m legged_gym.scripts.play --task go2 --load_run logs/go2/20250403_123456
Verify model exists:
import torch
checkpoint = torch.load('logs/go2/.../model_1000.pt')
print(checkpoint.keys()) # Should contain 'model_state_dict', 'optimizer_state_dict'
JIT Export Errors#
Problem: Cannot export model to TorchScript format for deployment.
Error Messages:
RuntimeError: Exporting the operator 'aten::grid_sampler_2d' is not supported
RuntimeError: Cannot extract guaranteed root tensor from output
Root Cause: Model contains operations not supported by TorchScript.
Solutions:
Use correct export command:
python -m legged_gym.scripts.play --task go2 --export
Check exported model:
import torch
model = torch.jit.load('logs/go2/.../exported/policy.pt')
print(model.code)
Handle observation normalization:
# Export includes normalization if configured
# Check config:
class normalizations:
class observations:
clip_observations = 100.0
For Teacher-Student models, export student network:
python -m legged_gym.scripts.play --task go2_ts --export
# Exported model only uses student observations (no privileged info)
Visualization Issues#
Problem: Cannot visualize training or inference, or viewer crashes.
Error Messages:
RuntimeError: Failed to create window
GLFW Error: X11: Failed to open display
Segmentation fault (core dumped)
Root Cause: Display or graphics driver issues.
Solutions:
For headless servers, disable visualization:
python -m legged_gym.scripts.train --task go2 --headless
Check display settings:
echo $DISPLAY
# Should output something like :0 or :1
# If empty, set it:
export DISPLAY=:0
For remote servers with X forwarding:
# On local machine
xhost +
# SSH with X forwarding
ssh -X user@server
# Then run training
python -m legged_gym.scripts.train --task go2
Use VirtualGL for remote rendering:
vglrun python -m legged_gym.scripts.train --task go2
Record video instead of live viewer:
python -m legged_gym.scripts.play --task go2 --record
Multi-Simulator Issues#
IsaacGym Reset Bug#
Problem: After calling reset(), rigid body states are incorrect, causing abnormal terminations.
Error Messages:
# No explicit error, but unexpected behavior:
# - Robot appears in wrong position after reset
# - Reference motion tracking fails after reset
# - Termination triggers incorrectly
Root Cause: IsaacGym requires one simulation step after reset() to update rigid body states properly. This is a known bug in IsaacGym Preview 4.
Solution:
Add simulator.forward() call after reset:
# In reset_idx or post_physics_step method:
def reset_idx(self, env_ids):
# ... reset logic ...
# BUG FIX: Call forward() to update rigid body states
if self.cfg.simulator == 'isaacgym':
self.simulator.forward()
# Now rigid body states are correct
Affected Methods:
_reset_dofs()_reset_root_states()_reset_dofs_from_reference_motion()_reset_root_states_from_reference_motion()
Example from codebase (see g1_deepmimic.py:73):
# BUG: IsaacGym requires 1 step after resetting to get the correct rigid body states
# When enabling reference motion termination, the rigid body state does not update
# after this reset, which causes the termination abnormally.
# The dof state and root state is reset correctly, but the rigid body state is not updated
# Solution is already implemented in DeepMimic tasks
# Apply same pattern if you encounter this issue
IsaacLab CPU Tensor Requirement#
Problem: Domain randomization functions fail with device errors in IsaacLab.
Error Messages:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
AssertionError: Domain randomization tensors must be on CPU for IsaacLab
Root Cause: IsaacLab’s backend requires domain randomization tensors to be on CPU, unlike IsaacGym which uses GPU tensors.
Solution:
Move tensors to CPU before calling randomization functions:
# Wrong (IsaacGym style):
self.simulator.set_material_properties(
env_ids,
friction_tensors.cuda() # Fails in IsaacLab
)
# Correct (IsaacLab compatible):
self.simulator.set_material_properties(
env_ids,
friction_tensors.cpu() # Must be CPU tensors
)
Affected Functions:
set_material_properties()set_masses()set_coms()(center of mass)set_friction()
Cross-simulator Code:
def _randomize_friction(self, env_ids):
friction = torch.rand(len(env_ids), device='cpu') # Create on CPU
friction = friction * (self.cfg.domain_rand.friction_range[1] -
self.cfg.domain_rand.friction_range[0]) + \
self.cfg.domain_rand.friction_range[0]
# Works for all simulators
self.simulator.set_material_properties(
env_ids.cpu() if hasattr(env_ids, 'cpu') else env_ids,
friction
)
Genesis XML Requirement#
Problem: Genesis simulator fails to load robot or environment.
Error Messages:
ValueError: XML file path must be provided for Genesis simulator
FileNotFoundError: Cannot find robot XML file
genesis.sim: Failed to load URDF from XML
Root Cause: Genesis requires XML configuration files for robot and scene setup, unlike IsaacGym which uses programmatic API.
Solution:
Ensure XML file exists:
ls resources/robots/go2/urdf/go2.xml
Check XML configuration:
class asset:
file = '{LEGGED_GYM_ROOT_DIR}/resources/robots/go2/urdf/go2.xml'
Verify path resolution:
# In robot config
import os
xml_path = os.path.join(
os.path.dirname(__file__),
'../../../resources/robots/go2/urdf/go2.xml'
)
assert os.path.exists(xml_path), f"XML not found: {xml_path}"
Set SIMULATOR environment variable:
export SIMULATOR=genesis
python -m legged_gym.scripts.train --task go2
Terrain Configuration Conflicts#
Problem: Terrain generation fails with conflicting options.
Error Messages:
ValueError: Curriculum and selected terrain cannot be both True.
Root Cause: Cannot use curriculum terrain and selected terrain simultaneously.
Solution:
Choose one terrain mode:
# Option 1: Curriculum terrain (difficulty increases over training)
class terrain:
curriculum = True
selected = False
terrain_curriculum_difficulty = 0.5
# Option 2: Selected terrain (fixed terrain types)
class terrain:
curriculum = False
selected = True
terrain_proportions = [0.2, 0.3, 0.5] # Proportions for each terrain type
Code Reference (see terrain.py:63):
if cfg.curriculum and cfg.selected:
raise ValueError("Curriculum and selected terrain cannot be both True.")
Heightfield Terrain Limitation#
Problem: Heightfield terrain not working in IsaacLab.
Error Messages:
NotImplementedError: Heightfield terrain not implemented for IsaacLabSimulator
RuntimeError: Cannot create heightfield in IsaacLab
Root Cause: Heightfield terrain generation is not implemented for IsaacLab backend.
Solution:
Use trimesh terrain instead:
class terrain:
mesh_type = 'trimesh' # Works for IsaacLab
# mesh_type = 'heightfield' # NOT supported in IsaacLab
Simulator Compatibility Matrix:
Terrain Type |
IsaacGym |
Genesis |
IsaacLab |
|---|---|---|---|
plane |
âś“ |
âś“ |
âś“ |
heightfield |
âś“ |
âś“ |
âś— |
trimesh |
âś“ |
âś“ |
âś“ |
Debugging Commands Quick Reference#
Check Environment#
# Verify CUDA
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Version: {torch.version.cuda}')"
# Check simulator
echo $SIMULATOR
# List available tasks
python tests/test_all_tasks.py --list
# Check conda environment
conda list | grep -E "torch|genesis|isaac"
Debug Training#
# Run with debug output
python -m legged_gym.scripts.train --task go2 --debug
# Monitor GPU
watch -n 0.5 nvidia-smi
# Check logs
tail -f logs/go2/<experiment>/log.txt
# TensorBoard
tensorboard --logdir logs/
Validate Installation#
# Test specific task
python tests/test_all_tasks.py --tasks go2 --iterations 1
# Test all tasks
python tests/test_all_tasks.py
# Verify simulator backend
python -c "from legged_gym.simulator import get_simulator; print(get_simulator.__name__)"
Getting Help#
If you cannot resolve an issue using this guide:
Check Documentation: Full documentation at https://leggedgym-ex-doc.readthedocs.io/
Search Issues: Check existing GitHub issues for similar problems
Debug Information: When asking for help, provide:
Simulator type (
echo $SIMULATOR)Python version (
python --version)PyTorch version (
python -c "import torch; print(torch.__version__)")CUDA version (
nvcc --versionornvidia-smi)Complete error message and stack trace
Minimal reproduction script
Community: Join the Feishu group (see README) for discussions
Bug Reports: Open a GitHub issue with:
Clear description
Steps to reproduce
Expected vs actual behavior
System information