Multi-Node Tasks

This guide explains how multi-node tasks behave when resources.num_nodes > 1, and how to write task.yaml for SLURM and SkyPilot providers.

Quick rules

Set resources.num_nodes in task.yaml to request more than one node.
Keep your launch command explicit in script: the system does not rewrite your command.
Prefer launcher-aware commands (torchrun, srun, mpirun) for distributed training.

What Transformer Lab sets for you

When num_nodes > 1, provider integrations add a common distributed environment baseline so training code can use standard variables.

SLURM provider

For multi-node jobs, SLURM scripts include:

#SBATCH --nodes=<num_nodes>
Default task layout if not already set in custom flags:
- #SBATCH --ntasks=<num_nodes>
- #SBATCH --ntasks-per-node=1

And these environment defaults (overridable by user-provided env vars):

MASTER_ADDR (first host from SLURM_JOB_NODELIST)
MASTER_PORT (derived from SLURM_JOB_ID)
NODE_RANK (from SLURM_NODEID)
RANK (from SLURM_PROCID)
LOCAL_RANK (from SLURM_LOCALID)
WORLD_SIZE (from SLURM_NTASKS, fallback to num_nodes)

SkyPilot provider

For multi-node jobs, run commands are prefixed with portable distributed defaults:

MASTER_ADDR (first IP from SKYPILOT_NODE_IPS)
MASTER_PORT (default 29500)
NODE_RANK / RANK (from SKYPILOT_NODE_RANK)
LOCAL_RANK (default 0)
WORLD_SIZE (default SKYPILOT_NUM_NODES * SKYPILOT_NUM_GPUS_PER_NODE)

SkyPilot also exposes its native variables (for example SKYPILOT_NODE_IPS, SKYPILOT_NUM_NODES, SKYPILOT_NODE_RANK), which you can still use directly.

Example: multi-node on SkyPilot

name: train-multinode-skypilot
description: '2-node distributed PyTorch run on SkyPilot'

resources:
  provider: skypilot
  accelerators: 'L4:2'
  cpus: 8
  num_nodes: 2

env:
  # Optional override (otherwise defaults to 29500 on SkyPilot multi-node)
  MASTER_PORT: '8008'

script: |
  set -euo pipefail
  cd /workspace/my-train-code

  # You can rely on pre-exported vars, or use SkyPilot-native vars directly.
  torchrun \
    --nnodes="${SKYPILOT_NUM_NODES}" \
    --nproc_per_node="${SKYPILOT_NUM_GPUS_PER_NODE}" \
    --node_rank="${SKYPILOT_NODE_RANK}" \
    --master_addr="${MASTER_ADDR}" \
    --master_port="${MASTER_PORT}" \
    train.py

Example: multi-node on SLURM

name: train-multinode-slurm
description: '2-node distributed PyTorch run on SLURM'

resources:
  provider: slurm
  num_nodes: 2
  # Configure SLURM partition and custom sbatch flags in provider settings.

env:
  # Optional override (otherwise defaults from job id logic)
  MASTER_PORT: '23456'

script: |
  set -euo pipefail
  cd /workspace/my-train-code

  # Launch style is user-controlled. Choose srun/torchrun/etc explicitly.
  srun python -m torch.distributed.run \
    --nnodes="${SLURM_NNODES}" \
    --nproc_per_node=1 \
    --node_rank="${NODE_RANK}" \
    --master_addr="${MASTER_ADDR}" \
    --master_port="${MASTER_PORT}" \
    train.py

Notes and troubleshooting

If your cluster needs a custom task layout, set custom_sbatch_flags (or provider-level user_sbatch_flags).
If your training framework computes rank/world-size itself, these env vars still help keep behavior consistent across providers.
If your script hard-codes rendezvous values, ensure they match the requested num_nodes.
Multi-node (num_nodes > 1) distributed training is not supported on the Runpod provider currently.

Quick rules​

What Transformer Lab sets for you​

SLURM provider​

SkyPilot provider​

Example: multi-node on SkyPilot​

Example: multi-node on SLURM​

Notes and troubleshooting​