Submitting a Slurm Task (Script Development)
In the HPC Phase 4 platform, jobs can be submitted to the Kunpeng CPU cluster via the Slurm scheduler, which allocates resources and schedules computations based on specific rules.
For more information on Slurm command usage, please refer to the official documentation: Slurm Documentation
1. Preparation
SSH Connection to the HPC Platform
Connect to the HPC Phase 4 login node using an SSH client:
ssh username@hpc4login.hpc.hkust-gz.edu.cn
Check Available Resources with sinfo
- The
sinfocommand lists all Slurm partitions, node counts, node states (idle/alloc/down), and other details. - The domestic cluster for HPC Phase 4 consists of 22 Kunpeng CPU nodes. The partition name is:
hpc
Check Pre-configured Software with module
- The HPC platform typically uses the
modulesystem to manage software environments - Use
module avto view pre-configured software and their versions - Use
module loadin your script to load the required software and version - Use
module listto check currently loaded software and versions
2. Writing a Slurm Job Script
A Slurm job script is a Bash script containing resource request directives and the actual commands to execute. Job parameters are specified using #SBATCH comments at the beginning of the script
Basic Structure
Please save the following content to script.sh Important: All #SBATCH directives must be placed at the very beginning of the script, before any non-comment commands
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --output=%j.out # Standard output file (%j is the job ID)
#SBATCH --error=%j.err # Error output file
#SBATCH --nodes=1 # Number of nodes requested
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --cpus-per-task=8 # Number of CPU cores per task
#SBATCH --mem=16G # Memory (per node or per task, depending on configuration)
#SBATCH --time=02:00:00 # Maximum runtime (HH:MM:SS)
#SBATCH --partition=hpc # Specify partition (queue)
# Load modules
module load your_software_module
# Run your program
./my_program input.txt
3. Submitting a Slurm Task: sbatch
Use the sbatch command to submit a Slurm job. If the job meets all running conditions (resources, quotas, permissions, etc.), Slurm will return a Job ID. If not, the job will enter a pending queue
sbatch script.sh
Common Slurm Commands
| Command | Description |
|---|---|
sbatch script.sh | Submit a job |
squeue -u $USER | View the status of your own jobs |
scancel <JOBID> | Cancel a specific job |
sinfo | View cluster partitions and node status |
4. Iterative Interactive Development: tmux + srun
- Running
srun --pty bashstarts an interactive job where you can perform iterative script development - Limitation: If the interactive session is disconnected (e.g., due to network interruption or SSH disconnect), the session terminates and cannot be reconnected.
Solution: Combine tmux (a terminal multiplexer) with srun to create a persistent interactive environment. This allows you to reconnect even after an SSH disconnect.
Step 1: Start a Named tmux Session on the Login Node
tmux new -s debug_session
This creates a session named debug_session and enters it.
Step 2: Submit an Interactive Slurm Task within the tmux Session
srun --job-name=interactive_debug \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=4 \
--mem=16G \
--time=02:00:00 \
--partition=hpc \
--pty bash
If the job conditions are met, you will connect to a compute node and obtain an interactive shell.
Step 3: Run Commands on the Compute Node
module load python/3.9
python hello.py
Step 4: Safely Disconnect (Without Terminating the Task)
Press the key combination: Ctrl-b, then press d
This returns you to the login node's shell, but the tmux session and the Slurm task running within it continue to run in the background.
Step 5: Reconnect to the Session
Reconnect to the session from the login node:
tmux attach -t debug_session