Skip to main content

Submitting a Slurm Task (Script Development)

In the HPC Phase 4 platform, jobs can be submitted to the Kunpeng CPU cluster via the Slurm scheduler, which allocates resources and schedules computations based on specific rules.

Note

For more information on Slurm command usage, please refer to the official documentation: Slurm Documentation

1. Preparation

SSH Connection to the HPC Platform

Connect to the HPC Phase 4 login node using an SSH client:

ssh username@hpc4login.hpc.hkust-gz.edu.cn

Check Available Resources with sinfo

  • The sinfo command lists all Slurm partitions, node counts, node states (idle/alloc/down), and other details.
  • The domestic cluster for HPC Phase 4 consists of 22 Kunpeng CPU nodes. The partition name is: hpc alt text

Check Pre-configured Software with module

  • The HPC platform typically uses the module system to manage software environments
  • Use module avto view pre-configured software and their versions
  • Use module load in your script to load the required software and version
  • Use module list to check currently loaded software and versions

2. Writing a Slurm Job Script

A Slurm job script is a Bash script containing resource request directives and the actual commands to execute. Job parameters are specified using #SBATCH comments at the beginning of the script

Basic Structure

Please save the following content to script.sh Important: All #SBATCH directives must be placed at the very beginning of the script, before any non-comment commands

#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --output=%j.out # Standard output file (%j is the job ID)
#SBATCH --error=%j.err # Error output file
#SBATCH --nodes=1 # Number of nodes requested
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --cpus-per-task=8 # Number of CPU cores per task
#SBATCH --mem=16G # Memory (per node or per task, depending on configuration)
#SBATCH --time=02:00:00 # Maximum runtime (HH:MM:SS)
#SBATCH --partition=hpc # Specify partition (queue)

# Load modules
module load your_software_module

# Run your program
./my_program input.txt

3. Submitting a Slurm Task: sbatch

Use the sbatch command to submit a Slurm job. If the job meets all running conditions (resources, quotas, permissions, etc.), Slurm will return a Job ID. If not, the job will enter a pending queue

sbatch script.sh

Common Slurm Commands

CommandDescription
sbatch script.shSubmit a job
squeue -u $USERView the status of your own jobs
scancel <JOBID>Cancel a specific job
sinfoView cluster partitions and node status

4. Iterative Interactive Development: tmux + srun

  • Running srun --pty bash starts an interactive job where you can perform iterative script development
  • Limitation: If the interactive session is disconnected (e.g., due to network interruption or SSH disconnect), the session terminates and cannot be reconnected.

Solution: Combine tmux (a terminal multiplexer) with srun to create a persistent interactive environment. This allows you to reconnect even after an SSH disconnect.

Step 1: Start a Named tmux Session on the Login Node

tmux new -s debug_session

This creates a session named debug_session and enters it.

Step 2: Submit an Interactive Slurm Task within the tmux Session

srun --job-name=interactive_debug \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=4 \
--mem=16G \
--time=02:00:00 \
--partition=hpc \
--pty bash

If the job conditions are met, you will connect to a compute node and obtain an interactive shell.

Step 3: Run Commands on the Compute Node

module load python/3.9
python hello.py

Step 4: Safely Disconnect (Without Terminating the Task)

Press the key combination: Ctrl-b, then press d This returns you to the login node's shell, but the tmux session and the Slurm task running within it continue to run in the background.

Step 5: Reconnect to the Session

Reconnect to the session from the login node: tmux attach -t debug_session