How to run your codes on Acres
Scott - these documents are probably worth linking to for your "managing jobs" instructions:
- SLURM cheat sheet - Useful PDF sheet of SLURM commands
- SLURM Job Manager Overview - An introduction to SLURM from virginia.edu
Step-by-step guide
-
To login to the system, use the following command in your local terminal if you are a Linux or Mac user:
[user@localhost~]$ ssh username@acres-login0.clarkson.edu [user@localhost~]$ ssh username@acres-login0.clarkson.edu Your username is your Clarkson ID that precedes your email address. Remember to connect to Clarkson server address vpn.clarkson.edu via VPN if you work at home.
For Windows users, PuTTY (www.chiark.greenend.org.uk/~sgtatham/putty/) is a good choice to serve as your local terminal.
Alternatively, enable the Windows Subsystem for Linux (WSL) feature that comes with Windows for the purpose.
If one wants to use remote software (such as Tecplot) with graphic user interface (GUI), use the -X option to login:
[user@localhost~]$ ssh -X username@acres-login0.clarkson.edu [user@localhost~]$ ssh -X username@acres-login0.clarkson.edu To transfer files between between local and remote systems, use scp command. If one wants to transfer a directory instead of a single file, use the **-r **option:
[user@localhost~]$ scp -r Documents/my_directory/ username@acres-login0.clarkson.edu:codes/ [user@localhost~]$ scp -r Documents/my_directory/ username@acres-login0.clarkson.edu:codes/ -
Once you login to the system, you should load packages needed by your codes. Commands to load packages can also be placed in your job script.
If you load packages in your job script, you do not have to load it through the terminal in advance.
Use module spider to find all possible modules. Use module load to load a compiler:
[user@acres-login0~]$ module load gnu8/8.3.0 [user@acres-login0~]$ module load gnu8/8.3.0 With that, the Fortran compiler gortran and C/C++ compiler gcc will be available to use. Use module list to see currently loaded packages.
-
Job scripts
**write a job script**
Jobs are submitted through a job script, which is a shell script (usually written in bash). Since it is a shell script, it must begins with a "shebang":
#!/bin/bash #!/bin/bash This is followed by a preamble describing the resource requests the job is making. Each request begins with #SBATCH followed by an option.
-------------------------------------------------------------------------------------------------------------------------------------------
#SBATCH --time=00-02:00:00 #SBATCH --time=00-02:00:00 This option gives the wall-clock time. It limits the running time. It is written in a format
#SBATCH --time=day-hour:minute:second #SBATCH --time=day-hour:minute:second -------------------------------------------------------------------------------------------------------------------------------------------
#SBATCH --partition=general #SBATCH --partition=general This option stipulates the partition requested. Here, we use the partition named general. What is a partition? In SLURM, multiple nodes can be grouped into partitions which are sets of
nodes with associated limits for wall-clock time, job size, etc. To see status and limits of different partitions, use the sinfo command:
From here, you can see TIMELIMIT of different partitions. If your wall-clock time is bigger than the TIMELIMIT of the partition you choose, your codes will not run.
-------------------------------------------------------------------------------------------------------------------------------------------
#SBATCH --ntasks=8 #SBATCH --ntasks=8 This option stipulates the requests number of CPU. Here we use 8 CPUs. This option is important if you run a parallel code.
-------------------------------------------------------------------------------------------------------------------------------------------
These are the basic options. There are numerous options available. But for a code to run, we may not need that much. But they may help if you have other needs.
See below to check if you need others.
#SBATCH --nodes= #number of nodes #SBATCH --nodes= #number of nodes You may not need this since when you stipulate number of CPUs, SLURM system will automatically assign the needed nodes for you.
#SBATCH --mem= #total memory per node in megabytes #SBATCH --mem= #total memory per node in megabytes System error file:
#SBATCH -e slurm%j.err #SBATCH -e slurm%j.err System output file:
#SBATCH -o slurm%j.out #SBATCH -o slurm%j.out Another thing we need to mention is that most SLURM options have two forms, a short (single-letter) form that is preceded by a single hyphen and followed by a space,
and a longer form preceded by a double hyphen and followed by an equal sign. For example, we can either use
#SBATCH --partition=general #SBATCH --partition=general or
#SBATCH -p partition #SBATCH -p partition . And we can either use
#SBATCH --ntasks= #SBATCH --ntasks= or
#SBATCH -n #SBATCH -n .
Here is a sample script called jobscript.sh for submitting an MPI job.
#!/bin/bash #SBATCH -t 00-02:00:00 #SBATCH -p general #SBATCH -n 120 RUN=/path_of_directory/ mpirun ./YourCode #!/bin/bash #SBATCH -t 00-02:00:00 #SBATCH -p general #SBATCH -n 120 RUN=/path_of_directory/ mpirun ./YourCode Note that the RUN=/path_of_directory/ stipulates where your executable file locates.
**submit a job**
Job scripts are submitted with the sbatch command, e.g.:
[user@acres-login0~]$ sbatch jobscript.sh [user@acres-login0~]$ sbatch jobscript.sh The job identification number is returned when you submit the job, e.g.:
[user@acres-login0~]$ sbatch jobscript.sh
Submitted batch job 831[user@acres-login0~]$ sbatch jobscript.sh
Submitted batch job 831**display job status*
The squeue command is used to obtain status information about jobs submitted to all queues, like:
The TIME field indicates the elapsed walltime. JOBID is the job identification number which is returned when you submit the job.
The ST field lists the state of the job. Commonly listed states include:
PD: Pending, job is waiting for idle CPUs
R: Running, job has the allocated CPUs and is running
S: Suspended, job has the allocated resources, but execution has been suspended
CG: Completing, job nearly completes and cannot be terminated probably because of an I/O operation
**cancel a job**
SLURM provides the scancel command for deleting jobs from the system using the job identification number:
[user@acres-login0~]$ scancel 831 [user@acres-login0~]$ scancel 831