Unit 2: Batch Job Submission

Presentation:

Moab-talk

Documentation on moab batch jobs:

https://www.bwhpc-c5.de/wiki/index.php/Batch_Jobs

Cluster hardware

Commands overview:

msub jobscript.moab
echo sleep 250 | msub -l nodes=1:ppn=1,walltime=00:05:00
checkjobs
qshow

Special workshop-msub option for bwUniCluster:

msub -l advres=bwhpc-workshop.65 -A workshop -l nodes=1,ppn=1 …

This option is not needed for the bwForCluster Justus. Your course jobs will also start quickly. (Note that any job with a Walltime below 5 minutes will start quickly on Justus even in normal operation)

Exercises/Questions:
Short excursion on workspaces:


ws_allocate shortlived

ws_allocate alsoshortlived

ws_list # show all your workspaces and some info on them 

ws_find shortlived # shows where it is

cd $(ws_find shortlived)

ls -l # well, it is empty

cd

ws_extend shortlived 90 # extend it to the maximum of 90 days

ws_extend shortlived 90

ws_extend shortlived 90

ws_extend shortlived 90 # end of possible extensions

ws_release shortlived # delete it

ws_release alsoshortlived

We load a module (math/matlab) and look at the example-jobscript provided.


module load math/matlab

cd $MATLAB_EXA_DIR

ls

less README

ws_allocate exercise 60 # 60 is max on bwuni, 90 on justus

ws_find exercise

cp * $(ws_find exercise)

cd $(ws_find exercise)


less README

less  bwhpc_matlab_single.moab

EOT

We use a workspace for running the job inside it. Why may this actually be the wrong choice for bwhpc_matlab_single.moab?

click answer

bwhpc_matlab_single.moab only runs on a single node (and actually using a single core).
Supposing this job had a big amount of output, it would be much better to use the local disks in the node inside the job.
You need workspaces if you run multi-node jobs in which the program needs to access the same data from several nodes (and you need to submit from workspaces so the job initially finds his files)
. For runs on a single node that produce much data, copying everything to (and all results after the job back) the local disks ($TMPDIR) is better, because ~500 nodes have to share the same workspace resources.
Your $HOME is only for storing results at the end of the job.

looking at the begin of the file yields:


########## Begin MOAB/Slurm header ##########

#

# Give job a reasonable name

#MOAB -N bwhpc_matlab_single

#

# Request number of nodes and CPU cores per node for job

#MOAB -l nodes=1:ppn=1

#

# Estimated wallclock time for job

#MOAB -l walltime=00:05:00

#

# Specify a queue class

look up the options -N and -l in the manpage to msub:


man msub

Search inside manpage for e.g. " -N" by typing "/ -N" (without the quotes)
Note that typing two spaces before "-N" will skip over any mentions of -N in the text describing different options.

Queues: Justus

See Queues at the bwhpc batch jobs page
Quick, short, normal and long queues exist, but you never have to select a queue manually. All your jobs will be put into the correct queue depending on the requested walltime. The limits for each queue still apply.

Queues: bwUniCluster

List of tables can be found at https://www.bwhpc-c5.de/wiki/index.php/Batch_Jobs_-_bwUniCluster_Features#msub_-q_queues


msub  bwhpc_matlab_single.moab

This returns a number, e.g. 6177199. This is your jobid.
Examples written here will use the jobid 6177199, use the real jobid of YOUR job instead.


showq # show all your jobs

showq -r  # show only your jobs that are currently running

# do not forget to check the manpage!

man showq


checkjob 6177199 # moab summary of job

checkjob -v -v -v 6177199 # very verbose moab job summary

Inspect the job output file in the directory where you submitted the job.
This file contains everything that programs write to STDOUT while they run.

Submit the same job again, this time request 30GB of memory (RAM) for the job. click for answer

-l mem=30gb Note that there are also -l pmem=... look up in the bwhpc-c5 wiki what the difference is.

We will make an even simpler job that will run as long as we want and use a lot of CPU while it does.:
Create the file with: cat > load.moab << EOT #!/bin/bash module load mpi/openmpi mpirun perl -e 'for(\$i=360,\$t=time;time-\$t<\$i;){}' EOT i=360 specifies 360 seconds of running time.
We specify all moab-options on the commandline, not inside the script: msub -l -l nodes=1:ppn=16,walltime=0:08:00 load.moab
BwUniCluster: You need to get an interactive node instead with:
msub -I -V -q develop -q singlenode -lnodes=1,ppn=16
you can then run the script with: bash load.moab &
Justus: You can ssh into any node that has a job of you running, because Justus will always reserve the whole node for you.
showq -r # wait for job to start checkjob -v -v -v 6177199 | less # find out which node the job runs on ssh »nodename« htop # should show 16 threads of CPU busy ps aux | grep $USER # if this doesn't find anything, username may be too long. Try ps aux | grep ul_ # to find ulm users vmstat 1 # end with ctrl-C df -h Can use gdb, strace etc on the PID of the job.
Repeat the same with nodes=1:ppn=8.

Why does htop show that only half of the cores on the node are in use while the other half is idle?

click for answer

The nodes have hyper-threading enabled (https://de.wikipedia.org/wiki/Hyper-Threading) That means that there are two "virtual cores" visible on the system for each real physical core. Most HPC applications do not profit from using more compute-threads than the real number of cores in the machine.

Run another job:


echo sleep 10 | msub -l nodes=1:ppn=40 -l walltime 1:00:00

Find out why the job does not start.
Cancel the job.

Array Jobs (JUSTUS)

An array job is a single job script that spawns many almost identical sub-jobs. A regular job becomes an array job when the -t flag is called in the “msub” command. Furthermore, the parameter <indexlist> specifies the amount and order of the submitted sub-jobs (see, job arrays). Thus, uniquely identifying the individual sub-jobs.

If you have many jobs that only need one core and have a variable runtime, it is very useful to limit execution to a certain number of nodes. Else, many nodes will only run a single one-core job in the end. Limiting the number of nodes helps filling those nodes with new jobs more efficiently.

Q: Try to create an array job called “bwhpc_jobscript” that spawns 100 sub-jobs executing the MATLAB script “bwhpc_example”. Each sub-job requires 1 CPU core, 1 GB memory, and 10 min runtime. Furthermore, make sure that a different input file will be passed to MATLAB within each sub-job. You find applicable input files in the “input” folder. click for commands

A: Submit the following job script with the command msub -t array.[1-100]%16 bwhpc_jobscript.moab.

#!/bin/bash

#

########## Begin MOAB header ##########

#

# Give job a reasonable name

#MOAB -N array_job

#

# Request number of nodes and CPU cores per node for job

#MOAB -l nodes=1:ppn=1

#

# Request maximum memory used by any single process of the job

#MOAB -l pmem=1000mb

#

# Estimated wallclock time for job

#MOAB -l walltime=00:10:00

#

# Write standard output and errors in same file

#MOAB -j oe

#

########### End MOAB header ##########


cd "$MOAB_SUBMITDIR"


initial_state="'input/${MOAB_JOBARRAYINDEX}.dat'"


# Load the required module

module load math/matlab/R2014b


# Start the Matlab example

matlab -nodisplay -singleCompThread -nojvm -r "bwhpc_example($initial_state)"


exit

Q: Concatenate and plot the data stored in the “output” folder.click for commands


cd output

cat *.txt > result.txt


gnuplot

plot [:] [:] 'result.txt' using 1:2 with points notitle

quit

Job Output and Feedback Files

Download the tar.gz file namd-kurs.tgz to your cluster account with
wget https://www.uni-ulm.de/fileadmin/website_uni_ulm/kiz/it/rechner-compute-server/workshops/namd-kurs.tgz.

Unpack the file and cd into “namd-kurs”. The command grep WallTime *feedback shows you that only one of the jobs ran for over three hours. Look at the corresponding *_feedback file. You will see some information about your running job, that may help you determine how efficiently it ran. There is a long link in "LINK to the job statistic (per node)". Open that link in your browser. Look at the corresponding job script namd_singlenode.moab. Why do you see the load distribution during the job runtime?