Unit 2: Batch Job Submission

Presentation:

Moab-talk

Documentation on moab  batch jobs: 

https://www.bwhpc-c5.de/wiki/index.php/Batch_Jobs

Cluster hardware 

Commands overview:

  • msub jobscript.moab
  • echo sleep 250 | msub -l nodes=1:ppn=1,walltime=00:05:00
  • checkjobs
  • qshow

Special workshop-msub option for bwUniCluster:

  • msub -l advres=bwhpc-workshop.65 -A workshop -l nodes=1,ppn=1 …

This option is not needed for the bwForCluster Justus. Your course jobs will also start quickly. (Note that any job with a Walltime below 5 minutes will start quickly on Justus even in normal operation)

Exercises/Questions:
Short excursion on workspaces:
ws_allocate shortlived
ws_allocate alsoshortlived
ws_list # show all your workspaces and some info on them
ws_find shortlived # shows where it is
cd $(ws_find shortlived)
ls -l # well, it is empty
cd
ws_extend shortlived 90 # extend it to the maximum of 90 days
ws_extend shortlived 90
ws_extend shortlived 90
ws_extend shortlived 90 # end of possible extensions
ws_release shortlived # delete it
ws_release alsoshortlived
We load a module (math/matlab) and look at the example-jobscript provided.
module load math/matlab
cd $MATLAB_EXA_DIR
ls
less README
ws_allocate exercise 60 # 60 is max on bwuni, 90 on justus
ws_find exercise
cp * $(ws_find exercise)
cd $(ws_find exercise)

less README
less bwhpc_matlab_single.moab
EOT

We use a workspace for running the job inside it. Why may this actually be the wrong choice for bwhpc_matlab_single.moab?

bwhpc_matlab_single.moab only runs on a single node (and actually using a single core).
Supposing this job had a big amount of output, it would be much better to use the local disks in the node inside the job.
You need workspaces if you run multi-node jobs in which the program needs to access the same data from several nodes (and you need to submit from workspaces so the job initially finds his files)
. For runs on a single node that produce much data, copying everything to (and all results after the job back) the local disks ($TMPDIR) is better, because ~500 nodes have to share the same workspace resources.
Your $HOME is only for storing results at the end of the job.
looking at the begin of the file yields:
########## Begin MOAB/Slurm header ##########
#
# Give job a reasonable name
#MOAB -N bwhpc_matlab_single
#
# Request number of nodes and CPU cores per node for job
#MOAB -l nodes=1:ppn=1
#
# Estimated wallclock time for job
#MOAB -l walltime=00:05:00
#
# Specify a queue class
look up the options -N and -l in the manpage to msub:
man msub
Search inside manpage for e.g. " -N" by typing "/ -N" (without the quotes)
Note that typing two spaces before "-N" will skip over any mentions of -N in the text describing different options.

Queues: Justus

See Queues at the bwhpc batch jobs page
Quick, short, normal and long queues exist, but you never have to select a queue manually. All your jobs will be put into the correct queue depending on the requested walltime. The limits for each queue still apply.

Queues: bwUniCluster

List of tables can be found at https://www.bwhpc-c5.de/wiki/index.php/Batch_Jobs_-_bwUniCluster_Features#msub_-q_queues

msub bwhpc_matlab_single.moab This returns a number, e.g. 6177199. This is your jobid.
Examples written here will use the jobid 6177199, use the real jobid of YOUR job instead.
showq # show all your jobs
showq -r # show only your jobs that are currently running
# do not forget to check the manpage!
man showq

checkjob 6177199 # moab summary of job
checkjob -v -v -v 6177199 # very verbose moab job summary

Inspect the job output file in the directory where you submitted the job.
This file contains everything that programs write to STDOUT while they run.

Submit the same job again, this time request 30GB of memory (RAM) for the job.
-l mem=30gb Note that there are also -l pmem=... look up in the bwhpc-c5 wiki what the difference is.

We will make an even simpler job that will run as long as we want and use a lot of CPU while it does.:
Create the file with: cat > load.moab << EOT
#!/bin/bash
module load mpi/openmpi

mpirun perl -e 'for(\$i=360,\$t=time;time-\$t<\$i;){}'
EOT
i=360 specifies 360 seconds of running time.
We specify all moab-options on the commandline, not inside the script: msub -l -l nodes=1:ppn=16,walltime=0:08:00 load.moab
BwUniCluster: You need to get an interactive node instead with:
msub -I -V -q develop -q singlenode -lnodes=1,ppn=16
you can then run the script with: bash load.moab &
Justus: You can ssh into any node that has a job of you running, because Justus will always reserve the whole node for you.
showq -r # wait for job to start
checkjob -v -v -v 6177199 | less
# find out which node the job runs on
ssh »nodename«
htop # should show 16 threads of CPU busy
ps aux | grep $USER
# if this doesn't find anything, username may be too long. Try
ps aux | grep ul_ # to find ulm users
vmstat 1 # end with ctrl-C
df -h
Can use gdb, strace etc on the PID of the job.
Repeat the same with nodes=1:ppn=8.

Why does htop show that only half of the cores on the node are in use while the other half is idle?

The nodes have hyper-threading enabled (https://de.wikipedia.org/wiki/Hyper-Threading) That means that there are two "virtual cores" visible on the system for each real physical core. Most HPC applications do not profit from using more compute-threads than the real number of cores in the machine.
Run another job: echo sleep 10 | msub -l nodes=1:ppn=40 -l walltime 1:00:00 Find out why the job does not start.
Cancel the job.


Array Jobs (JUSTUS)

An array job is a single job script that spawns many almost identical sub-jobs. A regular job becomes an array job when the -t flag is called in the “msub” command. Furthermore, the parameter <indexlist> specifies the amount and order of the submitted sub-jobs (see, job arrays). Thus, uniquely identifying the individual sub-jobs.

If you have many jobs that only need one core and have a variable runtime, it is very useful to limit execution to a certain number of nodes. Else, many nodes will only run a single one-core job in the end. Limiting the number of nodes helps filling those nodes with new jobs more efficiently.

Q: Try to create an array job called “bwhpc_jobscript” that spawns 100 sub-jobs executing the MATLAB script “bwhpc_example”. Each sub-job requires 1 CPU core, 1 GB memory, and 10 min runtime. Furthermore, make sure that a different input file will be passed to MATLAB within each sub-job. You find applicable input files in the “input” folder.
A: Submit the following job script with the command msub -t array.[1-100]%16 bwhpc_jobscript.moab.

#!/bin/bash
#
########## Begin MOAB header ##########
#
# Give job a reasonable name
#MOAB -N array_job
#
# Request number of nodes and CPU cores per node for job
#MOAB -l nodes=1:ppn=1
#
# Request maximum memory used by any single process of the job
#MOAB -l pmem=1000mb
#
# Estimated wallclock time for job
#MOAB -l walltime=00:10:00
#
# Write standard output and errors in same file
#MOAB -j oe
#
########### End MOAB header ##########

cd "$MOAB_SUBMITDIR"

initial_state="'input/${MOAB_JOBARRAYINDEX}.dat'"

# Load the required module
module load math/matlab/R2014b

# Start the Matlab example
matlab -nodisplay -singleCompThread -nojvm -r "bwhpc_example($initial_state)"

exit
Q: Concatenate and plot the data stored in the “output” folder.
cd output
cat *.txt > result.txt

gnuplot
plot [:] [:] 'result.txt' using 1:2 with points notitle
quit


Job Output and Feedback Files

Download the tar.gz file namd-kurs.tgz to your cluster account with
wget https://www.uni-ulm.de/fileadmin/website_uni_ulm/kiz/it/rechner-compute-server/workshops/namd-kurs.tgz.

Unpack the file and cd into “namd-kurs”. The command grep WallTime *feedback shows you that only one of the jobs ran for over three hours. Look at the corresponding *_feedback file. You will see some information about your running job, that may help you determine how efficiently it ran. There is a long link in "LINK to the job statistic (per node)". Open that link in your browser. Look at the corresponding job script namd_singlenode.moab. Why do you see the load distribution during the job runtime?