Unit 2: Process Monitoring, Redirection

Slides to the talk

Remark: The setup on Justus allows you to connect to a node with ssh while one of your jobs is running on it. You will be able to use the following things to check on the a job on a node while it is running. Other clusters may not allow ssh access.

Run the program numpy_test-print.py from the tarfile using:
python numpy_test-print.py
Redirect the output of the program to a file. (See cheat sheet on redirection/pipes)
click for commands

python numpy_test-print.py > out

Q: You now want to watch the output as the file "out" grows. How can you do this? (See cheat sheet "viewing and editing files")

click for commands

A: open a second shell and run:
tail -f out
tail usually shows the last few lines of a file. The -f is short for "follow" and means it will continue showing "the last few lines" as the file grows.

Another way to keep seeing the updated end of a growing file would be by repeatedly pressing "G" in less

Open up a second terminal and run htop in that terminal. Now start the program and look at the changes.
You should see the program you ran appear on the top of the list. You see that it is running on only one core. The nodes on Justus have 16 cores. Suppose you have many of those calculations to do, you could just run the same program 16 times with a different input on the same node. You also see the amount of virtual (VIRT) and residual (RES) memory used and some other data on the process.

While the program is still running, also look at the output of:

ps aux | grep -e PID -e numpy_test

This is basically the same information. The "|" is a pipe. You are not displaying the information from the program "ps", but feeding it into a different program "grep". grep has the task of only displaying lines that match certain patterns.

Also look at the output of

lsof |grep python

or even better do a grep on the PID (process identifier) number that you got from the output of ps or from htop.

lsof | grep [PID]

What you see is all the files that are open at that time by your process. Because the program doesn't read or write to a file while working, those are only system internal files needed for executing it.

To watch even further, what your program is doing, you could attach to it via a debugging interface - either with strace or with gdb, the gnu debugger. Using gdb to any extent would go to far in this exercise, but let us have a very short glance at strace.

strace shows you all calls to kernel functions of a program.
First, directly trace the script by running: strace python numpy_test-print.py

How can you write the output of strace in one file and the output of the program into a different file? (search for ^REDIRECTION in man bash for the answer. The program writes to stdout, strace to stderr)

You can also attach to your already-running process by running:

strace -p [PID]

strace -p [PID]

Stop tracing by hitting ^C (^ is short for the Control key, usually labelled Strg or Ctrl) in the shell with strace.

The output of strace is often very verbose and you will want to limit the output to only some functions you are interested in. For example only see calls to files that are being opened (look at the manpage of strace and search for -e for other options):

strace -e open -p [PID]

Stop tracing by hitting ^C in the shell with strace.

Most often this is still too much output. Processes can write output to different "streams". The most usual ones are called "standard output" and "standard error". They are called in short stdout and stderr and have the numbers 1 (stdout) and 2 (stderr). strace writes to stderr.

Connect again to the process with strace and write stderr to a file. Use the Cheat Sheet to find out how.
click for commands

strace -p [PID] 2> strace-out