3 Submitting Jobs

3.1 Checking Usage

At any time, a user can check what the current availability of the cluster is by typing SGE_Avail on their command line. The output will look something like this:

               #HOST  TOTRAM FREERAM    TOTSLOTS             Q  QSLOTS  QFREESLOTS   QSTATUS     QTYPE
               bacon   503.6   500.3          48         all.q      48          48    normal        BP
             lettuce   503.6   500.2          48         all.q      48          48    normal        BP
              tomato   503.6   500.2          48         all.q      48          48    normal        BP

Right now, according to this output, there are 3 hosts running: bacon, lettuce, and tomato. They each have 48 total slots and 48 free slots. They each have 500 GB of free RAM as well.

Additionally, users can check what the job queue looks like. Users can see what jobs are waiting to be run and what jobs are currently running. To do this, run the qstat command. If qstat comes back with no output, it means there are no jobs running at the moment. Here is some example output from the qstat command:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
 -----------------------------------------------------------------------------------------------------------------
     62 0.00000 runtime_test     glick        r     01/26/2018 18:59:00                                    1        
     63 0.00000 runitme_test2    glick        qw    01/26/2018 18:59:02                                    1        
     64 0.00000 runtime_test3    glick        qw    01/26/2018 18:59:04                                    1    

There are currently 3 jobs on the cluster, all submitted by the user “glick.” They are jobs with ids 62,63, and 64. They each take up one slot (another name for a core). One is running, while the other two have state qw, which is short for “Queued and Waiting.” This is usually an indication that either the cluster is busy or the scheduler has not yet scheduled the jobs.

3.2 A Note on Data

The home directories, /local/cluster/bin, and a few other things are mounted remotely to all of the worker nodes. This makes life easy. It means that if your script edits, reads, or otherwise depends on data from your home directory, you do not need to move the data, because the workers can access it directly. However, this also means that if your data edited by multiple jobs, there is no way to ensure that it will always be changed in the same order, so keep that in mind.

3.3 Running Python in Virtual Environments

Beginning in 2021, we now encourage everyone to run any Python scripts using virtual environments. This will keep the main Python from getting too unruly with packages, etc. BLT hero Ben Glick set up a program called venv_man for doing this fairly easily.

The general steps are: * Create a virtual environment (unless you’re using one already created) * Activate the virtual environment * Run your job using “SGE_Batch” (more on that below) * Deactivate the virtual environment

3.3.1 Should I create a Virtual Environment, or use a pre-existing one?

It depends! If you’re part of an ongoing research project, or want to use one specific to a subject (e.g. biology or economics), you may want to use one that already exists. Run the following to see a list of existing environments:

venv_man -l

3.3.2 Creating Virtual Environments

3.3.3 Activating Virtual Environments

3.3.4 Deactivating Virtual Environments

3.4 Jobs on the BLT Cluster

3.4.1 Grid Engine

BLT uses the GridEngine scheduler system to schedule HPC jobs. There are good docs available on the SGE toolkit HERE

3.4.2 Batch Jobs

A Batch job is some set of UNIX command line commands which is executed on a single core of a worker node in serial (one after another). Batch jobs can be submitted by using the following command:

SGE_Batch -r "<some runtime id>" -c "<a UNIX command or commands>"

3.4.3 Parallel Jobs

Parallel jobs are just like Batch jobs, except that in a parallel job, multiple cores are reserved, rather than a single core. In order to reserve multiple cores, simply add the -P flag to the SGE_Batch command like so:

SGE_Batch -r "<runtime id>" -c "UNIX command" -P <number of processors>

3.4.4 GPU Jobs

To submit a job to the GPU queue, all you need to do is add the -q gpu.q option. This will submit your job to the GPU node, which has 4 NVIDIA GeForce GTX2080 ti accelerators.

An example of this is:

SGE_Batch -r "<runtime id>" -c "UNIX command" -P <number of processors> -q gpu.q

3.4.5 Deleting A Job

To delete a job, use the qdel command. The syntax is as follows:

qdel <JOB ID>

The job ID can be found from using qstat

Remember that SGE_Batch will not parallelize your code for you. If your code is not meant to run on multiple cores, then using any more than 1 processor core is a waste.

3.4.6 Parsl Workflows

Parsl is a python-based workflow management system that we can use to run jobs on the cluster without having to interact with the scheduler at all. They are run the same way that you would run any script on your local machine, and can orchestrate inter-process communication between almost any kind of application needed. In depth documentation about running parsl jobs is available at the Parsl Workflows page.