Command Basics
There are a few commands that you will need to know to effectively work with distribute
. Don't worry,
they are not too complex. The full list of commands and their specific parameters can be found by running
$ distribute
at the time of writing, this yields:
distribute 0.9.4
A utility for scheduling jobs on a cluster
USAGE:
distribute [FLAGS] <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
--save-log
--show-logs
-V, --version Prints version information
SUBCOMMANDS:
add add a job set to the queue
client start this workstation as a node and prepare it for a server connection
help Prints this message or the help of the given subcommand(s)
kill terminate any running jobs of a given batch name and remove the batch from the queue
node-status check the status of all the nodes
pause pause all currently running processes on this node for a specified amount of time
pull Pull files from the server to your machine
run run a apptainer configuration file locally (without sending it off to a server)
server start serving jobs out to nodes using the provied configuration file
server-status check the status of all the nodes
template generate a template file to fill for executing with `distribute add`
add
distribute add
is how you can add jobs to the server queue. There are two main things needed to operate this command:
a configuration file and the IP of the main server node. If you do not specify the name of a configuration
file, it will default to distribute-jobs.yaml
. This command can be run (for most cases) as such:
distribute add --ip <server ip address here> my-distribute-jobs-file.yaml
or, using defaults:
distribute add --ip <server ip address here>
If there exists no node that matches all of your required capabilities, the job will not be run. There also exists a --dry
flag
if you want to check that your configuration file syntax is correct, and a --show-caps
flag to print the capabilities
of each node.
template
distribute template
is a simple way to create a distribute-jobs.yaml
file that either runs with python
or apptainer
s. The specifics
of each configuration file will be discussed later.
distribute template python
---
meta:
batch_name: your_jobset_name
namespace: example_namespace
matrix: ~
capabilities:
- gfortran
- python3
- apptainer
python:
initialize:
build_file: /path/to/build.py
required_files:
- path: /file/always/present/1.txt
alias: optional_alias.txt
- path: /another/file/2.json
alias: ~
- path: /maybe/python/utils_file.py
alias: ~
jobs:
- name: job_1
file: execute_job.py
required_files:
- path: job_configuration_file.json
alias: ~
- path: job_configuration_file_with_alias.json
alias: input.json
and
distribute template apptainer
---
meta:
batch_name: your_jobset_name
namespace: example_namespace
matrix: ~
capabilities:
- gfortran
- python3
- apptainer
apptainer:
initialize:
sif: execute_container.sif
required_files:
- path: /file/always/present/1.txt
alias: optional_alias.txt
- path: /another/file/2.json
alias: ~
- path: /maybe/python/utils_file.py
alias: ~
required_mounts:
- /path/inside/container/to/mount
jobs:
- name: job_1
required_files:
- path: job_configuration_file.json
alias: ~
- path: job_configuration_file_with_alias.json
alias: input.json
pause
If you use a compute node as a work station, distribute pause
will pause all locally running jobs so that you
can use the workstation normally. It takes a simple argument as an upper bound on how long the tasks can be paused. The maximum amount of time that
a job can be paused is four hours (4h
), but if this is not enough you can simply rerun the command. This
upper bound is just present to remove any chance of you accidentally leaving the jobs paused for an extended
period of time.
If you decide that you no longer need the tasks paused, you can simply Ctrl-C
to quit the hanging command
and all processes will be automatically resumed. Do not close your terminal before the pausing finishes or
you have canceled it with Ctrl-C
as the job on your machine will never resume.
some examples of this command:
sudo distribute pause --duration 4h
sudo distribute pause --duration 1h30m10s
sudo distribute pause --duration 60s
server-status
distribute status
prints out all the running jobs at the head node. It will show you all the job batches
that are currently running, as well as the number of jobs in that set currently running and the
names of the jobs that have not been run yet. You can use this command to fetch the required parameters
to execute the kill
command if needed.
distribute server-status --ip <server ip here>
If there is no output then there are no jobs currently in the queue or executing on nodes.
TODO An example output here
260sec
:jobs running now: 1
10sec_positive
-unforced_viscous_decay
-unforced_inviscid_decay
-viscous_forcing_no_compensation_eh_first
-viscous_forcing_no_compensation_eh_second
-viscous_forcing_no_compensation_eh_both
:jobs running now: 0
pull
distribute pull
takes a distribute-jobs.yaml
config file and pulls all the files associated with that batch
to a specified --save-dir
(default is the current directory). This is really convenient because the only thing
you need to fetch your files is the original file you used to compute the results in the first place!
Since you often dont want to pull all the files - which might include tens or hundreds of gigabytes of flowfield
files - this command also accepts include
or exclude
filters, which consist of a list of regular expressions
to apply to the file path. If using a include
query, any file matching one of the regexs will be pulled to
your machine. If using a exclude
query, any file matching a regex will not be pulled to your computer.
The full documentation on regular expressions is found here, but luckily
most character strings are valid regular exprssions (barring characters like +
, -
, (
, )
). Lets say your
meta
section of the config file looks like this:
---
meta:
batch_name: incompressible_5second_cases
namespace: brooks_openfoam_cases
capabilities: []
and your directory tree looks something like this
├── incompressible_5second_cases
├── case1
│ ├── flowfield.vtk
│ └── statistics.csv
├── case2
│ ├── flowfield.vtk
│ └── statistics.csv
└── case3
├── flowfield.vtk
└── statistics.csv
If you wanted to exclude any file with a vtk
extension, you could
distribute pull distribute-jobs.yaml --ip <server ip here> \
exclude \
--exclude "vtk"
Or, if you wanted to exclude all of the case3 files and all vtk files:
distribute pull distribute-jobs.yaml --ip <server ip here> \
exclude \
--exclude "vtk" \
--exclude "case3"
Maybe you only want to pull case1 files:
distribute pull distribute-jobs.yaml --ip <server ip here> \
include \
--include "case1"
run
distribute run
will run an apptainer job locally. It is usefull for debugging apptainer jobs
since the exact commands that are passed to the container are not always intuitive.
distribute run --help
distribute-run 0.6.0
run a apptainer configuration file locally (without sending it off to a server)
USAGE:
distribute run [FLAGS] [OPTIONS] [job-file]
FLAGS:
--clean-save allow the save_dir to exist, but remove all the contents of it before executing the code
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-s, --save-dir <save-dir> the directory where all the work will be performed [default: ./distribute-run]
ARGS:
<job-file> location of your configuration file [default: distribute-jobs.yaml]
An example is provided in the apptainer jobs section.