Configuration
Configuration files are fundamental to how distribute
works. Without a configuration file, the server would not
know what nodes that jobs could be run on, or even what the content of each job is. Configuration files
are also useful in pull
ing the files you want from your compute job to your local machine. Therefore,
they are imperative to understand.
Configuration files
As mentioned in the introduction, configuration files (usually named distribute-jobs.yaml
) come in two flavors:
python scripts and apptainer images.
The advantage of python scripts is that they are relatively easy to produce: you need to have a single script that specifies how to build your project, and another script (for each job) that specifies how to run each file. The disadvantage of python configurations is that they are very brittle - the exact server configuration may be slightly different from your environment and therefore can fail in unpredictable ways. Since all nodes with your capabilities are treated equally, a node failing to execute your files will quickly chew through your jobs and spit out some errors.
The advantage of apptainer jobs is that you can be sure that the way the job is run
on distribute
nodes is exactly how it would run on your local machine. This means that, while it may take
slightly longer to make a apptainer job, you can directly ensure that all the dependencies are present, and that there wont
be any unexpected differences in the environment to ruin your job execution. The importance of this cannot be
understated. The other advantage of apptainer jobs is that they can be directly run on other compute clusters (as
well as every lab machine), and they are much easier to debug if you want to hand off the project to another lab
member for help. The disadvantage of apptainer jobs is that the file system is not mutable - you cannot write
to any files in the the container. Any attempt to write a file in the apptainer filesystem will result in an error
and the job will fail. Fear not, the fix for this is relatively easy: you will just bind folders from the host file system
(via configuration file) to your container that will be writeable. All you have to do then is ensure that your
compute job only writes to folders that have been bound to the container from the host filesystem.
Regardless of using a python or apptainer configuration, the three main areas of the configuration file remain the same:
Section | Python Configuration | Apptainer Configuration |
---|---|---|
Meta |
|
The same as python |
Building |
|
|
Running |
|
|
How files are saved
Files are saved on the server using your namespace
, batch_name
, and job_name
s. Take the following configuration file that specifies
a apptainer job that does not save any of its own files:
meta:
batch_name: example_jobset_name
namespace: example_namespace
matrix: @your-username:matrix.org
capabilities: []
apptainer:
initialize:
sif: execute_container.sif
required_files: []
required_mounts:
- /path/inside/container/to/mount
jobs:
- name: job_1
required_files: []
- name: job_2
required_files: []
- name: job_3
required_files: []
The resulting folder structure on the head node will be
.
└── example_namespace
└── example_jobset_name
├── example_jobset_name_build_ouput-node-1.txt
├── example_jobset_name_build_ouput-node-2.txt
├── example_jobset_name_build_ouput-node-3.txt
├── job_1
│ └── stdout.txt
├── job_2
│ └── stdout.txt
└── job_3
└── stdout.txt
The nice thing about distribute
is that you also receive the output that would appear on your terminal
as a text file. Namely, you will have text files for how your project was compiled (example_jobset_name_build_ouput-node-1.txt
is the python build script output for node-1), as well as the output for each job inside each respective folder.
If you were to execute another configuration file using a different batch name, like this:
meta:
batch_name: example_jobset_name
namespace: example_namespace
matrix: @your-username:matrix.org
capabilities: []
# -- snip -- #
the output would look like this:
.
└── example_namespace
├── another_jobset
│ ├── example_jobset_name_build_ouput-node-1.txt
│ ├── example_jobset_name_build_ouput-node-2.txt
│ ├── example_jobset_name_build_ouput-node-3.txt
│ ├── job_1
│ │ └── stdout.txt
│ ├── job_2
│ │ └── stdout.txt
│ └── job_3
│ └── stdout.txt
└── example_jobset_name
├── example_jobset_name_build_ouput-node-1.txt
├── example_jobset_name_build_ouput-node-2.txt
├── example_jobset_name_build_ouput-node-3.txt
├── job_1
│ └── stdout.txt
├── job_2
│ └── stdout.txt
└── job_3
└── stdout.txt
Therefore, its important to ensure that your batch_name
fields are unique. If you don't, the output of
the previous batch will be deleted or combined with the new job.
Examples
Examples creating each configuration file can be found in the current page's subchapters.