Python

Python configuration file templates can be generated as follows:

distribute template python

At the time of writing, it outputs something like this:

---
meta:
  batch_name: your_jobset_name
  namespace: example_namespace
  matrix: ~
  capabilities:
    - gfortran
    - python3
    - apptainer
python:
  initialize:
    build_file: /path/to/build.py
    required_files:
      - path: /file/always/present/1.txt
        alias: optional_alias.txt
      - path: /another/file/2.json
        alias: ~
      - path: /maybe/python/utils_file.py
        alias: ~
  jobs:
    - name: job_1
      file: execute_job.py
      required_files:
        - path: job_configuration_file.json
          alias: ~
        - path: job_configuration_file_with_alias.json
          alias: input.json

What You Are Provided

You may ask, what do your see when they are executed on a node? While the base folder structure remains the same, the files you are provided differ. Lets say you are executing the following section of a configuration file:

python:
  initialize:
    build_file: /path/to/build.py
    required_files:
      - path: file1.txt
      - path: file999.txt
        alias: file2.txt
  jobs:
    - name: job_1
      file: execute_job.py
      required_files:
        - path: file3.txt
    - name: job_2
      file: execute_job.py
      required_files: []

When executing the compilation, the folder structure would look like this:

.
├── build.py
├── distribute_save
├── initial_files
│   ├── file1.txt
│   └── file2.txt
└── input
    ├── file1.txt
    ├── file2.txt

In other words: when building you only have access to the files from the required_files section in initialize. Another thing to note is that even though you have specified the path to the file999.txt file on your local computer, the file has actually been named file2.txt on the node. This is an additional feature to help your job execution scripts work uniform file names; you dont actually need to need to keep a bunch of solver inputs named solver_input.json in separate folders to prevent name collision. You can instead have several inputs solver_input_1.json, solver_input_2.json, solver_input_3.json on your local machine and then set the alias filed to solver_input.json so that you run script can simply read the file at ./input/solver_input.json!

Lets say your python build script (which has been renamed to build.py by distribute for uniformity) clones the STREAmS solver repository and compiled the project. Then, when executing job_1 your folder structure would look something like this:

.
├── job.py
├── distribute_save
├── initial_files
│   ├── file1.txt
│   └── file2.txt
├── input
│   ├── file1.txt
│   ├── file2.txt
│   └── file3.txt
└── STREAmS
    ├── README.md
    └── src
        └── main.f90

Now, the folder structure is exactly as you have left it, plus the addition of a new file3.txt that you specified in your required_files section under jobs. Since job_2 does not specify any additional required_files, the directory structure when running the python script would look like this:

.
├── job.py
├── distribute_save
├── initial_files
│   ├── file1.txt
│   └── file2.txt
├── input
│   ├── file1.txt
│   ├── file2.txt
└── STREAmS
    ├── README.md
    └── src
        └── main.f90

In general, the presence of ./initial_files is an implementation detail. The files in this section are not refreshed between job executions. You should not rely on the existance of this folder - or modify any of the contents of it. The contents of the folder are copied to ./input with every new job; use those files instead.

Saving results of your compute jobs

Archiving jobs to the head node is super easy. All you have to do is ensure that your execution script moves all files you wish to save to the ./distribute_save folder before exiting. distribute will automatically read all the files in ./distribute_save and save them to the corresponding job folder on the head node permenantly. distribute will also clear out the ./distribute_save folder for you between jobs so that you dont end up with duplicate files.

Build Scripts

The build script is specified in the initialize section under the build_file key. The build script is simply responsible for cloning relevant git repositories and compiling any scripts in the project. Since private repositories require a github SSH key, a read-only ssh key is provided on the system so that you can clone any private fluid-dynamics-group repo. An example build script that I have personally used for working with hit3d looks like this:

import subprocess
import os
import sys
import shutil
# hit3d_helpers is a python script that I have specified in 
# my `required_files` section of `initialize`
from initial_files import hit3d_helpers
import traceback

HIT3D = "https://github.com/Fluid-Dynamics-Group/hit3d.git"
HIT3D_UTILS = "https://github.com/Fluid-Dynamics-Group/hit3d-utils.git"
VTK = "https://github.com/Fluid-Dynamics-Group/vtk.git"
VTK_ANALYSIS = "https://github.com/Fluid-Dynamics-Group/vtk-analysis.git"
FOURIER = "https://github.com/Fluid-Dynamics-Group/fourier-analysis.git"
GRADIENT = "https://github.com/Fluid-Dynamics-Group/ndarray-gradient.git"
DIST = "https://github.com/Fluid-Dynamics-Group/distribute.git"
NOTIFY = "https://github.com/Fluid-Dynamics-Group/matrix-notify.git"

# executes a command as if you were typing it in a terminal
def run_shell_command(command):
    print(f"running {command}")
    output = subprocess.run(command,shell=True, check=True)
    if not output.stdout is None:
        print(output.stdout)

# construct a `git clone` string to run as a shell command
def make_clone_url(ssh_url, branch=None):
    if branch is not None:
        return f"git clone -b {branch} {ssh_url} --depth 1"
    else:
        return f"git clone {ssh_url} --depth 1"

def main():
    build = hit3d_helpers.Build.load_json("./initial_files")

    print("input files:")
    run_shell_command(make_clone_url(HIT3D, build.hit3d_branch))
    run_shell_command(make_clone_url(HIT3D_UTILS, build.hit3d_utils_branch))
    run_shell_command(make_clone_url(VTK))
    run_shell_command(make_clone_url(VTK_ANALYSIS))
    run_shell_command(make_clone_url(FOURIER))
    run_shell_command(make_clone_url(GRADIENT))
    run_shell_command(make_clone_url(DIST, "cancel-tasks"))
    run_shell_command(make_clone_url(NOTIFY))
    
    # move the directory for book keeping purposes
    shutil.move("fourier-analysis", "fourier")

    # build hit3d
    os.chdir("hit3d/src")
    run_shell_command("make")
    os.chdir("../../")

    # build hit3d-utils
    os.chdir("hit3d-utils")
    run_shell_command("cargo install --path .")
    os.chdir("../")

    # build vtk-analysis
    os.chdir("vtk-analysis")
    run_shell_command("cargo install --path .")
    os.chdir("../")

    # all the other projects cloned are dependencies of the built projects
    # they don't need to be explicitly built themselves

if __name__ == "__main__":
    main()

note that os.chdir is the equivalent of the GNU coreutils cd command: it simply changes the current working directory.

Job Execution Scripts

Execution scripts are specified in the file key of a list item a job name in jobs. Execution scripts can do a lot of things. I have found it productive to write a single generic_run.py script that reads a configuration file from ./input/input.json is spefied under my required_files for the job) and then run the sovler from there.

One import thing about execution scripts is that they are run with a command line argument specifying how many cores you are allowed to use. If you hardcode the number of cores you use you will either oversaturate the processor (therefore slowing down the overall execution speed), or undersaturate the resources available on the machine. Your script will be "executed" as if it was a command line program. If the computer had 16 cores available, this would be the command:

python3 ./job.py 16

you can parse this value using the sys.argv value in your script:

import sys
allowed_processors = sys.argv[1]
allowed_processors_int = int(allowed_processors)
assert(allowed_processors_int, 16)

You must ensure that you use all available cores on the machine. If your code can only use a reduced number of cores, make sure you specify this in your capabilities section! Do not run single threaded processes on the distributed computing network - they will not go faster.

A full working example of a run script that I use is this:

import os
import sys
import json
from input import hit3d_helpers
import shutil
import traceback

IC_SPEC_NAME = "initial_condition_espec.pkg"
IC_WRK_NAME = "initial_condition_wrk.pkg"

def load_json():
    path = "./input/input.json"

    with open(path, 'r') as f:
        data = json.load(f)
        print(data)
        return data

# copies some initial condition files from ./input 
# to the ./hit3d/src directory so they can be used 
# by the solver
def move_wrk_files(is_root):
	outfile = "hit3d/src/"
	infile = "input/"

    shutil.copy(infile + IC_SPEC_NAME, outfile + IC_SPEC_NAME)
    shutil.copy(infile + IC_WRK_NAME, outfile + IC_WRK_NAME)


# copy the ./input/input.json file to the output directory
# so that we can see it later when we download the data for viewing
def copy_input_json(is_root):
	outfile = "distribute_save/"
	infile = "input/"

    shutil.copy(infile  + "input.json", outfile + "input.json")

if __name__ == "__main__":
    try:
        data = load_json();

        # get the number of cores that we are allowed to use from the command line
        nprocs = int(sys.argv[1])

        print(f"running with nprocs = ", nprocs)

        # parse the json data into parameters to run the solver with
        skip_diffusion = data["skip_diffusion"]
        size = data["size"]
        dt = data["dt"]
        steps = data["steps"]
        restarts = data["restarts"]
        reynolds_number = data["reynolds_number"]
        path = data["path"]
        load_initial_data = data["load_initial_data"]
        export_vtk = data["export_vtk"]
        epsilon1 = data["epsilon1"]
        epsilon2 = data["epsilon2"]
        restart_time = data["restart_time"]
        skip_steps = data["skip_steps"]
        scalar_type = data["scalar_type"]
        validate_viscous_compensation = data["validate_viscous_compensation"]
        viscous_compensation = data["viscous_compensation"]
        require_forcing = data["require_forcing"]
        io_steps = data["io_steps"]
        export_divergence = data["export_divergence"]
        
		# if we need initial condition data then we copy it into ./hit3d/src/
        if not load_initial_data == 1:
            move_wrk_files()

        root = os.getcwd()

        # open hit3d folder
        openhit3d(is_root)

		# run the solver using the `hit3d_helpers` file that we have
		# ensured is present from `required_files`
        hit3d_helpers.RunCase(
            skip_diffusion,
            size,
            dt,
            steps, 
            restarts, 
            reynolds_number,
            path,
            load_initial_data, 
            nprocs, 
            export_vtk, 
            epsilon1,
            epsilon2, 
            restart_time,
            skip_steps,
            scalar_type,
            validate_viscous_compensation,
            viscous_compensation,
            require_forcing,
            io_steps,
            export_divergence
        ).run(0)

        # go back to main folder that we started in 
        os.chdir(root)

        copy_input_json(is_root)

        sys.exit(0)

    # this section will ensure that the exception and traceback 
	# is printed to the console (and therefore appears in stdout files saved
	# on the server
    except Exception as e:
        print("EXCEPTION occured:\n",e)
        print(e.__cause__)
        print(e.__context__)
        traceback.print_exc()
        sys.exit(1)

Full Example

A simpler example of a python job has been compiled and verified here.

distribute-docs