#parallel-processing #distributed-computing #node #script #worker #hpc #remote

bin+lib gosh-remote

Distributed parallel computing over multiple nodes

4 releases

0.3.2 Feb 14, 2023
0.3.1 Feb 12, 2023
0.2.2 Feb 11, 2023
0.2.0 Feb 7, 2023

#749 in Concurrency

GPL-3.0 license

59KB
1K SLoC

gosh-remote can turn any multiprocess parallelism script into remote distribution across multiple nodes in HPC environment.

Usage

  1. install scheduler and workers automatically using mpirun in batch system

    run-gosh-remote.sh

    # install scheduler
    gosh-remote bootstrap as-scheduler &
    
    # install workers on allocated nodes from batch system
    mpirun gosh-remote -v bootstrap as-worker
    

    The above works as a normal batch script, that can be submitted to batch system using command such as bsub:

    bsub -J test -R "span[ptile=24]" -n 72 ./run-gosh-remote.sh
    

    The above script request 3 nodes for remote executions.

  2. change job script

    the master script will call test.sh in parallel using 3 processes:

    xargs -P 3 -n 1 gosh-remote client run <<<$'./test.sh ./test.sh ./test.sh'
    

    job script of test.sh:

    #! /usr/bin/env bash
    echo running on $(hostname)
    

    example output

    "Ok(\"{\\\"JobCompleted\\\":\\\"running on node037\\\\n\\\"}\")"
    "Ok(\"{\\\"JobCompleted\\\":\\\"running on node038\\\\n\\\"}\")"
    "Ok(\"{\\\"JobCompleted\\\":\\\"running on node042\\\\n\\\"}\")"
    

Example in action (for magman)

run.sh

the main script for install scheduler and workers, and running magman

#! /usr/bin/env bash

set -x
#export SPDKIT_RANDOM_SEED=2227866437669085292

LOCK_FILE="gosh-remote-scheduler.lock"
# run MAX_PROCS processes at a time
MAX_NPROC=8

# start remote execution services
(
# install scheduler on the master node; the service address will be recorded in LOCK_FILE
gosh-remote -v bootstrap -w "$LOCK_FILE" as-scheduler &

# use mpirun to install one worker on each node by creating a machinefile for mpirun
which mpirun
# for LSB batch system, we can read nodes from env var
#echo $LSB_HOSTS |xargs -n 1| uniq | xargs -I{} echo {}:1>machines
# or
mpirun hostname | sort | uniq |xargs -I{} echo {}:1 >machines
# works for Intel MPI, MPICH, MVAPICH
mpirun -bootstrap=ssh -prepend-rank -machinefile machines gosh-remote -vv bootstrap as-worker -w "$LOCK_FILE"
) 2>&1 | tee gosh-remote.log &
sleep 2

# step 2: run magman
# NOTE: to run vasp remotely using the scheduler, run-vasp.sh need to be set accordingly
# write clean output to magman.out, while write everything to magman.log
(magman -j $MAX_NPROC -r -vvv | tee magman.out) 2>&1 | tee magman.log

# step 3: when magman done, kill background services
sleep 1
pkill gosh-remote
# could be better if using mpirun?
# mpirun pkill gosh-remote

run-vasp.sh

the script to call VASP

#! /usr/bin/env bash

# get root directory path of this script file
SCRIPT_DIR=$(dirname $(realpath "${BASH_SOURCE[0]:-$0}"))
LOCK_FILE="$SCRIPT_DIR/gosh-remote-scheduler.lock"

# NOTE: the "-host" option is required for avoiding process migration due to
# nested mpirun call
gosh-remote -vv client -w "$LOCK_FILE" run "mpirun -np 72 -host \$(hostname) vasp"

FAQ

how to avoid conflict due to nested mpirun

in the client side script, make sure vasp run in the worker node, without migration to other nodes:

mpirun -host `hostname` vasp

Dependencies

~41–56MB
~1M SLoC