Getting started¶
This plugin supplies an special scheduler that uses fireworks as the backend for job launching and placement, rather than the actual scheduler that is installed on the remote computer. The main advantage of doing so is that task farming can be facilitated, e.g. a single scheduler jobs may run multiple AiiDA calculations in serial or in parallel. In addition, this also means that AiiDA calculations can be launched alongside those using fireworks, and resources may be allocated dynamically based on priority of each jobs.
Note
Prerequisites:
A working MongoDB server
Remote cluster is not fenced, e.g. the compute nodes have internet access at run time.
Remote cluster’s schedulers is one of the following:
SLURM
SGE
Python environments are available on the remote cluster (typically through conda or virtualenv)
The second one is not necessary if local arrangements can be made to for hosting the MongoDB server inside the firewall. Support of other schedulers can be implemented easily if you know how to fetch job time limit from inside a job.
Installation¶
Use the following commands to install the plugin:
git clone https://github.com/zhubonan/aiida-fireworks-scheduler
cd aiida-fireworks-scheduler
pip install -e . # Note this will not install aiida, as you may need to do this on the remote machine!
#pip install -e .[pre-commit,testing] # install extras for more features
reentry scan -r aiida # refresh the entry points
verdi plugin list aiida.schedulers # should now show a scheduler entrypoint named `fw`
Setup for fireworks¶
The plugin requires a working fireworks installation to work, please consult the installation guide for fireworks for this. In particular, you will need to have an LaunchPad (the MongoDB server) up and running. The default LaunchPad will be used by the plugin. You can check if your set up is working using the following command:
lpad get_fws -m 1
Which should return exactly one Firework if there is any, or give an empty output.
Warning
Make sure you have work though the basic tutorials for Firework before continuing the setup process. In particular, you should be familiar with using lpad and rlaunch commands to manage and launch firework’s workflows manually. Obviously, there is no need to learn how to deign fireworks workflows so you can skip those.
Please confirm the following works: - lpad get_fws work on both local and remote computers. - rlaunch can be used to launch workflow on the remote computer.
Note
A MongoDB server is required to use this plugin, a source of free server is: https://www.mongodb.com/atlas/database. The server only stores a minimum amount of data and does not require lot of storage space, e.g. using the free-tier is usually fine. Alternatively, there are many VPS providers that offer pre-configured MongoDB instances. For the fireworks’ tutorials, you can install MongoDB locally, it will not work for production calculations. This is because the server need to be reachable by both the local workstation (where AiiDA is installed) and the remote cluster. If the remote cluster does not have internet access, then you will have to make certain arrangements, for example, using one of the post-processing node to host the MongoDB.
Setup for AiiDA¶
Once the plugin is installed, a fireworks
scheduler will be made available to choose for a new Computer
node.
It is also possible to migrate existing Computer
and associated Code
nodes with a helper command:
verdi data fireworks-scheduler duplicate-computer -Y <computer-name> --included-codes
The new Computer
will be assigned with a suffix -fw
, with original settings copied over,
By default, each AiiDA calculation is performed in a clean login shell on the node running fireworks.
This is needed when the scheduler does not support running sub-jobs, or using smart wrappers for launching MPI programs.
Certain schedulers actually support running sub-jobs inside a single job, such as SLURM.
In this case, the _aiidasubmit.sh
should be launched in the same environment rather than in a fresh login shell.
To select this mode of operation, use the fireworks_scheduler.keepenv
entrypoint rather than fireworks_scheduler.default
when setting up scheduler for the Computer
.
In this case, a --job-should-keep-env
flag should be passed to duplicate-computer
command.
Running calculations¶
On the local Computer¶
The process of launching jobs through fireworks
is identical to those using the native schedulers.
For resources
single key tot_num_mpiprocs
must be passed.
No other keys should be passed.
A few other options are supported:
- maximum_wallclock_seconds
Specifies the maximum length of the jobs. The
Firework
that the AiiDA job corresponds to will only run if there is enough time left in the actual cluster job. The AiiDA job will be terminated if it exceeds the wall clock limit.- priority
Specifies the priority of the underlying
Firework
(see this page for more) , those with higher priority will run first. AiiDA treatspriority
as a string so a string representation of the integer priority value can be passed. The default priority is'100'
.
On the remote Computer¶
The fireworks launches jobs in a sightly indirect way - the submission script contains a line calling the rlaunch
command that selects a Firework
to run dynamically.
Job placement is controlled by the definition of FireWorker, which typically is provided as a yaml file.
Each Firework
may have one of more category defined, and a FireWorker will only select those matches its own category list (see here for more).
The plugin will handling assigning category of the underlying Firework
for running the AiiDA job, but the user needs to manually point to right FwWork
.
The latter can be generated using a helper command:
verdi data fireworks-scheduler generate-worker -Y <computer> --mpinp <tot_num_mpiprocs> myworker.yaml
Note
Each FireWorker will only run jobs of a certain num of mpi processes.
Transfer the myworker.yaml
to the remote computer, and use the following line in the job submission script:
arlaunch -w myworker.yaml rapidfire
The arlaunch
command is an enhanced version of the original rlaunch
command provided by fireworks
, and it know the correct Fireworkk
containing the AiiDA job to run.
To ensure that each Firework
will have enough time to run as defined by the maximum_wallclock_seconds
, arlaunch
must be able to query the time left from the acutal scheduler.
At the moment, only SGE and SLURM are supported, but it should be relatively easy to add support for other schedulers as well.
Example job script (SGE):
#!/bin/bash -l # Batch script to fireworks each with 24 mpi processes #$ -l h_rt=48:00:00 #$ -l mem=4G #$ -l tmpfs=15G #$ -N aiida-fw-launcher # Select the MPI parallel environment and 16 processes. #$ -pe mpi 24 # Set the working directory to the current directory #$ -cwd # Activate the conda environment where aiida-fireworks-scheduler is installed conda activate $HOME/Scratch/fireworks_env CMD="arlaunch -l $HOME/Scratch/fw-config/my_launchpad.yaml -w ./aiida-fworker-24core.yaml rapidfire" eval $CMD
Task-farming
#!/bin/bash -l # Batch script to fireworks each with 24 mpi processes #$ -l h_rt=48:00:00 #$ -l mem=4G #$ -l tmpfs=15G #$ -N aiida-fw-launcher # Select the MPI parallel environment and 16 processes. #$ -pe mpi 24 # Set the working directory to the current directory #$ -cwd # Activate the conda environment where aiida-fireworks-scheduler is installed conda activate $HOME/Scratch/fireworks_env # Launch 4 concurrent workers, each using 6-cores - only aiida jobs requesting 6 cores will be # launched by these workers for i in $(seq 4); do arlaunch -l $HOME/Scratch/fw-config/my_launchpad.yaml -w ./aiida-fworker-6core.yaml rapidfire" & done wait
where aiida-fworker-24core.yaml
is the FireWorker file.