Getting Started

Welcome to seekrflow! This guide will walk you through installing and setting up seekrflow for molecular dynamics simulation workflows.

What is seekrflow?

seekrflow is a workflow management system designed for molecular dynamics simulations using the seekr package. It provides:

  • Automated parameter generation for molecular systems

  • Streamlined workflow execution for binding/unbinding simulations

  • Integration with high-performance computing resources

  • Support for complex molecular systems including protein-ligand interactions

Installation

The easiest, quickest way to install the seekrflow is to use Mamba. If you don’t already have Mamba installed, Download the Miniforge install script and run.

curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
bash Miniforge3-$(uname)-$(uname -m).sh

Fill out the prompts as they appear.

Once this has been done, set up a new environment:

mamba create -n SEEKR2 python=3.11 --yes

If you plan to use seekrflow for parameterization, you will probably need a second environment to avoid conflicts.

mamba create -n SEEKRFLOW_PARAM python=3.11 --yes

Dependencies

Many of the dependencies for seekrflow will be installed alongside seekrflow, but some must be installed separately, and are installed before seekrflow

Git (required)

Make sure git is installed to clone repositories. If git isn’t already installed on your computer, run:

mamba install git --yes

Espaloma Machine-learned Forcefield (optional)

If you want to parameterize your molecular system with the machine-learned forcefield espaloma, you will need to install it.

mamba install espaloma=0.3.2 --yes

You will also need to download the correct espaloma .pt file, and save it somewhere on your computer system.

curl -O https://github.com/choderalab/espaloma/releases/download/0.3.2/espaloma-0.3.2.pt

SEEKR2 plugin, SEEKR2, and SeekrTools (required)

This step installs OpenMM plugin for SEEKR2 package.

mamba activate SEEKR2
mamba install seekr2_openmm_plugin openmm=8.1 --yes

Run the following command to check if the SEEKR2 OpenMM plugin is correctly installed. If no error message appears, the installation was successful.

python -c "import seekr2plugin"

Install SEEKR2.

cd ~
git clone https://github.com/seekrcentral/seekr2.git
cd seekr2
python -m pip install .

Optionally, one may run the tests for SEEKR2.

pytest

Next, clone and install SeekrTools.

cd ~
git clone https://github.com/seekrcentral/seekrtools.git
cd seekrtools
python -m pip install .

Optionally, one may run the tests for SeekrTools.

pytest

Install seekrflow

Finally, with the dependencies out of the way, we can install seekrflow.

git clone https://github.com/seekrcentral/seekrflow.git
cd seekrflow
python -m pip install .

Testing seekrflow (Optional)

To test seekrflow, run the following command in the seekrflow/ directory:

pytest

Installing Dependencies for Remote Execution on HPC Systems

If one want to use seekrflow to run SEEKR on a remote system, one must decide whether to use SSH or Globus Compute SDK in order to control the jobs. They both function very similarly, but here are the pros and cons of each.

SSH vs Globus Compute SDK Comparison

Remote System Approach

Pros

Cons

SSH

Simple, supported on most Linux systems

Requires password or key-based authentication, passwords in plain text, cannot bypass 2-factor authentication

Globus Compute SDK

Only an endpoint id is needed, avoiding password and 2-factor authentication.

More complicated setup of software and settings

Install Globus Compute SDK (optional - if not using SSH)

To use remote execution on high-performance computing (HPC) systems with Globus Compute SDK, you will need to install the package. This allows you to run seekr jobs on remote resources while avoiding annoying things like passwords and 2-factor authentication.

Important

The python versions must be the same (or very similar at least) across the local and remote machines.

On the local machine:

pip install globus-compute-sdk --yes

On the remote HPC system, you will need to follow the steps above to create the Mamba environment (on the head or login node; the environment is assumed to be named “SEEKR2”). Into this environment, install SEEKR, SeekrTools, and seekrflow. Then, make sure it is activated. (There is probably no need to make a parameterization environment on the remote machine, as it is presumed that parameterization, if any, will be performed on the local machine.)

mamba activate SEEKR2

Install pipx for environmental isolation.

python3 -m pip install --user pipx

Once this is completed, install globus-compute-endpoint on the remote machine:

python3 -m pipx install globus-compute-endpoint

Configure the endpoint:

globus-compute-endpoint configure my_seekr_endpoint

At this point, one should modify the file at ~/.globus-compute/my_seekr_endpoint/user_config_template.yaml.j2 in order to properly make full use of the HPC resource’s full capabilities.

Here is an example Globus Compute Endpoint configuration file that I used for the NCSA Delta supercomputer:

# Comments here...

endpoint_setup: ''

engine:
    type: GlobusComputeEngine
    max_workers_per_node: 2

    provider:
        type: LocalProvider

        min_blocks: 0
        max_blocks: 1
        init_blocks: 1

        worker_init: 'source /path/to/your/miniforge3/etc/profile.d/conda.sh && mamba activate SEEKR2'

Note

This configuration applies to the head/login node before SLURM/PBS job submission. Notice that the max_workers_per_node is set to 2 for proper handling of the Globus compute SDK client in seekrflow.

Note

Make sure you have defined the OPENMM_CUDA_COMPILER environment variable in your shell initialization file (e.g., ~/.bashrc) on the remote machine.

Start the endpoint.

globus-compute-endpoint start my_seekr_endpoint

You will need to authenticate with Globus on a browser to start the endpoint, and enter the Authorization code given in the browser into the terminal. This authentication session should last for a number of days or weeks, depending on your Globus account settings.

One can see the endpoint, as well as its endpoint ID, and those of any other endpoints, by listing them:

globus-compute-endpoint list

Copy the endpoint ID from the ‘start’ or ‘list’ commands above, and save for future reference.

One can also stop the endpoint if/when desired:

globus-compute-endpoint stop my_seekr_endpoint

Before submitting remote jobs, always check the endpoints to make sure they are started on the remote machine, and start them if they are not.

globus-compute-endpoint list
globus-compute-endpoint start my_seekr_endpoint

More information about Globus endpoints and the Globus SDK can be found here: https://globus-compute.readthedocs.io/en/latest/endpoints/endpoints.html

Install Globus Connect Personal (optional)

File transfers may be done with Globus or with rsync. Some resources prefer the use of Globus for large file transfers, instead of rsync.

If you want to use Globus to transfer files to and from your remote machine, install the latest version of Globus Connect Personal on your local machine. This allows you to transfer files to and from the remote HPC system using Globus.

Follow the instructions here: https://www.globus.org/globus-connect-personal

–or– For a typical Linux/Unix installation:

curl -O https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
tar xzf globusconnectpersonal-latest.tgz
cd globusconnectpersonal-x.y.z
./globusconnectpersonal

And follow the on-screen instructions to set up your Globus Connect Personal instance.

Find the Collection IDs Using the Globus Web Portal

Now, one must find the Collection IDs for the Globus Connect Personal instance on the local machine, and the Globus Compute Endpoint on the remote HPC system. This is done by logging into the Globus Web Portal (https://app.globus.org) and navigating to the “Collections” tab.

Find your local Globus Connect Personal instance under “Administered By You”, click it’s name, and search the page for “UUID”. This is the Collection ID for your local Globus Connect Personal instance. Save this UUID for your seekrflow configuration input file.

Find your remote Globus Collection ID by backtracking to the “Collections” tab, and searching for your remote Globus Compute Endpoint. Click on the endpoint name, and search the page for “UUID”. This is the Collection ID for your remote Globus Compute Endpoint. Save this UUID for your seekrflow configuration input file.

Quick Start Example System - Trypsin/Benzamidine

A seekrflow calculation will need a input settings JSON file to run, as well as a starting PDB file containing a bound complex of the receptor and ligand molecules. An example system can be found in seekrflow/seekrflow/examples/trypsin_benzamidine. Here, the bound complex of the receptor and ligand is found in the file “protein_ligand.pdb”, and an example input JSON file can be found in “seekrflow.json”. A full workflow run can be done with the following steps.

mamba activate SEEKRFLOW_PARAM
python ~/seekrflow/seekrflow/parameterize.py --input_json seekrflow.json --ligand_resname BEN
mamba deactivate
mamba activate SEEKR2
python ~/seekrflow/seekrflow/flow.py prepare --input_json seekrflow.json
python ~/seekrflow/seekrflow/flow.py run --input_json seekrflow.json
python ~/seekr2/seekr2/analyze.py work/root/model.xml

Set up your .seekrflow_resources.json (Optional)

One can save a lot of time preparing to submit remote jobs if one sets up their .seekrflow_resources.json file. Instead of always including one’s run_settings.resources list within the seekrflow.json files, one may define all one’s resources in the .seekrflow_resources.json file. The file may exist either in one’s home directory (recommended), within one’s seekrflow work/ directory, or within one’s seekrflow work/root/ directory. The file is structured as follows:

{
"resources": [
{
"type": "slurm_remote",
"name": "delta",
"remote_seekr2_directory": "/PATH/TO/seekr2/seekr2/",
"remote_seekrtools_directory": "/PATH/TO/seekrtools/seekrtools/",
"remote_working_directory": "/PATH/TO/seekrflow_playground/",
"max_workers_per_node": 1,
"partition": "gpuA100x4,gpuA40x4",
"account": "YOUR ACCOUNT NAME",
"constraint": "scratch",
"nodes_per_block": 1,
"cpus_per_task": 32,
"memory_per_node": 5000,
"time_limit": "00:30:00",
"scheduler_options": "--gpus-per-node=1 --gpu-bind=closest",
"worker_init": "source $HOME/.bashrc; conda activate SEEKR2; export OPENMM_CUDA_COMPILER=`which nvcc`",
"remote_interface": {
    "type": "globus_compute_sdk",
    "endpoint_id": "YOUR GLOBUS COMPUTE ENDPOINT"
},
"transfer_settings": {
    "type": "globus",
    "local_collection_id": "YOUR LOCAL COLLECTION ID",
    "remote_collection_id": "7e936164-de58-4e3d-85da-21aa23c07169"
},
{
"type": "slurm_remote",
"name": "anvil",
"remote_seekr2_directory": "/PATH/TO/seekr2/seekr2/",
"remote_seekrtools_directory": "/PATH/TO/seekrtools/seekrtools/",
"remote_working_directory": "/PATH/TO/seekrflow_playground",
"max_workers_per_node": 1,
...
}
]
}

As one utilizes resources to run seekrflow, one may populate this file with various resources that one may use to run seekr remotely.

Important Options and Hints

  • In general, seekrflow, seekr, and SeekrTools programs can be run with the ‘-h’ argument to see

all available options. Please see https://seekr2.readthedocs.io/en/latest for a detailed description of programs and options.

For a complete tutorial, see the Tutorials section.

Troubleshooting

Getting Help

If you encounter issues:

  1. Check the User Guide for detailed usage instructions

  2. Review the API Reference for complete API documentation

  3. Look at the example in seekrflow/examples/trypsin_benzamidine/

  4. See https://seekr2.readthedocs.io/en/latest for SEEKR2-specific help

  5. Submit issues to the project repository

Next Steps

Now that you have seekrflow installed, you can: