Getting Started
Welcome to seekrflow! This guide will walk you through installing and setting up seekrflow for molecular dynamics simulation workflows.
What is seekrflow?
seekrflow is a workflow management system designed for molecular dynamics simulations using the seekr package. It provides:
Automated parameter generation for molecular systems
Streamlined workflow execution for binding/unbinding simulations
Integration with high-performance computing resources
Support for complex molecular systems including protein-ligand interactions
Installation
The easiest, quickest way to install the seekrflow is to use Mamba. If you don’t already have Mamba installed, Download the Miniforge install script and run.
curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
bash Miniforge3-$(uname)-$(uname -m).sh
Fill out the prompts as they appear.
Once this has been done, set up a new environment:
mamba create -n SEEKR2 python=3.11 --yes
If you plan to use seekrflow for parameterization, you will probably need a second environment to avoid conflicts.
mamba create -n SEEKRFLOW_PARAM python=3.11 --yes
Dependencies
Many of the dependencies for seekrflow will be installed alongside seekrflow, but some must be installed separately, and are installed before seekrflow
Git (required)
Make sure git is installed to clone repositories. If git isn’t already installed on your computer, run:
mamba install git --yes
PDBFixer (recommended; needed for parameterization)
PDBfixer is recommended to install in case one wants to run the parameterization using seekrflow.
mamba activate SEEKRFLOW_PARAM
mamba install pdbfixer --yes
pdb2pqr (recommended; needed for parameterization)
pdb2pqr is used in the parameterization workflow in order to choose protonation states, and to optionally produce PQR files for BD simulations
pip install pdb2pqr
OpenEye Toolkits (recommended; possibly needed for parameterization)
The OpenEye Toolkits are used for quantum chemistry-based force field parameterization. The OpenEye toolkits require a valid OpenEye academic license, free for academic users but must be obtained directly from https://www.eyesopen.com/academic-licensing. Seekrflow uses it to generate SDF files from PDB files of small molecules. If you do not plan to use seekrflow for parameterization, or will already have SDF files for your small molecules, then you do not need to install OpenEye. But if you wish to install it, then follow these steps.
mamba install openeye::openeye-toolkits --yes
After obtaining an OpenEye academic license, save the provided oe_license.txt file in a secure location on your computer system. For example, you may place it in:
/home/USERNAME/licenses/oe_license.txt
To ensure that OpenEye toolkits can find the license file at runtime, export the license path by adding the following line to your ~/.bashrc.
export OE_LICENSE="/home/USERNAME/licenses/oe_license.txt"
Then apply the change within the current terminal session.
source ~/.bashrc
OpenMM Forcefields (recommended; needed for parameterization)
If you wish to parameterize your molecular system with a common forcefield, such as AMBER FF14SB, GAFF2, CHARMM 36, OpenFF’s SMIRNOFF, or espaloma, you will need to install the OpenMM forcefields package.
mamba install openmmforcefields --yes
Espaloma Machine-learned Forcefield (optional)
If you want to parameterize your molecular system with the machine-learned forcefield espaloma, you will need to install it.
mamba install espaloma=0.3.2 --yes
You will also need to download the correct espaloma .pt file, and save it somewhere on your computer system.
curl -O https://github.com/choderalab/espaloma/releases/download/0.3.2/espaloma-0.3.2.pt
Browndye2 (recommended)
Seekr can use Browndye2 if Brownian dynamics (BD) simulations will be run (necessary for k-on calculations). Please see (https://browndye.ucsd.edu/) for Browndye2 installation instructions. Some of these steps require sudo privileges (administrator access). If you do not have sudo access, contact your system administrator.
SEEKR2 plugin, SEEKR2, and SeekrTools (required)
This step installs OpenMM plugin for SEEKR2 package.
mamba activate SEEKR2
mamba install seekr2_openmm_plugin openmm=8.1 --yes
Run the following command to check if the SEEKR2 OpenMM plugin is correctly installed. If no error message appears, the installation was successful.
python -c "import seekr2plugin"
Install SEEKR2.
cd ~
git clone https://github.com/seekrcentral/seekr2.git
cd seekr2
python -m pip install .
Optionally, one may run the tests for SEEKR2.
pytest
Next, clone and install SeekrTools.
cd ~
git clone https://github.com/seekrcentral/seekrtools.git
cd seekrtools
python -m pip install .
Optionally, one may run the tests for SeekrTools.
pytest
Install seekrflow
Finally, with the dependencies out of the way, we can install seekrflow.
git clone https://github.com/seekrcentral/seekrflow.git
cd seekrflow
python -m pip install .
Testing seekrflow (Optional)
To test seekrflow, run the following command in the seekrflow/ directory:
pytest
Installing Dependencies for Remote Execution on HPC Systems
If one want to use seekrflow to run SEEKR on a remote system, one must decide whether to use SSH or Globus Compute SDK in order to control the jobs. They both function very similarly, but here are the pros and cons of each.
Remote System Approach |
Pros |
Cons |
|---|---|---|
SSH |
Simple, supported on most Linux systems |
Requires password or key-based authentication, passwords in plain text, cannot bypass 2-factor authentication |
Globus Compute SDK |
Only an endpoint id is needed, avoiding password and 2-factor authentication. |
More complicated setup of software and settings |
Install Globus Compute SDK (optional - if not using SSH)
To use remote execution on high-performance computing (HPC) systems with Globus Compute SDK, you will need to install the package. This allows you to run seekr jobs on remote resources while avoiding annoying things like passwords and 2-factor authentication.
Important
The python versions must be the same (or very similar at least) across the local and remote machines.
On the local machine:
pip install globus-compute-sdk --yes
On the remote HPC system, you will need to follow the steps above to create the Mamba environment (on the head or login node; the environment is assumed to be named “SEEKR2”). Into this environment, install SEEKR, SeekrTools, and seekrflow. Then, make sure it is activated. (There is probably no need to make a parameterization environment on the remote machine, as it is presumed that parameterization, if any, will be performed on the local machine.)
mamba activate SEEKR2
Install pipx for environmental isolation.
python3 -m pip install --user pipx
Once this is completed, install globus-compute-endpoint on the remote machine:
python3 -m pipx install globus-compute-endpoint
Configure the endpoint:
globus-compute-endpoint configure my_seekr_endpoint
At this point, one should modify the file at ~/.globus-compute/my_seekr_endpoint/user_config_template.yaml.j2 in order to properly make full use of the HPC resource’s full capabilities.
Here is an example Globus Compute Endpoint configuration file that I used for the NCSA Delta supercomputer:
# Comments here...
endpoint_setup: ''
engine:
type: GlobusComputeEngine
max_workers_per_node: 2
provider:
type: LocalProvider
min_blocks: 0
max_blocks: 1
init_blocks: 1
worker_init: 'source /path/to/your/miniforge3/etc/profile.d/conda.sh && mamba activate SEEKR2'
Note
This configuration applies to the head/login node before SLURM/PBS job submission. Notice that the max_workers_per_node is set to 2 for proper handling of the Globus compute SDK client in seekrflow.
Note
Make sure you have defined the OPENMM_CUDA_COMPILER environment variable in your shell initialization file (e.g., ~/.bashrc) on the remote machine.
Start the endpoint.
globus-compute-endpoint start my_seekr_endpoint
You will need to authenticate with Globus on a browser to start the endpoint, and enter the Authorization code given in the browser into the terminal. This authentication session should last for a number of days or weeks, depending on your Globus account settings.
One can see the endpoint, as well as its endpoint ID, and those of any other endpoints, by listing them:
globus-compute-endpoint list
Copy the endpoint ID from the ‘start’ or ‘list’ commands above, and save for future reference.
One can also stop the endpoint if/when desired:
globus-compute-endpoint stop my_seekr_endpoint
Before submitting remote jobs, always check the endpoints to make sure they are started on the remote machine, and start them if they are not.
globus-compute-endpoint list
globus-compute-endpoint start my_seekr_endpoint
More information about Globus endpoints and the Globus SDK can be found here: https://globus-compute.readthedocs.io/en/latest/endpoints/endpoints.html
Install Globus Connect Personal (optional)
File transfers may be done with Globus or with rsync. Some resources prefer the use of Globus for large file transfers, instead of rsync.
If you want to use Globus to transfer files to and from your remote machine, install the latest version of Globus Connect Personal on your local machine. This allows you to transfer files to and from the remote HPC system using Globus.
Follow the instructions here: https://www.globus.org/globus-connect-personal
–or– For a typical Linux/Unix installation:
curl -O https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
tar xzf globusconnectpersonal-latest.tgz
cd globusconnectpersonal-x.y.z
./globusconnectpersonal
And follow the on-screen instructions to set up your Globus Connect Personal instance.
Find the Collection IDs Using the Globus Web Portal
Now, one must find the Collection IDs for the Globus Connect Personal instance on the local machine, and the Globus Compute Endpoint on the remote HPC system. This is done by logging into the Globus Web Portal (https://app.globus.org) and navigating to the “Collections” tab.
Find your local Globus Connect Personal instance under “Administered By You”, click it’s name, and search the page for “UUID”. This is the Collection ID for your local Globus Connect Personal instance. Save this UUID for your seekrflow configuration input file.
Find your remote Globus Collection ID by backtracking to the “Collections” tab, and searching for your remote Globus Compute Endpoint. Click on the endpoint name, and search the page for “UUID”. This is the Collection ID for your remote Globus Compute Endpoint. Save this UUID for your seekrflow configuration input file.
Quick Start Example System - Trypsin/Benzamidine
A seekrflow calculation will need a input settings JSON file to run, as well as a starting PDB file containing a bound complex of the receptor and ligand molecules. An example system can be found in seekrflow/seekrflow/examples/trypsin_benzamidine. Here, the bound complex of the receptor and ligand is found in the file “protein_ligand.pdb”, and an example input JSON file can be found in “seekrflow.json”. A full workflow run can be done with the following steps.
mamba activate SEEKRFLOW_PARAM
python ~/seekrflow/seekrflow/parameterize.py --input_json seekrflow.json --ligand_resname BEN
mamba deactivate
mamba activate SEEKR2
python ~/seekrflow/seekrflow/flow.py prepare --input_json seekrflow.json
python ~/seekrflow/seekrflow/flow.py run --input_json seekrflow.json
python ~/seekr2/seekr2/analyze.py work/root/model.xml
Set up your .seekrflow_resources.json (Optional)
One can save a lot of time preparing to submit remote jobs if one sets up their .seekrflow_resources.json file. Instead of always including one’s run_settings.resources list within the seekrflow.json files, one may define all one’s resources in the .seekrflow_resources.json file. The file may exist either in one’s home directory (recommended), within one’s seekrflow work/ directory, or within one’s seekrflow work/root/ directory. The file is structured as follows:
{
"resources": [
{
"type": "slurm_remote",
"name": "delta",
"remote_seekr2_directory": "/PATH/TO/seekr2/seekr2/",
"remote_seekrtools_directory": "/PATH/TO/seekrtools/seekrtools/",
"remote_working_directory": "/PATH/TO/seekrflow_playground/",
"max_workers_per_node": 1,
"partition": "gpuA100x4,gpuA40x4",
"account": "YOUR ACCOUNT NAME",
"constraint": "scratch",
"nodes_per_block": 1,
"cpus_per_task": 32,
"memory_per_node": 5000,
"time_limit": "00:30:00",
"scheduler_options": "--gpus-per-node=1 --gpu-bind=closest",
"worker_init": "source $HOME/.bashrc; conda activate SEEKR2; export OPENMM_CUDA_COMPILER=`which nvcc`",
"remote_interface": {
"type": "globus_compute_sdk",
"endpoint_id": "YOUR GLOBUS COMPUTE ENDPOINT"
},
"transfer_settings": {
"type": "globus",
"local_collection_id": "YOUR LOCAL COLLECTION ID",
"remote_collection_id": "7e936164-de58-4e3d-85da-21aa23c07169"
},
{
"type": "slurm_remote",
"name": "anvil",
"remote_seekr2_directory": "/PATH/TO/seekr2/seekr2/",
"remote_seekrtools_directory": "/PATH/TO/seekrtools/seekrtools/",
"remote_working_directory": "/PATH/TO/seekrflow_playground",
"max_workers_per_node": 1,
...
}
]
}
As one utilizes resources to run seekrflow, one may populate this file with various resources that one may use to run seekr remotely.
Important Options and Hints
In general, seekrflow, seekr, and SeekrTools programs can be run with the ‘-h’ argument to see
all available options. Please see https://seekr2.readthedocs.io/en/latest for a detailed description of programs and options.
For a complete tutorial, see the Tutorials section.
Troubleshooting
Getting Help
If you encounter issues:
Check the User Guide for detailed usage instructions
Review the API Reference for complete API documentation
Look at the example in
seekrflow/examples/trypsin_benzamidine/See https://seekr2.readthedocs.io/en/latest for SEEKR2-specific help
Submit issues to the project repository
Next Steps
Now that you have seekrflow installed, you can:
Follow the Tutorials for step-by-step examples
Read the User Guide for detailed usage information
Explore the API Reference reference for complete documentation
Check out the Developer Guide if you want to contribute