Getting Started =============== Welcome to seekrflow! This guide will walk you through installing and setting up seekrflow for molecular dynamics simulation workflows. What is seekrflow? ------------------ seekrflow is a workflow management system designed for molecular dynamics simulations using the seekr package. It provides: - Automated parameter generation for molecular systems - Streamlined workflow execution for binding/unbinding simulations - Integration with high-performance computing resources - Support for complex molecular systems including protein-ligand interactions Installation ------------ The easiest, quickest way to install the seekrflow is to use Mamba. If you don't already have Mamba installed, Download the Miniforge install script and run. .. code-block:: bash curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh bash Miniforge3-$(uname)-$(uname -m).sh Fill out the prompts as they appear. Once this has been done, set up a new environment: .. code-block:: bash mamba create -n SEEKR2 python=3.11 --yes If you plan to use seekrflow for parameterization, you will probably need a second environment to avoid conflicts. .. code-block:: bash mamba create -n SEEKRFLOW_PARAM python=3.11 --yes Dependencies ~~~~~~~~~~~~ Many of the dependencies for seekrflow will be installed alongside seekrflow, but some must be installed separately, and are installed before seekrflow Git (required) ++++++++++++++ Make sure git is installed to clone repositories. If git isn't already installed on your computer, run: .. code-block:: bash mamba install git --yes PDBFixer (recommended; needed for parameterization) +++++++++++++++++++++++++++++++++++++++++++++++++++ PDBfixer is recommended to install in case one wants to run the parameterization using seekrflow. .. code-block:: bash mamba activate SEEKRFLOW_PARAM mamba install pdbfixer --yes pdb2pqr (recommended; needed for parameterization) ++++++++++++++++++++++++++++++++++++++++++++++++++ pdb2pqr is used in the parameterization workflow in order to choose protonation states, and to optionally produce PQR files for BD simulations .. code-block:: bash pip install pdb2pqr OpenEye Toolkits (recommended; possibly needed for parameterization) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ The OpenEye Toolkits are used for quantum chemistry-based force field parameterization. The OpenEye toolkits require a valid OpenEye academic license, free for academic users but must be obtained directly from https://www.eyesopen.com/academic-licensing. Seekrflow uses it to generate SDF files from PDB files of small molecules. If you do not plan to use seekrflow for parameterization, or will already have SDF files for your small molecules, then you do not need to install OpenEye. But if you wish to install it, then follow these steps. .. code-block:: bash mamba install openeye::openeye-toolkits --yes After obtaining an OpenEye academic license, save the provided oe_license.txt file in a secure location on your computer system. For example, you may place it in: .. code-block:: bash /home/USERNAME/licenses/oe_license.txt To ensure that OpenEye toolkits can find the license file at runtime, export the license path by adding the following line to your ~/.bashrc. .. code-block:: bash export OE_LICENSE="/home/USERNAME/licenses/oe_license.txt" Then apply the change within the current terminal session. .. code-block:: bash source ~/.bashrc OpenMM Forcefields (recommended; needed for parameterization) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ If you wish to parameterize your molecular system with a common forcefield, such as AMBER FF14SB, GAFF2, CHARMM 36, OpenFF's SMIRNOFF, or espaloma, you will need to install the OpenMM forcefields package. .. code-block:: bash mamba install openmmforcefields --yes Espaloma Machine-learned Forcefield (optional) ++++++++++++++++++++++++++++++++++++++++++++++ If you want to parameterize your molecular system with the machine-learned forcefield espaloma, you will need to install it. .. code-block:: bash mamba install espaloma=0.3.2 --yes You will also need to download the correct espaloma .pt file, and save it somewhere on your computer system. .. code-block:: bash curl -O https://github.com/choderalab/espaloma/releases/download/0.3.2/espaloma-0.3.2.pt Browndye2 (recommended) +++++++++++++++++++++++ Seekr can use Browndye2 if Brownian dynamics (BD) simulations will be run (necessary for k-on calculations). Please see (https://browndye.ucsd.edu/) for Browndye2 installation instructions. Some of these steps require sudo privileges (administrator access). If you do not have sudo access, contact your system administrator. SEEKR2 plugin, SEEKR2, and SeekrTools (required) ++++++++++++++++++++++++++++++++++++++++++++++++ This step installs OpenMM plugin for SEEKR2 package. .. code-block:: bash mamba activate SEEKR2 mamba install seekr2_openmm_plugin openmm=8.1 --yes Run the following command to check if the SEEKR2 OpenMM plugin is correctly installed. If no error message appears, the installation was successful. .. code-block:: bash python -c "import seekr2plugin" Install SEEKR2. .. code-block:: bash cd ~ git clone https://github.com/seekrcentral/seekr2.git cd seekr2 python -m pip install . Optionally, one may run the tests for SEEKR2. .. code-block:: bash pytest Next, clone and install SeekrTools. .. code-block:: bash cd ~ git clone https://github.com/seekrcentral/seekrtools.git cd seekrtools python -m pip install . Optionally, one may run the tests for SeekrTools. .. code-block:: bash pytest Install seekrflow ~~~~~~~~~~~~~~~~~ Finally, with the dependencies out of the way, we can install seekrflow. .. code-block:: bash git clone https://github.com/seekrcentral/seekrflow.git cd seekrflow python -m pip install . Testing seekrflow (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To test seekrflow, run the following command in the seekrflow/ directory: .. code-block:: bash pytest Installing Dependencies for Remote Execution on HPC Systems ----------------------------------------------------------- If one want to use seekrflow to run SEEKR on a remote system, one must decide whether to use SSH or Globus Compute SDK in order to control the jobs. They both function very similarly, but here are the pros and cons of each. .. list-table:: SSH vs Globus Compute SDK Comparison :widths: 20 20 20 :header-rows: 1 * - Remote System Approach - Pros - Cons * - SSH - Simple, supported on most Linux systems - Requires password or key-based authentication, passwords in plain text, cannot bypass 2-factor authentication * - Globus Compute SDK - Only an endpoint id is needed, avoiding password and 2-factor authentication. - More complicated setup of software and settings Install Globus Compute SDK (optional - if not using SSH) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To use remote execution on high-performance computing (HPC) systems with Globus Compute SDK, you will need to install the package. This allows you to run seekr jobs on remote resources while avoiding annoying things like passwords and 2-factor authentication. .. important:: The python versions must be the same (or very similar at least) across the local and remote machines. On the local machine: .. code-block:: bash pip install globus-compute-sdk --yes On the remote HPC system, you will need to follow the steps above to create the Mamba environment (on the head or login node; the environment is assumed to be named "SEEKR2"). Into this environment, install SEEKR, SeekrTools, and seekrflow. Then, make sure it is activated. (There is probably no need to make a parameterization environment on the remote machine, as it is presumed that parameterization, if any, will be performed on the local machine.) .. code-block:: bash mamba activate SEEKR2 Install pipx for environmental isolation. .. code-block:: bash python3 -m pip install --user pipx Once this is completed, install globus-compute-endpoint on the remote machine: .. code-block:: bash python3 -m pipx install globus-compute-endpoint Configure the endpoint: .. code-block:: bash globus-compute-endpoint configure my_seekr_endpoint At this point, one should modify the file at ~/.globus-compute/my_seekr_endpoint/user_config_template.yaml.j2 in order to properly make full use of the HPC resource's full capabilities. Here is an example Globus Compute Endpoint configuration file that I used for the NCSA Delta supercomputer: .. code-block:: # Comments here... endpoint_setup: '' engine: type: GlobusComputeEngine max_workers_per_node: 2 provider: type: LocalProvider min_blocks: 0 max_blocks: 1 init_blocks: 1 worker_init: 'source /path/to/your/miniforge3/etc/profile.d/conda.sh && mamba activate SEEKR2' .. note:: This configuration applies to the head/login node *before* SLURM/PBS job submission. Notice that the **max_workers_per_node** is set to **2** for proper handling of the Globus compute SDK client in seekrflow. .. note:: Make sure you have defined the OPENMM_CUDA_COMPILER environment variable in your shell initialization file (e.g., ~/.bashrc) on the remote machine. Start the endpoint. .. code-block:: bash globus-compute-endpoint start my_seekr_endpoint You will need to authenticate with Globus on a browser to start the endpoint, and enter the Authorization code given in the browser into the terminal. This authentication session should last for a number of days or weeks, depending on your Globus account settings. One can see the endpoint, as well as its endpoint ID, and those of any other endpoints, by listing them: .. code-block:: bash globus-compute-endpoint list Copy the endpoint ID from the 'start' or 'list' commands above, and save for future reference. One can also stop the endpoint if/when desired: .. code-block:: bash globus-compute-endpoint stop my_seekr_endpoint Before submitting remote jobs, always check the endpoints to make sure they are started on the remote machine, and start them if they are not. .. code-block:: bash globus-compute-endpoint list globus-compute-endpoint start my_seekr_endpoint More information about Globus endpoints and the Globus SDK can be found here: https://globus-compute.readthedocs.io/en/latest/endpoints/endpoints.html Install Globus Connect Personal (optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File transfers may be done with Globus or with rsync. Some resources prefer the use of Globus for large file transfers, instead of rsync. If you want to use Globus to transfer files to and from your remote machine, install the latest version of Globus Connect Personal on your local machine. This allows you to transfer files to and from the remote HPC system using Globus. Follow the instructions here: https://www.globus.org/globus-connect-personal --or-- For a typical Linux/Unix installation: .. code-block:: bash curl -O https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz tar xzf globusconnectpersonal-latest.tgz cd globusconnectpersonal-x.y.z ./globusconnectpersonal And follow the on-screen instructions to set up your Globus Connect Personal instance. Find the Collection IDs Using the Globus Web Portal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, one must find the Collection IDs for the Globus Connect Personal instance on the local machine, and the Globus Compute Endpoint on the remote HPC system. This is done by logging into the Globus Web Portal (https://app.globus.org) and navigating to the "Collections" tab. Find your local Globus Connect Personal instance under "Administered By You", click it's name, and search the page for "UUID". This is the Collection ID for your local Globus Connect Personal instance. Save this UUID for your seekrflow configuration input file. Find your remote Globus Collection ID by backtracking to the "Collections" tab, and searching for your remote Globus Compute Endpoint. Click on the endpoint name, and search the page for "UUID". This is the Collection ID for your remote Globus Compute Endpoint. Save this UUID for your seekrflow configuration input file. Quick Start Example System - Trypsin/Benzamidine ------------------------------------------------ A seekrflow calculation will need a input settings JSON file to run, as well as a starting PDB file containing a bound complex of the receptor and ligand molecules. An example system can be found in seekrflow/seekrflow/examples/trypsin_benzamidine. Here, the bound complex of the receptor and ligand is found in the file "protein_ligand.pdb", and an example input JSON file can be found in "seekrflow.json". A full workflow run can be done with the following steps. .. code-block:: bash mamba activate SEEKRFLOW_PARAM python ~/seekrflow/seekrflow/parameterize.py --input_json seekrflow.json --ligand_resname BEN mamba deactivate mamba activate SEEKR2 python ~/seekrflow/seekrflow/flow.py prepare --input_json seekrflow.json python ~/seekrflow/seekrflow/flow.py run --input_json seekrflow.json python ~/seekr2/seekr2/analyze.py work/root/model.xml Set up your .seekrflow_resources.json (Optional) ------------------------------------------------ One can save a lot of time preparing to submit remote jobs if one sets up their `.seekrflow_resources.json` file. Instead of always including one's run_settings.resources list within the seekrflow.json files, one may define all one's resources in the `.seekrflow_resources.json` file. The file may exist either in one's home directory (recommended), within one's seekrflow work/ directory, or within one's seekrflow work/root/ directory. The file is structured as follows: .. code-block:: bash { "resources": [ { "type": "slurm_remote", "name": "delta", "remote_seekr2_directory": "/PATH/TO/seekr2/seekr2/", "remote_seekrtools_directory": "/PATH/TO/seekrtools/seekrtools/", "remote_working_directory": "/PATH/TO/seekrflow_playground/", "max_workers_per_node": 1, "partition": "gpuA100x4,gpuA40x4", "account": "YOUR ACCOUNT NAME", "constraint": "scratch", "nodes_per_block": 1, "cpus_per_task": 32, "memory_per_node": 5000, "time_limit": "00:30:00", "scheduler_options": "--gpus-per-node=1 --gpu-bind=closest", "worker_init": "source $HOME/.bashrc; conda activate SEEKR2; export OPENMM_CUDA_COMPILER=`which nvcc`", "remote_interface": { "type": "globus_compute_sdk", "endpoint_id": "YOUR GLOBUS COMPUTE ENDPOINT" }, "transfer_settings": { "type": "globus", "local_collection_id": "YOUR LOCAL COLLECTION ID", "remote_collection_id": "7e936164-de58-4e3d-85da-21aa23c07169" }, { "type": "slurm_remote", "name": "anvil", "remote_seekr2_directory": "/PATH/TO/seekr2/seekr2/", "remote_seekrtools_directory": "/PATH/TO/seekrtools/seekrtools/", "remote_working_directory": "/PATH/TO/seekrflow_playground", "max_workers_per_node": 1, ... } ] } As one utilizes resources to run seekrflow, one may populate this file with various resources that one may use to run seekr remotely. Important Options and Hints --------------------------- * In general, seekrflow, seekr, and SeekrTools programs can be run with the '-h' argument to see all available options. Please see https://seekr2.readthedocs.io/en/latest for a detailed description of programs and options. For a complete tutorial, see the :doc:`tutorials` section. Troubleshooting --------------- Getting Help ~~~~~~~~~~~~ If you encounter issues: 1. Check the :doc:`user_guide` for detailed usage instructions 2. Review the :doc:`api` for complete API documentation 3. Look at the example in ``seekrflow/examples/trypsin_benzamidine/`` 4. See https://seekr2.readthedocs.io/en/latest for SEEKR2-specific help 5. Submit issues to the project repository Next Steps ---------- Now that you have seekrflow installed, you can: - Follow the :doc:`tutorials` for step-by-step examples - Read the :doc:`user_guide` for detailed usage information - Explore the :doc:`api` reference for complete documentation - Check out the :doc:`developer_guide` if you want to contribute