Working on Wynton

Part 2

Natalie Elphick

April 16th, 2024

Press the ? key for tips on navigating these slides

Introductions

Instructor:

      Natalie Elphick
      Bioinformatician I

TAs:

      Alex Pico
      Bioinformatics Core Director
      Michela Traglia
      Senior Statistician

Target Audience

  • Prior experience with UNIX command-line

Part 2:

  1. Custom Containers
  2. Submitting Compute Jobs
  3. Array Jobs
  4. GPU Jobs
  5. Running Pipelines
  6. Jupyter Notebooks
  7. RStudio Server
  8. How to get help

Custom Containers

Motivation

  • Compute heavy jobs (high RAM, multiple cores) should be run on compute nodes
  • Containers allow us to make additional software available to the compute nodes
    • Also allows the use of software that might be hard to install on Rocky 8 Linux
    • Improves reproducibility

Compute Jobs

Dockerfile Basics

  • Dockerfiles contain instructions to build an image in layers
  • Layers are added using Dockerfile instruction syntax
  • Images are built by navigating to the directory that contains the Dockerfile and running:
docker build .

Dockerfile Instructions

  • First instruction is always FROM which specifies the base image
    • Base images are a starting point with some basics already installed like the OS and build tools, find them on DockerHub
  • RUN : Use before running any shell commands
  • SHELL : Set the shell
  • USER : Set the user (within the image)
  • CMD : Set the default instruction to be run by the image
  • COPY : COPY files into the image

See the Dockerfile documentation for a full list of instructions

Example Dockerfile

  • Click here to download the example Dockerfile
  • Open in your preffered text editor
# Bioconductor base image gives us access to a lot of bioinformatics tools and R packages.
FROM bioconductor/bioconductor_docker:RELEASE_3_17

# Shell options, we want to exit if any command fails
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

# Root permissions are required to install packages
USER root


# Install any UNIX packages you need
# First we update the package list and then install GNU make
# We clean up after ourselves to reduce the image size
RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y --no-install-recommends make \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install Seurat and harmony
RUN Rscript -e 'install.packages(c("Seurat","harmony"))'
# Check if installs worked
RUN Rscript -e 'lapply(c("Seurat","harmony"), library, character.only = TRUE)'


# Run container as non-root to avoid permission issues
RUN groupadd -g 10001 notroot && \
   useradd -u 10000 -g notroot notroot 

# Switch to the non-root user
USER notroot:notroot

# Default command to run when the container starts
CMD ["/bin/bash"]

# Copy dockerfile into the image (optional, but can be useful for reproducibility)
COPY Dockerfile /Dockerfile

Building Example Image

  • Do not run this during the workshop
    • It requires a lot of RAM
  • On macOS, make sure you have the Docker Desktop App running
  • We can provide an additional argument to the build command, -t, to set the name of the docker image
    • We can add version tags after the name using “:”
docker build -t docker_hub_user/seurat-harmony:1.0 .

Pushing Images to DockerHub

  • Make sure you are signed in to your DockerHub account locally (Docker Desktop for macOS)
  • The image name must start with your user name
docker push docker_hub_user/seurat-harmony:1.0
  • These can then be “pulled” on to Wynton as apptainer image files (image must be public)
[alice@dev1 ~]$ apptainer pull docker://docker_hub_user/seurat-harmony:1.0

Notes on Building Custom Images

  • Time consuming process and can use a lot of RAM on your local machine
  • A good base image can save you a lot of time
  • You must run apt-get update and apt-get install in the same command
    • Otherwise you will encounter caching issues
    • These are only for Ubuntu, for other OS run the equivalent package list retrieval and install commands together
  • Remember to use apt-get install -y
    • You will have no control over the process while it’s building

Compute Jobs

Submission Script - Basics

#!/bin/bash           # the shell language when run outside of the job scheduler
#                     # lines starting with #$ is an instruction to the job scheduler
#$ -S /bin/bash       # the shell language when run via the job scheduler [IMPORTANT]
#$ -cwd               # job should run in the current working directory
#$ -j y               # STDERR and STDOUT should be joined
#$ -l mem_free=1G     # job requires up to 1 GiB of RAM per slot (core)
#$ -l scratch=2G      # job requires up to 2 GiB of local /scratch space
#$ -l h_rt=1:00:00   # job requires up to 1 hour of runtime 
#$ -r y               # if job crashes, it should be restarted

date
hostname

## End-of-job summary, if running as a job
[[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID"  # This is useful for debugging and usage purposes,
                                          # e.g. "did my job exceed its memory request?"

Submission Script - Apptainer

  • Download this example job submission script that uses a container
  • Paths that the container needs read/write access to need to be mounted with APPTAINER_BINDPATH
#!/bin/bash
#$ -S /bin/bash      # the shell language when run via the job scheduler
#$ -cwd               # job should run in the current working directory
#$ -j y               # STDERR and STDOUT should be joined
#$ -l mem_free=1G     # job requires up to 1 GiB of RAM per slot
#$ -l scratch=2G      # job requires up to 2 GiB of local /scratch space
#$ -l h_rt=1:00:00    # job requires up to 1 hour of runtime


# Mount the current directory to the container
# Any directroy that needs to be accessed by the container should be mounted
directory=$(pwd)
export APPTAINER_BINDPATH="$directory"

h=$(hostname)

singularity run hello-world_1.0.sif figlet $h > $directory/hello.txt

[[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID"

Parallel Processing Jobs

  • By default jobs run on a single core

  • Multicore jobs must run in a SGE parallel environment (PE) and tell SGE how many cores the job will use

  • Do not use more cores than requested

  • There are four parallel environments on Wynton:

    • smp: for single-host parallel jobs using Symmetric multiprocessing (SMP)
    • mpi: for multiple-host parallel jobs based on MPI parallelization
    • mpi_onehost: for single-host parallel jobs based on MPI parallelization
    • mpi-8: for multi-threaded multi-host jobs based on MPI parallelization

Example Parallel Job

  • The simplest parallel environment on Wynton is smp, a single node with n cores
  • Download this example smp job submission script
#!/bin/bash
#$ -S /bin/bash 
#$ -cwd
#$ -j y
#$ -pe smp 4                    # 4 cores on a single node
#$ -l mem_free=2G               # 2 GiB of RAM per slot (core), so 8 GiB total
#$ -l scratch=5G                # 5 GiB of local /scratch space
#S -l h_rt=08:00:00


# Code that requires 4 cores
# **Specify the number of cores as ${NSLOTS}**



[[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID"

Array Jobs

  • This is a good option if the script you want to run operates on discrete sets of data
    • e.g. sample or chromosome
  • Download this example array job submission script
#!/bin/bash           
#$ -S /bin/bash       
#$ -cwd               
#$ -j y               
#$ -l mem_free=1G     
#$ -l scratch=2G     
#$ -l h_rt=1:00:00   
#$ -t 1-5          # Number of tasks to run in the array (each is a job with the same resource requirements above)

params=(sample1 sample2 sample3 sample4 sample5)

# The task ID is stored in the variable SGE_TASK_ID
# This variable is used to index the array of parameters
# The task ID is 1-indexed
param=${params[$SGE_TASK_ID - 1]}

echo "Running task $SGE_TASK_ID with parameter $param"

# Code for each task

[[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID"

GPU Jobs

  • To run a GPU job, specify -q gpu.q (queue) as a GPU queue
    • Other GPU queues may be available to you depending on your lab
  • It is important to specify the GPU using the SGE_GPU variable so that your job uses its assigned GPU
    • For CUDA based tools, add export CUDA_VISIBLE_DEVICES=$SGE_GPU to your submission script
  • GPU jobs must include a runtime request or they will be removed from the queue

Submitting and Querying jobs

  • Use qsub to submit jobs
[alice@dev1 ~]$ qsub job1.sh
Your job 714888 ("job1.sh") has been submitted
  • Use qstat to check the status of your jobs
[alice@dev1 ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 714888 0.06532 job1 alice     r     03/25/2024 19:54:18 member.q@msg-hmio1                 1        
 714889 0.06532 job2 alice     r     03/25/2024 19:54:19 member.q@msg-hmio1                 1        

Read the querying jobs Wynton documentation for more information.

Estimating Job Resources

  • Try to estimate the amount of RAM needed using a small test dataset
  • Request a little more RAM than you need to avoid having your job cancelled
  • Check on jobs you are running for the first time with qstat -j to make sure they are not going over

Poll 3

Any submitted job to compute nodes can also be run on development nodes.

  1. True
  2. False

Running Pipelines

Nextflow RNA-seq

RNA-seq

Example - RNA-seq Pipeline

Do not run this during the workshop as it will fill up the Wynton SGE queue

  • Download the testing script
    • Runs a minimal test on the RNA-seq pipeline
  • Download the config file
    • Configures nextflow to use the SGE job scheduler and sets limits on compute job resources for each process
  • Put these in the same directory (do not use your user home directory for this) and run the script in a screen/tmux session
  • When not running the test, the -profile should be apptainer

Jupyter Notebooks

Installing Jupyter Notebooks

  • The preferred way to install and use Jupyter notebooks on Wynton is though pip, not conda
python3 -m pip install --user notebook
  • Jupyter notebooks can only be run on development nodes
  • See the Wynton python documentation for more info on managing python environments on Wynton

Running Jupyter Notebooks - Step 1

  • You cannot connect from outside Wynton HPC directly to a development node
    • Instead we need to use SSH port forwarding to establish the connection with a local web browser
  • Find an available TCP port:
[alice@dev1 ~]$ module load CBI port4me
[alice@dev1 ~]$ port4me --tool=jupyter
47467

Note the port number returned by port4me, you will need this later.

Running Jupyter Notebooks - Step 2

  • Launch Jupyter notebook using the port numer from step 1
[alice@dev1]$ jupyter notebook --no-browser --port 47467
[I 2024-03-20 14:48:45.693 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-03-20 14:48:45.698 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-03-20 14:48:45.703 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-03-20 14:48:45.708 ServerApp] notebook | extension was successfully linked.
[I 2024-03-20 14:48:46.577 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-03-20 14:48:46.666 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-03-20 14:48:46.668 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-03-20 14:48:46.669 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-03-20 14:48:46.675 LabApp] JupyterLab extension loaded from /wynton/home/boblab/alice/.local/lib/python3.11/site-packages/jupyterlab
[I 2024-03-20 14:48:46.675 LabApp] JupyterLab application directory is /wynton/home/boblab/alice/.local/share/jupyter/lab
[I 2024-03-20 14:48:46.677 LabApp] Extension Manager is pypi.
[I 2024-03-20 14:48:46.707 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-03-20 14:48:46.711 ServerApp] notebook | extension was successfully loaded.
[I 2024-03-20 14:48:46.712 ServerApp] Serving notebooks from local directory: /wynton/home/boblab/alice
[I 2024-03-20 14:48:46.712 ServerApp] Jupyter Server 2.13.0 is running at:
[I 2024-03-20 14:48:46.712 ServerApp] http://localhost:44214/tree?token=8e37f8d62fca6a1c9b2da429f27df5ebcec706a808c3a8f2
[I 2024-03-20 14:48:46.712 ServerApp]     http://127.0.0.1:44214/tree?token=8e37f8d62fca6a1c9b2da429f27df5ebcec706a808c3a8f2
[I 2024-03-20 14:48:46.712 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-03-20 14:48:46.725 ServerApp]

    To access the server, open this file in a browser:
        file:///wynton/home/boblab/alice/.local/share/jupyter/runtime/jpserver-2853162-open.html
    Or copy and paste one of these URLs:
        http://localhost:44214/tree?token=8e37f8d62fca6a1c9b2da429f27df5ebcec706a808c3a8f2
        http://127.0.0.1:44214/tree?token=8e37f8d62fca6a1c9b2da429f27df5ebcec706a808c3a8f2

Running Jupyter Notebooks - Step 3

  • Set up SSH port forwarding on your local machine in a separate terminal, leave both terminals open
{local}$ ssh -J alice@log1.wynton.ucsf.edu -L 47467:localhost:47467 alice@dev1
...
[alice@dev1 ~]$ 

The notebook should now be available at the URL from step 2

RStudio Server

RStudio Server

  • RStudio server is already available in the CBI module
  • This allows you to set up a personal RStudio instance that only you can access
  • Requires two separate SSH connections to the cluster:
    • One to launch RStudio Server
    • One to connect to it

RStudio Server - Step 1

  • Launch your own RStudio Server instance
[alice@dev1 ~]$ module load CBI rstudio-server-controller
[alice@dev1 ~]$ rsc start
alice, your personal RStudio Server 2023.09.1-494 running R 4.3.2 is available on:

  <http://127.0.0.1:20612>

Importantly, if you are running from a remote machine without direct access
to dev1, you need to set up SSH port forwarding first, which you can do by
running:

  ssh -L 20612:dev1:20612 alice@log1.wynton.ucsf.edu

in a second terminal from your local computer.

Any R session started times out after being idle for 120 minutes.
WARNING: You now have 10 minutes, until 2023-11-15 17:06:50-08:00, to
connect and log in to the RStudio Server before everything times out.
Your one-time random password for RStudio Server is: y+IWo7rfl7Z7MRCPI3Z4

Note the password and URL, they will be needed to log in to the server instance.

RStudio Server - Step 2

  • Connect to your personal RStudio Server instance from your local machine in a separate terminal
{local}$ ssh -L 20612:dev1:20612 alice@log1.wynton.ucsf.edu
alice1@log1.wynton.ucsf.edu:s password: XXXXXXXXXXXXXXXXXXX
[alice@log1 ~]$ 

RStudio Server - Step 3

  • Open RStudio Server in your local web browser
  • Open the link from step 1
  • Enter your Wynton user name
  • Enter the password from step 1

How to Get Help

Wynton Questions

  • Follow the Wynton question checklist
  • Email
  • Slack
    • ucsf-wynton
    • Sign-up using a UCSF email address
    • Email support if that does not work
  • Zoom office hours every Tuesday at 11-12pm
    • Zoom URL in the message-of-the-day (MOTD) that you get when you log into Wynton

Bioinformatics Questions

For any bioinformatics specific questions feel free to reach out to the Gladstone Bioinformatics Core.

End of Part 2

Thank You!

Upcoming Data Science Training Program Workshops

Introduction to Linear Mixed Effects Models
April 25-April 26, 2024 1-3pm PDT

Single Cell RNA-Seq Data Analysis
April 29-April 30, 2024 9am-4pm PDT

Single Cell ATAC-Seq Data Analysis Part 1
May 6-May 7, 2024 1-4pm PDT

Complete Schedule