Working on Wynton

Part 1

Natalie Elphick

March 24th, 2025

Press the ? key for tips on navigating these slides

Introductions

Instructor:

      Natalie Elphick
      Bioinformatician II

Target Audience

  • Prior experience with UNIX command-line

Part 1:

  1. What is an HPC cluster?
  2. Node Types and Logging in
  3. Storage
  4. Data Transfer
  5. Installing Software
  6. Containers

What is Wynton HPC?

High-performance Computing Cluster

  • A collection of specialized computers (nodes) connected together on a fast local network

HPC Diagram

HPC File System

HPC File System

Wynton

  • A HPC Linux environment available to all UCSF researchers for free
  • Uses the Rocky 8 linux OS
  • Includes several hundred compute nodes and a large shared storage system (Cluster specifications)
  • Funded and administered cooperatively by UCSF campus IT and key research groups

https://wynton.ucsf.edu

Node Types & Logging in

Node Types

  • Login: Submit and query jobs. SSH to development nodes. File management.
  • Development: Compile and install software. Test job scripts. Submit and query jobs. Version control. File management.
  • Compute: Running job scripts.
  • Transfer: Fast in- & outbound file transfers. File management.

The Login Nodes

  • Only capable of basic tasks (file management, submitting and checking on jobs)
  • Lacks access to pre-installed software tools that the development nodes have
  • The primary method to log in is to use an SSH client application

Names:

log1, log2 and plog1 (for PHI users)

Login

  • Connect to the UCSF or Gladstone WiFi networks (or the respective VPN) or using 2FA
  • ssh [your-username]@[node].wynton.ucsf.edu
{local}$ ssh alice@log1.wynton.ucsf.edu
alice@log1.wynton.ucsf.edu's password: 
[alice@log1 ~]$
  • There will not be any visual feedback when typing your password

The Development Nodes

  • Has a set of core software installed
    • e.g. git, vim, nano, make and python
  • Also has access to software repositories some which are maintained by other users or research groups
    • e.g. matlab, R and openjdk
  • Cannot SSH in to directly, only from a login node
ssh dev1

Names:

dev[1-3], gpudev1, pdev1 (PHI) and pgpudev1 (PHI)

Data Transfer Nodes

  • Can SSH in to directly
  • Fast network speed
  • Limited software
  • Use for transferring files to and from Wynton

Example:

{local}$ scp local_file.tsv alice@dt1.wynton.ucsf.edu:~/

Names:

dt1 and dt2

Compute Nodes

  • Can not SSH in to directly
  • No internet or UCSF network access
  • Used to run non-interactive compute job scripts
  • The software to run the job script is provided using a container

Compute Jobs

Storage

Storage

  • Wynton storage is not backed up
  • /wynton/home/[group_name]/[user]
    • PHI users : /wynton/protected/home/[group_name]/[user]
    • User home directory - limited to 500 GiB
  • /wynton/group/[group_name]
    • PHI users : /wynton/protected/group/[group_name]
    • User group directory - disk quota varies by group
    • Use this directory for any analysis you want to share with your lab
  • More information on disk quotas

To check your group disk quota run:

beegfs-ctl --getquota --storagepoolid=12 --gid "$(id --group)"

Scratch - Temporary Storage

  • Local /scratch - 0.1-1.8 TiB/node storage unique to each compute node
    • Can only be accessed from the specific compute node
    • Use this to store intermediate files only needed for a job
  • /wynton/scratch and /wynton/protected/scratch (for PHI users)
    • 703 TiB storage accessible from everywhere
  • No quotas


Files not used for 2 weeks are automatically deleted

Gladstone HIVE

  • Gladstone’s HIVE storage server is mounted directly to Wynton under /gladstone
    • Only certain HIVE folders are accessible directly on Wynton
    • Files under /gladstone are backed up
  • Naming: /gladstone/[lab]
    • Directories that are shared between multiple labs can be set up by contacting Gladstone IT
  • For more information visit the IT knowledge base page

Storage Advice

  • Always back up anything you store under /wynton
  • If you have access to it keep all of your data on /gladstone
    • A large number of jobs reading and writing to these directories may be slower since it is NFS mounted not BeeGFS
  • Use the scratch directories to store temporary files
    • e.g. A large amount of .fastq that you do not need after the alignment step

Data Transfer

Secure Copy - scp

  • Local file to Wynton
{local}$ scp /path/to/local_file.tsv alice@dt1.wynton.ucsf.edu:/destination/path
  • Copy a directory to a folder on Wynton
{local}$ scp -r local_folder/ alice@dt1.wynton.ucsf.edu:/destination/path
  • Copy a single file to Wynton from your local machine
{local}$ scp alice@dt1.wynton.ucsf.edu:/path/to/local_file.tsv /destination/path

Hands-on

  • Use scp to copy this file into your home directory on Wynton

GUI SFTP Clients

  • These let you transfer files to and from Wynton using a GUI
  • 2 factor authentication may be required
  • Cyberduck
    • Navigate to Preferences -> Transfers -> General
    • change the Transfer Files setting “Use browser connection” instead of “Open Multiple connections”
  • FileZilla
    • In the General tab, select ‘SFTP’ as the Protocol instead of ‘FTP’
    • For Logon Type, select ‘Interactive’ instead of ‘Ask for Password’
    • Under the Transfer Settings tab, you might need to click the ‘Limit number of simultaneous connections’ and make sure the ‘Maximum number of connections’ is set to 1

Globus

  • Globus is a service for moving, syncing, and sharing large amounts of data
  • Wynton Accounts are not required to transfer data with Globus
  • Useful for transferring data between institutions

Rclone

  • Rclone is a command-line program to manage files on remote storage
  • Can be used to transfer data from Wynton directly to DropBox or other storage systems (AWS, Azure, Google Drive etc.)
    • Do this from a data transfer node using screen/tmux
  • Do not use rclone for transfers to Box, follow the Wynton to UCSF Box instructions

Poll 1

Poll 1 - Which of these can you not SSH in to?

  1. Login Nodes
  2. Development Nodes
  3. Data transfer Nodes
  4. Compute Nodes

Poll 2

The /wynton directory is backed up on a nightly basis, so there is no need to back up anything stored here.

  1. True
  2. False

Installing Software

Basics

  • Check if the tool is already available in a module
  • Ensure the software you are trying to install is compatible with Rocky 8 linux (use a container if not)
  • Always install software in a development node
  • Download a precompiled binary or install from source

Install Samtools from Source

  1. Download and extract source code
[alice@dev1 ~]$ mkdir -p "/scratch/$USER"
[alice@dev1 ~]$ cd "/scratch/$USER"
[alice@dev1 alice]$ wget https://github.com/samtools/samtools/releases/download/1.21/samtools-1.21.tar.bz2
[alice@dev1 alice]$ tar -x -f samtools-1.21.tar.bz2
  1. Create install location and configure
[alice@dev1 ~]$ mkdir -p $HOME/software/samtools-1.21
[alice@dev1 ~]$ cd samtools-1.21
[alice@dev1 ~]$ ./configure --prefix=$HOME/software/samtools-1.21
  1. Build and install
[alice@dev1 ~]$ make
[alice@dev1 ~]$ make install

Install Samtools from Source

  1. Add to PATH
[alice@dev1 ~]$ echo "export PATH=$HOME/software/samtools-1.21/bin:\$PATH" >> $HOME/.bashrc
[alice@dev1 ~]$ source $HOME/.bashrc
  1. Test Installation
[alice@dev1 ~]$ samtools --help
Program: samtools (Tools for alignments in the SAM format)
Version: 1.21 (using htslib 1.21)

Usage:   samtools <command> [options]

Install Nextflow

  • Scientific workflow system with a community maintained set of core bioinformatics analysis pipelines
    • We will cover an example RNA-seq pipeline in part 2
  • These can be configured to use the Wynton compute job submission system
[alice@dev1 ~]$ cd ~/software
[alice@dev1 ~]$ curl -s "https://get.sdkman.io" | bash
[alice@dev1 ~]$ exit
[alice@log1 ~]$ ssh dev1
[alice@dev1 ~]$ sdk install java 17.0.6-tem
[alice@dev1 ~]$ wget -qO- https://get.nextflow.io | bash
[alice@dev1 ~]$ nextflow -v

Containers

Motivation

  • Compute heavy jobs (high RAM, multiple cores) should be run on compute nodes
  • Containers allow us to make additional software available to the compute nodes
    • Also allows the use of software that might be hard to install on Rocky 8 Linux
    • Improves reproducibility

Compute Jobs

Definitions

  • Containers: An isolated environment for running software that avoids conflicts with the host system. Containers are stored, shared and executed as image files with a .sif extension.
  • Images: are built from definition files (or Dockerfiles) which are a set of instruction you specify for your environment.

Apptainer

  • Wynton supports Apptainer (formerly singularity) containers

  • Docker is a commonly used image creation software, these can be turned into apptainer image files (.sif) easily

  • apptainer run

    • Run predefined script within container
  • apptainer exec

    • Execute any command within container
  • apptainer shell

    • Run bash shell within container

Example Container - Hello World

  • Run this command to convert the public Docker image to a apptainer image file
[alice@dev1 ~]$ apptainer pull docker://natalie23gill/hello-world:1.0
  • Execute the “hi” command in the container
[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif hi
    __  __     ____         _       __           __    __   __
   / / / /__  / / /___     | |     / /___  _____/ /___/ /  / /
  / /_/ / _ \/ / / __ \    | | /| / / __ \/ ___/ / __  /  / / 
 / __  /  __/ / / /_/ /    | |/ |/ / /_/ / /  / / /_/ /  /_/  
/_/ /_/\___/_/_/\____/     |__/|__/\____/_/  /_/\__,_/  (_) 

Example Container

  • This container has figlet installed which creates ASCII art from text input
  • Try running this command to create your own using exec
[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif figlet your_text

Docker

  • Docker uses Dockerfiles to specify image creation
  • Preferred by the Gladstone Bioinformatics Core to create new images
  • In part 2, we will go over how to build custom container images from DockerFiles
  • To see the Dockerfile used to create the hello-world image, run:
[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif cat /Dockerfile

End of Part 1

Thank You!

  • Please take some time to fill out the workshop survey if you are not attending part 2:

https://www.surveymonkey.com/r/bioinfo-training

Upcoming Data Science Training Program Workshops

Single Cell RNA-Seq Analysis
March 27-March 28, 2025 9:00-12:00pm PDT

Introduction to Linear Mixed Effects Models
April 3-April 4, 2025 1:00-3:00pm PDT

Introduction to scATAC-seq Data Analysis
April 17-April 18, 2025 9:00am-12:00pm PDT

Introduction to Pathway Analysis
April 22, 2025 1:00-4:00pm PDT

Complete Schedule