Working on Wynton

Part 1

Natalie Elphick

April 15th, 2024

Press the ? key for tips on navigating these slides

Introductions

Instructor:

      Natalie Elphick
      Bioinformatician I

TAs:

      Alex Pico
      Bioinformatics Core Director
      Ayushi Agrawal
      Bioinformatician III
      Min-Gyoung Shin
      Bioinformatician III

Target Audience

  • Prior experience with UNIX command-line

Part 1:

  1. What is an HPC cluster?
  2. Node Types and Logging in
  3. Storage
  4. Data Transfer
  5. Installing Software
  6. Containers

What is Wynton HPC?

High-performance Computing Cluster

  • A collection of specialized computers (nodes) connected together on a fast local network

HPC Diagram

Wynton

  • A HPC Linux environment available to all UCSF researchers for free
  • Uses the Rocky 8 linux OS
  • Includes several hundred compute nodes and a large shared storage system (Cluster specifications)
  • Funded and administered cooperatively by UCSF campus IT and key research groups

https://wynton.ucsf.edu

Node Types and Logging in

Node Types

  • Login: Submit and query jobs. SSH to development nodes. File management.
  • Development: Compile and install software. Test job scripts. Submit and query jobs. Version control. File management.
  • Compute: Running job scripts.
  • Transfer: Fast in- & outbound file transfers. File management.

The Login Nodes

  • Only capable of basic tasks (file management, submitting and checking on jobs)
  • Lacks access to pre-installed software tools that the development nodes have
  • The primary method to log in is to use an SSH client application
  • The Wynton HPC is up to date with information on logging in: Access Cluster

Names:

log1, log2 and plog1 (for PHI users)

Login

  • Connect to the UCSF or Gladstone WiFi networks (or the respective VPN) or using 2FA
  • ssh [your-username]@[node].wynton.ucsf.edu
{local}$ ssh alice@log1.wynton.ucsf.edu
alice@log1.wynton.ucsf.edu's password: 
[alice@log1 ~]$
  • There will not be any visual feedback when typing your password

The Development Nodes

  • Has a set of core software installed
    • e.g. git, vim, nano, make and python
  • Also has access to software repositories some which are maintained by other users or research groups
    • e.g. matlab, R and openjdk
  • Cannot SSH in to directly, only from a login node
ssh dev1

Names:

dev[1-3], gpudev1, pdev1 (PHI) and pgpudev1 (PHI)

Data Transfer Nodes

  • Can SSH in to directly
  • Fast network speed
  • Limited software
  • Use for transferring files to and from Wynton

Example:

{local}$ scp local_file.tsv alice@dt1.wynton.ucsf.edu:~/

Names:

dt1 and dt2

Compute Nodes

  • Can not SSH in to directly
  • No internet or UCSF network access
  • Used to run non-interactive compute job scripts
  • The software to run the job script is provided using a container

Compute Jobs

Storage

The File System

  • A file system how information is stored and retrieved on a computer
    • Consists of files and directories
  • A local file system is function of the operating system and only accessible from a single computer
  • A shared file system is accessible from multiple computers

BeeGFS

  • Wynton uses a parallel shared file system called BeeGFS
    • The files are stored as “chunks” spread across many different servers
  • BeeGFS has multiple services that work together to manage the file system
    • Storage (stores the chunks)
    • Metadata (tracks the chunks and information about their file)
    • Management (tracks all of the services)
    • Client (provides linux access to the file system)

BeeGFS - Advantages

  • High throughput
  • Redundancy can be built in by mirroring services
  • Adding new storage is fast and does not require downtime

BeeGFS - Caveats

  • For any client node, performance is limited by the network bandwidth of that node
  • Network latency becomes extremely important for all metadata requests
  • Certain input/output patterns can be problematic

BeeGFS - I/O patterns

  • Anything that requires lots of metadata operations can feel slow
    • e.g: lots of writes to the same directory and lots of file lookups and directory searches (conda)
  • Keep the number of reads and writes to a single directory to a reasonable number

BeeGFS - Takehome Message

  • Prefer fewer, large files over many small ones
  • Distribute reading and writing over several directories
  • Use local scratch (/scratch) when possible
  • Don’t include anything in /wynton in your default LD_LIBRARY_PATH
  • If using conda, putting the conda application inside a Apptainer (formerly singularity) container will result in better performance

Storage

  • Wynton storage is not backed up
  • /wynton/home/[group_name]/[user]
    • PHI users : /wynton/protected/home/[group_name]/[user]
    • User home directory - limited to 500 GiB
  • /wynton/group/[group_name]
    • PHI users : /wynton/protected/group/[group_name]
    • User group directory - disk quota varies by group
    • Use this directory for any analysis you want to share with your lab
  • More information on disk quotas

To check your group disk quota run:

beegfs-ctl --getquota --storagepoolid=12 --gid "$(id --group)"

Scratch - Temporary Storage

  • Local /scratch - 0.1-1.8 TiB/node storage unique to each compute node
    • Can only be accessed from the specific compute node
    • Use this to store intermediate files only needed for a job
  • /wynton/scratch and /wynton/protected/scratch (for PHI users)
    • 703 TiB storage accessible from everywhere
  • No quotas


Files not used for 2 weeks are automatically deleted

Gladstone HIVE

  • Gladstone’s HIVE storage server is mounted directly to Wynton under /gladstone
    • Only certain HIVE folders are accessible directly on Wynton
    • Files under /gladstone are backed up
  • Naming: /gladstone/[lab]
    • Directories that are shared between multiple labs can be set up by contacting Gladstone IT
  • For more information visit the IT knowledge base page

Storage Advice

  • Always back up anything you store under /wynton
  • If you have access to it keep all of your data on /gladstone
    • A large number of jobs reading and writing to these directories may be slower since it is NFS mounted not BeeGFS
  • Use the scratch directories to store temporary files
    • e.g. A large amount of .fastq that you do not need after the alignment step

Data Transfer

Secure Copy - scp

  • Local file to Wynton
{local}$ scp /path/to/local_file.tsv alice@dt1.wynton.ucsf.edu:/destination/path
  • Copy a directory to a folder on Wynton
{local}$ scp -r local_folder/ alice@dt1.wynton.ucsf.edu:/destination/path
  • Copy a single file to Wynton from your local machine
{local}$ scp alice@dt1.wynton.ucsf.edu:/path/to/local_file.tsv /destination/path

Hands-on

  • Use scp to copy this file into your home directory on Wynton

GUI SFTP Clients

  • These let you transfer files to and from Wynton using a GUI
  • 2 factor authentication may be required
  • Cyberduck
    • Navigate to Preferences -> Transfers -> General
    • change the Transfer Files setting “Use browser connection” instead of “Open Multiple connections”
  • FileZilla
    • In the General tab, select ‘SFTP’ as the Protocol instead of ‘FTP’
    • For Logon Type, select ‘Interactive’ instead of ‘Ask for Password’
    • Under the Transfer Settings tab, you might need to click the ‘Limit number of simultaneous connections’ and make sure the ‘Maximum number of connections’ is set to 1

Globus

  • Globus is a service for moving, syncing, and sharing large amounts of data
  • Wynton Accounts are not required to transfer data with Globus
  • Useful for transferring data between institutions

Rclone

  • Rclone is a command-line program to manage files on remote storage
  • Can be used to transfer data from Wynton directly to DropBox or other storage systems (AWS, Azure, Google Drive etc.)
    • Do this from a data transfer node using screen/tmux
  • Do not use rclone for transfers to Box, follow the Wynton to UCSF Box instructions

Poll 1

Poll 1 - Which of these can you not SSH in to?

  1. Login Nodes
  2. Development Nodes
  3. Data transfer Nodes
  4. Compute Nodes

Poll 2

The /wynton directory is backed up on a nightly basis, so there is no need to back up anything stored here.

  1. True
  2. False

Installing Software

Basics

  • Check if the tool is already available in a module
  • Ensure the software you are trying to install is compatible with Rocky 8 linux (use a container if not)
  • Always install software in a development node
  • Download a precompiled binary or install from source

Install Samtools from Source

  1. Download and extract source code
[alice@dev1 ~]$ mkdir -p "/scratch/$USER"
[alice@dev1 ~]$ cd "/scratch/$USER"
[alice@dev1 alice]$ wget https://github.com/samtools/samtools/releases/download/1.19.2/samtools-1.19.2.tar.bz2
[alice@dev1 alice]$ tar -x -f samtools-1.19.2.tar.bz2
  1. Create install location and configure
[alice@dev1 ~]$ mkdir -p $HOME/software/samtools-1.14
[alice@dev1 ~]$ cd samtools-1.19.2
[alice@dev1 ~]$ ./configure --prefix=$HOME/software/samtools-1.14
  1. Build and install
[alice@dev1 ~]$ make
[alice@dev1 ~]$ make install

Install Nextflow

  • Scientific workflow system with a community maintained set of core bioinformatics analysis pipelines
    • We will cover an example RNA-seq pipeline in part 2
  • These can be configured to use the Wynton compute job submission system
[alice@dev1 ~]$ cd ~/software
[alice@dev1 ~]$ curl -s "https://get.sdkman.io" | bash
[alice@dev1 ~]$ exit
[alice@log1 ~]$ ssh dev1
[alice@dev1 ~]$ sdk install java 17.0.6-tem
[alice@dev1 ~]$ wget -qO- https://get.nextflow.io | bash
[alice@dev1 ~]$ nextflow -v

Containers

Motivation

  • Compute heavy jobs (high RAM, multiple cores) should be run on compute nodes
  • Containers allow us to make additional software available to the compute nodes
    • Also allows the use of software that might be hard to install on Rocky 8 Linux
    • Improves reproducibility

Compute Jobs

Definitions

  • Containers: An isolated environment for running software that is created from an image file, preventing conflicts with the host system.
  • Images: An ordered collection of root filesystem changes that contain all necessary dependencies, ensuring software run identically across various computing platforms.

Apptainer

  • Wynton supports Apptainer (formerly singularity) containers

  • Docker is a commonly used image creation software, these can be turned into apptainer image files (.sif) easily

  • apptainer run

    • Run predefined script within container
  • apptainer exec

    • Execute any command within container
  • apptainer shell

    • Run bash shell within container

Example Container - Hello World

  • Run this command to convert the public Docker image to a apptainer image file
[alice@dev1 ~]$ apptainer pull docker://natalie23gill/hello-world:1.0
  • Execute the “hi” command in the container
[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif hi
    __  __     ____         _       __           __    __   __
   / / / /__  / / /___     | |     / /___  _____/ /___/ /  / /
  / /_/ / _ \/ / / __ \    | | /| / / __ \/ ___/ / __  /  / / 
 / __  /  __/ / / /_/ /    | |/ |/ / /_/ / /  / / /_/ /  /_/  
/_/ /_/\___/_/_/\____/     |__/|__/\____/_/  /_/\__,_/  (_) 

Example Container

  • This container has figlet installed which creates ASCII art from text input
  • Try running this command to create your own using exec
[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif figlet your_text

Docker

  • Docker uses Dockerfiles to specify image creation
  • Preferred by the Gladstone Bioinformatics Core to create new images
  • In part 2, we will go over how to build custom container images from DockerFiles
  • To see the Dockerfile used to create the hello-world image, run:
[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif cat /Dockerfile

End of Part 1

Thank You!

Upcoming Data Science Training Program Workshops

Introduction to Linear Mixed Effects Models
April 25-April 26, 2024 1-3pm PDT

Single Cell RNA-Seq Data Analysis
April 29-April 30, 2024 9am-4pm PDT

Single Cell ATAC-Seq Data Analysis Part 1
May 6-May 7, 2024 1-4pm PDT

Complete Schedule