update for 2024, closes #18

2025-11-30 09:45:43 -08:00 · 2024-12-03 17:11:42 -08:00 · 2024-12-03 17:11:42 -08:00 · 98e1c6b026
commit 98e1c6b026
parent 5dc458a476
6 changed files with 543 additions and 473 deletions
--- a/docs/Working_on_Wynton_Part_1.html
+++ b/docs/Working_on_Wynton_Part_1.html
--- a/docs/Working_on_Wynton_Part_2.html
+++ b/docs/Working_on_Wynton_Part_2.html
--- a/working-on-wynton-hpc/Working_on_Wynton_Part_1.Rmd
+++ b/working-on-wynton-hpc/Working_on_Wynton_Part_1.Rmd
@ -2,7 +2,7 @@
 title: "Working on Wynton"
 subtitle: "Part 1"
 author: "Natalie Elphick"
-date: "April 15th, 2024"
+date: "December 5th, 2024"
 knit: (function(input, ...) {
    rmarkdown::render(
      input,
@ -17,7 +17,7 @@ output:
 ---

 ```{r, setup, include=FALSE}
-
+knitr::opts_chunk$set(comment = "")
 ```

 ## 
@ -35,10 +35,8 @@ TAs:

 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Alex Pico**    
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Bioinformatics Core Director*   
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Ayushi Agrawal**    
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Bioinformatician III*   
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Min-Gyoung Shin**    
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Bioinformatician III*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Michela Traglia**    
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Senior Statistician*

 ## Target Audience

@ -61,6 +59,13 @@ TAs:

 ![HPC Diagram](slide_materials/HPC_diagram.png)

+
+## HPC File System {.smaller-picture}
+
+![HPC File System](slide_materials/file_system_node_relationship.png)
+
+
+
 ## Wynton {.small-bullets}

 -   A HPC Linux environment available to all UCSF researchers for free\
@ -84,7 +89,6 @@ TAs:
 -   Only capable of basic tasks (file management, submitting and checking on jobs)
 -   Lacks access to pre-installed software tools that the development nodes have
 -   The primary method to log in is to use an SSH client application
-   The Wynton HPC is up to date with information on logging in: [Access Cluster](https://wynton.ucsf.edu/hpc/get-started/access-cluster.html)

 <u>Names</u>:

@ -147,49 +151,6 @@ dt1 and dt2

 # Storage

-## The File System {.small-bullets}
-
-   A file system how information is stored and retrieved on a computer
-    -   Consists of files and directories
-   A local file system is function of the operating system and only accessible from a single computer
-   A shared file system is accessible from multiple computers
-
-## BeeGFS {.small-bullets}
-
-   Wynton uses a *parallel* shared file system called BeeGFS
-    -   The files are stored as "chunks" spread across many different servers
-   BeeGFS has multiple services that work together to manage the file system
-    -   Storage (stores the chunks)
-    -   Metadata (tracks the chunks and information about their file)
-    -   Management (tracks all of the services)
-    -   Client (provides linux access to the file system)
-
-## BeeGFS - Advantages
-
-   High throughput
-   Redundancy can be built in by mirroring services
-   Adding new storage is fast and does not require downtime
-
-## BeeGFS - Caveats
-
-   For any client node, performance is limited by the network bandwidth of that node
-   Network latency becomes extremely important for all metadata requests
-   Certain input/output patterns can be problematic
-
-## BeeGFS - I/O patterns 
-
-   Anything that requires lots of metadata operations can feel slow
-    -   e.g: lots of writes to the same directory and lots of file lookups and directory searches (**conda**)
-   Keep the number of reads and writes to a single directory to a reasonable number
-
-
-## BeeGFS - Takehome Message {.small-bullets}
-
-   Prefer fewer, large files over many small ones
-   Distribute reading and writing over several directories
-   Use local scratch (**/scratch**) when possible
-   Don't include anything in **/wynton** in your default LD_LIBRARY_PATH
-   If using conda, putting the conda application inside a Apptainer (formerly singularity) container will result in better performance

 ## Storage {.small-bullets}

@ -327,16 +288,16 @@ The **/wynton** directory is backed up on a nightly basis, so there is no need t
 ```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
 echo '[alice@dev1 ~]$ mkdir -p "/scratch/$USER"
 [alice@dev1 ~]$ cd "/scratch/$USER"
-[alice@dev1 alice]$ wget https://github.com/samtools/samtools/releases/download/1.19.2/samtools-1.19.2.tar.bz2
-[alice@dev1 alice]$ tar -x -f samtools-1.19.2.tar.bz2'
+[alice@dev1 alice]$ wget https://github.com/samtools/samtools/releases/download/1.21/samtools-1.21.tar.bz2
+[alice@dev1 alice]$ tar -x -f samtools-1.21.tar.bz2'
 ```

 2.  Create install location and configure

 ```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-echo '[alice@dev1 ~]$ mkdir -p $HOME/software/samtools-1.14'
-echo '[alice@dev1 ~]$ cd samtools-1.19.2'
-echo '[alice@dev1 ~]$ ./configure --prefix=$HOME/software/samtools-1.14'
+echo '[alice@dev1 ~]$ mkdir -p $HOME/software/samtools-1.21'
+echo '[alice@dev1 ~]$ cd samtools-1.21'
+echo '[alice@dev1 ~]$ ./configure --prefix=$HOME/software/samtools-1.21'
 ```

 3.  Build and install
@ -346,6 +307,29 @@ echo '[alice@dev1 ~]$ make'
 echo '[alice@dev1 ~]$ make install'
 ```

+## Install Samtools from Source {.small-list}
+
+4. Add to PATH
+
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo '[alice@dev1 ~]$ echo "export PATH=$HOME/software/samtools-1.21/bin:\$PATH" >> $HOME/.bashrc'
+echo '[alice@dev1 ~]$ source $HOME/.bashrc'
+```
+
+5. Test Installation
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo '[alice@dev1 ~]$ samtools --help'
+```
+
+```{r, engine='bash', echo=FALSE}
+echo 'Program: samtools (Tools for alignments in the SAM format)
+Version: 1.21 (using htslib 1.21)
+
+Usage:   samtools <command> [options]'
+```
+
+
+
 ## Install Nextflow

 -   Scientific workflow system with a community maintained set of [core bioinformatics analysis](https://nf-co.re/) pipelines
@ -375,8 +359,8 @@ echo '[alice@dev1 ~]$ nextflow -v'

 ## Definitions {.small-bullets}

-   **Containers:** An isolated environment for running software that is created from an *image* file, preventing conflicts with the host system.
-   **Images:** An ordered collection of root filesystem changes that contain all necessary dependencies, ensuring software run identically across various computing platforms.
+-   **Containers**: An isolated environment for running software that avoids conflicts with the host system. Containers are stored, shared and executed as **image files** with a .sif extension. 
+-   **Images:** are built from definition files (or Dockerfiles) which are a set of instruction you specify for your environment.

 ## Apptainer {.small-bullets}

@ -444,18 +428,11 @@ echo '[alice@dev1 ~]$ apptainer exec hello-world_1.0.sif cat /Dockerfile'

 ## Thank You!

-   Please take some time to fill out the workshop survey if you are not attending part 2:\
-    <https://www.surveymonkey.com/r/F75J6VZ>
+-   Please take some time to fill out the workshop survey if you are not attending part 2:   
+    <https://www.surveymonkey.com/r/bioinfo-training>

 ## Upcoming Data Science Training Program Workshops

-[Introduction to Linear Mixed Effects Models](https://gladstone.org/events/introduction-linear-mixed-effects-models)\
-April 25-April 26, 2024 1-3pm PDT
-
-[Single Cell RNA-Seq Data Analysis](https://gladstone.org/events/single-cell-rna-seq-data-analysis)\
-April 29-April 30, 2024 9am-4pm PDT
-
-[Single Cell ATAC-Seq Data Analysis Part 1](https://gladstone.org/events/single-cell-atac-seq-data-analysis-part-1-1)\
-May 6-May 7, 2024 1-4pm PDT
+This is our last workshop for 2024, please check the link below for future workshop dates.

 [Complete Schedule](https://gladstone.org/events?series=189)
--- a/working-on-wynton-hpc/Working_on_Wynton_Part_2.Rmd
+++ b/working-on-wynton-hpc/Working_on_Wynton_Part_2.Rmd
@ -2,7 +2,7 @@
 title: "Working on Wynton"
 subtitle: "Part 2"
 author: "Natalie Elphick"
-date: "April 16th, 2024"
+date: "December 6th, 2024"
 knit: (function(input, ...) {
    rmarkdown::render(
      input,
@ -17,7 +17,7 @@ output:
 ---

 ```{r, setup, include=FALSE}
-
+knitr::opts_chunk$set(comment = "")
 ```

 ## 
@ -36,8 +36,7 @@ TAs:
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Alex Pico**    
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Bioinformatics Core Director*   
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Michela Traglia**    
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Senior Statistician* 
-
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Senior Statistician*

 ## Target Audience
 -   Prior experience with UNIX command-line 
@ -46,101 +45,18 @@ TAs:

 ## Part 2:

-1.    Custom Containers
-2.    Submitting Compute Jobs
-3.    Array Jobs
-4.    GPU Jobs
-5.    Running Pipelines
-6.    Jupyter Notebooks 
-7.    RStudio Server
+1.    Submitting Compute Jobs
+2.    Array Jobs
+3.    GPU Jobs
+4.    Running Pipelines
+5.    Jupyter Notebooks 
+6.    RStudio Server
+7.    Advanced Tips and Tricks
 8.    How to get help




-# Custom Containers
-
-## Motivation {.small-bullets .small-picture}
-
-   Compute heavy jobs (high RAM, multiple cores) should be run on compute nodes
-   Containers allow us to make additional software available to the compute nodes
-    -   Also allows the use of software that might be hard to install on Rocky 8 Linux
-    -   Improves reproducibility
-
-![Compute Jobs](slide_materials/compute_job_workflow.png)
-
-
-
-
-## Dockerfile Basics
-
-   Dockerfiles contain instructions to build an image in **layers**
-   Layers are added using Dockerfile instruction syntax
-   Images are built by navigating to the directory that contains the Dockerfile and running:
-
-```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-echo 'docker build .'
-```
-
-## Dockerfile Instructions {.small-bullets}
-   First instruction is always **FROM** which specifies the base image
-    -   Base images are a starting point with some basics already installed like the OS and build tools, find them on [DockerHub](https://hub.docker.com/)
-   **RUN** : Use before running any shell commands
-   **SHELL** : Set the shell
-   **USER** : Set the user (within the image)
-   **CMD** : Set the default instruction to be run by the image
-   **COPY** : COPY files into the image
-
-
-See the [Dockerfile documentation](https://docs.docker.com/reference/dockerfile/) for a full list of instructions
-
-## Example Dockerfile {.code-alt}
-
-   Click [here](https://www.dropbox.com/scl/fi/mdbefp3h8ahdvxtgjypqo/Dockerfile?rlkey=7d4zd9ge1m3wwszlfy78712ky&dl=1) to download the example Dockerfile
-   Open in your preffered text editor
-
-
-```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-curl -s -L -o Dockerfile 'https://www.dropbox.com/scl/fi/mdbefp3h8ahdvxtgjypqo/Dockerfile?rlkey=7d4zd9ge1m3wwszlfy78712ky&dl=0'
-cat Dockerfile
-rm Dockerfile
-```
-
-## Building Example Image
-
-   Do not run this during the workshop
-    -   It requires a lot of RAM
-   On macOS, make sure you have the Docker Desktop App running
-   We can provide an additional argument to the **build** command, -t, to set the name of the docker image
-      -   We can add version tags after the name using ":" 
-```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-echo "docker build -t docker_hub_user/seurat-harmony:1.0 ."
-```
-
-
-## Pushing Images to DockerHub  {.small-bullets}
-
-   Make sure you are signed in to your DockerHub account locally (Docker Desktop for macOS)
-   The image name must start with your user name
-
-```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-echo "docker push docker_hub_user/seurat-harmony:1.0"
-```
-
-   These can then be "pulled" on to Wynton as apptainer image files (image must be public)
-```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-echo "[alice@dev1 ~]$ apptainer pull docker://docker_hub_user/seurat-harmony:1.0"
-```
-
-## Notes on Building Custom Images {.small-bullets}
-
-   Time consuming process and can use a lot of RAM on your local machine
-   A good base image can save you a lot of time
-   You must run **apt-get update** and **apt-get install** in the same command
-    -   Otherwise you will encounter caching issues
-    -   These are only for Ubuntu, for other OS run the equivalent package list retrieval and install commands together
-   Remember to use **apt-get install -y**
-    -   You will have no control over the process while it's building

 # Compute Jobs

@ -159,9 +75,15 @@ cat submission.sh
 rm submission.sh
 ```

+## Submission Script - Apptainer
+
+-   Download the example job submission script that uses a container
+```{r,engine='bash', eval=FALSE, echo=TRUE}
+curl -s -L -o apptainer_submission_script.sh 'https://www.dropbox.com/scl/fi/zzl9fnfcoxu3pyrx5ffd1/apptainer_submission_script.sh?rlkey=w05e18ahw4hvbvaucac379za9&dl=1'
+```
+
 ## Submission Script - Apptainer {.small-bullets .code-alt}

-   [Download](https://www.dropbox.com/scl/fi/zzl9fnfcoxu3pyrx5ffd1/apptainer_submission_script.sh?rlkey=w05e18ahw4hvbvaucac379za9&dl=1) this example job submission script that uses a container
 -   Paths that the container needs read/write access to need to be mounted with APPTAINER_BINDPATH

 ```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
@ -194,18 +116,27 @@ rm submission.sh
 ```


-## Array Jobs {.small-bullets .code-alt}
+## Array Jobs {.small-bullets}

 -   This is a good option if the script you want to run operates on discrete sets of data
    - e.g. sample or chromosome
-   [Download](https://www.dropbox.com/scl/fi/upl71jeny62fxfzkxao1f/array_job_submission_script.sh?rlkey=ggkyjxx8nz400e1t96mif5t34&dl=1) this example array job submission script
+-   Array jobs allow one file to create multiple jobs that are indexed by a task ID
+-   Download the example array job submission folder

 ```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
-curl -s -L -o submission.sh 'https://www.dropbox.com/scl/fi/upl71jeny62fxfzkxao1f/array_job_submission_script.sh?rlkey=ggkyjxx8nz400e1t96mif5t34&dl=0'
-cat submission.sh
-rm submission.sh
+echo 'curl -L -o array_job_example.zip https://www.dropbox.com/scl/fo/j0muxevls22ylwxqe76ws/ANFEeLzPH4D_GmHpldiVCTg?rlkey=h6y0ginsrtlsc02beb65zbysh&dl=1'
 ```

+## Array Jobs {.small-bullets}
+
+-   Unzip it 
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo 'unzip array_job_example.zip -d array_job_example'
+```
+
+- Follow along with the demo
+
+
 ## GPU Jobs  {.small-bullets}

 -   To run a [GPU job](https://wynton.ucsf.edu/hpc/scheduler/gpu.html), specify **-q gpu.q** (queue) as a GPU queue
@ -430,33 +361,144 @@ For any bioinformatics specific questions feel free to reach out to the Gladston
 -   Slack channel #questions-about-bioinformatics
    -   Contact us at the email above to be added to the channel

+# Advanced Tips and Tricks
+
+
+## BeeGFS {.small-bullets}
+
+-   Wynton uses a *parallel* shared file system called BeeGFS
+    -   The files are stored as "chunks" spread across many different servers
+-   BeeGFS has multiple services that work together to manage the file system
+    -   Storage (stores the chunks)
+    -   Metadata (tracks the chunks and information about their file)
+    -   Management (tracks all of the services)
+    -   Client (provides linux access to the file system)
+
+## BeeGFS - Advantages
+
+-   High throughput
+-   Redundancy can be built in by mirroring services
+-   Adding new storage is fast and does not require downtime
+
+## BeeGFS - Caveats
+
+-   For any client node, performance is limited by the network bandwidth of that node
+-   Network latency becomes extremely important for all metadata requests
+-   Certain input/output patterns can be problematic
+
+## BeeGFS - I/O patterns 
+
+-   Anything that requires lots of metadata operations can feel slow
+    -   e.g: lots of writes to the same directory and lots of file lookups and directory searches (**conda**)
+-   Keep the number of reads and writes to a single directory to a reasonable number
+
+## BeeGFS - Takehome Message {.small-bullets}
+
+-   Prefer fewer, large files over many small ones
+-   Distribute reading and writing over several directories
+-   Use local scratch (**/scratch**) when possible
+-   Don't include anything in **/wynton** in your default LD_LIBRARY_PATH
+-   If using conda, putting the conda application inside a Apptainer (formerly singularity) container will result in better performance
+
+## Custom Containers
+
+## Motivation {.small-bullets .small-picture}
+
+-   Compute heavy jobs (high RAM, multiple cores) should be run on compute nodes
+-   Containers allow us to make additional software available to the compute nodes
+    -   Also allows the use of software that might be hard to install on Rocky 8 Linux
+    -   Improves reproducibility
+
+![Compute Jobs](slide_materials/compute_job_workflow.png)
+
+
+
+
+## Dockerfile Basics
+
+-   Dockerfiles contain instructions to build an image in **layers**
+-   Layers are added using Dockerfile instruction syntax
+-   Images are built by navigating to the directory that contains the Dockerfile and running:
+
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo 'docker build .'
+```
+
+## Dockerfile Instructions {.small-bullets}
+-   First instruction is always **FROM** which specifies the base image
+    -   Base images are a starting point with some basics already installed like the OS and build tools, find them on [DockerHub](https://hub.docker.com/)
+-   **RUN** : Use before running any shell commands
+-   **SHELL** : Set the shell
+-   **USER** : Set the user (within the image)
+-   **CMD** : Set the default instruction to be run by the image
+-   **COPY** : COPY files into the image
+
+
+See the [Dockerfile documentation](https://docs.docker.com/reference/dockerfile/) for a full list of instructions
+
+## Example Dockerfile {.code-alt}
+
+-   Click [here](https://www.dropbox.com/scl/fi/mdbefp3h8ahdvxtgjypqo/Dockerfile?rlkey=7d4zd9ge1m3wwszlfy78712ky&dl=1) to download the example Dockerfile
+-   Open in your preffered text editor
+
+
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+curl -s -L -o Dockerfile 'https://www.dropbox.com/scl/fi/mdbefp3h8ahdvxtgjypqo/Dockerfile?rlkey=7d4zd9ge1m3wwszlfy78712ky&dl=0'
+cat Dockerfile
+rm Dockerfile
+```
+
+## Building Example Image
+
+-   Do not run this during the workshop
+    -   It requires a lot of RAM
+-   On macOS, make sure you have the Docker Desktop App running
+-   We can provide an additional argument to the **build** command, -t, to set the name of the docker image
+      -   We can add version tags after the name using ":" 
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo "docker build -t docker_hub_user/seurat-harmony:1.0 ."
+```
+
+
+## Pushing Images to DockerHub  {.small-bullets}
+
+-   Make sure you are signed in to your DockerHub account locally (Docker Desktop for macOS)
+-   The image name must start with your user name
+
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo "docker push docker_hub_user/seurat-harmony:1.0"
+```
+
+-   These can then be "pulled" on to Wynton as apptainer image files (image must be public)
+```{r, engine='bash', eval=TRUE, results='markup',comment=NA, highlight=TRUE, echo=FALSE}
+echo "[alice@dev1 ~]$ apptainer pull docker://docker_hub_user/seurat-harmony:1.0"
+```
+
+## Notes on Building Custom Images {.small-bullets}
+
+-   Time consuming process and can use a lot of RAM on your local machine
+-   A good base image can save you a lot of time
+-   You must run **apt-get update** and **apt-get install** in the same command
+    -   Otherwise you will encounter caching issues
+    -   These are only for Ubuntu, for other OS run the equivalent package list retrieval and install commands together
+-   Remember to use **apt-get install -y**
+    -   You will have no control over the process while it's building


 # End of Part 2

 ## Thank You!

+
 -   Please take some time to fill out the workshop survey:   
-[https://www.surveymonkey.com/r/F75J6VZ](https://www.surveymonkey.com/r/F75J6VZ)
-
-   Want some additional Wynton training?    
-Check out the UCSF library [Introduction to Wynton HPC Cluster](https://calendars.library.ucsf.edu/event/12197724) Workshop
-
+    <https://www.surveymonkey.com/r/bioinfo-training>

 ## Upcoming Data Science Training Program Workshops

+This is our last workshop for 2024, please check the link below for future workshop dates.

-[Introduction to Linear Mixed Effects Models](https://gladstone.org/events/introduction-linear-mixed-effects-models)   
-April 25-April 26, 2024 1-3pm PDT
-
-[Single Cell RNA-Seq Data Analysis](https://gladstone.org/events/single-cell-rna-seq-data-analysis)     
-April 29-April 30, 2024 9am-4pm PDT
-
-[Single Cell ATAC-Seq Data Analysis Part 1](https://gladstone.org/events/single-cell-atac-seq-data-analysis-part-1-1)    
-May 6-May 7, 2024 1-4pm PDT
-
-
-[Complete Schedule](https://gladstone.org/events?series=189)     
+[Complete Schedule](https://gladstone.org/events?series=189)
+ 



--- a/working-on-wynton-hpc/renv.lock
+++ b/working-on-wynton-hpc/renv.lock
@ -1,6 +1,6 @@
 {
  "R": {
-    "Version": "4.3.2",
+    "Version": "4.4.1",
    "Repositories": [
      {
        "Name": "CRAN",
--- a/working-on-wynton-hpc/slide_materials/file_system_node_relationship.png
+++ b/working-on-wynton-hpc/slide_materials/file_system_node_relationship.png