diff --git a/docs/Intro_to_R_data_analysis_part_1.html b/docs/Intro_to_R_data_analysis_part_1.html index 3ec194f..a3cb4b6 100644 --- a/docs/Intro_to_R_data_analysis_part_1.html +++ b/docs/Intro_to_R_data_analysis_part_1.html @@ -2562,7 +2562,7 @@ class CountdownTimer {

Introduction to R Data Analysis - Part 1

Natalie Elphick

-

January 22nd

+

January 22nd, 2024

@@ -2593,16 +2593,15 @@ Matlab etc.)

Part 1:

    -
  1. What is R and why should you use it?

  2. -
  3. The RStudio interface

  4. -
  5. File types

  6. -
  7. Error messages

  8. -
  9. Variables

  10. -
  11. Types & data structures

    -

    10 min break

  12. -
  13. Math and logic operations

  14. -
  15. Functions and libraries

  16. -
  17. Reading data into R

  18. +
  19. What is R and why should you use it?
  20. +
  21. The RStudio interface
  22. +
  23. File types
  24. +
  25. Error messages
  26. +
  27. Variables
  28. +
  29. Types & data structures
  30. +
  31. Math and logic operations
  32. +
  33. Functions and libraries
  34. +
  35. Reading data into R
@@ -2732,13 +2731,13 @@ and periods one style of names
# Snake case
-dog_breeds <- c("Labrador Retriever","Akita", "Bulldog")
+dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog")
 
 # Period separated
-dog.breeds <- c("Labrador Retriever","Akita", "Bulldog")
+dog.breeds <- c("Labrador Retriever", "Akita", "Bulldog")
 
 # Camel case
-DogBreeds <- c("Labrador Retriever","Akita", "Bulldog")
+DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")

Poll 2

@@ -2821,7 +2820,7 @@ types/structures (ex. nested lists)

10 min break

-
+
10:00
@@ -2966,131 +2965,6 @@ RNA-Seq Analysis Using R
  • Feb 1, 2024 (9:30am-12:00pm)
  • -
    -
    -
    -

    ChatGPT Tips for R

    - -
    -
    -

    General Tips

    -
      -
    • Always confirm ChatGPT’s outputs are correct
    • -
    • Provide as much detail as possible about the problem in the 1st -prompt
    • -
    • Use separate chats for separate tasks/projects
    • -
    • Try the ‘Custom Instructions’ function that adds additional -information to every prompt
    • -
    • Can visit webpages (GPT 4 only), which can help get more specific -answers
    • -
    -
    -
    -

    Code Tips

    -
      -
    • Commented R code yields better responses in my experience
    • -
    • Provide the code and error message in the same prompt
    • -
    • ChatGPT can work well to convert syntax and improve your code: -
        -
      • “Turn this loop into a function : [your code]”
      • -
      • “Is there a better way to do this : [your code]”
      • -
    • -
    • Check out the file: -example_code/1_convert_syntax_example.R for an example use -case
    • -
    -
    -
    -
    -

    Finding R Packages

    - -
    -
    -

    Key Questions

    -
      -
    • What assay was the package designed for?
    • -
    • When was the last release?
    • -
    • Is it maintained (frequent updates)?
    • -
    • Does it work on all operating systems?
    • -
    • Are other people using it? (citations)
    • -
    • Do they respond to github issues?
    • -
    • Is there a benchmarking paper?
    • -
    -
    -
    -

    BioConductor and CRAN

    -
      -
    • Both of these have stringent requirements for packages they host -(eg. for BioConductor they have to run on all major operating -systems)

    • -
    • Prefer BioConductor packages if available over CRAN

    • -
    • Prefer CRAN packages over ones only hosted on GitHub

    • -
    -
    -
    -

    Start with the Assay

    -
      -
    • Click here -to go to BioC views
    • -
    • Pick the assay you want to analyse
    • -
    • Pick the type of analysis you want to do
    • -
    • Find a package that does it
    • -
    • Find benchmarking papers to narrow the list of packages down
    • -
    • Find the vignette on the package page and refer to the manual for -any questions not covered by it
    • -
    -
    -
    -
    -

    Additional Resources

    - -
    -
    -

    R

    - -
    -
    -

    Statistics

    - -
    -
    -

    RNA-seq Analysis

    - -
    -
    -

    Dimensional Reduction

    -
    diff --git a/docs/Intro_to_R_data_analysis_part_2.html b/docs/Intro_to_R_data_analysis_part_2.html new file mode 100644 index 0000000..45adb18 --- /dev/null +++ b/docs/Intro_to_R_data_analysis_part_2.html @@ -0,0 +1,10031 @@ + + + + + + + Introduction to R Data Analysis - Part 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    +

    Introduction to R Data Analysis - Part 2

    +

    Natalie Elphick

    +

    January 23rd, 2024

    +
    + +
    +

    +
    +Press the ? key for tips on navigating these slides +
    +
    +
    +

    Schedule

    +
      +
    1. Introduction to Tidyverse
    2. +
    3. Filtering and reformatting data
    4. +
    5. Plotting data
    6. +
    7. Hands on data analysis
    8. +
    +
    + +
    +
    +

    Introduction to Tidyverse

    + +
    +
    +

    Tidyverse

    +
      +
    • The tidyverse packages work well together because they share common +data representations and design principles +
        +
      • Rows = observations, columns = variables
      • +
    • +
    • ggplot2, for data visualization.
    • +
    • dplyr, for data manipulation.
    • +
    • tidyr, for data tidying.
    • +
    • readr, for data import.
    • +
    • purrr, for iteration.
    • +
    • and more..
    • +
    +
    +
    +

    dplyr

    +
      +
    • Offers a common “grammar” of functions for data manipulation +
        +
      • mutate() +adds new variables that are functions of existing columns
      • +
      • select() +picks columns based on their names
      • +
      • filter() +picks rows based on their values
      • +
      • summarise() +reduces multiple values down to a single summary
      • +
      • arrange() +changes the ordering of the rows
      • +
      • group_by() +allows any operation to be done “by group”
      • +
    • +
    +
    +
    +

    Example Dataframe

    +
      +
    • mpg is a dataframe built into the ggplot2 package
    • +
    +
    head(mpg)
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +manufacturer + +model + +displ + +year + +cyl + +trans + +drv + +cty + +hwy + +fl + +class +
    +audi + +a4 + +1.8 + +1999 + +4 + +auto(l5) + +f + +18 + +29 + +p + +compact +
    +audi + +a4 + +1.8 + +1999 + +4 + +manual(m5) + +f + +21 + +29 + +p + +compact +
    +audi + +a4 + +2.0 + +2008 + +4 + +manual(m6) + +f + +20 + +31 + +p + +compact +
    +audi + +a4 + +2.0 + +2008 + +4 + +auto(av) + +f + +21 + +30 + +p + +compact +
    +audi + +a4 + +2.8 + +1999 + +6 + +auto(l5) + +f + +16 + +26 + +p + +compact +
    +audi + +a4 + +2.8 + +1999 + +6 + +manual(m5) + +f + +18 + +26 + +p + +compact +
    +
    +
    +
    +

    Select Columns

    +
    select(.data = mpg,
    +       year, cty, hwy, manufacturer)
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +year + +cty + +hwy + +manufacturer +
    +1999 + +18 + +29 + +audi +
    +1999 + +21 + +29 + +audi +
    +2008 + +20 + +31 + +audi +
    +2008 + +21 + +30 + +audi +
    +1999 + +16 + +26 + +audi +
    +1999 + +18 + +26 + +audi +
    +
    +
    +
    +

    Filter Rows

    +
    filter(.data = mpg,
    +       year == 2008)
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +manufacturer + +model + +displ + +year + +cyl + +trans + +drv + +cty + +hwy + +fl + +class +
    +audi + +a4 + +2.0 + +2008 + +4 + +manual(m6) + +f + +20 + +31 + +p + +compact +
    +audi + +a4 + +2.0 + +2008 + +4 + +auto(av) + +f + +21 + +30 + +p + +compact +
    +audi + +a4 + +3.1 + +2008 + +6 + +auto(av) + +f + +18 + +27 + +p + +compact +
    +audi + +a4 quattro + +2.0 + +2008 + +4 + +manual(m6) + +4 + +20 + +28 + +p + +compact +
    +audi + +a4 quattro + +2.0 + +2008 + +4 + +auto(s6) + +4 + +19 + +27 + +p + +compact +
    +audi + +a4 quattro + +3.1 + +2008 + +6 + +auto(s6) + +4 + +17 + +25 + +p + +compact +
    +
    +
    +
    +

    Arrange Rows

    +
      +
    • desc() is used to arrange rows in descending order, the default is +ascending
    • +
    +
    arrange(.data = mpg,
    +        desc(cyl))
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +manufacturer + +model + +displ + +year + +cyl + +trans + +drv + +cty + +hwy + +fl + +class +
    +audi + +a6 quattro + +4.2 + +2008 + +8 + +auto(s6) + +4 + +16 + +23 + +p + +midsize +
    +chevrolet + +c1500 suburban 2wd + +5.3 + +2008 + +8 + +auto(l4) + +r + +14 + +20 + +r + +suv +
    +chevrolet + +c1500 suburban 2wd + +5.3 + +2008 + +8 + +auto(l4) + +r + +11 + +15 + +e + +suv +
    +
    +
    +
    +

    Summarising data

    +
      +
    • The dplyr summarize()function computes a table of +summaries for a data frame
    • +
    • group_by() groups the input data frame by the +specified variable(s)
    • +
    • Combining these two allows us to easily create summaries for +different categorical groupings
    • +
    +
    +
    +

    Group and Summarise

    +
    summarise(group_by(.data = mpg,
    +                   manufacturer),
    +          mean_cty = mean(cty),
    +          median_cty = median(cty))
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +manufacturer + +mean_cty + +median_cty +
    +audi + +17.61111 + +17.5 +
    +chevrolet + +15.00000 + +15.0 +
    +dodge + +13.13514 + +13.0 +
    +ford + +14.00000 + +14.0 +
    +honda + +24.44444 + +24.0 +
    +hyundai + +18.64286 + +18.5 +
    +
    +
    +
    +

    The pipe operator |>

    +
      +
    • Allows “chaining” of function calls to make code more readable
    • +
    +
    mpg |>
    +  group_by(manufacturer) |>
    +  summarise(mean_cty = mean(cty),
    +            median_cty = median(cty))
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +manufacturer + +mean_cty + +median_cty +
    +audi + +17.61111 + +17.5 +
    +chevrolet + +15.00000 + +15.0 +
    +dodge + +13.13514 + +13.0 +
    +ford + +14.00000 + +14.0 +
    +
    +
    +
    +
    +

    Plotting

    + +
    +
    +

    ggplot2

    +
      +
    • The most popular tidyverse package
    • +
    • Create publication quality, highly customizable plots +
    • +
    • ggplots use “layers” to build, modify and overlap visualizations +
        +
      • Layers are added using the + symbol and can be added to an existing +ggplot
      • +
    • +
    • Many popular packages output ggplots which can then be easily +modified by adding layers
    • +
    +
    +
    +

    Creating ggplots

    +



    Plotting

    +
    +
    +

    Plot Example

    +
    ggplot(data = mpg,                         # Input dataframe
    +       mapping = aes(x = cty, y = hwy)) +  # Aesthetic mapping
    +  geom_point()                             # Point graph
    +

    +
    +
    +

    Adding and Modifying Layers

    +
    ggplot(data = mpg,
    +       mapping = aes(x = class, y = cty, fill = class)) +
    +  geom_violin() +
    +  geom_boxplot(width = 0.1,
    +               fill = "white")
    +

    +
    +
    +

    10 min break

    +
    +
    +
    +10:00 +
    +
    +
    + +
    +
    +

    Hands-on Data Analysis

    + +
    +
    +

    Dataset Description

    +
      +
    • PanTHERIA +
        +
      • A global species-level data set of key life-history, ecological and +geographical traits of all known extant and recently extinct mammals +compiled from the literature
      • +
      • Macroecological and macroevolutionary research projects
      • +
      • Data is organized by taxonomic rank
      • +
    • +
    +
    +
    +

    Taxonomic Rank

    +

    Taxonomy

    +
    +
    +

    Data Preview

    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +Order + +Family + +Genus + +Species + +Binomial + +ActivityCycle + +AdultBodyMass_g + +AdultForearmLen_mm + +AdultHeadBodyLen_mm + +AgeatEyeOpening_d + +AgeatFirstBirth_d + +BasalMetRate_mLO2hr + +BasalMetRateMass_g + +DietBreadth + +DispersalAge_d + +GestationLen_d + +HabitatBreadth + +HomeRange_km2 + +HomeRange_Indiv_km2 + +InterbirthInterval_d + +LitterSize + +LittersPerYear + +MaxLongevity_m + +NeonateBodyMass_g + +NeonateHeadBodyLen_mm + +PopulationDensity_n/km2 + +PopulationGrpSize + +SexualMaturityAge_d + +SocialGrpSize + +Terrestriality + +TrophicLevel + +WeaningAge_d + +WeaningBodyMass_g + +WeaningHeadBodyLen_mm + +References + +AdultBodyMass_g_EXT + +LittersPerYear_EXT + +NeonateBodyMass_g_EXT + +WeaningBodyMass_g_EXT + +GR_Area_km2 + +GR_MaxLat_dd + +GR_MinLat_dd + +GR_MidRangeLat_dd + +GR_MaxLong_dd + +GR_MinLong_dd + +GR_MidRangeLong_dd + +HuPopDen_Min_n/km2 + +HuPopDen_Mean_n/km2 + +HuPopDen_5p_n/km2 + +HuPopDen_Change + +Precip_Mean_mm + +Temp_Mean_01degC + +AET_Mean_mm + +PET_Mean_mm +
    +Carnivora + +Canidae + +Canis + +latrans + +Canis latrans + +crepuscular + +11989.1 + +NA + +872.39 + +11.94 + +365 + +3699 + +10450 + +1 + +255 + +61.74 + +1 + +18.88 + +19.91 + +365 + +5.72 + +NA + +262 + +200.01 + +NA + +0.25 + +NA + +372.9 + +NA + +fossorial + +carnivore + +43.71 + +NA + +NA + +367;542;543;730;1113;1297;1573;1822;2655 + +NA + +1.1000000000000001 + +NA + +NA + +17099094.300000001 + +71.39 + +8.02 + +39.700000000000003 + +-67.069999999999993 + +-168.12 + +-117.6 + +0 + +27.27 + +0 + +0.06 + +53.03 + +58.18 + +503.02 + +728.37 +
    +Carnivora + +Canidae + +Canis + +lupus + +Canis lupus + +crepuscular + +31756.51 + +NA + +1055 + +14.01 + +547.5 + +11254.2 + +33100 + +1 + +180 + +63.5 + +1 + +159.86000000000001 + +43.13 + +365 + +4.9800000000000004 + +2 + +354 + +412.31 + +NA + +0.01 + +NA + +679.37 + +NA + +fossorial + +carnivore + +44.82 + +NA + +NA + +367;542;543;730;1015;1052;1113;1297;1573;1594;2338;2655 + +NA + +NA + +NA + +NA + +50803439.700000003 + +83.27 + +11.48 + +47.38 + +179.65 + +-171.84 + +3.9 + +0 + +37.869999999999997 + +0 + +0.04 + +34.79 + +4.82 + +313.33 + +561.11 +
    +Carnivora + +Canidae + +Canis + +simensis + +Canis simensis + +diurnal + +14361.86 + +NA + +938.19 + +NA + +NA + +NA + +NA + +1 + +180 + +63.61 + +1 + +4.2 + +5.0199999999999996 + +365 + +NA + +NA + +NA + +NA + +NA + +1.2 + +NA + +754.74 + +NA + +fossorial + +carnivore + +69.599999999999994 + +NA + +NA + +542;730;1113;1573;2655 + +NA + +1.1000000000000001 + +NA + +NA + +11402.81 + +13.31 + +6.55 + +9.93 + +39.96 + +38.020000000000003 + +38.99 + +30 + +99.87 + +30 + +0.15 + +83.87 + +99.03 + +931.35 + +1471.36 +
    +Carnivora + +Canidae + +Atelocynus + +microtis + +Atelocynus microtis + +NA + +8363.2199999999993 + +NA + +831.01 + +NA + +NA + +NA + +NA + +1 + +NA + +NA + +1 + +NA + +NA + +NA + +NA + +NA + +132 + +NA + +NA + +NA + +NA + +NA + +1 + +fossorial + +carnivore + +NA + +NA + +NA + +543;890;1113;2655 + +NA + +NA + +NA + +NA + +7634256.5999999996 + +4.79 + +-32.31 + +-13.76 + +-43.54 + +-78.61 + +-61.08 + +0 + +7.43 + +0 + +0.12 + +163.06 + +235.49 + +1316.27 + +1488 +
    +Cetacea + +Balaenopteridae + +Balaenoptera + +musculus + +Balaenoptera musculus + +NA + +154321304.5 + +NA + +30480 + +NA + +NA + +NA + +NA + +1 + +NA + +326.97000000000003 + +1 + +NA + +NA + +821.25 + +1 + +0.45 + +1320 + +2738612.79 + +7236.55 + +NA + +1 + +1959.8 + +1.25 + +NA + +carnivore + +211.71 + +16999999.969999999 + +NA + +172;511;543;899;1004;1015;1217;1297;2151;2409;2655 + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA +
    +Cetacea + +Balaenopteridae + +Balaenoptera + +physalus + +Balaenoptera physalus + +NA + +47506008.229999997 + +NA + +20641.060000000001 + +NA + +NA + +NA + +NA + +2 + +NA + +338.36 + +1 + +NA + +NA + +730 + +1.01 + +0.37 + +1392 + +1899999.99 + +6273.75 + +NA + +1.5 + +2666.41 + +NA + +NA + +carnivore + +196.58 + +NA + +12000 + +24;27;543;899;1004;1015;1217;1297;1577;2151;2655 + +NA + +NA + +NA + +6395530.4199999999 + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA + +NA +
    +
    +
    +
    +

    Hands-on Analysis

    +
      +
    • Open part_2.Rmd
    • +
    +
    +
    +
    +

    ChatGPT Tips for R

    + +
    +
    +

    General Tips

    +
      +
    • Always confirm ChatGPT’s outputs are correct
    • +
    • Provide as much detail as possible about the problem in the 1st +prompt
    • +
    • Use separate chats for separate tasks/projects
    • +
    • Try the ‘Custom Instructions’ function that adds additional +information to every prompt
    • +
    • Can visit webpages (GPT 4 only), which can help get more specific +answers
    • +
    +
    +
    +

    Code Tips

    +
      +
    • Commented R code yields better responses in my experience
    • +
    • Provide the code and error message in the same prompt
    • +
    • ChatGPT can work well to convert syntax and improve your code: +
        +
      • “Turn this loop into a function : [your code]”
      • +
      • “Is there a better way to do this : [your code]”
      • +
    • +
    • Check out the file: +example_code/1_convert_syntax_example.R for an example use +case
    • +
    +
    +
    +
    +

    Finding R Packages

    + +
    +
    +

    Key Questions

    +
      +
    • What assay was the package designed for?
    • +
    • When was the last release?
    • +
    • Is it maintained (frequent updates)?
    • +
    • Does it work on all operating systems?
    • +
    • Are other people using it? (citations)
    • +
    • Do they respond to github issues?
    • +
    • Is there a benchmarking paper?
    • +
    +
    +
    +

    BioConductor and CRAN

    +
      +
    • Both of these have stringent requirements for packages they host +(eg. for BioConductor they have to run on all major operating +systems)

    • +
    • Prefer BioConductor packages if available over CRAN

    • +
    • Prefer CRAN packages over ones only hosted on GitHub

    • +
    +
    +
    +

    Start with the Assay

    +
      +
    • Click here +to go to BioC views
    • +
    • Pick the assay you want to analyse
    • +
    • Pick the type of analysis you want to do
    • +
    • Find a package that does it
    • +
    • Find benchmarking papers to narrow the list of packages down
    • +
    • Find the vignette on the package page and refer to the manual for +any questions not covered by it
    • +
    +
    +
    +
    +

    Additional Resources

    + +
    +
    +

    R

    + +
    +
    +

    Statistics

    + +
    +
    +

    RNA-seq Analysis

    + +
    +
    +

    Dimensional Reduction

    + +
    +
    +
    +

    End of Part 2

    + +
    +
    +

    Workshop survey

    +
      +
    • Please fill out our workshop survey so we +can continue to improve these workshops
    • +
    +
    +
    +

    Upcoming Workshops

    +
      +
    1. Introduction +to Statistics, Experimental Design, and Hypothesis Testing +
        +
      • Jan 25, 2024 (Session 1 - 10am–12pm) (Session 2 - 1pm–3pm)
      • +
      • Jan 26, 2024 (Session 3 - 10am–12pm)
      • +
    2. +
    3. Intermediate +RNA-Seq Analysis Using R +
        +
      • Feb 1, 2024 (9:30am-12:00pm)
      • +
    4. +
    +
    +
    +
    + + + + + + + + + + + + + diff --git a/docs/index.html b/docs/index.html index 0d4fb06..b31097d 100644 --- a/docs/index.html +++ b/docs/index.html @@ -8,6 +8,7 @@
  • Introduction to Unix - Part 1
  • Introduction to Unix - Part 2
  • Introduction to R Data Analysis - Part 1
  • +
  • Introduction to R Data Analysis - Part 2
  • diff --git a/intro-r-data-analysis/Intro_to_R_data_analysis_part_1.Rmd b/intro-r-data-analysis/Intro_to_R_data_analysis_part_1.Rmd index f4b4839..b348b7e 100644 --- a/intro-r-data-analysis/Intro_to_R_data_analysis_part_1.Rmd +++ b/intro-r-data-analysis/Intro_to_R_data_analysis_part_1.Rmd @@ -1,7 +1,7 @@ --- title: "Introduction to R Data Analysis - Part 1" author: "Natalie Elphick" -date: "January 22nd" +date: "January 22nd, 2024" knit: (function(input, ...) { rmarkdown::render( input, @@ -47,8 +47,8 @@ Bioinformatician I 5. Variables 6. Types & data structures 7. Math and logic operations -8. Functions and libraries -9. Reading data into R +8. Functions and packages + # What is R? @@ -142,13 +142,13 @@ to one style of names ```{r} # Snake case -dog_breeds <- c("Labrador Retriever","Akita", "Bulldog") +dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog") # Period separated -dog.breeds <- c("Labrador Retriever","Akita", "Bulldog") +dog.breeds <- c("Labrador Retriever", "Akita", "Bulldog") # Camel case -DogBreeds <- c("Labrador Retriever","Akita", "Bulldog") +DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog") ``` ## Poll 2 @@ -329,81 +329,3 @@ packages - Feb 1, 2024 (9:30am-12:00pm) -# ChatGPT Tips for R - -## General Tips - -- Always confirm ChatGPT's outputs are correct -- Provide as much detail as possible about the problem in the 1st prompt -- Use separate chats for separate tasks/projects -- Try the 'Custom Instructions' function that adds additional information to every prompt -- Can visit webpages (GPT 4 only), which can help get more specific answers - -## Code Tips - -- Commented R code yields better responses in my experience -- Provide the code and error message in the same prompt -- ChatGPT can work well to convert syntax and improve your code: - - "Turn this loop into a function : [your code]" - - "Is there a better way to do this : [your code]" -- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case - -# Finding R Packages - -## Key Questions - -- What assay was the package designed for? -- When was the last release? -- Is it maintained (frequent updates)? -- Does it work on all operating systems? -- Are other people using it? (citations) -- Do they respond to github issues? -- Is there a benchmarking paper? - -## BioConductor and CRAN - -- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems) - -- Prefer BioConductor packages if available over CRAN - -- Prefer CRAN packages over ones only hosted on GitHub - -## Start with the Assay - -- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views -- Pick the assay you want to analyse -- Pick the type of analysis you want to do -- Find a package that does it -- Find benchmarking papers to narrow the list of packages down -- Find the vignette on the package page and refer to the manual for any questions not covered by it - - -# Additional Resources - -## R - -- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference - -- [R for Data Science](https://r4ds.hadley.nz/) - -- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/) - -- [Advanced R](https://adv-r.hadley.nz/) - -## Statistics - -- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science* -- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\ -- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html) - -## RNA-seq Analysis - -- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments -- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units -- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R - -## Dimensional Reduction - -- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples -- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis - diff --git a/intro-r-data-analysis/Intro_to_R_data_analysis_part_2.Rmd b/intro-r-data-analysis/Intro_to_R_data_analysis_part_2.Rmd new file mode 100644 index 0000000..0d3ba15 --- /dev/null +++ b/intro-r-data-analysis/Intro_to_R_data_analysis_part_2.Rmd @@ -0,0 +1,348 @@ +--- +title: "Introduction to R Data Analysis - Part 2" +author: "Natalie Elphick" +date: "January 23rd, 2024" +knit: (function(input, ...) { + rmarkdown::render( + input, + output_dir = "../docs" + ) + }) +output: + revealjs::revealjs_presentation: + theme: simple + css: style.css +--- + +```{r, setup, include=FALSE} +library(kableExtra) +library(tidyverse) +library(readxl) +theme_set(theme_grey(base_size = 16)) +``` + +## + +
    *Press the ? key for tips on navigating these slides*
    + +# Schedule + +1. Introduction to Tidyverse +2. Filtering and reformatting data +3. Plotting data +4. Hands on data analysis + +# Introduction to Tidyverse + +## Tidyverse + +- The tidyverse packages work well together because they share +common data representations and design principles + - Rows = observations, columns = variables +- [ggplot2](), for data visualization. +- [dplyr](), for data manipulation. +- [tidyr](), for data tidying. +- [readr](), for data import. +- [purrr](), for iteration. +- and more.. + +## dplyr +- Offers a common “grammar” of functions for data manipulation + - [mutate()](https://dplyr.tidyverse.org/reference/mutate.html) adds new variables that are functions of existing + columns + - [select()](https://dplyr.tidyverse.org/reference/select.html) picks columns based on their names + - [filter()](https://dplyr.tidyverse.org/reference/filter.html) picks rows based on their values + - [summarise()](https://dplyr.tidyverse.org/reference/summarise.html) reduces multiple values down to a single summary + - [arrange()](https://dplyr.tidyverse.org/reference/arrange.html) changes the ordering of the rows + - [group_by()](https://dplyr.tidyverse.org/reference/group_by.html) allows any operation to be done “by group” + + + +## Example Dataframe +- mpg is a dataframe built into the ggplot2 package +```{r, eval = FALSE} +head(mpg) +``` + +```{r, echo = FALSE} +head(mpg) |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` + +## Select Columns + +```{r, eval = FALSE} +select(.data = mpg, + year, cty, hwy, manufacturer) +``` + +```{r, echo = FALSE} +select(.data = mpg, + year, cty, hwy, manufacturer) |> + head() |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` + + +## Filter Rows + + +```{r, eval = FALSE} +filter(.data = mpg, + year == 2008) +``` + +```{r, echo = FALSE} +filter(.data = mpg, + year == 2008) |> + head() |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` +## Arrange Rows + +- desc() is used to arrange rows in descending order, the default is ascending +```{r, eval = FALSE} +arrange(.data = mpg, + desc(cyl)) +``` + +```{r, echo = FALSE} +arrange(.data = mpg, + desc(cyl)) |> + head(n = 3) |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` +## Summarising data +- The dplyr **summarize()**function computes a table of +summaries for a data frame +- **group_by()** groups the input data frame by the specified +variable(s) +- Combining these two allows us to easily create summaries for +different categorical groupings + +## Group and Summarise +```{r, eval = FALSE} +summarise(group_by(.data = mpg, + manufacturer), + mean_cty = mean(cty), + median_cty = median(cty)) +``` + +```{r, echo = FALSE} +summarise(group_by(.data = mpg, + manufacturer), + mean_cty = mean(cty), + median_cty = median(cty)) |> + head() |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` + +## The pipe operator |> +- Allows "chaining" of function calls to make code more readable +```{r, eval = FALSE} +mpg |> + group_by(manufacturer) |> + summarise(mean_cty = mean(cty), + median_cty = median(cty)) +``` + +```{r, echo = FALSE} +mpg |> + group_by(manufacturer) |> + summarise(mean_cty = mean(cty), + median_cty = median(cty)) |> + head(n = 4) |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` + + +# Plotting + +## ggplot2 +- The most popular tidyverse package +- Create publication quality, highly customizable plots + - See the [R graph gallery](https://r-graph-gallery.com/index.html) for examples +- ggplots use “layers” to build, modify and overlap visualizations + - Layers are added using the + symbol and can be added to an existing ggplot +- Many popular packages output ggplots which can then be easily modified by adding layers + + +## Creating ggplots + +
    +
    +![Plotting](assets/plotting.png) + + +## Plot Example + +```{r, fig.dim=c(6,4)} +ggplot(data = mpg, # Input dataframe + mapping = aes(x = cty, y = hwy)) + # Aesthetic mapping + geom_point() # Point graph +``` + +## Adding and Modifying Layers + +```{r, fig.dim=c(10,4)} +ggplot(data = mpg, + mapping = aes(x = class, y = cty, fill = class)) + + geom_violin() + + geom_boxplot(width = 0.1, + fill = "white") +``` + + +# 10 min break + +
    + +```{r, echo=FALSE} + +countdown::countdown(minutes = 10, + seconds = 0, + color_border = "black", + padding = "50px", + margin = "5%", + font_size = "5em", + style = "position: relative; width: min-content;") +``` + +
    + + +# Hands-on Data Analysis + +## Dataset Description +- PanTHERIA + - A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature + - Macroecological and macroevolutionary research projects + - Data is organized by taxonomic rank + +## Taxonomic Rank + +![Taxonomy](assets/Taxonomic_Rank_Graph.svg) + +## Data Preview + +```{r, echo = FALSE} +read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |> + head() |> + kable() |> + kable_styling("striped") |> + scroll_box(width = "100%") +``` + + + +## Hands-on Analysis +- Open part_2.Rmd + + + +# ChatGPT Tips for R + +## General Tips + +- Always confirm ChatGPT's outputs are correct +- Provide as much detail as possible about the problem in the 1st prompt +- Use separate chats for separate tasks/projects +- Try the 'Custom Instructions' function that adds additional information to every prompt +- Can visit webpages (GPT 4 only), which can help get more specific answers + +## Code Tips + +- Commented R code yields better responses in my experience +- Provide the code and error message in the same prompt +- ChatGPT can work well to convert syntax and improve your code: + - "Turn this loop into a function : [your code]" + - "Is there a better way to do this : [your code]" +- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case + +# Finding R Packages + +## Key Questions + +- What assay was the package designed for? +- When was the last release? +- Is it maintained (frequent updates)? +- Does it work on all operating systems? +- Are other people using it? (citations) +- Do they respond to github issues? +- Is there a benchmarking paper? + +## BioConductor and CRAN + +- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems) + +- Prefer BioConductor packages if available over CRAN + +- Prefer CRAN packages over ones only hosted on GitHub + +## Start with the Assay + +- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views +- Pick the assay you want to analyse +- Pick the type of analysis you want to do +- Find a package that does it +- Find benchmarking papers to narrow the list of packages down +- Find the vignette on the package page and refer to the manual for any questions not covered by it + + +# Additional Resources + +## R + +- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference + +- [R for Data Science](https://r4ds.hadley.nz/) + +- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/) + +- [Advanced R](https://adv-r.hadley.nz/) + +## Statistics + +- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science* +- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\ +- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html) + +## RNA-seq Analysis + +- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments +- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units +- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R + +## Dimensional Reduction + +- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples +- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis + + + +# End of Part 2 + +## Workshop survey +- Please fill out our [workshop survey](https://www.surveymonkey.com/r/F75J6VZ) so we can continue to improve these workshops + +## Upcoming Workshops + +1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0) + - Jan 25, 2024 (Session 1 - 10am–12pm) (Session 2 - 1pm–3pm) + - Jan 26, 2024 (Session 3 - 10am–12pm) + +2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4) + - Feb 1, 2024 (9:30am-12:00pm) + diff --git a/intro-r-data-analysis/Intro_to_R_workshop_materials/PanTHERIA.xlsx b/intro-r-data-analysis/Intro_to_R_workshop_materials/PanTHERIA.xlsx new file mode 100644 index 0000000..4343810 Binary files /dev/null and b/intro-r-data-analysis/Intro_to_R_workshop_materials/PanTHERIA.xlsx differ diff --git a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html new file mode 100644 index 0000000..d25ea3e --- /dev/null +++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html @@ -0,0 +1,1341 @@ + + + + + + + + + + + + + + +Intro to R Data Analysis: Part 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + +
    +

    R Markdown

    +

    This is an R Markdown document. Markdown is a simple formatting +syntax for authoring HTML, PDF, and MS Word documents. For more details +on using R Markdown see http://rmarkdown.rstudio.com. Guide to markdown syntax +https://www.markdownguide.org/basic-syntax/.

    +

    When you click the Knit button a document will be +generated that includes both content as well as the output of any +embedded R code chunks within the document. You can embed an R code +chunk like this:

    +
    # Simulates 100 observations from a normal distribution
    +# and plots a histogram
    +val <- rnorm(n = 100)
    +hist(val) 
    +

    +
    # The ggplot version of the same plot
    +  ggplot(data = tibble(values = val),
    +         mapping = aes(x = values))+
    +  geom_histogram(bins = 20)
    +

    +

    Important: before running the code below, click Session -> +Set Working Directory -> To Source File Location

    +
    +
    +

    Exercise 3: Reading in Data

    +

    The data we will be analyzing is from the PanTHERIA database which is +“a global species-level data set of key life-history, ecological and +geographical traits of all known extant and recently extinct mammals +(PanTHERIA) developed for a number of macroecological and +macroevolutionary research projects.”

    +
    # The data is spread across 3 sheets in an excel file. We need to 
    +# combine these data into one table/data frame.
    +
    +# na = "NA" tells read_xlsx how missing values appear in the data
    +# the default is empty cells. Run "?read_xlsx" for more info
    +sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
    +sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
    +sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
    +
    +# rbind (row-bind) combines data frames by row
    +pantheria <- rbind(sheet1, sheet2, sheet3)
    +
    # How many rows and columns are there?
    +nrow(pantheria)
    +
    ## [1] 2161
    +
    ncol(pantheria)
    +
    ## [1] 54
    +
    # What does the data look like?
    +head(pantheria)
    +
    ## # A tibble: 6 × 54
    +##   Order     Family          Genus Species Binomial ActivityCycle AdultBodyMass_g
    +##   <chr>     <chr>           <chr> <chr>   <chr>    <chr>                   <dbl>
    +## 1 Carnivora Canidae         Canis latrans Canis l… crepuscular            11989.
    +## 2 Carnivora Canidae         Canis lupus   Canis l… crepuscular            31757.
    +## 3 Carnivora Canidae         Canis simens… Canis s… diurnal                14362.
    +## 4 Carnivora Canidae         Atel… microt… Atelocy… <NA>                    8363.
    +## 5 Cetacea   Balaenopteridae Bala… muscul… Balaeno… <NA>               154321304.
    +## 6 Cetacea   Balaenopteridae Bala… physal… Balaeno… <NA>                47506008.
    +## # ℹ 47 more variables: AdultForearmLen_mm <dbl>, AdultHeadBodyLen_mm <dbl>,
    +## #   AgeatEyeOpening_d <dbl>, AgeatFirstBirth_d <dbl>,
    +## #   BasalMetRate_mLO2hr <dbl>, BasalMetRateMass_g <dbl>, DietBreadth <dbl>,
    +## #   DispersalAge_d <dbl>, GestationLen_d <dbl>, HabitatBreadth <dbl>,
    +## #   HomeRange_km2 <dbl>, HomeRange_Indiv_km2 <dbl>, InterbirthInterval_d <dbl>,
    +## #   LitterSize <dbl>, LittersPerYear <dbl>, MaxLongevity_m <dbl>,
    +## #   NeonateBodyMass_g <dbl>, NeonateHeadBodyLen_mm <dbl>, …
    +
    +
    +

    Exercise 4: Filtering and Reformatting Data

    +

    We will exploring adult body mass from these mammals as it relates to +their trophic level using dpylr and ggplot2. +Download the cheatsheets for these packages at the following links:

    + +

    Let’s start by subsetting the data with select()

    +
    # Pipes (%>%) work by passing the data in front of the pipe to the first argument
    +# of the function after it, this prevents a lot of nested function calls and makes
    +# code easier to read.
    +
    +pantheria <- pantheria %>%     # Passes pantheria as the first argument of select
    +  select(Order,   
    +         Family,               # select returns the specified columns
    +         Genus,                
    +         Species,
    +         TrophicLevel,
    +         AdultBodyMass_g) %>%
    +  drop_na() %>%                # Remove any rows that have NAs
    +  distinct()                   # Remove any duplicate rows
    +

    Data is almost never clean, for example there should be only 3 +trophic levels:

    +
    unique(pantheria$TrophicLevel) # unique elements of a vector
    +
    ## [1] "carnivore" "herbivore" "omnivore"  "Omnivore"
    +

    Let’s fix the TrophicLevel column using mutate()

    +
    # mutate allows us to add columns or modify existing ones
    +pantheria <- pantheria %>%
    +  mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase
    +
    +
    +

    Exercise 5: Summarizing data

    +

    Now we can summarize the adult body mass by trophic level by +computing standard metrics like mean and standard deviation.

    +
    pantheria %>%
    +  group_by(TrophicLevel) %>%                            # Group observations by this column
    +  summarize(Mean = mean(AdultBodyMass_g),               # Summarize will calculate these group wise
    +            `Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names
    +            Min = min(AdultBodyMass_g),
    +            Max = max(AdultBodyMass_g)) %>%
    +  ungroup() %>%
    +  arrange(desc(Mean))                                   # Order the data frame by descending mean body mass
    +
    ## # A tibble: 3 × 5
    +##   TrophicLevel    Mean `Standard Deviation`   Min        Max
    +##   <chr>          <dbl>                <dbl> <dbl>      <dbl>
    +## 1 carnivore    824948.             7963841.  1.96 154321304.
    +## 2 omnivore      77419.             1173853.  3.29  27324024.
    +## 3 herbivore     59854.              305498.  5.55   4750000.
    +
    +
    +

    Exercise 6: Plotting Data

    +

    According to the table above, body masses have a really wide range +across trophic levels. Let’s visualize the distribution of adult body +masses.

    +
    # ggplot2 constructs graphics in layers, each layer is separated by "+"
    +# x and y values are supplied in aes(), the type of plot is specified using
    +# the "geom" functions
    +
    +ggplot(data = pantheria,                             # input data
    +       mapping =  aes(x = log10(AdultBodyMass_g))) + # log10 transform adult body mass
    +  geom_histogram(fill = "#CE3274",                   # type of plot
    +                 bins = 40) + 
    +  xlab(label = "log10 Adult Body Mass (g)") +        # x label
    +  ylab(label = "Frequency") +                        # y label
    +  labs(title = "Histogram of log10 Adult Body Mass") # title 
    +

    +

    The data looks skewed even after log10 transformation. Let’s view the +distribution by trophic level.

    +
    pantheria %>%
    +  ggplot(aes(x = log10(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
    +  geom_histogram(bins = 40) +
    +  facet_grid(rows = vars(TrophicLevel)) +                        # Split the plot into rows by trophic level
    +  ylab(label = "Frequency") + 
    +  xlab(label = "log10 Adult Body Mass (g)") + 
    +  labs(title = "Histograms of log10 Adult Body Mass") +
    +  theme(plot.title = element_text(hjust = 0.5))                  # Center the plot title
    +

    +

    It is clear that trophic level does have an impact on the +distribution of adult body mass, carnivores tend to be smaller which +makes sense because carnivores have higher metabolic demands and so +there might be a selection pressure towards smaller carnivores. If we +wanted to confirm this by fitting a model, we could use the +lm() function to fit a linear model.

    +
    +
    +

    Exercise 7.1: Hands on coding

    +

    An important caveat of the data is that some Orders of mammals are +more biodiverse than others and are therefore over represented in the +dataset. Using the dplyr cheatsheet, write code generates a +table of Orders and what percentage of the data they are. Scroll down to +see the hint if you are having trouble.

    +
    # Your code 
    +

    Exercise 7.1 hint: group_by + summarize + arrange

    +
    +
    +

    Exercise 7.2: Hands on coding

    +

    Now that we see what the over represented Orders are, we can plot +their body masses by trophic level to see if they are skewing the +overall distributions.

    +
    top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above
    +# filter uses a conditional to select rows from the data
    +pantheria %>%
    +  filter(Order %in% top_orders) %>%
    +  ggplot(aes(x = log10(AdultBodyMass_g), fill = Order)) +
    +  geom_histogram(bins = 40) +
    +  facet_grid(rows = vars(TrophicLevel),
    +             cols = vars(Order)) +
    +  ylab(label = "Frequency") + 
    +  xlab(label = "log10 Adult Body Mass (g)") + 
    +  theme(plot.title = element_text(hjust = 0.5))   
    +

    +

    It looks like one of them is mostly made up of small cornivores. +Let’s remove it and redo the plot of body mass distribution by trophic +level.

    +
    pantheria %>%
    +  filter(Order != "Chiroptera") %>%
    +  ggplot(aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
    +  geom_histogram(bins = 40) +
    +  facet_grid(rows = vars(TrophicLevel)) +                       # Split the plot into rows by trophic level
    +  ylab(label = "Frequency") + 
    +  xlab(label = "log10 Adult Body Mass (g)") + 
    +  labs(title = "Histograms of log10 Adult Body Mass")
    +

    +

    We can see now that body mass of carnivorous mammals is much less +skewed than the initial plots show. There is still an effect of trophic +level on body mass, but the effect size is likely much smaller than we +would estimate by including all 393 Chiroptera. Now that we +have generated these plots, we can generate a full report that contains +all of the text and code, click Knit to render the HTML +report.

    +
    +
    +

    End of workshop exercises

    +

    Hopefully this workshop has provided a good foundation for you to +learn R. If you would like some additional practice, check out the +resources on the workshop +wiki. R also contains many built in datasets you can use for +practice:

    + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    PackageItemTitle
    ggplot2diamondsPrices of over 50,000 round cut diamonds
    ggplot2economicsUS economic time series
    ggplot2economics_longUS economic time series
    ggplot2faithfuld2d density estimate of Old Faithful data
    ggplot2luv_colours‘colors()’ in Luv space
    ggplot2midwestMidwest demographics
    ggplot2mpgFuel economy data from 1999 to 2008 for 38 popular +models of cars
    ggplot2msleepAn updated and expanded version of the mammals sleep +dataset
    ggplot2presidentialTerms of 12 presidents from Eisenhower to Trump
    ggplot2sealsVector field of seal movements
    ggplot2txhousingHousing sales in TX
    tidyrbillboardSong rankings for Billboard top 100 in the year +2000
    tidyrcms_patient_careData from the Centers for Medicare & Medicaid +Services
    tidyrcms_patient_experienceData from the Centers for Medicare & Medicaid +Services
    tidyrconstructionCompleted construction in the US in 2018
    tidyrfish_encountersFish encounters
    tidyrhouseholdHousehold data
    tidyrpopulationWorld Health Organization TB data
    tidyrrelig_incomePew religion and income survey
    tidyrsmithsSome data about the Smith family
    tidyrtable1Example tabular representations
    tidyrtable2Example tabular representations
    tidyrtable3Example tabular representations
    tidyrtable4aExample tabular representations
    tidyrtable4bExample tabular representations
    tidyrtable5Example tabular representations
    tidyrus_rent_incomeUS rent and income data
    tidyrwhoWorld Health Organization TB data
    tidyrwho2World Health Organization TB data
    tidyrworld_bank_popPopulation data from the world bank
    dplyrband_instrumentsBand membership
    dplyrband_instruments2Band membership
    dplyrband_membersBand membership
    dplyrstarwarsStarwars characters
    dplyrstormsStorm tracks data
    datasetsAirPassengersMonthly Airline Passenger Numbers 1949-1960
    datasetsBJsalesSales Data with Leading Indicator
    datasetsBJsales.lead (BJsales)Sales Data with Leading Indicator
    datasetsBODBiochemical Oxygen Demand
    datasetsCO2Carbon Dioxide Uptake in Grass Plants
    datasetsChickWeightWeight versus age of chicks on different diets
    datasetsDNaseElisa assay of DNase
    datasetsEuStockMarketsDaily Closing Prices of Major European Stock Indices, +1991-1998
    datasetsFormaldehydeDetermination of Formaldehyde
    datasetsHairEyeColorHair and Eye Color of Statistics Students
    datasetsHarman23.corHarman Example 2.3
    datasetsHarman74.corHarman Example 7.4
    datasetsIndomethPharmacokinetics of Indomethacin
    datasetsInsectSpraysEffectiveness of Insect Sprays
    datasetsJohnsonJohnsonQuarterly Earnings per Johnson & Johnson Share
    datasetsLakeHuronLevel of Lake Huron 1875-1972
    datasetsLifeCycleSavingsIntercountry Life-Cycle Savings Data
    datasetsLoblollyGrowth of Loblolly pine trees
    datasetsNileFlow of the River Nile
    datasetsOrangeGrowth of Orange Trees
    datasetsOrchardSpraysPotency of Orchard Sprays
    datasetsPlantGrowthResults from an Experiment on Plant Growth
    datasetsPuromycinReaction Velocity of an Enzymatic Reaction
    datasetsSeatbeltsRoad Casualties in Great Britain 1969-84
    datasetsTheophPharmacokinetics of Theophylline
    datasetsTitanicSurvival of passengers on the Titanic
    datasetsToothGrowthThe Effect of Vitamin C on Tooth Growth in Guinea +Pigs
    datasetsUCBAdmissionsStudent Admissions at UC Berkeley
    datasetsUKDriverDeathsRoad Casualties in Great Britain 1969-84
    datasetsUKgasUK Quarterly Gas Consumption
    datasetsUSAccDeathsAccidental Deaths in the US 1973-1978
    datasetsUSArrestsViolent Crime Rates by US State
    datasetsUSJudgeRatingsLawyers’ Ratings of State Judges in the US Superior +Court
    datasetsUSPersonalExpenditurePersonal Expenditure Data
    datasetsUScitiesDDistances Between European Cities and Between US +Cities
    datasetsVADeathsDeath Rates in Virginia (1940)
    datasetsWWWusageInternet Usage per Minute
    datasetsWorldPhonesThe World’s Telephones
    datasetsability.covAbility and Intelligence Tests
    datasetsairmilesPassenger Miles on Commercial US Airlines, +1937-1960
    datasetsairqualityNew York Air Quality Measurements
    datasetsanscombeAnscombe’s Quartet of ‘Identical’ Simple Linear +Regressions
    datasetsattenuThe Joyner-Boore Attenuation Data
    datasetsattitudeThe Chatterjee-Price Attitude Data
    datasetsaustresQuarterly Time Series of the Number of Australian +Residents
    datasetsbeaver1 (beavers)Body Temperature Series of Two Beavers
    datasetsbeaver2 (beavers)Body Temperature Series of Two Beavers
    datasetscarsSpeed and Stopping Distances of Cars
    datasetschickwtsChicken Weights by Feed Type
    datasetsco2Mauna Loa Atmospheric CO2 Concentration
    datasetscrimtabStudent’s 3000 Criminals Data
    datasetsdiscoveriesYearly Numbers of Important Discoveries
    datasetsesophSmoking, Alcohol and (O)esophageal Cancer
    datasetseuroConversion Rates of Euro Currencies
    datasetseuro.cross (euro)Conversion Rates of Euro Currencies
    datasetseurodistDistances Between European Cities and Between US +Cities
    datasetsfaithfulOld Faithful Geyser Data
    datasetsfdeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
    datasetsfreenyFreeny’s Revenue Data
    datasetsfreeny.x (freeny)Freeny’s Revenue Data
    datasetsfreeny.y (freeny)Freeny’s Revenue Data
    datasetsinfertInfertility after Spontaneous and Induced Abortion
    datasetsirisEdgar Anderson’s Iris Data
    datasetsiris3Edgar Anderson’s Iris Data
    datasetsislandsAreas of the World’s Major Landmasses
    datasetsldeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
    datasetslhLuteinizing Hormone in Blood Samples
    datasetslongleyLongley’s Economic Regression Data
    datasetslynxAnnual Canadian Lynx trappings 1821-1934
    datasetsmdeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
    datasetsmorleyMichelson Speed of Light Data
    datasetsmtcarsMotor Trend Car Road Tests
    datasetsnhtempAverage Yearly Temperatures in New Haven
    datasetsnottemAverage Monthly Temperatures at Nottingham, +1920-1939
    datasetsnpkClassical N, P, K Factorial Experiment
    datasetsoccupationalStatusOccupational Status of Fathers and their Sons
    datasetsprecipAnnual Precipitation in US Cities
    datasetspresidentsQuarterly Approval Ratings of US Presidents
    datasetspressureVapor Pressure of Mercury as a Function of +Temperature
    datasetsquakesLocations of Earthquakes off Fiji
    datasetsranduRandom Numbers from Congruential Generator RANDU
    datasetsriversLengths of Major North American Rivers
    datasetsrockMeasurements on Petroleum Rock Samples
    datasetssleepStudent’s Sleep Data
    datasetsstack.loss (stackloss)Brownlee’s Stack Loss Plant Data
    datasetsstack.x (stackloss)Brownlee’s Stack Loss Plant Data
    datasetsstacklossBrownlee’s Stack Loss Plant Data
    datasetsstate.abb (state)US State Facts and Figures
    datasetsstate.area (state)US State Facts and Figures
    datasetsstate.center (state)US State Facts and Figures
    datasetsstate.division (state)US State Facts and Figures
    datasetsstate.name (state)US State Facts and Figures
    datasetsstate.region (state)US State Facts and Figures
    datasetsstate.x77 (state)US State Facts and Figures
    datasetssunspot.monthMonthly Sunspot Data, from 1749 to “Present”
    datasetssunspot.yearYearly Sunspot Data, 1700-1988
    datasetssunspotsMonthly Sunspot Numbers, 1749-1983
    datasetsswissSwiss Fertility and Socioeconomic Indicators (1888) +Data
    datasetstreeringYearly Treering Data, -6000-1979
    datasetstreesDiameter, Height and Volume for Black Cherry Trees
    datasetsuspopPopulations Recorded by the US Census
    datasetsvolcanoTopographic Information on Auckland’s Maunga Whau +Volcano
    datasetswarpbreaksThe Number of Breaks in Yarn during Weaving
    datasetswomenAverage Heights and Weights for American Women
    +
    + + + + +
    + + + + + + + + + + + + + + + diff --git a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.Rmd b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.Rmd new file mode 100644 index 0000000..c97fae2 --- /dev/null +++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.Rmd @@ -0,0 +1,259 @@ +--- +title: "Intro to R Data Analysis: Part 2" +output: html_document # knitr report document type +date: "`r Sys.Date()`" # This will update the date everytime you knit the doc +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) + +# Load packages +library(dplyr) # tidyverse data frame manipulation package +library(tidyr) # functions to help clean data +library(magrittr) # this package provides the pipe operator %>% +library(readxl) # read excel files +library(ggplot2) # highly customizable plots +``` + +## R Markdown + +This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . Guide to markdown syntax . + +When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: + +```{r} +# Simulates 100 observations from a normal distribution +# and plots a histogram +val <- rnorm(n = 100) +hist(val, breaks = 20) +``` + +```{r} +# The ggplot version of the same plot + ggplot(data = tibble(values = val), + mapping = aes(x = values))+ + geom_histogram(bins = 20) +``` + + + +**Important: before running the code below, click Session -> Set Working Directory -> To Source File Location** + + +## Exercise 3: Reading in Data + +The data we will be analyzing is from the PanTHERIA database which is "a global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals (PanTHERIA) developed for a number of macroecological and macroevolutionary research projects." + +```{r} +# The data is spread across 3 sheets in an excel file. We need to +# combine these data into one table/data frame. + +# na = "NA" tells read_xlsx how missing values appear in the data +# the default is empty cells. Run "?read_xlsx" for more info +sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA") +sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA") +sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA") + +# rbind (row-bind) combines data frames by row +pantheria <- rbind(sheet1, sheet2, sheet3) +``` + + +```{r} +# How many rows and columns are there? +nrow(pantheria) +ncol(pantheria) +``` + + +```{r} +# What does the data look like? +head(pantheria) +``` + + +## Exercise 4: Filtering and Reformatting Data + +We will exploring adult body mass from these mammals as it relates to their trophic level using `dpylr` and `ggplot2`. Download the cheatsheets for these packages at the following links: + +* [dplyr cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf) +* [ggplot2 cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf) + +Let's start by subsetting the data with `select()` +```{r} +# Pipes (%>%) work by passing the data in front of the pipe to the first argument +# of the function after it, this prevents a lot of nested function calls and makes +# code easier to read. + +pantheria <- pantheria %>% # Passes pantheria as the first argument of select + select(Order, + Family, # select returns the specified columns + Genus, + Species, + TrophicLevel, + AdultBodyMass_g) %>% + drop_na() %>% # Remove any rows that have NAs + distinct() # Remove any duplicate rows +``` + +Data is almost never clean, for example there should be only 3 trophic levels: +```{r} +unique(pantheria$TrophicLevel) # unique elements of a vector +``` + +Let's fix the TrophicLevel column using `mutate()` + +```{r} +# mutate allows us to add columns or modify existing ones +pantheria <- pantheria %>% + mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase +``` + + +## Exercise 5: Summarizing data + +Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation. + +```{r} +pantheria %>% + group_by(TrophicLevel) %>% # Group observations by this column + summarize(Mean = mean(AdultBodyMass_g), # Summarize will calculate these group wise + `Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names + Min = min(AdultBodyMass_g), + Max = max(AdultBodyMass_g)) %>% + ungroup() %>% + arrange(desc(Mean)) # Order the data frame by descending mean body mass +``` + + + +## Exercise 6: Plotting Data + +According to the table above, body masses have a really wide range across trophic levels. Let's visualize the distribution of adult body masses. + +```{r} +# ggplot2 constructs graphics in layers, each layer is separated by "+" +# x and y values are supplied in aes(), the type of plot is specified using +# the "geom" functions + +ggplot(data = pantheria, # input data + mapping = aes(x = log10(AdultBodyMass_g))) + # log10 transform adult body mass + geom_histogram(fill = "#CE3274", # type of plot + bins = 40) + + xlab(label = "log10 Adult Body Mass (g)") + # x label + ylab(label = "Frequency") + # y label + labs(title = "Histogram of log10 Adult Body Mass") # title +``` + +The data looks skewed even after log10 transformation. Let's view the distribution by trophic level. + +```{r} +pantheria %>% + ggplot(aes(x = log10(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level + geom_histogram(bins = 40) + + facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level + ylab(label = "Frequency") + + xlab(label = "log10 Adult Body Mass (g)") + + labs(title = "Histograms of log10 Adult Body Mass") + + theme(plot.title = element_text(hjust = 0.5)) # Center the plot title +``` + +It is clear that trophic level does have an impact on the distribution of adult body mass, carnivores tend to be smaller which makes sense because carnivores have higher metabolic demands and so there might be a selection pressure towards smaller carnivores. If we wanted to confirm this by fitting a model, we could use the `lm()` function to fit a linear model. + +## Exercise 7.1: Hands on coding + +An important caveat of the data is that some Orders of mammals are more biodiverse than others and are therefore over represented in the dataset. Using the `dplyr` cheatsheet, write code generates a table of Orders and what percentage of the data they are. Scroll down to see the hint if you are having trouble. + +```{r} +# Your code +pantheria %>% + group_by(Order) %>% + summarise(n = n()) %>% + arrange(desc(n)) + +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +*Exercise 7.1 hint: group_by + summarize(n = n()) + arrange* + +## Exercise 7.2: Hands on coding + +Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions. + + +```{r} + +top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above +# filter uses a conditional to select rows from the data +pantheria %>% + filter(Order %in% top_orders) %>% + ggplot(aes(x = log10(AdultBodyMass_g), fill = Order)) + + geom_histogram(bins = 40) + + facet_grid(rows = vars(TrophicLevel), + cols = vars(Order)) + + ylab(label = "Frequency") + + xlab(label = "log10 Adult Body Mass (g)") + + theme(plot.title = element_text(hjust = 0.5)) +``` + +It looks like one of them is mostly made up of small carnivores. Let's remove it and redo the plot of body mass distribution by trophic level. + +```{r} +pantheria %>% + filter(Order != "Chiroptera") %>% + ggplot(aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level + geom_histogram(bins = 40) + + facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level + ylab(label = "Frequency") + + xlab(label = "log10 Adult Body Mass (g)") + + labs(title = "Histograms of log10 Adult Body Mass") +``` + +We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*. Now that we have generated these plots, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report. + +## End of workshop exercises + +Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the [workshop wiki](https://github.com/gladstone-institutes/Bioinformatics-Workshops/wiki/Introduction-to-R-for-Data-Analysis). R also contains many built in datasets you can use for practice: +```{r, echo=FALSE} +available_datasets <- data() +available_datasets$results %>% + as_tibble() %>% + select(-LibPath) %>% knitr::kable() +``` + + + diff --git a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.html b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.html new file mode 100644 index 0000000..daa94b5 --- /dev/null +++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.html @@ -0,0 +1,1360 @@ + + + + + + + + + + + + + + +Intro to R Data Analysis: Part 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + +
    +

    R Markdown

    +

    This is an R Markdown document. Markdown is a simple formatting +syntax for authoring HTML, PDF, and MS Word documents. For more details +on using R Markdown see http://rmarkdown.rstudio.com. Guide to markdown syntax +https://www.markdownguide.org/basic-syntax/.

    +

    When you click the Knit button a document will be +generated that includes both content as well as the output of any +embedded R code chunks within the document. You can embed an R code +chunk like this:

    +
    # Simulates 100 observations from a normal distribution
    +# and plots a histogram
    +val <- rnorm(n = 100)
    +hist(val, breaks = 20) 
    +

    +
    # The ggplot version of the same plot
    +  ggplot(data = tibble(values = val),
    +         mapping = aes(x = values))+
    +  geom_histogram(bins = 20)
    +

    +

    Important: before running the code below, click Session -> +Set Working Directory -> To Source File Location

    +
    +
    +

    Exercise 3: Reading in Data

    +

    The data we will be analyzing is from the PanTHERIA database which is +“a global species-level data set of key life-history, ecological and +geographical traits of all known extant and recently extinct mammals +(PanTHERIA) developed for a number of macroecological and +macroevolutionary research projects.”

    +
    # The data is spread across 3 sheets in an excel file. We need to 
    +# combine these data into one table/data frame.
    +
    +# na = "NA" tells read_xlsx how missing values appear in the data
    +# the default is empty cells. Run "?read_xlsx" for more info
    +sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
    +sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
    +sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
    +
    +# rbind (row-bind) combines data frames by row
    +pantheria <- rbind(sheet1, sheet2, sheet3)
    +
    # How many rows and columns are there?
    +nrow(pantheria)
    +
    ## [1] 2161
    +
    ncol(pantheria)
    +
    ## [1] 54
    +
    # What does the data look like?
    +head(pantheria)
    +
    ## # A tibble: 6 × 54
    +##   Order     Family          Genus Species Binomial ActivityCycle AdultBodyMass_g
    +##   <chr>     <chr>           <chr> <chr>   <chr>    <chr>                   <dbl>
    +## 1 Carnivora Canidae         Canis latrans Canis l… crepuscular            11989.
    +## 2 Carnivora Canidae         Canis lupus   Canis l… crepuscular            31757.
    +## 3 Carnivora Canidae         Canis simens… Canis s… diurnal                14362.
    +## 4 Carnivora Canidae         Atel… microt… Atelocy… <NA>                    8363.
    +## 5 Cetacea   Balaenopteridae Bala… muscul… Balaeno… <NA>               154321304.
    +## 6 Cetacea   Balaenopteridae Bala… physal… Balaeno… <NA>                47506008.
    +## # ℹ 47 more variables: AdultForearmLen_mm <dbl>, AdultHeadBodyLen_mm <dbl>,
    +## #   AgeatEyeOpening_d <dbl>, AgeatFirstBirth_d <dbl>,
    +## #   BasalMetRate_mLO2hr <dbl>, BasalMetRateMass_g <dbl>, DietBreadth <dbl>,
    +## #   DispersalAge_d <dbl>, GestationLen_d <dbl>, HabitatBreadth <dbl>,
    +## #   HomeRange_km2 <dbl>, HomeRange_Indiv_km2 <dbl>, InterbirthInterval_d <dbl>,
    +## #   LitterSize <dbl>, LittersPerYear <dbl>, MaxLongevity_m <dbl>,
    +## #   NeonateBodyMass_g <dbl>, NeonateHeadBodyLen_mm <dbl>, …
    +
    +
    +

    Exercise 4: Filtering and Reformatting Data

    +

    We will exploring adult body mass from these mammals as it relates to +their trophic level using dpylr and ggplot2. +Download the cheatsheets for these packages at the following links:

    + +

    Let’s start by subsetting the data with select()

    +
    # Pipes (%>%) work by passing the data in front of the pipe to the first argument
    +# of the function after it, this prevents a lot of nested function calls and makes
    +# code easier to read.
    +
    +pantheria <- pantheria %>%     # Passes pantheria as the first argument of select
    +  select(Order,   
    +         Family,               # select returns the specified columns
    +         Genus,                
    +         Species,
    +         TrophicLevel,
    +         AdultBodyMass_g) %>%
    +  drop_na() %>%                # Remove any rows that have NAs
    +  distinct()                   # Remove any duplicate rows
    +

    Data is almost never clean, for example there should be only 3 +trophic levels:

    +
    unique(pantheria$TrophicLevel) # unique elements of a vector
    +
    ## [1] "carnivore" "herbivore" "omnivore"  "Omnivore"
    +

    Let’s fix the TrophicLevel column using mutate()

    +
    # mutate allows us to add columns or modify existing ones
    +pantheria <- pantheria %>%
    +  mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase
    +
    +
    +

    Exercise 5: Summarizing data

    +

    Now we can summarize the adult body mass by trophic level by +computing standard metrics like mean and standard deviation.

    +
    pantheria %>%
    +  group_by(TrophicLevel) %>%                            # Group observations by this column
    +  summarize(Mean = mean(AdultBodyMass_g),               # Summarize will calculate these group wise
    +            `Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names
    +            Min = min(AdultBodyMass_g),
    +            Max = max(AdultBodyMass_g)) %>%
    +  ungroup() %>%
    +  arrange(desc(Mean))                                   # Order the data frame by descending mean body mass
    +
    ## # A tibble: 3 × 5
    +##   TrophicLevel    Mean `Standard Deviation`   Min        Max
    +##   <chr>          <dbl>                <dbl> <dbl>      <dbl>
    +## 1 carnivore    824948.             7963841.  1.96 154321304.
    +## 2 omnivore      77419.             1173853.  3.29  27324024.
    +## 3 herbivore     59854.              305498.  5.55   4750000.
    +
    +
    +

    Exercise 6: Plotting Data

    +

    According to the table above, body masses have a really wide range +across trophic levels. Let’s visualize the distribution of adult body +masses.

    +
    # ggplot2 constructs graphics in layers, each layer is separated by "+"
    +# x and y values are supplied in aes(), the type of plot is specified using
    +# the "geom" functions
    +
    +ggplot(data = pantheria,                             # input data
    +       mapping =  aes(x = log10(AdultBodyMass_g))) + # log10 transform adult body mass
    +  geom_histogram(fill = "#CE3274",                   # type of plot
    +                 bins = 40) + 
    +  xlab(label = "log10 Adult Body Mass (g)") +        # x label
    +  ylab(label = "Frequency") +                        # y label
    +  labs(title = "Histogram of log10 Adult Body Mass") # title 
    +

    +

    The data looks skewed even after log10 transformation. Let’s view the +distribution by trophic level.

    +
    pantheria %>%
    +  ggplot(aes(x = log10(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
    +  geom_histogram(bins = 40) +
    +  facet_grid(rows = vars(TrophicLevel)) +                        # Split the plot into rows by trophic level
    +  ylab(label = "Frequency") + 
    +  xlab(label = "log10 Adult Body Mass (g)") + 
    +  labs(title = "Histograms of log10 Adult Body Mass") +
    +  theme(plot.title = element_text(hjust = 0.5))                  # Center the plot title
    +

    +

    It is clear that trophic level does have an impact on the +distribution of adult body mass, carnivores tend to be smaller which +makes sense because carnivores have higher metabolic demands and so +there might be a selection pressure towards smaller carnivores. If we +wanted to confirm this by fitting a model, we could use the +lm() function to fit a linear model.

    +
    +
    +

    Exercise 7.1: Hands on coding

    +

    An important caveat of the data is that some Orders of mammals are +more biodiverse than others and are therefore over represented in the +dataset. Using the dplyr cheatsheet, write code generates a +table of Orders and what percentage of the data they are. Scroll down to +see the hint if you are having trouble.

    +
    # Your code 
    +pantheria %>%
    +  group_by(Order) %>%
    +  summarise(n = n()) %>%
    +  arrange(desc(n))
    +
    ## # A tibble: 29 × 2
    +##    Order               n
    +##    <chr>           <int>
    +##  1 Rodentia          575
    +##  2 Chiroptera        393
    +##  3 Carnivora         227
    +##  4 Primates          175
    +##  5 Artiodactyla      164
    +##  6 Diprotodontia      80
    +##  7 Soricomorpha       80
    +##  8 Cetacea            49
    +##  9 Didelphimorphia    39
    +## 10 Lagomorpha         37
    +## # ℹ 19 more rows
    +

    Exercise 7.1 hint: group_by + summarize(n = n()) + +arrange

    +
    +
    +

    Exercise 7.2: Hands on coding

    +

    Now that we see what the over represented Orders are, we can plot +their body masses by trophic level to see if they are skewing the +overall distributions.

    +
    top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above
    +# filter uses a conditional to select rows from the data
    +pantheria %>%
    +  filter(Order %in% top_orders) %>%
    +  ggplot(aes(x = log10(AdultBodyMass_g), fill = Order)) +
    +  geom_histogram(bins = 40) +
    +  facet_grid(rows = vars(TrophicLevel),
    +             cols = vars(Order)) +
    +  ylab(label = "Frequency") + 
    +  xlab(label = "log10 Adult Body Mass (g)") + 
    +  theme(plot.title = element_text(hjust = 0.5))   
    +

    +

    It looks like one of them is mostly made up of small cornivores. +Let’s remove it and redo the plot of body mass distribution by trophic +level.

    +
    pantheria %>%
    +  filter(Order != "Chiroptera") %>%
    +  ggplot(aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
    +  geom_histogram(bins = 40) +
    +  facet_grid(rows = vars(TrophicLevel)) +                       # Split the plot into rows by trophic level
    +  ylab(label = "Frequency") + 
    +  xlab(label = "log10 Adult Body Mass (g)") + 
    +  labs(title = "Histograms of log10 Adult Body Mass")
    +

    +

    We can see now that body mass of carnivorous mammals is much less +skewed than the initial plots show. There is still an effect of trophic +level on body mass, but the effect size is likely much smaller than we +would estimate by including all 393 Chiroptera. Now that we +have generated these plots, we can generate a full report that contains +all of the text and code, click Knit to render the HTML +report.

    +
    +
    +

    End of workshop exercises

    +

    Hopefully this workshop has provided a good foundation for you to +learn R. If you would like some additional practice, check out the +resources on the workshop +wiki. R also contains many built in datasets you can use for +practice:

    + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    PackageItemTitle
    ggplot2diamondsPrices of over 50,000 round cut diamonds
    ggplot2economicsUS economic time series
    ggplot2economics_longUS economic time series
    ggplot2faithfuld2d density estimate of Old Faithful data
    ggplot2luv_colours‘colors()’ in Luv space
    ggplot2midwestMidwest demographics
    ggplot2mpgFuel economy data from 1999 to 2008 for 38 popular +models of cars
    ggplot2msleepAn updated and expanded version of the mammals sleep +dataset
    ggplot2presidentialTerms of 12 presidents from Eisenhower to Trump
    ggplot2sealsVector field of seal movements
    ggplot2txhousingHousing sales in TX
    tidyrbillboardSong rankings for Billboard top 100 in the year +2000
    tidyrcms_patient_careData from the Centers for Medicare & Medicaid +Services
    tidyrcms_patient_experienceData from the Centers for Medicare & Medicaid +Services
    tidyrconstructionCompleted construction in the US in 2018
    tidyrfish_encountersFish encounters
    tidyrhouseholdHousehold data
    tidyrpopulationWorld Health Organization TB data
    tidyrrelig_incomePew religion and income survey
    tidyrsmithsSome data about the Smith family
    tidyrtable1Example tabular representations
    tidyrtable2Example tabular representations
    tidyrtable3Example tabular representations
    tidyrtable4aExample tabular representations
    tidyrtable4bExample tabular representations
    tidyrtable5Example tabular representations
    tidyrus_rent_incomeUS rent and income data
    tidyrwhoWorld Health Organization TB data
    tidyrwho2World Health Organization TB data
    tidyrworld_bank_popPopulation data from the world bank
    dplyrband_instrumentsBand membership
    dplyrband_instruments2Band membership
    dplyrband_membersBand membership
    dplyrstarwarsStarwars characters
    dplyrstormsStorm tracks data
    datasetsAirPassengersMonthly Airline Passenger Numbers 1949-1960
    datasetsBJsalesSales Data with Leading Indicator
    datasetsBJsales.lead (BJsales)Sales Data with Leading Indicator
    datasetsBODBiochemical Oxygen Demand
    datasetsCO2Carbon Dioxide Uptake in Grass Plants
    datasetsChickWeightWeight versus age of chicks on different diets
    datasetsDNaseElisa assay of DNase
    datasetsEuStockMarketsDaily Closing Prices of Major European Stock Indices, +1991-1998
    datasetsFormaldehydeDetermination of Formaldehyde
    datasetsHairEyeColorHair and Eye Color of Statistics Students
    datasetsHarman23.corHarman Example 2.3
    datasetsHarman74.corHarman Example 7.4
    datasetsIndomethPharmacokinetics of Indomethacin
    datasetsInsectSpraysEffectiveness of Insect Sprays
    datasetsJohnsonJohnsonQuarterly Earnings per Johnson & Johnson Share
    datasetsLakeHuronLevel of Lake Huron 1875-1972
    datasetsLifeCycleSavingsIntercountry Life-Cycle Savings Data
    datasetsLoblollyGrowth of Loblolly pine trees
    datasetsNileFlow of the River Nile
    datasetsOrangeGrowth of Orange Trees
    datasetsOrchardSpraysPotency of Orchard Sprays
    datasetsPlantGrowthResults from an Experiment on Plant Growth
    datasetsPuromycinReaction Velocity of an Enzymatic Reaction
    datasetsSeatbeltsRoad Casualties in Great Britain 1969-84
    datasetsTheophPharmacokinetics of Theophylline
    datasetsTitanicSurvival of passengers on the Titanic
    datasetsToothGrowthThe Effect of Vitamin C on Tooth Growth in Guinea +Pigs
    datasetsUCBAdmissionsStudent Admissions at UC Berkeley
    datasetsUKDriverDeathsRoad Casualties in Great Britain 1969-84
    datasetsUKgasUK Quarterly Gas Consumption
    datasetsUSAccDeathsAccidental Deaths in the US 1973-1978
    datasetsUSArrestsViolent Crime Rates by US State
    datasetsUSJudgeRatingsLawyers’ Ratings of State Judges in the US Superior +Court
    datasetsUSPersonalExpenditurePersonal Expenditure Data
    datasetsUScitiesDDistances Between European Cities and Between US +Cities
    datasetsVADeathsDeath Rates in Virginia (1940)
    datasetsWWWusageInternet Usage per Minute
    datasetsWorldPhonesThe World’s Telephones
    datasetsability.covAbility and Intelligence Tests
    datasetsairmilesPassenger Miles on Commercial US Airlines, +1937-1960
    datasetsairqualityNew York Air Quality Measurements
    datasetsanscombeAnscombe’s Quartet of ‘Identical’ Simple Linear +Regressions
    datasetsattenuThe Joyner-Boore Attenuation Data
    datasetsattitudeThe Chatterjee-Price Attitude Data
    datasetsaustresQuarterly Time Series of the Number of Australian +Residents
    datasetsbeaver1 (beavers)Body Temperature Series of Two Beavers
    datasetsbeaver2 (beavers)Body Temperature Series of Two Beavers
    datasetscarsSpeed and Stopping Distances of Cars
    datasetschickwtsChicken Weights by Feed Type
    datasetsco2Mauna Loa Atmospheric CO2 Concentration
    datasetscrimtabStudent’s 3000 Criminals Data
    datasetsdiscoveriesYearly Numbers of Important Discoveries
    datasetsesophSmoking, Alcohol and (O)esophageal Cancer
    datasetseuroConversion Rates of Euro Currencies
    datasetseuro.cross (euro)Conversion Rates of Euro Currencies
    datasetseurodistDistances Between European Cities and Between US +Cities
    datasetsfaithfulOld Faithful Geyser Data
    datasetsfdeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
    datasetsfreenyFreeny’s Revenue Data
    datasetsfreeny.x (freeny)Freeny’s Revenue Data
    datasetsfreeny.y (freeny)Freeny’s Revenue Data
    datasetsinfertInfertility after Spontaneous and Induced Abortion
    datasetsirisEdgar Anderson’s Iris Data
    datasetsiris3Edgar Anderson’s Iris Data
    datasetsislandsAreas of the World’s Major Landmasses
    datasetsldeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
    datasetslhLuteinizing Hormone in Blood Samples
    datasetslongleyLongley’s Economic Regression Data
    datasetslynxAnnual Canadian Lynx trappings 1821-1934
    datasetsmdeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
    datasetsmorleyMichelson Speed of Light Data
    datasetsmtcarsMotor Trend Car Road Tests
    datasetsnhtempAverage Yearly Temperatures in New Haven
    datasetsnottemAverage Monthly Temperatures at Nottingham, +1920-1939
    datasetsnpkClassical N, P, K Factorial Experiment
    datasetsoccupationalStatusOccupational Status of Fathers and their Sons
    datasetsprecipAnnual Precipitation in US Cities
    datasetspresidentsQuarterly Approval Ratings of US Presidents
    datasetspressureVapor Pressure of Mercury as a Function of +Temperature
    datasetsquakesLocations of Earthquakes off Fiji
    datasetsranduRandom Numbers from Congruential Generator RANDU
    datasetsriversLengths of Major North American Rivers
    datasetsrockMeasurements on Petroleum Rock Samples
    datasetssleepStudent’s Sleep Data
    datasetsstack.loss (stackloss)Brownlee’s Stack Loss Plant Data
    datasetsstack.x (stackloss)Brownlee’s Stack Loss Plant Data
    datasetsstacklossBrownlee’s Stack Loss Plant Data
    datasetsstate.abb (state)US State Facts and Figures
    datasetsstate.area (state)US State Facts and Figures
    datasetsstate.center (state)US State Facts and Figures
    datasetsstate.division (state)US State Facts and Figures
    datasetsstate.name (state)US State Facts and Figures
    datasetsstate.region (state)US State Facts and Figures
    datasetsstate.x77 (state)US State Facts and Figures
    datasetssunspot.monthMonthly Sunspot Data, from 1749 to “Present”
    datasetssunspot.yearYearly Sunspot Data, 1700-1988
    datasetssunspotsMonthly Sunspot Numbers, 1749-1983
    datasetsswissSwiss Fertility and Socioeconomic Indicators (1888) +Data
    datasetstreeringYearly Treering Data, -6000-1979
    datasetstreesDiameter, Height and Volume for Black Cherry Trees
    datasetsuspopPopulations Recorded by the US Census
    datasetsvolcanoTopographic Information on Auckland’s Maunga Whau +Volcano
    datasetswarpbreaksThe Number of Breaks in Yarn during Weaving
    datasetswomenAverage Heights and Weights for American Women
    +
    + + + + +
    + + + + + + + + + + + + + + + diff --git a/intro-r-data-analysis/assets/Taxonomic_Rank_Graph.svg b/intro-r-data-analysis/assets/Taxonomic_Rank_Graph.svg new file mode 100644 index 0000000..4852495 --- /dev/null +++ b/intro-r-data-analysis/assets/Taxonomic_Rank_Graph.svg @@ -0,0 +1,303 @@ + + + + + + + + + + + + + + + লাল শিয়াল + (ভালপেস ভালপেস) + Rotfuchs + (Vulpes vulpes) + Zorro rojo + (Vulpes vulpes) + + မြေခွေးနီ (Vulpes vulpes) + обыкновенная лисица + (Vulpes vulpes) + + රතු හිවලා (වුල්පෙස් වුල්පෙස්) + Rödräv + (Vulpes vulpes) + + लाल लोमड़ी + वुल्पेस वुल्पेस + Црвена лисица + (Vulpes vulpes) + Red fox + (Vulpes vulpes) + + + অধিজগৎজগৎপর্বশ্রেণিবর্গগোত্রগণ + প্রজাতি + DomäneReichStammKlasseOrdnungFamilieGattung + Art + DominioReinoFiloClaseOrdenFamiliaGénero + Especie + နယ်ပယ်လောကမျိုးပေါင်းစုမျိုးပေါင်းမျိုးစဉ်မျိုးရင်းမျိုးစု + မျိုးစိတ် + ДоменЦарствоТипКлассОтрядСемействоРод + Вид + වසමරාජධානියවංශයවර්ගයගෝත්‍රයකුලයගණය + විශේෂය + DomänRikeFylumKlassOrdningFamiljSläkte + Art + + अधिजगत्जगत् संघवर्गगणकुटुम्बवंश + जाति + ДоменЦарствоКоленоКласаРедСемејствоРод + Вид + DomainKingdomPhylumClassOrderFamilyGenus + Species + + + সুকেন্দ্রিকপ্রাণীমেরুদণ্ডীস্তন্যপায়ীশ্বাপদক্যানিডেভালপেস + ভালপেস ভালপেস + EucariotaAnimaliaCordadosMamíferosCarnívoraCánidosVulpes + Vulpes vulpes + ယူကာရုတ်တိရစ္ဆာန်ကော်ဒိတ်နို့တိုက်သတ္တဝါကာနီဗိုရာခွေးမျိုးရင်းVulpes + Vuples vulpes + ЭукариотыЖивотныеХордовыеМлекопитающиеХищныеПсовыеVulpes + Vulpes vulpes + යුකේරියාඇනිමේලියාකෝඩේටාමමාලියාකානිවෝරාකානිඩේවුල්පෙස් + Vuples vulpes + EukaryoterDjurRyggsträngsdjurDäggdjurRovdjurHunddjurVulpes + Vulpes vulpes + + सुकेन्द्रकप्राणीरज्जुकीस्तनधारी मांसाहारीश्वानवुल्पेस + वुल्पेस वुल्पेस + ЕукариотиЖивотниХордовиЦицачиЅверовиКучињаЛисици + Црвена лисица + EukaryaAnimaliaChordataMammaliaCarnivoraCanidaeVulpes + Vulpes vulpes + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/intro-r-data-analysis/assets/plotting.png b/intro-r-data-analysis/assets/plotting.png new file mode 100644 index 0000000..6cf8b20 Binary files /dev/null and b/intro-r-data-analysis/assets/plotting.png differ diff --git a/intro-r-data-analysis/style.css b/intro-r-data-analysis/style.css index eddb990..6e7ace6 100644 --- a/intro-r-data-analysis/style.css +++ b/intro-r-data-analysis/style.css @@ -14,7 +14,7 @@ .reveal pre code { background-color: #d5d5d5 !important; color: #333 !important; - font-size: 1.5em !important; + font-size: 1.25em !important; } /* Left-align all code outputs */ .reveal pre code {