Introduction to R Data Analysis - Part 2

Natalie Elphick

January 23rd, 2024

Press the ? key for tips on navigating these slides

Introductions

Natalie Elphick
Bioinformatician I

Yihang Xin (TA)
Software Engineer III

Schedule

  1. Introduction to Tidyverse
  2. Filtering and reformatting data
  3. Plotting data
  4. Hands on data analysis

Introduction to Tidyverse

Tidyverse

  • The tidyverse packages work well together because they share common data representations and design principles
    • Rows = observations, columns = variables
  • ggplot2, for data visualization.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for iteration.
  • and more..

dplyr

  • Offers a common “grammar” of functions for data manipulation
    • mutate() adds new variables that are functions of existing columns
    • select() picks columns based on their names
    • filter() picks rows based on their values
    • summarise() reduces multiple values down to a single summary
    • arrange() changes the ordering of the rows
    • group_by() allows any operation to be done “by group”

Example Dataframe

  • mpg is a dataframe built into the ggplot2 package
head(mpg)
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

Select Columns

select(.data = mpg,
       year, cty, hwy, manufacturer)
year cty hwy manufacturer
1999 18 29 audi
1999 21 29 audi
2008 20 31 audi
2008 21 30 audi
1999 16 26 audi
1999 18 26 audi

Filter Rows

filter(.data = mpg,
       year == 2008)
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact
audi a4 quattro 2.0 2008 4 auto(s6) 4 19 27 p compact
audi a4 quattro 3.1 2008 6 auto(s6) 4 17 25 p compact

Arrange Rows

  • desc() is used to arrange rows in descending order, the default is ascending
arrange(.data = mpg,
        desc(cyl))
manufacturer model displ year cyl trans drv cty hwy fl class
audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 11 15 e suv

Summarising data

  • The dplyr summarise() function computes a table of summaries for a data frame
  • group_by() groups the input data frame by the specified variable(s)
  • Combining these two allows us to easily create summaries for different categorical groupings

Group and Summarise

summarise(group_by(.data = mpg,
                   manufacturer),
          mean_cty = mean(cty),
          median_cty = median(cty))
manufacturer mean_cty median_cty
audi 17.61111 17.5
chevrolet 15.00000 15.0
dodge 13.13514 13.0
ford 14.00000 14.0
honda 24.44444 24.0
hyundai 18.64286 18.5

The pipe operator |>

  • Allows “chaining” of function calls to make code more readable
mpg |>
  group_by(manufacturer) |>
  summarise(mean_cty = mean(cty),
            median_cty = median(cty))
manufacturer mean_cty median_cty
audi 17.61111 17.5
chevrolet 15.00000 15.0
dodge 13.13514 13.0
ford 14.00000 14.0

Plotting

ggplot2

  • The most popular tidyverse package
  • Create publication quality, highly customizable plots
  • ggplots use “layers” to build, modify and overlap visualizations
    • Layers are added using the + symbol and can be added to an existing ggplot
  • Many popular packages output ggplots which can then be easily modified by adding layers

Creating ggplots



Plotting

Plot Example

ggplot(data = mpg,                         # Input dataframe
       mapping = aes(x = cty, y = hwy)) +  # Aesthetic mapping
  geom_point()                             # Point graph

Adding and Modifying Layers

ggplot(data = mpg,
       mapping = aes(x = class, y = cty, fill = class)) +
  geom_violin() +
  geom_boxplot(width = 0.1,
               fill = "white")

10 min break

10:00

Hands-on Data Analysis

Dataset Description

  • PanTHERIA
    • A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature
    • Macroecological and macroevolutionary research projects
    • Data is organized by taxonomic rank

Taxonomic Rank

Taxonomy

Data Preview

Order Family Genus Species Binomial ActivityCycle AdultBodyMass_g AdultForearmLen_mm AdultHeadBodyLen_mm AgeatEyeOpening_d AgeatFirstBirth_d BasalMetRate_mLO2hr BasalMetRateMass_g DietBreadth DispersalAge_d GestationLen_d HabitatBreadth HomeRange_km2 HomeRange_Indiv_km2 InterbirthInterval_d LitterSize LittersPerYear MaxLongevity_m NeonateBodyMass_g NeonateHeadBodyLen_mm PopulationDensity_n/km2 PopulationGrpSize SexualMaturityAge_d SocialGrpSize Terrestriality TrophicLevel WeaningAge_d WeaningBodyMass_g WeaningHeadBodyLen_mm References AdultBodyMass_g_EXT LittersPerYear_EXT NeonateBodyMass_g_EXT WeaningBodyMass_g_EXT GR_Area_km2 GR_MaxLat_dd GR_MinLat_dd GR_MidRangeLat_dd GR_MaxLong_dd GR_MinLong_dd GR_MidRangeLong_dd HuPopDen_Min_n/km2 HuPopDen_Mean_n/km2 HuPopDen_5p_n/km2 HuPopDen_Change Precip_Mean_mm Temp_Mean_01degC AET_Mean_mm PET_Mean_mm
Carnivora Canidae Canis latrans Canis latrans crepuscular 11989.1 NA 872.39 11.94 365 3699 10450 1 255 61.74 1 18.88 19.91 365 5.72 NA 262 200.01 NA 0.25 NA 372.9 NA fossorial carnivore 43.71 NA NA 367;542;543;730;1113;1297;1573;1822;2655 NA 1.1000000000000001 NA NA 17099094.300000001 71.39 8.02 39.700000000000003 -67.069999999999993 -168.12 -117.6 0 27.27 0 0.06 53.03 58.18 503.02 728.37
Carnivora Canidae Canis lupus Canis lupus crepuscular 31756.51 NA 1055 14.01 547.5 11254.2 33100 1 180 63.5 1 159.86000000000001 43.13 365 4.9800000000000004 2 354 412.31 NA 0.01 NA 679.37 NA fossorial carnivore 44.82 NA NA 367;542;543;730;1015;1052;1113;1297;1573;1594;2338;2655 NA NA NA NA 50803439.700000003 83.27 11.48 47.38 179.65 -171.84 3.9 0 37.869999999999997 0 0.04 34.79 4.82 313.33 561.11
Carnivora Canidae Canis simensis Canis simensis diurnal 14361.86 NA 938.19 NA NA NA NA 1 180 63.61 1 4.2 5.0199999999999996 365 NA NA NA NA NA 1.2 NA 754.74 NA fossorial carnivore 69.599999999999994 NA NA 542;730;1113;1573;2655 NA 1.1000000000000001 NA NA 11402.81 13.31 6.55 9.93 39.96 38.020000000000003 38.99 30 99.87 30 0.15 83.87 99.03 931.35 1471.36
Carnivora Canidae Atelocynus microtis Atelocynus microtis NA 8363.2199999999993 NA 831.01 NA NA NA NA 1 NA NA 1 NA NA NA NA NA 132 NA NA NA NA NA 1 fossorial carnivore NA NA NA 543;890;1113;2655 NA NA NA NA 7634256.5999999996 4.79 -32.31 -13.76 -43.54 -78.61 -61.08 0 7.43 0 0.12 163.06 235.49 1316.27 1488
Cetacea Balaenopteridae Balaenoptera musculus Balaenoptera musculus NA 154321304.5 NA 30480 NA NA NA NA 1 NA 326.97000000000003 1 NA NA 821.25 1 0.45 1320 2738612.79 7236.55 NA 1 1959.8 1.25 NA carnivore 211.71 16999999.969999999 NA 172;511;543;899;1004;1015;1217;1297;2151;2409;2655 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Cetacea Balaenopteridae Balaenoptera physalus Balaenoptera physalus NA 47506008.229999997 NA 20641.060000000001 NA NA NA NA 2 NA 338.36 1 NA NA 730 1.01 0.37 1392 1899999.99 6273.75 NA 1.5 2666.41 NA NA carnivore 196.58 NA 12000 24;27;543;899;1004;1015;1217;1297;1577;2151;2655 NA NA NA 6395530.4199999999 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Hands-on Analysis

  • Open part_2.Rmd

ChatGPT Tips for R

General Tips

  • Always confirm ChatGPT’s outputs are correct
  • Provide as much detail as possible about the problem in the 1st prompt
  • Use separate chats for separate tasks/projects
  • Try the ‘Custom Instructions’ function that adds additional information to every prompt
  • Can visit webpages (GPT 4 only), which can help get more specific answers

Code Tips

  • Commented R code yields better responses in my experience
  • Provide the code and error message in the same prompt
  • ChatGPT can work well to convert syntax and improve your code:
    • “Turn this loop into a function : [your code]”
    • “Is there a better way to do this : [your code]”
  • Check out the file: example_code/1_convert_syntax_example.R for an example use case

Finding R Packages

Key Questions

  • What assay was the package designed for?
  • When was the last release?
  • Is it maintained (frequent updates)?
  • Does it work on all operating systems?
  • Are other people using it? (citations)
  • Do they respond to github issues?
  • Is there a benchmarking paper?

BioConductor and CRAN

  • Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)

  • Prefer BioConductor packages if available over CRAN

  • Prefer CRAN packages over ones only hosted on GitHub

Start with the Assay

  • Click here to go to BioC views
  • Pick the assay you want to analyse
  • Pick the type of analysis you want to do
  • Find a package that does it
  • Find benchmarking papers to narrow the list of packages down
  • Find the vignette on the package page and refer to the manual for any questions not covered by it

Additional Resources

R

Statistics

RNA-seq Analysis

Dimensional Reduction

End of Part 2

Workshop survey

  • Please fill out our workshop survey so we can continue to improve these workshops

Upcoming Workshops

  1. Introduction to Statistics, Experimental Design, and Hypothesis Testing
    • Jan 25, 2024 (Session 1 - 10am–12pm) (Session 2 - 1pm–3pm)
    • Jan 26, 2024 (Session 3 - 10am–12pm)
  2. Intermediate RNA-Seq Analysis Using R
    • Feb 1, 2024 (9:30am-12:00pm)