This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Guide to markdown syntax https://www.markdownguide.org/basic-syntax/.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Simulates 100 observations from a normal distribution
# and plots a histogram
val <- rnorm(n = 100)
hist(val, breaks = 20)
# The ggplot version of the same plot
ggplot(data = tibble(values = val),
mapping = aes(x = values)) +
geom_histogram(bins = 20)
Important: before running the code below, click Session -> Set Working Directory -> To Source File Location
The data we will be analyzing is from the PanTHERIA database which is “a global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals (PanTHERIA) developed for a number of macroecological and macroevolutionary research projects.”
# The data is spread across 3 sheets in an excel file. We need to
# combine these data into one table/data frame.
# na = "NA" tells read_xlsx how missing values appear in the data
# the default is empty cells. Run "?read_xlsx" for more info
sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
# rbind (row-bind) combines data frames by row
pantheria <- rbind(sheet1, sheet2, sheet3)
# How many rows and columns are there?
nrow(pantheria)
## [1] 2161
ncol(pantheria)
## [1] 54
# What does the data look like?
head(pantheria)
## # A tibble: 6 × 54
## Order Family Genus Species Binomial ActivityCycle AdultBodyMass_g
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Carnivora Canidae Canis latrans Canis l… crepuscular 11989.
## 2 Carnivora Canidae Canis lupus Canis l… crepuscular 31757.
## 3 Carnivora Canidae Canis simens… Canis s… diurnal 14362.
## 4 Carnivora Canidae Atel… microt… Atelocy… <NA> 8363.
## 5 Cetacea Balaenopteridae Bala… muscul… Balaeno… <NA> 154321304.
## 6 Cetacea Balaenopteridae Bala… physal… Balaeno… <NA> 47506008.
## # ℹ 47 more variables: AdultForearmLen_mm <dbl>, AdultHeadBodyLen_mm <dbl>,
## # AgeatEyeOpening_d <dbl>, AgeatFirstBirth_d <dbl>,
## # BasalMetRate_mLO2hr <dbl>, BasalMetRateMass_g <dbl>, DietBreadth <dbl>,
## # DispersalAge_d <dbl>, GestationLen_d <dbl>, HabitatBreadth <dbl>,
## # HomeRange_km2 <dbl>, HomeRange_Indiv_km2 <dbl>, InterbirthInterval_d <dbl>,
## # LitterSize <dbl>, LittersPerYear <dbl>, MaxLongevity_m <dbl>,
## # NeonateBodyMass_g <dbl>, NeonateHeadBodyLen_mm <dbl>, …
We will exploring adult body mass from these mammals as it relates to
their trophic level using dpylr and ggplot2.
Download the cheatsheets for these packages at the following links:
Let’s start by subsetting the data with select()
# Pipes (|>) work by passing the data in front of the pipe to the first argument
# of the function after it, this prevents a lot of nested function calls and makes
# code easier to read.
pantheria <- pantheria |> # Passes pantheria as the first argument of select
select(Order,
Family, # select returns the specified columns
Genus,
Species,
TrophicLevel,
AdultBodyMass_g) |>
drop_na() |> # Remove any rows that have NAs
distinct() # Remove any duplicate rows
Data is almost never clean, for example there should be only 3 trophic levels:
unique(pantheria$TrophicLevel) # unique elements of a vector
## [1] "carnivore" "herbivore" "omnivore" "Omnivore"
Let’s fix the TrophicLevel column using mutate()
# mutate allows us to add columns or modify existing ones
pantheria <- pantheria |>
mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase
Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation.
pantheria |>
group_by(TrophicLevel) |> # Group observations by this column
summarize(Mean = mean(AdultBodyMass_g), # Summarize will calculate these group wise
`Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names
Min = min(AdultBodyMass_g),
Max = max(AdultBodyMass_g)) |>
ungroup() |>
arrange(desc(Mean)) # Order the data frame by descending mean body mass
## # A tibble: 3 × 5
## TrophicLevel Mean `Standard Deviation` Min Max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 carnivore 824948. 7963841. 1.96 154321304.
## 2 omnivore 77419. 1173853. 3.29 27324024.
## 3 herbivore 59854. 305498. 5.55 4750000.
According to the table above, body masses have a really wide range across trophic levels. Let’s visualize the distribution of adult body masses.
# ggplot2 constructs graphics in layers, each layer is separated by "+"
# x and y values are supplied in aes(), the type of plot is specified using
# the "geom" functions
ggplot(data = pantheria, # input data
mapping = aes(x = log2(AdultBodyMass_g))) + # log2 transform adult body mass
geom_histogram(fill = "#CE3274", # type of plot
bins = 40) +
xlab(label = "log2 Adult Body Mass (g)") + # x label
ylab(label = "Frequency") + # y label
labs(title = "Histogram of log2 Adult Body Mass") # title
The data looks skewed even after log2 transformation. Let’s view the distribution by trophic level.
pantheria |>
ggplot(aes(x = log2(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
geom_histogram(bins = 40) +
facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level
ylab(label = "Frequency") +
xlab(label = "log2 Adult Body Mass (g)") +
labs(title = "Histograms of log2 Adult Body Mass") +
theme(plot.title = element_text(hjust = 0.5)) # Center the plot title
It is clear that trophic level does have an impact on the
distribution of adult body mass, carnivores tend to be smaller which
makes sense because carnivores have higher metabolic demands and so
there might be a selection pressure towards smaller carnivores. If we
wanted to confirm this by fitting a model, we could use the
lm() function to fit a linear model.
An important caveat of the data is that some Orders of mammals are
more biodiverse than others and are therefore over represented in the
dataset. Using the dplyr cheatsheet, write code generates a
table of Orders and what percentage of the data they are. Scroll down to
see the hint if you are having trouble.
# Your code
pantheria |>
group_by(Order) |>
summarise(n = n()) |>
arrange(desc(n))
## # A tibble: 29 × 2
## Order n
## <chr> <int>
## 1 Rodentia 575
## 2 Chiroptera 393
## 3 Carnivora 227
## 4 Primates 175
## 5 Artiodactyla 164
## 6 Diprotodontia 80
## 7 Soricomorpha 80
## 8 Cetacea 49
## 9 Didelphimorphia 39
## 10 Lagomorpha 37
## # ℹ 19 more rows
Exercise 7.1 hint: group_by + summarize(n = n()) + arrange
Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions.
top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above
# filter uses a conditional to select rows from the data
pantheria |>
filter(Order %in% top_orders) |>
ggplot(aes(x = log2(AdultBodyMass_g), fill = Order)) +
geom_histogram(bins = 40) +
facet_grid(rows = vars(TrophicLevel),
cols = vars(Order)) +
ylab(label = "Frequency") +
xlab(label = "log2 Adult Body Mass (g)") +
theme(plot.title = element_text(hjust = 0.5))
It looks like one of them is mostly made up of small carnivores. Let’s remove it and redo the plot of body mass distribution by trophic level.
pantheria |>
filter(Order != "Chiroptera") |>
ggplot(aes(x = log2(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
geom_histogram(bins = 40) +
facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level
ylab(label = "Frequency") +
xlab(label = "log2 Adult Body Mass (g)") +
labs(title = "Histograms of log2 Adult Body Mass")
We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all 393 Chiroptera.
# Set the reference level to herbivore
pantheria <- pantheria |>
mutate(TrophicLevel = factor(TrophicLevel,
levels = c("herbivore","omnivore","carnivore")))
lm(data = pantheria, log2(AdultBodyMass_g) ~ TrophicLevel) |>
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.2 0.166 61.6 0
## 2 TrophicLevelomnivore -1.72 0.238 -7.22 7.52e-13
## 3 TrophicLevelcarnivore -2.82 0.253 -11.2 4.66e-28
# Remove Chiroptera
pantheria <- pantheria |>
filter(Order != "Chiroptera")
lm(data = pantheria, log2(AdultBodyMass_g) ~ TrophicLevel) |>
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.0 0.173 63.6 0
## 2 TrophicLevelomnivore -2.08 0.243 -8.53 3.26e-17
## 3 TrophicLevelcarnivore -0.813 0.296 -2.74 6.16e- 3
Now that we have generated these plots and results, we can generate a
full report that contains all of the text and code, click
Knit to render the HTML report.
Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the workshop wiki. R also contains many built in datasets you can use for practice:
| Package | Item | Title |
|---|---|---|
| ggplot2 | diamonds | Prices of over 50,000 round cut diamonds |
| ggplot2 | economics | US economic time series |
| ggplot2 | economics_long | US economic time series |
| ggplot2 | faithfuld | 2d density estimate of Old Faithful data |
| ggplot2 | luv_colours | ‘colors()’ in Luv space |
| ggplot2 | midwest | Midwest demographics |
| ggplot2 | mpg | Fuel economy data from 1999 to 2008 for 38 popular models of cars |
| ggplot2 | msleep | An updated and expanded version of the mammals sleep dataset |
| ggplot2 | presidential | Terms of 12 presidents from Eisenhower to Trump |
| ggplot2 | seals | Vector field of seal movements |
| ggplot2 | txhousing | Housing sales in TX |
| tidyr | billboard | Song rankings for Billboard top 100 in the year 2000 |
| tidyr | cms_patient_care | Data from the Centers for Medicare & Medicaid Services |
| tidyr | cms_patient_experience | Data from the Centers for Medicare & Medicaid Services |
| tidyr | construction | Completed construction in the US in 2018 |
| tidyr | fish_encounters | Fish encounters |
| tidyr | household | Household data |
| tidyr | population | World Health Organization TB data |
| tidyr | relig_income | Pew religion and income survey |
| tidyr | smiths | Some data about the Smith family |
| tidyr | table1 | Example tabular representations |
| tidyr | table2 | Example tabular representations |
| tidyr | table3 | Example tabular representations |
| tidyr | table4a | Example tabular representations |
| tidyr | table4b | Example tabular representations |
| tidyr | table5 | Example tabular representations |
| tidyr | us_rent_income | US rent and income data |
| tidyr | who | World Health Organization TB data |
| tidyr | who2 | World Health Organization TB data |
| tidyr | world_bank_pop | Population data from the world bank |
| dplyr | band_instruments | Band membership |
| dplyr | band_instruments2 | Band membership |
| dplyr | band_members | Band membership |
| dplyr | starwars | Starwars characters |
| dplyr | storms | Storm tracks data |
| datasets | AirPassengers | Monthly Airline Passenger Numbers 1949-1960 |
| datasets | BJsales | Sales Data with Leading Indicator |
| datasets | BJsales.lead (BJsales) | Sales Data with Leading Indicator |
| datasets | BOD | Biochemical Oxygen Demand |
| datasets | CO2 | Carbon Dioxide Uptake in Grass Plants |
| datasets | ChickWeight | Weight versus age of chicks on different diets |
| datasets | DNase | Elisa assay of DNase |
| datasets | EuStockMarkets | Daily Closing Prices of Major European Stock Indices, 1991-1998 |
| datasets | Formaldehyde | Determination of Formaldehyde |
| datasets | HairEyeColor | Hair and Eye Color of Statistics Students |
| datasets | Harman23.cor | Harman Example 2.3 |
| datasets | Harman74.cor | Harman Example 7.4 |
| datasets | Indometh | Pharmacokinetics of Indomethacin |
| datasets | InsectSprays | Effectiveness of Insect Sprays |
| datasets | JohnsonJohnson | Quarterly Earnings per Johnson & Johnson Share |
| datasets | LakeHuron | Level of Lake Huron 1875-1972 |
| datasets | LifeCycleSavings | Intercountry Life-Cycle Savings Data |
| datasets | Loblolly | Growth of Loblolly pine trees |
| datasets | Nile | Flow of the River Nile |
| datasets | Orange | Growth of Orange Trees |
| datasets | OrchardSprays | Potency of Orchard Sprays |
| datasets | PlantGrowth | Results from an Experiment on Plant Growth |
| datasets | Puromycin | Reaction Velocity of an Enzymatic Reaction |
| datasets | Seatbelts | Road Casualties in Great Britain 1969-84 |
| datasets | Theoph | Pharmacokinetics of Theophylline |
| datasets | Titanic | Survival of passengers on the Titanic |
| datasets | ToothGrowth | The Effect of Vitamin C on Tooth Growth in Guinea Pigs |
| datasets | UCBAdmissions | Student Admissions at UC Berkeley |
| datasets | UKDriverDeaths | Road Casualties in Great Britain 1969-84 |
| datasets | UKgas | UK Quarterly Gas Consumption |
| datasets | USAccDeaths | Accidental Deaths in the US 1973-1978 |
| datasets | USArrests | Violent Crime Rates by US State |
| datasets | USJudgeRatings | Lawyers’ Ratings of State Judges in the US Superior Court |
| datasets | USPersonalExpenditure | Personal Expenditure Data |
| datasets | UScitiesD | Distances Between European Cities and Between US Cities |
| datasets | VADeaths | Death Rates in Virginia (1940) |
| datasets | WWWusage | Internet Usage per Minute |
| datasets | WorldPhones | The World’s Telephones |
| datasets | ability.cov | Ability and Intelligence Tests |
| datasets | airmiles | Passenger Miles on Commercial US Airlines, 1937-1960 |
| datasets | airquality | New York Air Quality Measurements |
| datasets | anscombe | Anscombe’s Quartet of ‘Identical’ Simple Linear Regressions |
| datasets | attenu | The Joyner-Boore Attenuation Data |
| datasets | attitude | The Chatterjee-Price Attitude Data |
| datasets | austres | Quarterly Time Series of the Number of Australian Residents |
| datasets | beaver1 (beavers) | Body Temperature Series of Two Beavers |
| datasets | beaver2 (beavers) | Body Temperature Series of Two Beavers |
| datasets | cars | Speed and Stopping Distances of Cars |
| datasets | chickwts | Chicken Weights by Feed Type |
| datasets | co2 | Mauna Loa Atmospheric CO2 Concentration |
| datasets | crimtab | Student’s 3000 Criminals Data |
| datasets | discoveries | Yearly Numbers of Important Discoveries |
| datasets | esoph | Smoking, Alcohol and (O)esophageal Cancer |
| datasets | euro | Conversion Rates of Euro Currencies |
| datasets | euro.cross (euro) | Conversion Rates of Euro Currencies |
| datasets | eurodist | Distances Between European Cities and Between US Cities |
| datasets | faithful | Old Faithful Geyser Data |
| datasets | fdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
| datasets | freeny | Freeny’s Revenue Data |
| datasets | freeny.x (freeny) | Freeny’s Revenue Data |
| datasets | freeny.y (freeny) | Freeny’s Revenue Data |
| datasets | infert | Infertility after Spontaneous and Induced Abortion |
| datasets | iris | Edgar Anderson’s Iris Data |
| datasets | iris3 | Edgar Anderson’s Iris Data |
| datasets | islands | Areas of the World’s Major Landmasses |
| datasets | ldeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
| datasets | lh | Luteinizing Hormone in Blood Samples |
| datasets | longley | Longley’s Economic Regression Data |
| datasets | lynx | Annual Canadian Lynx trappings 1821-1934 |
| datasets | mdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
| datasets | morley | Michelson Speed of Light Data |
| datasets | mtcars | Motor Trend Car Road Tests |
| datasets | nhtemp | Average Yearly Temperatures in New Haven |
| datasets | nottem | Average Monthly Temperatures at Nottingham, 1920-1939 |
| datasets | npk | Classical N, P, K Factorial Experiment |
| datasets | occupationalStatus | Occupational Status of Fathers and their Sons |
| datasets | precip | Annual Precipitation in US Cities |
| datasets | presidents | Quarterly Approval Ratings of US Presidents |
| datasets | pressure | Vapor Pressure of Mercury as a Function of Temperature |
| datasets | quakes | Locations of Earthquakes off Fiji |
| datasets | randu | Random Numbers from Congruential Generator RANDU |
| datasets | rivers | Lengths of Major North American Rivers |
| datasets | rock | Measurements on Petroleum Rock Samples |
| datasets | sleep | Student’s Sleep Data |
| datasets | stack.loss (stackloss) | Brownlee’s Stack Loss Plant Data |
| datasets | stack.x (stackloss) | Brownlee’s Stack Loss Plant Data |
| datasets | stackloss | Brownlee’s Stack Loss Plant Data |
| datasets | state.abb (state) | US State Facts and Figures |
| datasets | state.area (state) | US State Facts and Figures |
| datasets | state.center (state) | US State Facts and Figures |
| datasets | state.division (state) | US State Facts and Figures |
| datasets | state.name (state) | US State Facts and Figures |
| datasets | state.region (state) | US State Facts and Figures |
| datasets | state.x77 (state) | US State Facts and Figures |
| datasets | sunspot.month | Monthly Sunspot Data, from 1749 to “Present” |
| datasets | sunspot.year | Yearly Sunspot Data, 1700-1988 |
| datasets | sunspots | Monthly Sunspot Numbers, 1749-1983 |
| datasets | swiss | Swiss Fertility and Socioeconomic Indicators (1888) Data |
| datasets | treering | Yearly Treering Data, -6000-1979 |
| datasets | trees | Diameter, Height and Volume for Black Cherry Trees |
| datasets | uspop | Populations Recorded by the US Census |
| datasets | volcano | Topographic Information on Auckland’s Maunga Whau Volcano |
| datasets | warpbreaks | The Number of Breaks in Yarn during Weaving |
| datasets | women | Average Heights and Weights for American Women |