layout: true <style> .onehundredtwenty { font-size: 120%; } <style> .ninety { font-size: 90%; } .eightyfive { font-size: 85%; } .eighty { font-size: 80%; } .seventyfive { font-size: 75%; } .seventy { font-size: 70%; } .fifty { font-size: 50%; } .forty { font-size: 40%; } </style> --- class: banner --- name: title-slide class: primary #.fancy[A {fun} Intro to R Programming] ###.fancy[The basics,<br>data wrangling,<br>and more!] <br> Fabio Votta [
@favstats](http://twitter.com/favstats)<br> [
@favstats](http://github.com/favstats)<br> [
favstats.eu](https://www.favstats.eu) August 14 2023 .fifty[Link to slides: [favstats.github.io/ds3_r_intro](https://favstats.github.io/ds3_r_intro)] --- ### Your friendly neighborhood R Instructor .leftcol40[ <img src="https://github.com/favstats/WarwickSpringCamp_QTA/blob/main/docs/slides/day1/images/me.jpg?raw=true" style="width: 90%" /> ] .rightcol60[ + Ph.D. Candidate in Political Communication at University of Amsterdam + Passionate about R and Data Science + I love to travel + I enjoy and (occasionally) create R memes <center> <img src="https://pbs.twimg.com/profile_banners/866598352577867776/1495448334/1500x500" width="60%"> .font60[ [twitter.com/rstatsmemes](https://twitter.com/rstatsmemes) </center> ] ] --- class: middle, center ## But enough of me ### Let's learn something about .blue[you]! <center> <img src="images/mentimeter_qr_code.png" width="20%"> </center> Go to `menti.com` and type in the code: 6410 5559 or visit this website: [menti.com/aloxtiea8mx3](https://www.menti.com/aloxtiea8mx3) --- <div style='position: relative; padding-bottom: 56.25%; padding-top: 35px; height: 0; overflow: hidden;'><iframe sandbox='allow-scripts allow-same-origin allow-presentation' allowfullscreen='true' allowtransparency='true' frameborder='0' height='315' src='https://www.mentimeter.com/app/presentation/al367t83vwruzc9sy2tu65mrpwjfqqtt/embed' style='position: absolute; top: 0; left: 0; width: 100%; height: 100%;' width='420'></iframe></div> --- ### It's not unusual to struggle at first but it gets better! <img src="images/r_first_then_new.png" width="80%" style="display: block; margin: auto;" /> <!-- ![](images/r_first_then_new.png){width=50%} --> .fifty[Illustration adapted from [Allison Horst](https://twitter.com/allison_horst)] -- + My experience is that this stuff isn't super easy... but it gets better! -- + Awesome inclusive community that is always ready to help + Active blogosphere with use cases and examples --- ### A Note on Live-Coding <img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/118ae091-4329-4382-8cb8-012035593dfc_rw_1920.png?h=29e95985575f458232d53ad93d1860cb" width="80%" style="display: block; margin: auto;" /> --- ## Overview + R Basics + Operators + Objects (inc. vectors) + Functions + Exercises + Data frames `\(\text{B R E A K at 2pm (CET) - (5:30 PM IST) - (8:00 PM CST)}\)` + Data Manipulation + the tidyverse and friends + `janitor` + `tidyr` + `dplyr` + Exercises --- <!-- #### What is <img src="images/Rlogo.svg" width="30px" inline-block/> ? --> #### What is <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>? R is a .fancy[statistical] programming language developed for data analysis and visualization. #### What is <img src="images/rstudio.png" style="display: inline-block; margin: 0"; width="80px"/>? RStudio is an IDE (Integrated Development Environment). * Write, save and open R Code (.R/.Rmd files) * Provides syntax-highlighting and auto-completion & much more -- # But why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="60px"/>? <!-- ![](http://www.favstats.eu/img/headers/season_views.png) --> --- <!-- #### What is <img src="images/Rlogo.svg" width="30px" inline-block/> ? --> .leftcol[ #### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>? <!-- ![](http://www.favstats.eu/img/headers/season_views.png) --> + Amazing Community .forty[(but I already said that)] <!-- + Outstanding repertoire of statistical & computational methods --> <!-- -- --> <!-- + Integrates well with other programming languages (like Python) --> <!-- -- --> <!-- + Beautiful Data Visualization with `ggplot2` and more --> ] .rightcol[ ![](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/welcome_to_rstats_twitter.png) .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] ] --- <!-- #### What is <img src="images/Rlogo.svg" width="30px" inline-block/> ? --> .leftcol60[ #### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>? <!-- ![](http://www.favstats.eu/img/headers/season_views.png) --> + Amazing Community .forty[(but I already said that)] + Data wrangling is accessible & fun <!-- + Outstanding repertoire of statistical & computational methods --> <!-- -- --> <!-- + Integrates well with other programming languages (like Python) --> <!-- -- --> <!-- + Beautiful Data Visualization with `ggplot2` and more --> ] .rightcol40[ <br> ![](https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/tidyverse_celestial.png?raw=true) .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] ] --- .leftcol60[ #### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>? <!-- ![](http://www.favstats.eu/img/headers/season_views.png) --> + Amazing Community .forty[(but I already said that)] + Data wrangling is accessible & fun + Outstanding repertoire of statistical & computational methods <!-- + Integrates well with other programming languages (like Python) --> <!-- + Beautiful Data Visualization with `ggplot2` and more --> ] .rightcol40[ <br> ![](https://pbs.twimg.com/media/E4mDfxSXEAA_kvG.jpg) .fifty[ [easystats](https://github.com/easystats/easystats) packageverse for statistical analysis.] ] --- .leftcol60[ #### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>? <!-- ![](http://www.favstats.eu/img/headers/season_views.png) --> + Amazing Community .forty[(but I already said that)] + Data wrangling is accessible & fun + Outstanding repertoire of statistical & computational methods + Integrates well with other programming languages <!-- + Beautiful Data Visualization with `ggplot2` and more --> ] .rightcol40[ <br> ![](https://rstudio.github.io/reticulate/images/reticulated_python.png) .fifty[ [reticulate](https://github.com/easystats/easystats) integrates Python into R] ] --- .leftcol60[ #### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>? <!-- ![](http://www.favstats.eu/img/headers/season_views.png) --> + Amazing Community .forty[(but I already said that)] + Data wrangling is accessible & fun + Outstanding repertoire of statistical & computational methods + Integrates well with other programming languages + Beautiful data visualization with `ggplot2` <br> <br> .right[.fifty[Data visualization by [Cédric Scherer](https://www.cedricscherer.com/)]] ] .rightcol40[ ![](https://raw.githubusercontent.com/Z3tt/TidyTuesday/master/plots/2020_31/2020_31_PalmerPenguins.png?raw=true) ] --- .leftcol75[ ### This workshop is held in .green[Rmarkdown] ] .rightcol25[ <img src="https://rmarkdown.rstudio.com/docs/reference/figures/logo.png" width="80" height="100" style="display: block; margin: auto 0 auto auto;" /> ] <center> <img src="https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/rmarkdown_wizards.png?raw=true" style="width: 76%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- .leftcol75[ ## .green[Rmarkdown] ] .rightcol25[ <img src="https://rmarkdown.rstudio.com/docs/reference/figures/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] .leftcol[ <center> <img src="https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/rmarkdown_wizards.png?raw=true" style="width: 100%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] ] .rightcol[ * create *documents* with R * mix code, text and graphs as you like * write your thesis, create slides, automated reports, dashboards and interactive web apps all from within [Rmarkdown](https://rmarkdown.rstudio.com/docs/articles/rmarkdown.html) ] --- ## R vs. Rmarkdown .seventy[ There are two (main) ways you can create scripts in R ] .pull-left[ .seventy[ R scripts (file ending `.R`): ] ![](images/rcode.png) ] .pull-right[ .seventy[ Rmarkdown (`.Rmd`) scripts produce HTML: ] <img src="images/rmarkown_html.png" style="width: 72%" /> ] --- ## R vs. Rmarkdown .seventy[ There are two (main) ways you can create scripts in R ] .pull-left[ .seventy[ R scripts (file ending `.R`): ] ![](images/rcode.png) ] .pull-right[ .seventy[ Rmarkdown (`.Rmd`) scripts produce PDF: ] ![](images/rmarkown_pdf.png) ] --- ## R vs. Rmarkdown .seventy[ There are two (main) ways you can create scripts in R ] .pull-left[ .seventy[ R scripts (file ending `.R`): ] ![](images/rcode.png) ] .pull-right[ .seventy[ Rmarkdown (`.Rmd`) scripts produce so much more: ] slides (*like the ones you are looking at right now!*), automated reports, dashboards and interactive web apps ] --- ## We'll switch to RStudio in a moment > The next few slides will have a couple of links that you may want to follow. You can find **all** materials of this workshop, including links, in this GitHub repository: [tinyurl.com/ds3repo](https://github.com/favstats/ds3_r_intro) .font80[ Hopefully at this point you have already followed the **pre-workshop instructions:** [tinyurl.com/ds3prep](https://favstats.github.io/ds3_r_intro/prep/instructions.html) *Once in RStudio, feel free to code along with me, changing a few bits here and there and see how results change.* ] --- ## Alternatives to local RStudio If you cannot install R or RStudio on your computer for any reason, you have two options to follow along: + **Binder** Simply start a **Binder** instance which will create a session of RStudio in your Browser (may take a little bit): [tinyurl.com/ds3binder](https://tinyurl.com/ds3binder) [![Binder](https://img.shields.io/badge/launch-binder-579aca.svg?logo=)](https://mybinder.org/v2/gh/favstats/ds3_r_intro/rstudio?urlpath=rstudio) > Note: If you are using Binder don't forget to download the files before you close out the session because otherwise anything you added will be lost! --- ## Alternatives to local RStudio + **Google Colab** Google Colab instantaneously runs Jupyter Notebooks in your browser with an R Kernel. + [Part I (R Basics): tinyurl.com/ds3rintro1](https://colab.research.google.com/drive/1dLsdGbkvgn1JbWgsy9Z-pFmPd_2MG4Xu?usp=sharing) + [Part II (Data Manipulation with the `tidyverse`): tinyurl.com/ds3rintro2](https://colab.research.google.com/drive/14CRElnKewnp5MnlxhqVu6OOcIXd-Bkaj?usp=sharing) --- ## Workshop files Please download the workshop files from here: [https://tinyurl.com/ds3files](https://www.dropbox.com/sh/jievqgwl43nwnbf/AADhoYQW5oMZ-JygK7aNklHra?dl=0). The link will download a `.zip` file. Extract it to **its own folder**. Now double-click on **ds3_intro.Rproj** ![](images/click_on_rproject.png) --- ## Within RStudio ![](images/rstudio_instructions.png) <!-- ## RStudio Tour --> <!-- <center> --> <!-- <img src="https://bookdown.org/ageraci/STAT160Companion/images/rstudiopanes.png" style="width: 82%" /> --> <!-- </center> --> <!-- --- --> <!-- ## RStudio Tour --> <!-- * Program pane --> <!-- * Shows R Files, typically .R or .Rmd --> <!-- * Environment pane --> <!-- * All the objects that you created during your session --> <!-- * Console pane --> <!-- * For executing short snippets of R code --> <!-- * Files pane --> <!-- * Gives access to files, but also help, and plots --> <!-- --- --> --- class: inverse, middle, center # R Basics -- ### Math Operators --- ### Math Operators At its core R is just a fancy *calculator* You can do: `+` addition `-` subtraction `*` multiplication `/` division `^` exponentiate --- ### Math Operators At its core R is just a fancy *calculator* ### `+` addition .details[ ```r 15 + 5 ``` ``` #> [1] 20 ``` ] -- ### Mixing operators .details[ ```r (15 + 5) / (2 * 5) ``` ``` #> [1] 2 ``` ] --- ## Short excourse: a new data set appears.. But adding up numbers for no reason is no fun. -- That's why we will use a data set about .fancy[animals] to learn some R Basics. .leftcol45[ <img src="images/kisspng-animal-clip-art-portable-network-graphics-illustra-cartoon-animal-transparent-amp-png-clipart-free-5c7ccdbb1c9c98.7009291415516830031172.png" style="width: 100%" /> ] .rightcol55[ [Animal Ageing and Longevity Database](https://www.johnsnowlabs.com/marketplace/the-animal-aging-and-longevity-database/) Data on over 4200 animals. Information on age of maturity, gestation or incubation periods but also **longevity (in years)**. ] --- ## Animal Ageing and Longevity Database Say we want to know how old an animal is in *human years*. -- We can use the following simple formula to determine that: <br> `$$\frac{\text{Maximum lifespan human}}{\text{Maximum lifespan non-human animal}} = \text{animal to human years ratio}$$` <br> *Note: This is just a **very rough** way to determine the conversion ratio. It is **much more** complicated in [reality](https://www.akc.org/expert-advice/health/how-to-calculate-dog-years-to-human-years/).* --- ## Animal Ageing and Longevity Database .leftcol[ | Animal | Maximum Lifespan | | --- | --- | | Human | 122.5 | | Domestic dog | 24.0 | | Domestic cat | 30.0 | | American alligator | 77.0 | | Golden hamster | 3.9 | | King penguin | 26.0 | | Lion| 27.0 | | Greenland shark | 392.0 | | Galapagos tortoise | 177.0 | ] .rightcol[ | Animal | Maximum Lifespan | | --- | --- | | African bush elephant | 65.0 | | California sea lion | 35.7 | | Fruit fly | 0.3 | | House mouse | 4.0 | | Giraffe | 39.5 | | Wild boar | 27.0 | Source: [Animal Ageing and Longevity Database](https://www.johnsnowlabs.com/marketplace/the-animal-aging-and-longevity-database/) ] --- ### Math Operators Say we want to know how old a dog is in *human years*. -- The observed maximum lifespan of a human is 122.5 years. For dogs it is 24. -- .details[ ```r 122.5/24 ``` ``` #> [1] 5.104167 ``` ] So for every year a human ages, a dog "ages" 5.1 human years. How old is a 15 year old dog in human years? .details[ ```r 5.104167*15 ``` ``` #> [1] 76.56251 ``` ] --- ## So many numbers.. Now it can be quite tedious to juggle all those numbers around. Especially if we want to keep reusing numbers we calculated before. -- Here to simplify that process are: **Objects** --- class: inverse, middle, center # R Basics ### R Objects --- ## R Objects You can think of R objects as *saving information*, for example simple numbers or just plain text. Once saved we can recall it whenever we want by just running the name of the object. -- > Everything that exists in R is *.red[an object]*. .fifty[~John M. Chambers] -- We create R objects by using the assignment operator: <center> .large[**`<-`**] </center> .right[ .fifty[You can also assign with the **`=`** sign but we will not be using this here.] ] --- ## R Objects Here is an example: ```r human_lifespan <- 122.5 dog_lifespan <- 24 ``` If we now run the respective objects we retrieve the saved numbers. .details[ ```r human_lifespan ``` ``` #> [1] 122.5 ``` ] .details[ ```r dog_lifespan ``` ``` #> [1] 24 ``` ] --- ## R Objects Now we can perform the same calculation as before but this time using objects! ```r dogs_to_human <- human_lifespan / dog_lifespan ``` The object `dogs_to_human` now holds the dog to human years conversion ratio. -- *Now* we ask again: how old is a 15 year old dog in human years? .details[ ```r dogs_to_human*15 ``` ``` #> [1] 76.5625 ``` ] --- ## A quick note on naming things .leftcol[ > Note that object names could be *anything* here! I could have chosen to just name them `x`, `y` and `z`. I typically use lower-case snake case in this style: `animal_rights`. Also recommended by the [tidyverse style guide](https://style.tidyverse.org/syntax.html). ] .rightcol[ <img src="images/in_that_case.png" style="width: 90%" /> ] --- ## More operators.. But wait there are more operators: .fancy[logical operators] -- Logical operators are used for logical tests which can result in either: `\(\text{TRUE}\)` or `\(\text{FALSE}\)` *(sometimes this is also called a boolean variable)* --- class: inverse, middle, center # R Basics ### Logical Operators --- ## Logical Operators Let's first create some more objects to try some logical tests! ```r lion_lifespan <- 27 mouse_lifespan <- 4 fly_lifespan <- 0.3 boar_lifespan <- 27 alligator_lifespan <- 77 greenland_shark_lifespan <- 392 galapagos_tortoise_lifespan <- 177 ``` --- ## Logical Operators `==` asks whether two values are the same or **equal** The code below tests the following statement: *The maximum lifespan of a lion equals that of a boar.* .details[ ```r lion_lifespan == boar_lifespan ``` ``` #> [1] TRUE ``` ] -- Since both maximum lifespans are `27` this is of course a **`TRUE`** statement. --- ## Logical Operators `!=` asks whether two values are the *not* the same or **unequal** The code below tests the following statement: *The maximum lifespan of a lion **does not equal** that of a boar.* .details[ ```r lion_lifespan != boar_lifespan ``` ``` #> [1] FALSE ``` ] -- Since both maximum lifespans are `27` (as we saw before) this is of course a **`FALSE`** statement. --- ## Logical Operators We can also test whether certain values are greater or smaller than others: `>` greater than The code below tests the statement: *The lifespan of a human is greater than the lifespan of a fly.* .details[ ```r human_lifespan > fly_lifespan ``` ``` #> [1] TRUE ``` ] -- Since the maximum human lifespan is `122.5` and a fly does not live longer than `0.3` years this is of course a **`TRUE`** statement. --- ## Logical Operators `<` smaller than The code below tests the statement: *The lifespan of an alligator is smaller than the lifespan of a mouse.* .details[ ```r alligator_lifespan < mouse_lifespan ``` ``` #> [1] FALSE ``` ] -- Since the maximum alligator lifespan is `77` and a mouse lives for `4` years maximum, this is of course a **`FALSE`** statement. -- Also note the following options: `>=` greater or equals and `<=` smaller or equals --- ## Combine Logical Operators We can also combine logical tests by testing multiple statements at the same time: * `&` stands for "and" (unsurprisingly) * `|` stands for "or" For example both `alligator_lifespan` and `fly_lifespan` have to be greater than `mouse_lifespan` for the code below to evaluate as `TRUE`. .details[ ```r alligator_lifespan > mouse_lifespan & fly_lifespan > mouse_lifespan ``` ``` #> [1] FALSE ``` ] --- ## Combine Logical Operators We can also combine logical tests by testing multiple statements at the same time: * `&` stands for "and" (unsurprisingly) * `|` stands for "or" If we say `|` (= or) instead, it means either statement evaluation to `TRUE` is enough! .details[ ```r alligator_lifespan > mouse_lifespan | fly_lifespan > mouse_lifespan ``` ``` #> [1] TRUE ``` ] --- ## Even more objects..? Now we learned about operators and some basic objects. But so far objects have only ever held a *single numeric value*. R is of course much more powerful than that and objects can hold any number and types of data. -- Now we will take a look at **vectors**, or objects that include more than one value. --- class: inverse, middle, center # R Basics ### Vectors --- ## Vectors You can simply imagine vectors as a list of values. They can consist of numbers but also *strings* (or: text). In order to create a vector in R we make use of `c()` (stands for *concatenate*) .details[ ```r c(1, 100, 1000, 2000, 5000) ``` ``` #> [1] 1 100 1000 2000 5000 ``` ] We can also create a vector of strings by using quotes: .details[ ```r c("I", "am", "a", "vector", "of", "strings") ``` ``` #> [1] "I" "am" "a" "vector" "of" "strings" ``` ] --- ## Vectors We can combine the lifespans we assigned into objects earlier: ```r animal_lifespans <- c(greenland_shark_lifespan, dog_lifespan, galapagos_tortoise_lifespan, mouse_lifespan, fly_lifespan, lion_lifespan, boar_lifespan, alligator_lifespan, human_lifespan) ``` .details[ ```r animal_lifespans ``` ``` #> [1] 392.0 24.0 177.0 4.0 0.3 27.0 27.0 77.0 122.5 ``` ] --- ## Vectors We can also create a vector of strings by using quotes: .details[ ```r animals <- c("greenland_shark", "dog", "galapagos_tortoise", "mouse", "fly", "lion", "boar", "alligator", "human") ``` ] .details[ ```r animals ``` ``` #> [1] "greenland_shark" "dog" "galapagos_tortoise" #> [4] "mouse" "fly" "lion" #> [7] "boar" "alligator" "human" ``` ] --- ## Vectors Now if we wanted to get the different year conversation ratios we can simply divide the maximum human age number by the vector. .details[ ```r human_lifespan / animal_lifespans ``` ``` #> [1] 0.3125000 5.1041667 0.6920904 30.6250000 408.3333333 4.5370370 #> [7] 4.5370370 1.5909091 1.0000000 ``` ] Notice how the operation is performed for each item separately and the result is yet another vector. --- ## Vectors We can also use *logical* operators with vectors: .details[ ```r animal_lifespans > human_lifespan ``` ``` #> [1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ``` ] Again, notice how the operation is performed for each item separately and the result is yet another vector, this time consisting of `TRUE`s and `FALSE`s. --- ## A logical operator for vectors: `%in%` An incredibly useful operator for vectors is **`%in%`**. The operator checks whether multiple elements occur somewhere in your vectors. Its basic usage looks like this: `\(\color{red}{\text{vector1}}\)` %in% `\(\color{orange}{\text{vector2}}\)` --- ## A logical operator for vectors: `%in%` Let's say we want to check whether `giraffe`, `greenland_shark` or `lion` occur in `animals`. If we use `|` we would have to write something like this: ```r animals == "giraffe" | animals == "greenland_shark" | animals == "lion" ``` -- With `%in%` we can simply pass a vector like this: ```r animals_to_check <- c("giraffe", "greenland_shark", "lion") animals %in% animals_to_check ``` --- ## A logical operator for vectors: `%in%` Doesn't that look much better? Now imagine you have dozens or hundreds of animals to check! --- ## A logical operator for vectors: `%in%` With **`|`** it's very repetitive and utter chaos. .fifty[ ```r animals == "honey_bee" | animals == "cardiocondyla_obscurior" | animals == "black_garden_ant" | animals == "pheidole_dentata" | animals == "squinting_bush_brown" | animals == "american_lobster" | animals == "firebelly_toad" | animals == "oriental_firebelly_toad" | animals == "yellow_bellied_toad" | animals == "american_toad" | animals == "western_toad" | animals == "yosemite_toad" | animals == "great_plains_toad" | animals == "green_toad" | animals == "canadian_toad" | animals == "red_spotted_toad" | animals == "sonoran_green_toad" | animals == "southern_toad" | animals == "veragoa_stubfoot_toad" | animals == "common_european_toad" | animals == "colorado_river_toad" | animals == "kihansi_spray_toad" | animals == "ridge_headed_toad" | animals == "cuban_toad" | animals == "european_green_toad" | animals == "colombian_giant_toad" | animals == "argentine_toad" | animals == "cane_toad" | animals == "eurura_toad" | animals == "common_horned_frog" | animals == "colombian_horned_frog" | animals == "amazonian_horned_frog" | animals == "ornate_horned_frog" | animals == "green_and_black_dart_poison_frog" ``` ``` #> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` ] --- ## A logical operator for vectors: `%in%` With **`%in%`** it's much more readable. .fifty[ ```r animals_to_check <- c("honey_bee", "cardiocondyla_obscurior", "black_garden_ant", "pheidole_dentata", "squinting_bush_brown", "american_lobster", "firebelly_toad", "oriental_firebelly_toad", "yellow_bellied_toad", "american_toad", "western_toad", "yosemite_toad", "great_plains_toad", "green_toad", "canadian_toad", "red_spotted_toad", "sonoran_green_toad", "southern_toad", "veragoa_stubfoot_toad", "common_european_toad", "colorado_river_toad", "kihansi_spray_toad", "ridge_headed_toad", "cuban_toad", "european_green_toad","colombian_giant_toad", "argentine_toad", "cane_toad", "eurura_toad", "common_horned_frog", "colombian_horned_frog", "amazonian_horned_frog", "ornate_horned_frog") ``` ] ```r animals %in% animals_to_check ``` ``` #> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` --- class: inverse, middle, center # R Basics ### Indexing (with vectors) --- ## Indexing When you want to know a specific value within your object you can use indexing. Indexing is done via square brackets `[]`. The basic setup looks like this: `$$\color{red}{\text{vector}}[\color{orange}{\text{elements}}]$$` --- ## Indexing Exracting the first element of a vector: ```r animal_lifespans[1] ``` ``` #> [1] 392 ``` ```r animals[1] ``` ``` #> [1] "greenland_shark" ``` --- ## Indexing Exracting the fifth element of a vector: ```r animal_lifespans[5] ``` ``` #> [1] 0.3 ``` ```r animals[5] ``` ``` #> [1] "fly" ``` --- ## Indexing with logical tests You can also index using logical tests. So if an expression evaluates to `TRUE` it will **keep** that element and when it evaluates to `FALSE` it will remove the **element**. `$$\color{red}{\text{vector}}[\color{orange}{\text{vector of TRUE/FALSE of same length}}]$$` --- ## Indexing with logical tests Let's first take a look at a logical test that extracts all animals that have greater lifespans than humans: ```r longer_living <- animal_lifespans > human_lifespan longer_living ``` ``` #> [1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ``` Now we can use square brackets to only keep those animals that have greater lifespans than humans. ```r animals[longer_living] ``` ``` #> [1] "greenland_shark" "galapagos_tortoise" ``` --- ## Short excourse: variable types There are three-ish main types of variables: -- .seventy[ * **logical**: Boolean/binary, is either `TRUE` or `FALSE` ] ```r class(TRUE) ``` ``` #> [1] "logical" ``` .seventy[ * **character (or string)**: simple text, including symbols and numbers `"text"` ] ```r class("I am a character") ``` ``` #> [1] "character" ``` .seventy[ * **numeric**: Numbers. Mathmatical operators can be used here. ] ```r class(2020) ``` ``` #> [1] "numeric" ``` --- ## Short excourse: variable types Another important value to consider is `NA` (*Not Available*). `NA` is a special value that simply means *missing value*. ```r c(12, NA, 23, 22, NA) ``` ``` #> [1] 12 NA 23 22 NA ``` --- class: inverse, middle, center # R Basics ### Functions --- # Functions ![](https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/9718f461-8060-433b-b014-b294da38d172_rw_1920.png?h=5749aa3c82e02c0c4c9428a6788714da) Artist: [Allison Horst](https://github.com/allisonhorst/stats-illustrations) --- ## Functions > Everything that happens in R is *.red[a function]*. .fifty[~John M. Chambers] You can think of functions as little machines that (in most cases) process some kind of **input** and create an **output**. Input is everything that goes *into* a function: * **arguments** you can think of as (pre-determined) input types like a lever or numpad. * **values** you can think of as the various settings that the levers or numpads can have. `$$\text{function_name}(\color{orange}{\text{argument}}=\color{lightblue}{\text{value}})$$` --- ## Functions > Everything that happens in R is *.red[a function]*. .fifty[~John M. Chambers] You can think of functions as little machines that (in most cases) process some kind of **input** and create an **output**. Input is everything that goes *into* a function: * **arguments** you can think of as (pre-determined) input types like a lever or numpad. * **values** you can think of as the various settings that the levers or numpads can have. *Let's take a look at an example: the star producer!* --- <img src="https://raw.githubusercontent.com/favstats/hertieschool_datasciencesummerschool/master/img/starproducer2.png" width="90%" /> --- ## The Star Producer Let's consider the following function (that does not exist unfortunately): A `star_producer`! This little machine creates tiny hand-drawn stars depending on some input. It takes two arguments: * `how_many` tells the machine how many stars to produce * `type` tells the machine how the stars should look like (in this case the machine only supports `"squiggly"` stars but it could be upgraded in the future when we learn how to create our own functions later on) --- ## Getting `?help` How do we know what function takes what kind of arguments? Within R you can always run the code: ```r ?function_name ``` And it will open up the *documentation* for the function that will tell you how to use it. Googling the function (adding R or rstats) will also often bring you to some documentation in most cases! --- ## An example function: `seq` -- From the help file we can learn that this function is used to > "[g]enerate regular sequences". Its first three arguments are this: * `from`, `to`: the starting and (maximal) end values of the sequence. * `by`: number: increment of the sequence. Let's first take a look at this within our machine allegory. --- <img src="https://raw.githubusercontent.com/favstats/hertieschool_datasciencesummerschool/master/img/seq2.png" width="90%" /> --- ## An example function: `seq` If we would want to create a vector from 1 to 10 that increments by 1 we can simple specify the following input values for the arguments: * `from`: 1 * `to`: 10 * `by`: 1 This is how that looks like in code: ```r seq(from = 1, to = 10, by = 1) ``` ``` #> [1] 1 2 3 4 5 6 7 8 9 10 ``` --- ### A tip There is however a much simpler way to create sequences that increment by one. Simply use a *`:`* between two numbers and it generates a sequence: ```r 1:10 ``` ``` #> [1] 1 2 3 4 5 6 7 8 9 10 ``` ```r 1000:1010 ``` ``` #> [1] 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ``` --- ### Passing Values Now there are two ways to pass values to functions in R: 1. Pass by argument *names* (we already did this!) 2. Pass by argument *position* In the former case, we specifically mention which arguments we want to pass our values to. For that, it doesn't matter in which **order** we pass our arguments. ```r seq(to = 10, by = 1, from = 1) ``` ``` #> [1] 1 2 3 4 5 6 7 8 9 10 ``` --- ### Passing Values by position But: **coders are lazy**. There is no need to always specify which argument you mean exactly when you can just match *by position*. So our sequence example could just as well look like this: ```r seq(1, 10, 1) ``` ``` #> [1] 1 2 3 4 5 6 7 8 9 10 ``` And it works because the documentation tells us that the first three arguments are `from`, `to`, and `by`. In the future you will see it often that people just leave out the arguments completely so it's good to get used to it. --- ## More examples: Mean and Median Many functions have so intuitive arguments that we often don't need to even look up the documentation. An easy function to use is `mean` which simply calculates the average of a numeric vector. Let's try this with the `animal_lifespans` vector we created earlier. ```r mean(animal_lifespans) ``` ``` #> [1] 94.53333 ``` The mean value is quite high! --- ## More examples: Mean and Median We can also try and take the median value: ```r median(animal_lifespans) ``` ``` #> [1] 27 ``` When we take the median instead of the mean we can see that this is due to high outliers (the median is of course more robust to extreme values). There are many more functions in R and we will get to learn some of them during this workshop. --- class: inverse, middle, center # R Basics ### Creating our own Functions -- <center> <img src="https://media.giphy.com/media/f74WDV59cP0NArh8gu/giphy.gif" style="width: 50%" /> </center> --- ## Creating our own functions We can create our own function using the call: `function()`. We encode what is supposed to happen within curly brackets `{}`. Here is the anatomy of a function: `\(\color{purple}{\text{my_function_name}}\)` <- `\(\text{function}(\color{orange}{\text{argument}})\)`{ `\(\color{green}{\text{# function body}}\)` `\(\color{lightblue}{\text{output}}\)` <- `\(\color{orange}{\text{argument}}\)` `\(\text{return(}\color{lightblue}{\text{output}}\text{)}\)` } --- ## Creating our own functions * `\(\color{purple}{\text{Function name}}\)`: * An identifier by which the function is called * `\(\color{orange}{\text{Argument(s)}}\)`: * Contains a list of values passed to the function * Can also contain a default value like this: `argument = 1` * `\(\color{green}{\text{Function body}}\)`: * This is executed each time the function is called * `\(\color{lightblue}{\text{Return value}}\)`: * Ends function call & sends the value back to the global environment --- ## Creating our own functions Let's try this basic example: ```r my_function_name <- function(argument){ # function body output <- argument return(output) } ``` ```r my_function_name("I am output!") ``` ``` #> [1] "I am output!" ``` > Tip: In RStudio we can just type `fun` and enter after the popup and RStudio will just automatically generate a template for a function. --- ## Creating our own functions We can also specify *default values* for our arguments: ```r my_function_name <- function(argument = "I am a default value"){ # function body output <- argument return(output) } ``` ```r my_function_name() ``` ``` #> [1] "I am a default value" ``` --- ## Creating our own functions Let's create a slightly more useful function: a function which squares numeric values. ```r square <- function(here_goes_my_number) { output <- here_goes_my_number^2 return(output) } ``` ```r square(2) ``` ``` #> [1] 4 ``` --- ## Creating our own functions Let's create a function that is able to calculate dog years into human years. We call the function `dog_to_human_years`. ```r dog_to_human_years <- function(animal_years){ human_lifespan <- 122.5 dog_lifespan <- 24 ratio <- human_lifespan/dog_lifespan human_years <- animal_years*ratio return(human_years) } ``` ```r dog_to_human_years(15) ``` ``` #> [1] 76.5625 ``` --- ## A quick note on errors and debugging Soon we will go into exercise mode! Before that, however, it's important to understand: > Seasoned R user or complete beginner, **everyone makes mistakes.** > Encountering errors sometimes is normal. No R programmer ever just fell from the sky. *Debugging* is the process of finding and resolving bugs/problems in code and it **happens all the time**. --- ## A quick note on errors and debugging So you encountered an error: ``` #> Error: object of type 'closure' is not subsettable ``` Steps to take: * Try to understand what it says. * Easier said than done, because many R errors are actually quite cryptic unfortunately. --- ## A quick note on errors and debugging So you encountered an error: ``` #> Error: object of type 'closure' is not subsettable ``` Steps to take: * Try to understand what it says. * Google (or other search engines) * Search for the error * Search for what you were trying to do (add R or rstats) --- ## A quick note on errors and debugging So you encountered an error: ``` #> Error: object of type 'closure' is not subsettable ``` Steps to take: * Try to understand what it says. * Google (or other search engines) * If the error occurs in a function check whether you passed an object type that isn't expected by it. * Checking the documentation can help here! Type: `?function` --- ## A quick note on errors and debugging So you encountered an error: ``` #> Error: object of type 'closure' is not subsettable ``` Steps to take: * Try to understand what it says. * Google (or other search engines) * If the error occurs in a function check whether you passed an object type that isn't expected by it. * Ask for help! Create a *reproducible example* (reprex) and post to online communities --- class: center, middle, inverse # Exercises ### It's time to type some R code Open `02_exercises_I.Rmd` <center> <img src="https://media1.tenor.com/images/72bf7922ac0b07b2f7f8f630e4ae01d2/tenor.gif?itemid=11364811" style="width: 50%" /> </center> --- class: center, middle, inverse # Data frames --- ## Data frames Data frames are the main R object that we will be interacting with. In many ways you already know about them too. An example for a data frame would be the table from the [Animal Ageing and Longevity Database](https://www.johnsnowlabs.com/marketplace/the-animal-aging-and-longevity-database/) we already saw earlier. | Animal | Maximum Longevity (in years)| | --- | --- | | Human | 122.5.5 | | Domestic dog | 24.0 | | Domestic cat | 30.0 | | American alligator | 77.0 | --- ## Data frames To create a data frame from scratch we can simply pass two (same-sized) vectors to the function `data.frame`. ```r animals_data <- data.frame(animals, animal_lifespans) animals_data ``` ``` #> animals animal_lifespans #> 1 greenland_shark 392.0 #> 2 dog 24.0 #> 3 galapagos_tortoise 177.0 #> 4 mouse 4.0 #> 5 fly 0.3 #> 6 lion 27.0 #> 7 boar 27.0 #> 8 alligator 77.0 #> 9 human 122.5 ``` --- ## Variable Names We can also retrieve the variable names of any data frame by passing it to `names()`. ```r names(animals_data) ``` ``` #> [1] "animals" "animal_lifespans" ``` --- ## Retrieve variables If we want to retrieve specific variables from a data frame we can do that via the `$` operator. $$\color{red}{\text{dataset}}$\color{orange}{\text{variable_name}}$$ Think of the `$` symbol as a door opener that helps you check what is inside an object. ```r animals_data$animal_lifespans ``` ``` #> [1] 392.0 24.0 177.0 4.0 0.3 27.0 27.0 77.0 122.5 ``` ```r animals_data$animals ``` ``` #> [1] "greenland_shark" "dog" "galapagos_tortoise" #> [4] "mouse" "fly" "lion" #> [7] "boar" "alligator" "human" ``` --- ## (Re-)Code variables We can also use the `$` data access to add **new variables**. In the below case we create a variable called `animal_to_human` which holds all the human to animal years conversions. We do that by simply assigning a vector containing that information to `animals_data$animal_to_human` even if that variable doesn't exist yet. ```r animals_data$animal_to_human <- animals_data$animal_lifespans / human_lifespan ``` ```r animals_data ``` ``` #> animals animal_lifespans animal_to_human #> 1 greenland_shark 392.0 3.20000000 #> 2 dog 24.0 0.19591837 #> 3 galapagos_tortoise 177.0 1.44489796 #> 4 mouse 4.0 0.03265306 #> 5 fly 0.3 0.00244898 #> 6 lion 27.0 0.22040816 #> 7 boar 27.0 0.22040816 #> 8 alligator 77.0 0.62857143 #> 9 human 122.5 1.00000000 ``` --- ## Indexing Just as we did before with vectors we can also index data frames with square brackets: `[]`. However, unlike vectors, data frames have **two dimensions**. So that is why the square brackets in this case take two inputs, separated by a comma: `$$\color{red}{\text{dataset}}[\color{orange}{\text{rows}},\color{lightblue}{\text{columns}}]$$` * The first value after the opening square bracket refers to `\(\color{orange}{\text{which rows}}\)` you want to keep. * The second value refers to `\(\color{lightblue}{\text{which columns}}\)` you want to keep. --- ## Indexing So if we only want to keep the first row of the first column of our `animals_data` that is how we would do that: ```r animals_data[1, 1] ``` ``` #> [1] "greenland_shark" ``` *If* we want to keep a certain row but all columns we can do this by leaving the *second* value within the square brackets empty. ```r animals_data[1, ] ``` ``` #> animals animal_lifespans animal_to_human #> 1 greenland_shark 392 3.2 ``` --- ## Indexing *If* we want to keep a certain column but keep all rows we can do this by leaving the *first* value within the square brackets empty. ```r animals_data[, 1] ``` ``` #> [1] "greenland_shark" "dog" "galapagos_tortoise" #> [4] "mouse" "fly" "lion" #> [7] "boar" "alligator" "human" ``` --- ## Indexing with logical tests We can also do more complex indexing by keeping only the rows that fulfill a certain condition. Let's say we only want to keep the rows that contain animals that have longer lifespans than humans. ```r animals_to_check <- animals_data$animal_lifespans > human_lifespan ``` ```r animals_data[animals_to_check, ] ``` ``` #> animals animal_lifespans animal_to_human #> 1 greenland_shark 392 3.200000 #> 3 galapagos_tortoise 177 1.444898 ``` --- class: center, middle, inverse ## Break? Break! --- <div style='position: relative; padding-bottom: 56.25%; padding-top: 35px; height: 0; overflow: hidden;'><iframe sandbox='allow-scripts allow-same-origin allow-presentation' allowfullscreen='true' allowtransparency='true' frameborder='0' height='315' src='https://www.mentimeter.com/app/presentation/alm31w5ojicxzhdkv4poagbty1vvbmjp/embed' style='position: absolute; top: 0; left: 0; width: 100%; height: 100%;' width='420'></iframe></div> --- class: center, middle, inverse # R Packages <center> <img src="https://media.tenor.com/images/0a9c3d898ea6fb84d7f190b67db91a0e/tenor.gif" style="width: 30%" /> </center> --- ## R Packages Packages are at the heart of R: * R packages are basically a collection of functions that you load into your working environment. * They contain code that other R users have prepared for the community. -- * It's good to know your packages, they can really make your life easier. * I suggest keeping track of package developments either on Twitter via #rstats --- ## R Packages You can install packages in R like this using the `install.packages` function: ```r install.packages("janitor") ``` However, installing is not enough. You also need to load the package via `library`. ```r library(janitor) ``` Think of `install.packages` as buying a set of tools (for free!) and `library` as pulling out the tools each time you want to work with them. --- class: center, middle, inverse ![](https://predictivehacks.com/wp-content/uploads/2020/11/tidyverse-default.png) --- ## What is the `tidyverse`? The tidyverse describes itself: > The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures. <center> <img src="https://rstudio-education.github.io/tidyverse-cookbook/images/data-science-workflow.png" style="width: 60%" /> </center> --- ## Core principle: tidy data * Every column is a variable. * Every row is an observation. * Every cell is a single value. We have already seen tidy data: | Animal | Maximum Lifespan | Animal/Human Years Ratio | | --- | --- | --- | | Domestic dog | 24.0 | 5.10 | | Domestic cat | 30.0 | 4.08 | | American alligator | 77.0 | 1.59 | | Golden hamster | 3.9 | 31.41 | | King penguin | 26.0 | 4.71 | --- ## Untidy data I .leftcol[ | Animal | Type | Value | | --- | --- | --- | | Domestic dog | lifespan | 24.0 | | Domestic dog | ratio | 5.10 | | Domestic cat | lifespan | 30.0 | | Domestic cat | ratio | 4.08 | | American alligator | lifespan | 77.0 | | American alligator | ratio | 1.59 | | Golden hamster | lifespan | 3.9 | | Golden hamster | ratio | 31.41 | | King penguin | lifespan | 26.0 | | King penguin | ratio | 4.71 | ] .rightcol[ <br> <br> The data on the right has multiple rows with the same observation (animal). = not tidy ] --- ## Untidy data II | Animal | Lifespan/Ratio | | --- | --- | | Domestic dog | 24.0 / 5.10 | | Domestic cat | 30.0 / 4.08 | | American alligator | 77.0 / 1.59 | | Golden hamster | 3.9 / 31.41 | | King penguin | 26.0 / 4.71 | The data above has multiple variables per column. = not tidy --- ## Core principle: tidy data <center> <img src="https://www.openscapes.org/img/blog/tidydata/tidydata_2.jpg" style="width: 80%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- ## Core principle: tidy data Tidy data has two decisive advantages: * Consistently prepared data is easier to read, process, load and save. * Many procedures (or the associated functions) in R require this type of data. <center> <img src="https://www.openscapes.org/img/blog/tidydata/tidydata_4.jpg" style="width: 40%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- ## Installing and loading the tidyverse First we install the packages of the tidyverse like this: ```r install.packages("tidyverse") ``` Then we load them: ```r library(tidyverse) ``` --- ## A new dataset appears.. We are going to work with a new data from here on out. No worries, we will stay within the animal kingdom but we need a dataset that is a little more complex than what we have seen already. --- ## A new dataset appears.. We are going to work with a new data from here on out. Meet the Palmer penguins! Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/). .leftcol[ <center> <img src="https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/lter_penguins.png" style="width: 80%" /> </center> ] .rightcol[ <center> <img src="https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/culmen_depth.png" style="width: 80%" /> </center> .right[ .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)]] ] --- ## Palmer Penguins We could install the R package `palmerpenguins` and then access the data. However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet. We can use the `readr` package which provides many convenient functions to load data into R. Here we need `read_csv`: ```r penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins_raw.csv") ``` --- ## Palmer Penguins ```r penguins_raw ``` ``` #> # A tibble: 344 × 17 #> studyName `Sample Number` Species Region Island Stage `Individual ID` #> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> #> 1 PAL0708 1 Adelie Penguin… Anvers Torge… Adul… N1A1 #> 2 PAL0708 2 Adelie Penguin… Anvers Torge… Adul… N1A2 #> 3 PAL0708 3 Adelie Penguin… Anvers Torge… Adul… N2A1 #> 4 PAL0708 4 Adelie Penguin… Anvers Torge… Adul… N2A2 #> 5 PAL0708 5 Adelie Penguin… Anvers Torge… Adul… N3A1 #> 6 PAL0708 6 Adelie Penguin… Anvers Torge… Adul… N3A2 #> 7 PAL0708 7 Adelie Penguin… Anvers Torge… Adul… N4A1 #> 8 PAL0708 8 Adelie Penguin… Anvers Torge… Adul… N4A2 #> 9 PAL0708 9 Adelie Penguin… Anvers Torge… Adul… N5A1 #> 10 PAL0708 10 Adelie Penguin… Anvers Torge… Adul… N5A2 #> # ℹ 334 more rows #> # ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>, #> # `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>, #> # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>, #> # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr> ``` --- ## Palmer Penguins We can also take a look at data set using the `glimpse` function from `dplyr`. ```r glimpse(penguins_raw) ``` ``` #> Rows: 344 #> Columns: 17 #> $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL… #> $ `Sample Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1… #> $ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P… #> $ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"… #> $ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse… #> $ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu… #> $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", … #> $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", … #> $ `Date Egg` <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,… #> $ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34… #> $ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18… #> $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,… #> $ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34… #> $ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"… #> $ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18… #> $ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298… #> $ Comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult… ``` --- class: center, middle, inverse ## initial data cleaning ### using `janitor` <center> <img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" style="width: 20%" /> </center> --- .leftcol75[ ## cleaning with `janitor` ] .rightcol25[ <img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] `janitor` is not offically part of the tidyverse package compilation but in my view it is incredibly important to know. Provides some convenient functions for basic cleaning of the data. Just like any tidverse-style package it fullfills the following criteria for its functions: > The data is always the first argument. This helps us to match by position. --- .leftcol75[ ## cleaning with `janitor` ] .rightcol25[ <img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] One annoyance with the `penguins_raw` data is that it has spaces in the variable names. Urgh! R has to put quotes around the variable names that have spaces: ```r penguins_raw$`Delta 15 N (o/oo)` penguins_raw$`Flipper Length (mm)` ``` `janitor` can help with that: using a function called `clean_names()` --- .leftcol75[ ## cleaning with `janitor` ] .rightcol25[ <img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] `clean_names()` just magically turns all our messy column names into readable lower-case snake case: ```r library(janitor) penguins_clean <- clean_names(penguins_raw) ``` That is how the variables look like now: ```r penguins_clean$delta_15_n_o_oo penguins_clean$flipper_length_mm ``` --- ## cleaning with `janitor` ```r glimpse(penguins_clean) ``` ``` #> Rows: 344 #> Columns: 17 #> $ study_name <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708… #> $ sample_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1… #> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu… #> $ region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A… #> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", … #> $ stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, … #> $ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A… #> $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"… #> $ date_egg <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200… #> $ culmen_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … #> $ culmen_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … #> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… #> $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … #> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F… #> $ delta_15_n_o_oo <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,… #> $ delta_13_c_o_oo <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, … #> $ comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult not… ``` --- .leftcol75[ ## cleaning with `janitor` ] .rightcol25[ <img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Now we have another problem. Not all variables in the `penguins_clean` data set are that useful. Some of them are the same across all observations. We don't need those variables, like `region`. ```r table(penguins_clean$region) ``` ``` #> #> Anvers #> 344 ``` We can use the base R function `table` to quickly get some tabulations of our variable. --- .leftcol75[ ## cleaning with `janitor` ] .rightcol25[ <img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Here to help get rid of these *constant* columns is the function `remove_constant()`. ```r penguins_clean <- remove_constant(penguins_clean, quiet = F) ``` ``` #> Removing 2 constant columns of 17 columns total (Removed: region, stage). ``` When we set `quiet = F` we even get some input as to what exactly was removed. Neat! Another useful function in `janitor` is `remove_empty()` which removes all rows or columns that just consist of missing values (i.e. `NA`) --- class: center, middle, inverse ## Data cleaning using `tidyr` <center> <img src="https://tidyr.tidyverse.org/logo.png" style="width: 32%" /> </center> --- .leftcol75[ ## Data cleaning using `tidyr` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Now we are already fairly advanced in our tidying. But our dataset is still not entirely tidy yet. Consider the `species` variable: ```r table(penguins_clean$species) ``` ``` #> #> Adelie Penguin (Pygoscelis adeliae) #> 152 #> Chinstrap penguin (Pygoscelis antarctica) #> 68 #> Gentoo penguin (Pygoscelis papua) #> 124 ``` --- ## `tidyr` ```r table(penguins_clean$species) ``` ``` #> #> Adelie Penguin (Pygoscelis adeliae) #> 152 #> Chinstrap penguin (Pygoscelis antarctica) #> 68 #> Gentoo penguin (Pygoscelis papua) #> 124 ``` This variable violates the tidy rule that each cell should include a single value. Species hold both the *common name* and the *latin name* of the penguin. --- .leftcol75[ ## `tidyr` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] We can use a `tidyr` function called `separate()` to turn this into two variables. Two arguments are important for that: + `sep`: specifies by which character the value should be split + `into`: a vector which specifies the resulting new variable names --- .leftcol75[ ## `tidyr` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] In our case we want to split by opening bracket `\\(` and will name our variables `species` and `latin_name`: ```r penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name")) ``` ```r penguins_clean ``` ``` #> # A tibble: 344 × 16 #> study_name sample_number species latin_name island individual_id #> <chr> <dbl> <chr> <chr> <chr> <chr> #> 1 PAL0708 1 Adelie Penguin Pygoscelis adel… Torge… N1A1 #> 2 PAL0708 2 Adelie Penguin Pygoscelis adel… Torge… N1A2 #> 3 PAL0708 3 Adelie Penguin Pygoscelis adel… Torge… N2A1 #> 4 PAL0708 4 Adelie Penguin Pygoscelis adel… Torge… N2A2 #> 5 PAL0708 5 Adelie Penguin Pygoscelis adel… Torge… N3A1 #> 6 PAL0708 6 Adelie Penguin Pygoscelis adel… Torge… N3A2 #> 7 PAL0708 7 Adelie Penguin Pygoscelis adel… Torge… N4A1 #> 8 PAL0708 8 Adelie Penguin Pygoscelis adel… Torge… N4A2 #> 9 PAL0708 9 Adelie Penguin Pygoscelis adel… Torge… N5A1 #> 10 PAL0708 10 Adelie Penguin Pygoscelis adel… Torge… N5A2 #> # ℹ 334 more rows #> # ℹ 10 more variables: clutch_completion <chr>, date_egg <date>, #> # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>, #> # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, #> # comments <chr> ``` --- .leftcol75[ ## `tidyr` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] In our case we want to split by an empty space and an opening bracket ` \\(` and we will name our variables `species` and `latin_name`: ```r penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name")) ``` There is a also a function called `unite()` which works in the opposite direction. --- .leftcol75[ ## `tidyr` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Now our data is in tidy format! We were in luck because the data pretty much already came in a format that was: one observation per row. But what if that is not the case? --- .leftcol75[ ### `pivot_wider()` and `pivot_longer()` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] `tidyr` also comes equipped to deal with data that has more than one observation per row. The function to use here is called `pivot_wider`. Now our `penguin_clean` data is already tidy. But we can just read in a dataset that isn't: ```r untidy_animals <- read_csv("https://github.com/favstats/ds3_r_intro/blob/main/data/untidy_animals.csv?raw=true") ``` --- ### `pivot_wider()` and `pivot_longer()` ```r untidy_animals ``` ``` #> # A tibble: 10 × 3 #> Animal Type Value #> <chr> <chr> <dbl> #> 1 Domestic dog lifespan 24 #> 2 Domestic dog ratio 5.1 #> 3 Domestic cat lifespan 30 #> 4 Domestic cat ratio 4.08 #> 5 American alligator lifespan 77 #> 6 American alligator ratio 1.59 #> 7 Golden hamster lifespan 3.9 #> 8 Golden hamster ratio 31.4 #> 9 King penguin lifespan 26 #> 10 King penguin ratio 4.71 ``` --- .leftcol75[ ### `pivot_wider()` and `pivot_longer()` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] You may recognize this data from the subsection *Untidy data I* Now let's use `pivot_wider` to make every row an observation. We need two main arguments for that: 1. `names_from`: tells the function where the new column names come from 2. `values_from`: tells the function where the values should come from --- ### `pivot_wider()` and `pivot_longer()` ```r tidy_animals <- pivot_wider(untidy_animals, names_from = Type, values_from = Value) tidy_animals ``` ``` #> # A tibble: 5 × 3 #> Animal lifespan ratio #> <chr> <dbl> <dbl> #> 1 Domestic dog 24 5.1 #> 2 Domestic cat 30 4.08 #> 3 American alligator 77 1.59 #> 4 Golden hamster 3.9 31.4 #> 5 King penguin 26 4.71 ``` --- .leftcol75[ ### `pivot_wider()` and `pivot_longer()` ] .rightcol25[ <img src="https://tidyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] `pivot_longer` can untidy our data again The argument `cols = ` tells the function which variables to turn into long format: ```r pivot_longer(tidy_animals, cols = c(lifespan, ratio)) ``` ``` #> # A tibble: 10 × 3 #> Animal name value #> <chr> <chr> <dbl> #> 1 Domestic dog lifespan 24 #> 2 Domestic dog ratio 5.1 #> 3 Domestic cat lifespan 30 #> 4 Domestic cat ratio 4.08 #> 5 American alligator lifespan 77 #> 6 American alligator ratio 1.59 #> 7 Golden hamster lifespan 3.9 #> 8 Golden hamster ratio 31.4 #> 9 King penguin lifespan 26 #> 10 King penguin ratio 4.71 ``` --- class: center, middle, inverse ## Data manipulation using `dplyr` <center> <img src="https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/dplyr_wrangling.png?raw=true" style="width: 62%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- class: center, middle, inverse ## `select()` helps you select variables --- .leftcol75[ ## `select()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] ![](images/select.png) `select()` is part of the dplyr package and helps you select variables Remember: with tidyverse-style functions, **data is always the first argument**. --- .leftcol75[ ## `select()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] ![](images/select.png) Here we only keep `individual_id`, `sex` and `species`. ```r select(penguins_clean, individual_id, sex, species) ``` ``` #> # A tibble: 344 × 3 #> individual_id sex species #> <chr> <chr> <chr> #> 1 N1A1 MALE Adelie Penguin #> 2 N1A2 FEMALE Adelie Penguin #> 3 N2A1 FEMALE Adelie Penguin #> 4 N2A2 <NA> Adelie Penguin #> 5 N3A1 FEMALE Adelie Penguin #> 6 N3A2 MALE Adelie Penguin #> 7 N4A1 FEMALE Adelie Penguin #> 8 N4A2 MALE Adelie Penguin #> 9 N5A1 <NA> Adelie Penguin #> 10 N5A2 <NA> Adelie Penguin #> # ℹ 334 more rows ``` --- .leftcol75[ ## `select()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] We can also **remove** variables with a **`-`** (minus). Here we remove `individual_id`, `sex` and `species`. ```r names(select(penguins_clean, -individual_id, -sex, -species)) ``` ``` #> [1] "study_name" "sample_number" "latin_name" #> [4] "island" "clutch_completion" "date_egg" #> [7] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm" #> [10] "body_mass_g" "delta_15_n_o_oo" "delta_13_c_o_oo" #> [13] "comments" ``` `individual_id`, `sex` and `species` are now removed, just as we wanted! --- .leftcol75[ #### Selection helpers ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] These *selection helpers* match variables according to a given pattern. `starts_with()`: Starts with a prefix. `ends_with()`: Ends with a suffix. `contains()`: Contains a literal string. `matches()`: Matches a regular expression. --- .leftcol75[ #### Selection helpers ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] For example: let's keep all variables that start with `s`: ```r select(penguins_clean, starts_with("s")) ``` ``` #> # A tibble: 344 × 4 #> study_name sample_number species sex #> <chr> <dbl> <chr> <chr> #> 1 PAL0708 1 Adelie Penguin MALE #> 2 PAL0708 2 Adelie Penguin FEMALE #> 3 PAL0708 3 Adelie Penguin FEMALE #> 4 PAL0708 4 Adelie Penguin <NA> #> 5 PAL0708 5 Adelie Penguin FEMALE #> 6 PAL0708 6 Adelie Penguin MALE #> 7 PAL0708 7 Adelie Penguin FEMALE #> 8 PAL0708 8 Adelie Penguin MALE #> 9 PAL0708 9 Adelie Penguin <NA> #> 10 PAL0708 10 Adelie Penguin <NA> #> # ℹ 334 more rows ``` --- .leftcol75[ #### Even more ways to select ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Select the first 5 variables: ```r select(penguins_clean, 1:5) ``` ``` #> # A tibble: 344 × 5 #> study_name sample_number species latin_name island #> <chr> <dbl> <chr> <chr> <chr> #> 1 PAL0708 1 Adelie Penguin Pygoscelis adeliae) Torgersen #> 2 PAL0708 2 Adelie Penguin Pygoscelis adeliae) Torgersen #> 3 PAL0708 3 Adelie Penguin Pygoscelis adeliae) Torgersen #> 4 PAL0708 4 Adelie Penguin Pygoscelis adeliae) Torgersen #> 5 PAL0708 5 Adelie Penguin Pygoscelis adeliae) Torgersen #> 6 PAL0708 6 Adelie Penguin Pygoscelis adeliae) Torgersen #> 7 PAL0708 7 Adelie Penguin Pygoscelis adeliae) Torgersen #> 8 PAL0708 8 Adelie Penguin Pygoscelis adeliae) Torgersen #> 9 PAL0708 9 Adelie Penguin Pygoscelis adeliae) Torgersen #> 10 PAL0708 10 Adelie Penguin Pygoscelis adeliae) Torgersen #> # ℹ 334 more rows ``` --- .leftcol75[ #### Even more ways to select ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Select everything from `individual_id` to `flipper_length_mm`. ```r select(penguins_clean, individual_id:flipper_length_mm) ``` ``` #> # A tibble: 344 × 6 #> individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm #> <chr> <chr> <date> <dbl> <dbl> #> 1 N1A1 Yes 2007-11-11 39.1 18.7 #> 2 N1A2 Yes 2007-11-11 39.5 17.4 #> 3 N2A1 Yes 2007-11-16 40.3 18 #> 4 N2A2 Yes 2007-11-16 NA NA #> 5 N3A1 Yes 2007-11-16 36.7 19.3 #> 6 N3A2 Yes 2007-11-16 39.3 20.6 #> 7 N4A1 No 2007-11-15 38.9 17.8 #> 8 N4A2 No 2007-11-15 39.2 19.6 #> 9 N5A1 Yes 2007-11-09 34.1 18.1 #> 10 N5A2 Yes 2007-11-09 42 20.2 #> # ℹ 334 more rows #> # ℹ 1 more variable: flipper_length_mm <dbl> ``` --- class: center, middle, inverse ## `filter()` helps you filter rows --- .leftcol75[ ## `filter()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] helps you filter rows ![](images/filter.png) Here we only keep penguins from the Island `Dream`. ```r filter(penguins_clean, island == "Dream") ``` ``` #> # A tibble: 124 × 16 #> study_name sample_number species latin_name island individual_id #> <chr> <dbl> <chr> <chr> <chr> <chr> #> 1 PAL0708 31 Adelie Penguin Pygoscelis adel… Dream N21A1 #> 2 PAL0708 32 Adelie Penguin Pygoscelis adel… Dream N21A2 #> 3 PAL0708 33 Adelie Penguin Pygoscelis adel… Dream N22A1 #> 4 PAL0708 34 Adelie Penguin Pygoscelis adel… Dream N22A2 #> 5 PAL0708 35 Adelie Penguin Pygoscelis adel… Dream N23A1 #> 6 PAL0708 36 Adelie Penguin Pygoscelis adel… Dream N23A2 #> 7 PAL0708 37 Adelie Penguin Pygoscelis adel… Dream N24A1 #> 8 PAL0708 38 Adelie Penguin Pygoscelis adel… Dream N24A2 #> 9 PAL0708 39 Adelie Penguin Pygoscelis adel… Dream N25A1 #> 10 PAL0708 40 Adelie Penguin Pygoscelis adel… Dream N25A2 #> # ℹ 114 more rows #> # ℹ 10 more variables: clutch_completion <chr>, date_egg <date>, #> # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>, #> # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, #> # comments <chr> ``` --- .leftcol75[ ## `filter()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Here the **`%in%`** operator can come in handy again if we want to filter more than one island: ```r islands_to_keep <- c("Dream", "Biscoe") filter(penguins_clean, island %in% islands_to_keep) ``` ``` #> # A tibble: 292 × 16 #> study_name sample_number species latin_name island individual_id #> <chr> <dbl> <chr> <chr> <chr> <chr> #> 1 PAL0708 21 Adelie Penguin Pygoscelis adel… Biscoe N11A1 #> 2 PAL0708 22 Adelie Penguin Pygoscelis adel… Biscoe N11A2 #> 3 PAL0708 23 Adelie Penguin Pygoscelis adel… Biscoe N12A1 #> 4 PAL0708 24 Adelie Penguin Pygoscelis adel… Biscoe N12A2 #> 5 PAL0708 25 Adelie Penguin Pygoscelis adel… Biscoe N13A1 #> 6 PAL0708 26 Adelie Penguin Pygoscelis adel… Biscoe N13A2 #> 7 PAL0708 27 Adelie Penguin Pygoscelis adel… Biscoe N17A1 #> 8 PAL0708 28 Adelie Penguin Pygoscelis adel… Biscoe N17A2 #> 9 PAL0708 29 Adelie Penguin Pygoscelis adel… Biscoe N18A1 #> 10 PAL0708 30 Adelie Penguin Pygoscelis adel… Biscoe N18A2 #> # ℹ 282 more rows #> # ℹ 10 more variables: clutch_completion <chr>, date_egg <date>, #> # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>, #> # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, #> # comments <chr> ``` --- class: center, middle, inverse ## `mutate()` helps you create variables --- .leftcol75[ ## `mutate()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] ![](images/mutate.png) `mutate` will take a statement like this: `variable_name = some_calculation` and attach `variable_name` at the *end of the dataset*. --- .leftcol75[ ## `mutate()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] ![](images/mutate.png) Let's say we want to calculate penguin bodymass in kg rather than gram. ```r pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000) ``` ```r select(pg_new, bodymass_kg, body_mass_g) ``` ``` #> # A tibble: 344 × 2 #> bodymass_kg body_mass_g #> <dbl> <dbl> #> 1 3.75 3750 #> 2 3.8 3800 #> 3 3.25 3250 #> 4 NA NA #> 5 3.45 3450 #> 6 3.65 3650 #> 7 3.62 3625 #> 8 4.68 4675 #> 9 3.48 3475 #> 10 4.25 4250 #> # ℹ 334 more rows ``` --- #### Recoding with `ifelse` `ifelse()` is a very useful function that allows to easily recode variables based on logical tests. It's basic functionality looks like this: `$$\color{red}{\text{ifelse}}(\color{orange}{\text{logical test}},\color{blue}{\text{value if TRUE}}, \color{green}{\text{value if FALSE}})$$` Here is a very basic example: ```r ifelse(1 == 1, "Pick me if test is TRUE", "Pick me if test is FALSE") ``` ``` #> [1] "Pick me if test is TRUE" ``` ```r ifelse(1 != 1, "Pick me if test is TRUE", "Pick me if test is FALSE") ``` ``` #> [1] "Pick me if test is FALSE" ``` --- #### Recoding with `ifelse` Let's use `ifelse` in combination with `mutate`. Let's create the variable `sex_short` which has a shorter label for sex: ```r pg_new <- mutate(penguins_clean, sex_short = ifelse(sex == "MALE", "m", "f")) select(pg_new, sex, sex_short) ``` ``` #> # A tibble: 344 × 2 #> sex sex_short #> <chr> <chr> #> 1 MALE m #> 2 FEMALE f #> 3 FEMALE f #> 4 <NA> <NA> #> 5 FEMALE f #> 6 MALE m #> 7 FEMALE f #> 8 MALE m #> 9 <NA> <NA> #> 10 <NA> <NA> #> # ℹ 334 more rows ``` --- .leftcol75[ #### Recoding with `case_when` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] `case_when` (from the `dplyr` package) is like `ifelse` but allows for much more complex combinations. The basic setup for a `case_when` call look like this: case_when( `\(\color{orange}{\text{logical test}}\)` ~ `\(\color{blue}{\text{what should happen if TRUE}}\)`, `\(\color{orange}{\text{logical test}}\)` ~ `\(\color{blue}{\text{what should happen if TRUE}}\)`, `\(\color{orange}{\text{logical test}}\)` ~ `\(\color{blue}{\text{what should happen if TRUE}}\)`, `\(TRUE\)` ~ `\(\color{green}{\text{what should happen with everything else}}\)`, ) --- .leftcol75[ #### Recoding with `case_when` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] The following code recodes a numeric vector (1 through 50) into three categorical ones: ```r x <- 1:50 case_when( x %in% 1:10 ~ "1 through 10", x %in% 11:30 ~ "11 through 30", TRUE ~ "above 30" ) ``` ``` #> [1] "1 through 10" "1 through 10" "1 through 10" "1 through 10" #> [5] "1 through 10" "1 through 10" "1 through 10" "1 through 10" #> [9] "1 through 10" "1 through 10" "11 through 30" "11 through 30" #> [13] "11 through 30" "11 through 30" "11 through 30" "11 through 30" #> [17] "11 through 30" "11 through 30" "11 through 30" "11 through 30" #> [21] "11 through 30" "11 through 30" "11 through 30" "11 through 30" #> [25] "11 through 30" "11 through 30" "11 through 30" "11 through 30" #> [29] "11 through 30" "11 through 30" "above 30" "above 30" #> [33] "above 30" "above 30" "above 30" "above 30" #> [37] "above 30" "above 30" "above 30" "above 30" #> [41] "above 30" "above 30" "above 30" "above 30" #> [45] "above 30" "above 30" "above 30" "above 30" #> [49] "above 30" "above 30" ``` --- .leftcol75[ #### Recoding with `case_when` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Let's use `case_when` in combination with `mutate`. Creating the variable `short_island` which has a shorter label for `island`: ```r pg_new <- mutate(penguins_clean, island_short = case_when( island == "Torgersen" ~ "T", island == "Biscoe" ~ "B", island == "Dream" ~ "D" )) ``` ```r select(pg_new, island, island_short) ``` ``` #> # A tibble: 344 × 2 #> island island_short #> <chr> <chr> #> 1 Torgersen T #> 2 Torgersen T #> 3 Torgersen T #> 4 Torgersen T #> 5 Torgersen T #> 6 Torgersen T #> 7 Torgersen T #> 8 Torgersen T #> 9 Torgersen T #> 10 Torgersen T #> # ℹ 334 more rows ``` --- class: center, middle, inverse ## `rename()` helps you rename variables --- .leftcol75[ ## `rename()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Just changes the variable name but leaves all else intact: ```r rename(penguins_clean, sample = sample_number) ``` ``` #> # A tibble: 344 × 16 #> study_name sample species latin_name island individual_id clutch_completion #> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> #> 1 PAL0708 1 Adelie P… Pygosceli… Torge… N1A1 Yes #> 2 PAL0708 2 Adelie P… Pygosceli… Torge… N1A2 Yes #> 3 PAL0708 3 Adelie P… Pygosceli… Torge… N2A1 Yes #> 4 PAL0708 4 Adelie P… Pygosceli… Torge… N2A2 Yes #> 5 PAL0708 5 Adelie P… Pygosceli… Torge… N3A1 Yes #> 6 PAL0708 6 Adelie P… Pygosceli… Torge… N3A2 Yes #> 7 PAL0708 7 Adelie P… Pygosceli… Torge… N4A1 No #> 8 PAL0708 8 Adelie P… Pygosceli… Torge… N4A2 No #> 9 PAL0708 9 Adelie P… Pygosceli… Torge… N5A1 Yes #> 10 PAL0708 10 Adelie P… Pygosceli… Torge… N5A2 Yes #> # ℹ 334 more rows #> # ℹ 9 more variables: date_egg <date>, culmen_length_mm <dbl>, #> # culmen_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>, #> # sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr> ``` --- class: center, middle, inverse ## `arrange()` orders your dataset --- .leftcol75[ ## `arrange()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] You can order your data to show the highest or lowest value first. Lowest first: ```r arrange(penguins_clean, sample_number) ``` Highest first: ```r arrange(penguins_clean, desc(sample_number)) ``` --- class: center, middle, inverse ## `group_by()` and `summarize()` when you want to aggregate your data (by groups) --- .leftcol75[ ## `group_by()` and `summarize()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Sometimes we want to calculate group statistics. In other languages this is often a pain. With `dplyr` this is fairly easy **and** readable. <img src="https://learn.r-journalism.com/wrangling/dplyr/images/groupby.png" style="width: 80%" /> --- .leftcol75[ ## `group_by()` and `summarize()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] First we group `penguins_clean` by `sex`. ```r grouped_by_sex <- group_by(penguins_clean, sex) ``` `summarize` works in a similar way to `mutate`: `variable_name = some_calculation` ```r summarise(grouped_by_sex, avg_culmen_length_mm = mean(culmen_length_mm, na.rm = T)) ``` ``` #> # A tibble: 3 × 2 #> sex avg_culmen_length_mm #> <chr> <dbl> #> 1 FEMALE 42.1 #> 2 MALE 45.9 #> 3 <NA> 41.3 ``` --- class: center, middle, inverse ## `count()` When you want to count how often a certain value within variables(s) occurs --- .leftcol75[ ## `count()` ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Now this is a function that I use all the time. Simply specify which variable you want to count: ```r count(penguins_clean, species, sort = T) ``` ``` #> # A tibble: 3 × 2 #> species n #> <chr> <int> #> 1 Adelie Penguin 152 #> 2 Gentoo penguin 124 #> 3 Chinstrap penguin 68 ``` --- class: center, middle, inverse # **`%>%`** ## The pipe operator <center> <img src="https://rpodcast.github.io/officer-advrmarkdown/img/magrittr.png" style="width: 62%" /> </center> --- .leftcol75[ ## The `%>%` operator ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] The point of the pipe is to help you write code in a way that is easier to read and understand. Let's consider an example with the data manipulation we have done so far: ```r ## first I select variables pg <- select(penguins_clean, individual_id, island, body_mass_g) ## then I filter to only Dream island pg <- filter(pg, island == "Dream") ## then I convert body_mass_g to kg pg <- mutate(pg, bodymass_kg = body_mass_g/1000) ## rename individual id to simply id pg <- rename(pg, id = individual_id) ``` --- .leftcol75[ ## The `%>%` operator ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Now this works but the problem is: we have to write a lot of code that repeats itself! ```r pg ``` ``` #> # A tibble: 124 × 4 #> id island body_mass_g bodymass_kg #> <chr> <chr> <dbl> <dbl> #> 1 N21A1 Dream 3250 3.25 #> 2 N21A2 Dream 3900 3.9 #> 3 N22A1 Dream 3300 3.3 #> 4 N22A2 Dream 3900 3.9 #> 5 N23A1 Dream 3325 3.32 #> 6 N23A2 Dream 4150 4.15 #> 7 N24A1 Dream 3950 3.95 #> 8 N24A2 Dream 3550 3.55 #> 9 N25A1 Dream 3300 3.3 #> 10 N25A2 Dream 4650 4.65 #> # ℹ 114 more rows ``` --- .leftcol75[ ## The `%>%` operator ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] Another (hardly readable) alternative is to *nest all the functions*: ```r rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id) ``` ``` #> # A tibble: 124 × 4 #> id island body_mass_g bodymass_kg #> <chr> <chr> <dbl> <dbl> #> 1 N21A1 Dream 3250 3.25 #> 2 N21A2 Dream 3900 3.9 #> 3 N22A1 Dream 3300 3.3 #> 4 N22A2 Dream 3900 3.9 #> 5 N23A1 Dream 3325 3.32 #> 6 N23A2 Dream 4150 4.15 #> 7 N24A1 Dream 3950 3.95 #> 8 N24A2 Dream 3550 3.55 #> 9 N25A1 Dream 3300 3.3 #> 10 N25A2 Dream 4650 4.65 #> # ℹ 114 more rows ``` --- .leftcol75[ ## The `%>%` operator ] .rightcol25[ <img src="https://dplyr.tidyverse.org/logo.png" width="100" height="120" style="display: block; margin: auto 0 auto auto;" /> ] *The piping style*: Read from top to bottom and from left to right and the `%>%` as "and then". ```r penguins_clean %>% select(individual_id, island, body_mass_g) %>% filter(island == "Dream") %>% mutate(bodymass_kg = body_mass_g/1000) %>% rename(id = individual_id) ``` ``` #> # A tibble: 124 × 4 #> id island body_mass_g bodymass_kg #> <chr> <chr> <dbl> <dbl> #> 1 N21A1 Dream 3250 3.25 #> 2 N21A2 Dream 3900 3.9 #> 3 N22A1 Dream 3300 3.3 #> 4 N22A2 Dream 3900 3.9 #> 5 N23A1 Dream 3325 3.32 #> 6 N23A2 Dream 4150 4.15 #> 7 N24A1 Dream 3950 3.95 #> 8 N24A2 Dream 3550 3.55 #> 9 N25A1 Dream 3300 3.3 #> 10 N25A2 Dream 4650 4.65 #> # ℹ 114 more rows ``` --- ## Small Note on the Pipe Since R Version 4.1.0 Base R also provides a pipe. It looks like this: `\(|>\)` While it shares many similarities with the `%>%` there are also some differences. It's beyond the scope of this workshop to go over it here but for the sake of simplicity we will stick with the `magrittr` pipe. --- class: center, middle, inverse # Exercises ### It's time to type some R code Open `04_exercises_II.Rmd` <center> <img src="https://media1.tenor.com/images/72bf7922ac0b07b2f7f8f630e4ae01d2/tenor.gif?itemid=11364811" style="width: 50%" /> </center> --- class: center, middle, inverse # Some Final Things ## If only we had more time.. <center> <img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/de7f2fcf-0f01-43bb-8580-d489c877d672_rw_1920.png?h=d651e514f5a4cfd0128ed6972bed764e" style="width: 75%" /> </center> --- ## Using RStudio Projects [RStudio projects](https://r4ds.had.co.nz/workflow-projects.html) make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. ![](http://www.rstudio.com/images/docs/projects_new.png) --- ## Further Resources to learn R * [Book: R for Data Science](https://r4ds.had.co.nz/) * [Danielle Navarro's YouTube channel](https://www.youtube.com/channel/UCfNGzUFfsy_3udMY8UyaqBA) * [Start coding using RStudio.cloud Primers](https://rstudio.cloud/learn/primers) * [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/) * [Book: A ModernDive into R and the Tidyverse](https://moderndive.com/) * [TidyTuesday - Community](https://www.tidytuesday.com/) --- class: center, middle, inverse ## Q&A -- <center> <img src="https://i.gifer.com/24OD.gif" style="width: 22%" /> </center> Ways to stay in touch with me: [
@favstats](http://twitter.com/favstats)<br> --- class: center, middle, inverse ### Thank you for participating! I hope you had fun! <center> <img src="https://media1.tenor.com/images/da0f7d5d93faa11dfc36db1e6c6fdf2a/tenor.gif?itemid=6159389" style="width: 32%" /> </center> .fifty[Link to slides: [favstats.github.io/ds3_r_intro](https://favstats.github.io/ds3_r_intro)]