A {fun} Intro to R Programming

.eightyfive {
  font-size: 85%;
   }
   
.eighty {
  font-size: 80%;
   }
   
.seventyfive {
  font-size: 75%;
   }
   
.seventy {
  font-size: 70%;
   }
   
.fifty {
  font-size: 50%;
   }
   
.forty {
  font-size: 40%;
   }
</style>

---
class: banner

---
name: title-slide
class: primary

#.fancy[A {fun} Intro to R Programming]

###.fancy[The basics,<br>data wrangling,<br>and more!]

<br>

Fabio Votta

[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:black;overflow:visible;position:relative;"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @favstats](http://twitter.com/favstats)<br>
[<svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:black;overflow:visible;position:relative;"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> @favstats](http://github.com/favstats)<br>
[<svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:black;overflow:visible;position:relative;"><path d="M579.8 267.7c56.5-56.5 56.5-148 0-204.5c-50-50-128.8-56.5-186.3-15.4l-1.6 1.1c-14.4 10.3-17.7 30.3-7.4 44.6s30.3 17.7 44.6 7.4l1.6-1.1c32.1-22.9 76-19.3 103.8 8.6c31.5 31.5 31.5 82.5 0 114L422.3 334.8c-31.5 31.5-82.5 31.5-114 0c-27.9-27.9-31.5-71.8-8.6-103.8l1.1-1.6c10.3-14.4 6.9-34.4-7.4-44.6s-34.4-6.9-44.6 7.4l-1.1 1.6C206.5 251.2 213 330 263 380c56.5 56.5 148 56.5 204.5 0L579.8 267.7zM60.2 244.3c-56.5 56.5-56.5 148 0 204.5c50 50 128.8 56.5 186.3 15.4l1.6-1.1c14.4-10.3 17.7-30.3 7.4-44.6s-30.3-17.7-44.6-7.4l-1.6 1.1c-32.1 22.9-76 19.3-103.8-8.6C74 372 74 321 105.5 289.5L217.7 177.2c31.5-31.5 82.5-31.5 114 0c27.9 27.9 31.5 71.8 8.6 103.9l-1.1 1.6c-10.3 14.4-6.9 34.4 7.4 44.6s34.4 6.9 44.6-7.4l1.1-1.6C433.5 260.8 427 182 377 132c-56.5-56.5-148-56.5-204.5 0L60.2 244.3z"/></svg> favstats.eu](https://www.favstats.eu)

August 14 2023

---

### Your friendly neighborhood R Instructor

.leftcol40[
<img src="https://github.com/favstats/WarwickSpringCamp_QTA/blob/main/docs/slides/day1/images/me.jpg?raw=true" style="width: 90%" />

]

+ Passionate about R and Data Science

+ I love to travel

+ I enjoy and (occasionally) create R memes

</center>

]

---

## But enough of me

### Let's learn something about .blue[you]!

</center>

Go to `menti.com` and type in the code: 6410 5559

or visit this website: [menti.com/aloxtiea8mx3](https://www.menti.com/aloxtiea8mx3)

---

---

### It's not unusual to struggle at first but it gets better!

+ My experience is that this stuff isn't super easy... but it gets better!
  
--

+ Awesome inclusive community that is always ready to help
+ Active blogosphere with use cases and examples

---

### A Note on Live-Coding

---

## Overview

+ R Basics
  + Operators
  + Objects (inc. vectors)
  + Functions
  + Exercises
  + Data frames
  
`$\text{B R E A K at 2pm (CET) - (5:30 PM IST) - (8:00 PM CST)}$`
  
+ Data Manipulation
  + the tidyverse and friends
  + `janitor`
  + `tidyr`
  + `dplyr`
  + Exercises

---

#### What is <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>?

R is a .fancy[statistical] programming language developed for data analysis and visualization.

#### What is <img src="images/rstudio.png" style="display: inline-block; margin: 0"; width="80px"/>?

RStudio is an IDE (Integrated Development Environment).

* Write, save and open R Code (.R/.Rmd files)

* Provides syntax-highlighting and auto-completion & much more

# But why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="60px"/>?

---

#### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>?

+ Amazing Community .forty[(but I already said that)]

]

![](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/welcome_to_rstats_twitter.png)

]

---

#### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>?

+ Amazing Community .forty[(but I already said that)]

+ Data wrangling is accessible & fun

]

<br>

![](https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/tidyverse_celestial.png?raw=true)

]

---

#### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>?

+ Amazing Community .forty[(but I already said that)]

+ Data wrangling is accessible & fun

+ Outstanding repertoire of statistical & computational methods

]

<br>

![](https://pbs.twimg.com/media/E4mDfxSXEAA_kvG.jpg)
.fifty[ [easystats](https://github.com/easystats/easystats) packageverse for statistical analysis.]

]

---

#### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>?

+ Amazing Community .forty[(but I already said that)]

+ Data wrangling is accessible & fun

+ Outstanding repertoire of statistical & computational methods

+ Integrates well with other programming languages

]

<br>

![](https://rstudio.github.io/reticulate/images/reticulated_python.png)
.fifty[ [reticulate](https://github.com/easystats/easystats) integrates Python into R]

]

---

#### Why learn <img src="images/Rlogo.svg" style="display: inline-block; margin: 0"; width="30px"/>?

+ Amazing Community .forty[(but I already said that)]

+ Data wrangling is accessible & fun

+ Outstanding repertoire of statistical & computational methods

+ Integrates well with other programming languages

+ Beautiful data visualization with `ggplot2`

<br>

]

![](https://raw.githubusercontent.com/Z3tt/TidyTuesday/master/plots/2020_31/2020_31_PalmerPenguins.png?raw=true)

]

---

### This workshop is held in .green[Rmarkdown]

]

]

---

## .green[Rmarkdown]

]

]

.leftcol[
<center>
<img src="https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/rmarkdown_wizards.png?raw=true" style="width: 100%" />
</center>

* write your thesis, create slides, automated reports, dashboards and interactive web apps all from within [Rmarkdown](https://rmarkdown.rstudio.com/docs/articles/rmarkdown.html)
  
]

---

## R vs. Rmarkdown

R scripts (file ending `.R`):

]

![](images/rcode.png)

]

]

]

---

## R vs. Rmarkdown

R scripts (file ending `.R`):

]

![](images/rcode.png)

]

]

![](images/rmarkown_pdf.png)

]

---

## R vs. Rmarkdown

R scripts (file ending `.R`):

]

![](images/rcode.png)

]

]

slides (*like the ones you are looking at right now!*), automated reports, dashboards and interactive web apps

]

---

## We'll switch to RStudio in a moment

> The next few slides will have a couple of links that you may want to follow. You can find **all** materials of this workshop, including links, in this GitHub repository: [tinyurl.com/ds3repo](https://github.com/favstats/ds3_r_intro)

Hopefully at this point you have already followed the **pre-workshop instructions:**

[tinyurl.com/ds3prep](https://favstats.github.io/ds3_r_intro/prep/instructions.html)

*Once in RStudio, feel free to code along with me, changing a few bits here and there and see how results change.*

]

---

## Alternatives to local RStudio

If you cannot install R or RStudio on your computer for any reason, you have two options to follow along:

+ **Binder**

Simply start a **Binder** instance which will create a session of RStudio in your Browser (may take a little bit):

[tinyurl.com/ds3binder](https://tinyurl.com/ds3binder)

[![Binder](https://img.shields.io/badge/launch-binder-579aca.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFkAAABZCAMAAABi1XidAAAB8lBMVEX///9XmsrmZYH1olJXmsr1olJXmsrmZYH1olJXmsr1olJXmsrmZYH1olL1olJXmsr1olJXmsrmZYH1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olJXmsrmZYH1olL1olL0nFf1olJXmsrmZYH1olJXmsq8dZb1olJXmsrmZYH1olJXmspXmspXmsr1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olLeaIVXmsrmZYH1olL1olL1olJXmsrmZYH1olLna31Xmsr1olJXmsr1olJXmsrmZYH1olLqoVr1olJXmsr1olJXmsrmZYH1olL1olKkfaPobXvviGabgadXmsqThKuofKHmZ4Dobnr1olJXmsr1olJXmspXmsr1olJXmsrfZ4TuhWn1olL1olJXmsqBi7X1olJXmspZmslbmMhbmsdemsVfl8ZgmsNim8Jpk8F0m7R4m7F5nLB6jbh7jbiDirOEibOGnKaMhq+PnaCVg6qWg6qegKaff6WhnpKofKGtnomxeZy3noG6dZi+n3vCcpPDcpPGn3bLb4/Mb47UbIrVa4rYoGjdaIbeaIXhoWHmZYHobXvpcHjqdHXreHLroVrsfG/uhGnuh2bwj2Hxk17yl1vzmljzm1j0nlX1olL3AJXWAAAAbXRSTlMAEBAQHx8gICAuLjAwMDw9PUBAQEpQUFBXV1hgYGBkcHBwcXl8gICAgoiIkJCQlJicnJ2goKCmqK+wsLC4usDAwMjP0NDQ1NbW3Nzg4ODi5+3v8PDw8/T09PX29vb39/f5+fr7+/z8/Pz9/v7+zczCxgAABC5JREFUeAHN1ul3k0UUBvCb1CTVpmpaitAGSLSpSuKCLWpbTKNJFGlcSMAFF63iUmRccNG6gLbuxkXU66JAUef/9LSpmXnyLr3T5AO/rzl5zj137p136BISy44fKJXuGN/d19PUfYeO67Znqtf2KH33Id1psXoFdW30sPZ1sMvs2D060AHqws4FHeJojLZqnw53cmfvg+XR8mC0OEjuxrXEkX5ydeVJLVIlV0e10PXk5k7dYeHu7Cj1j+49uKg7uLU61tGLw1lq27ugQYlclHC4bgv7VQ+TAyj5Zc/UjsPvs1sd5cWryWObtvWT2EPa4rtnWW3JkpjggEpbOsPr7F7EyNewtpBIslA7p43HCsnwooXTEc3UmPmCNn5lrqTJxy6nRmcavGZVt/3Da2pD5NHvsOHJCrdc1G2r3DITpU7yic7w/7Rxnjc0kt5GC4djiv2Sz3Fb2iEZg41/ddsFDoyuYrIkmFehz0HR2thPgQqMyQYb2OtB0WxsZ3BeG3+wpRb1vzl2UYBog8FfGhttFKjtAclnZYrRo9ryG9uG/FZQU4AEg8ZE9LjGMzTmqKXPLnlWVnIlQQTvxJf8ip7VgjZjyVPrjw1te5otM7RmP7xm+sK2Gv9I8Gi++BRbEkR9EBw8zRUcKxwp73xkaLiqQb+kGduJTNHG72zcW9LoJgqQxpP3/Tj//c3yB0tqzaml05/+orHLksVO+95kX7/7qgJvnjlrfr2Ggsyx0eoy9uPzN5SPd86aXggOsEKW2Prz7du3VID3/tzs/sSRs2w7ovVHKtjrX2pd7ZMlTxAYfBAL9jiDwfLkq55Tm7ifhMlTGPyCAs7RFRhn47JnlcB9RM5T97ASuZXIcVNuUDIndpDbdsfrqsOppeXl5Y+XVKdjFCTh+zGaVuj0d9zy05PPK3QzBamxdwtTCrzyg/2Rvf2EstUjordGwa/kx9mSJLr8mLLtCW8HHGJc2R5hS219IiF6PnTusOqcMl57gm0Z8kanKMAQg0qSyuZfn7zItsbGyO9QlnxY0eCuD1XL2ys/MsrQhltE7Ug0uFOzufJFE2PxBo/YAx8XPPdDwWN0MrDRYIZF0mSMKCNHgaIVFoBbNoLJ7tEQDKxGF0kcLQimojCZopv0OkNOyWCCg9XMVAi7ARJzQdM2QUh0gmBozjc3Skg6dSBRqDGYSUOu66Zg+I2fNZs/M3/f/Grl/XnyF1Gw3VKCez0PN5IUfFLqvgUN4C0qNqYs5YhPL+aVZYDE4IpUk57oSFnJm4FyCqqOE0jhY2SMyLFoo56zyo6becOS5UVDdj7Vih0zp+tcMhwRpBeLyqtIjlJKAIZSbI8SGSF3k0pA3mR5tHuwPFoa7N7reoq2bqCsAk1HqCu5uvI1n6JuRXI+S1Mco54YmYTwcn6Aeic+kssXi8XpXC4V3t7/ADuTNKaQJdScAAAAAElFTkSuQmCC)](https://mybinder.org/v2/gh/favstats/ds3_r_intro/rstudio?urlpath=rstudio)

> Note: If you are using Binder don't forget to download the files before you close out the session because otherwise anything you added will be lost!

---

## Alternatives to local RStudio

+ **Google Colab**

Google Colab instantaneously runs Jupyter Notebooks in your browser with an R Kernel.

+ [Part I (R Basics): tinyurl.com/ds3rintro1](https://colab.research.google.com/drive/1dLsdGbkvgn1JbWgsy9Z-pFmPd_2MG4Xu?usp=sharing)
+ [Part II (Data Manipulation with the `tidyverse`): tinyurl.com/ds3rintro2](https://colab.research.google.com/drive/14CRElnKewnp5MnlxhqVu6OOcIXd-Bkaj?usp=sharing)

---

## Workshop files

Please download the workshop files from here: [https://tinyurl.com/ds3files](https://www.dropbox.com/sh/jievqgwl43nwnbf/AADhoYQW5oMZ-JygK7aNklHra?dl=0).

The link will download a `.zip` file. Extract it to **its own folder**.

Now double-click on **ds3_intro.Rproj**

![](images/click_on_rproject.png)

---

## Within RStudio

![](images/rstudio_instructions.png)

---

# R Basics

### Math Operators

---

### Math Operators

At its core R is just a fancy *calculator*

You can do:

`+` addition

`-` subtraction

`*` multiplication

`/` division

`^` exponentiate

---

### Math Operators

At its core R is just a fancy *calculator*

###  `+` addition

```r
15 + 5
```

```
#> [1] 20
```
]

### Mixing operators

```r
(15 + 5) / (2 * 5)
```

```
#> [1] 2
```
]

---

## Short excourse: a new data set appears..

But adding up numbers for no reason is no fun.

That's why we will use a data set about .fancy[animals] to learn some R Basics.

.leftcol45[
<img src="images/kisspng-animal-clip-art-portable-network-graphics-illustra-cartoon-animal-transparent-amp-png-clipart-free-5c7ccdbb1c9c98.7009291415516830031172.png" style="width: 100%" />
]

[Animal Ageing and Longevity Database](https://www.johnsnowlabs.com/marketplace/the-animal-aging-and-longevity-database/)

Data on over 4200 animals.

Information on age of maturity, gestation or incubation periods but also **longevity (in years)**.

]

---

## Animal Ageing and Longevity Database

Say we want to know how old an animal is in *human years*.

We can use the following simple formula to determine that:

<br>

`$$\frac{\text{Maximum lifespan human}}{\text{Maximum lifespan non-human animal}} = \text{animal to human years ratio}$$`

<br>

*Note: This is just a **very rough** way to determine the conversion ratio. It is **much more** complicated in [reality](https://www.akc.org/expert-advice/health/how-to-calculate-dog-years-to-human-years/).*

---

## Animal Ageing and Longevity Database

| Animal | Maximum Lifespan |
| --- | --- | 
| Human | 122.5 | 
| Domestic dog | 24.0 | 
| Domestic cat | 30.0 | 
| American alligator | 77.0 | 
| Golden hamster | 3.9 | 
| King penguin | 26.0 | 
| Lion|	27.0 | 
| Greenland shark	 | 392.0 | 
| Galapagos tortoise | 177.0 | 
]

| Animal | Maximum Lifespan |
| --- | --- | 
| African bush elephant	 | 65.0 | 
| California sea lion	| 35.7	 | 
| Fruit fly		| 0.3	 | 
| House mouse		| 4.0	 |
| Giraffe		| 39.5	 |
| Wild boar		| 27.0	|

Source: [Animal Ageing and Longevity Database](https://www.johnsnowlabs.com/marketplace/the-animal-aging-and-longevity-database/)

]

---

### Math Operators

Say we want to know how old a dog is in *human years*.

The observed maximum lifespan of a human is 122.5 years. For dogs it is 24.

```r
122.5/24
```

```
#> [1] 5.104167
```
]

So for every year a human ages, a dog "ages" 5.1 human years. How old is a 15 year old dog in human years?

```r
5.104167*15
```

```
#> [1] 76.56251
```
]

---

## So many numbers..

Now it can be quite tedious to juggle all those numbers around.

Especially if we want to keep reusing numbers we calculated before.

Here to simplify that process are:

**Objects**

---

# R Basics

### R Objects

---

## R Objects

You can think of R objects as *saving information*, for example simple numbers or just plain text.

Once saved we can recall it whenever we want by just running the name of the object.

> Everything that exists in R is *.red[an object]*. .fifty[~John M. Chambers]

We create R objects by using the assignment operator:

<center>
.large[**`<-`**]
</center>

---

## R Objects

Here is an example:

```r
human_lifespan  <- 122.5
dog_lifespan <- 24
```

If we now run the respective objects we retrieve the saved numbers.

```r
human_lifespan
```

```
#> [1] 122.5
```
]

```r
dog_lifespan
```

```
#> [1] 24
```
]

---

## R Objects

Now we can perform the same calculation as before but this time using objects!

```r
dogs_to_human <- human_lifespan / dog_lifespan
```

The object `dogs_to_human` now holds the dog to human years conversion ratio.

*Now* we ask again: how old is a 15 year old dog in human years?

```r
dogs_to_human*15
```

```
#> [1] 76.5625
```
]

---

## A quick note on naming things

.leftcol[
> Note that object names could be *anything* here!  I could have chosen to just name them `x`, `y` and `z`.

I typically use lower-case snake case in this style: `animal_rights`.
 
Also recommended by the [tidyverse style guide](https://style.tidyverse.org/syntax.html). 
]

]

---

## More operators..

But wait there are more operators: .fancy[logical operators]

Logical operators are used for logical tests which can result in either:

`$\text{TRUE}$` or `$\text{FALSE}$`

*(sometimes this is also called a boolean variable)*

---
class: inverse, middle, center

# R Basics

### Logical Operators

---

## Logical Operators

Let's first create some more objects to try some logical tests!

```r
lion_lifespan <- 27
mouse_lifespan <- 4
fly_lifespan <- 0.3
boar_lifespan <- 27
alligator_lifespan <- 77
greenland_shark_lifespan <- 392
galapagos_tortoise_lifespan <- 177
```

---

## Logical Operators

`==` asks whether two values are the same or **equal**

The code below tests the following statement:

*The maximum lifespan of a lion equals that of a boar.*

```r
lion_lifespan == boar_lifespan
```

```
#> [1] TRUE
```
]

Since both maximum lifespans are `27` this is of course a **`TRUE`** statement.

---

## Logical Operators

`!=`  asks whether two values are the *not* the same or **unequal**

The code below tests the following statement:

*The maximum lifespan of a lion **does not equal** that of a boar.*

```r
lion_lifespan != boar_lifespan
```

```
#> [1] FALSE
```
]

Since both maximum lifespans are `27` (as we saw before) this is of course a **`FALSE`** statement.

---

## Logical Operators

We can also test whether certain values are greater or smaller than others:

`>` greater than

The code below tests the statement:

*The lifespan of a human is greater than the lifespan of a fly.*

```r
human_lifespan > fly_lifespan
```

```
#> [1] TRUE
```
]

Since the maximum human lifespan is `122.5` and a fly does not live longer than `0.3` years this is of course a **`TRUE`** statement.

---

## Logical Operators

`<` smaller than

The code below tests the statement:

*The lifespan of an alligator is smaller than the lifespan of a mouse.*

```r
alligator_lifespan < mouse_lifespan
```

```
#> [1] FALSE
```
]

Since the maximum alligator lifespan is `77` and a mouse lives for `4` years maximum, this is of course a **`FALSE`** statement.

Also note the following options:

`>=` greater or equals and `<=` smaller or equals

---

## Combine Logical Operators

We can also combine logical tests by testing multiple statements at the same time:

* `&` stands for "and" (unsurprisingly)
* `|` stands for "or"

For example both `alligator_lifespan` and `fly_lifespan` have to be greater than `mouse_lifespan` for the code below to evaluate as `TRUE`.

```r
alligator_lifespan > mouse_lifespan & 
fly_lifespan > mouse_lifespan
```

```
#> [1] FALSE
```
]

---

## Combine Logical Operators

We can also combine logical tests by testing multiple statements at the same time:

* `&` stands for "and" (unsurprisingly)
* `|` stands for "or"

If we say `|` (= or) instead, it means either statement evaluation to `TRUE` is enough!

```r
alligator_lifespan > mouse_lifespan | 
fly_lifespan > mouse_lifespan
```

```
#> [1] TRUE
```
]

---

## Even more objects..?

Now we learned about operators and some basic objects.

But so far objects have only ever held a *single numeric value*.

R is of course much more powerful than that and objects can hold any number and types of data.

Now we will take a look at **vectors**, or objects that include more than one value.

---
class: inverse, middle, center

# R Basics

### Vectors

---

## Vectors

You can simply imagine vectors as a list of values. They can consist of numbers but also *strings* (or: text).

In order to create a vector in R we make use of `c()` (stands for *concatenate*)

```r
c(1, 100, 1000, 2000, 5000)
```

```
#> [1]    1  100 1000 2000 5000
```
]

We can also create a vector of strings by using quotes:

```r
c("I", "am", "a", "vector", "of", "strings")
```

```
#> [1] "I"       "am"      "a"       "vector"  "of"      "strings"
```
]

---

## Vectors

We can combine the lifespans we assigned into objects earlier:

```r
animal_lifespans <-  c(greenland_shark_lifespan, dog_lifespan, 
  galapagos_tortoise_lifespan,
  mouse_lifespan, fly_lifespan,
  lion_lifespan, boar_lifespan,
  alligator_lifespan, human_lifespan)
```

```r
animal_lifespans
```

```
#> [1] 392.0  24.0 177.0   4.0   0.3  27.0  27.0  77.0 122.5
```
]

---

## Vectors

We can also create a vector of strings by using quotes:

```r
animals <- c("greenland_shark", "dog", 
  "galapagos_tortoise", "mouse", 
  "fly", "lion", "boar",
  "alligator", "human")
```
]

```r
animals
```

```
#> [1] "greenland_shark"    "dog"                "galapagos_tortoise"
#> [4] "mouse"              "fly"                "lion"              
#> [7] "boar"               "alligator"          "human"
```
]

---

## Vectors

Now if we wanted to get the different year conversation ratios we can simply divide the maximum human age number by the vector.

```r
human_lifespan / animal_lifespans
```

```
#> [1]   0.3125000   5.1041667   0.6920904  30.6250000 408.3333333   4.5370370
#> [7]   4.5370370   1.5909091   1.0000000
```
]

Notice how the operation is performed for each item separately and the result is yet another vector.

---

## Vectors

We can also use *logical* operators with vectors:

```r
animal_lifespans > human_lifespan
```

```
#> [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
```
]

Again, notice how the operation is performed for each item separately and the result is yet another vector, this time consisting of `TRUE`s and `FALSE`s.

---

## A logical operator for vectors: `%in%`

An incredibly useful operator for vectors is **`%in%`**.

The operator checks whether multiple elements occur somewhere in your vectors.

Its basic usage looks like this:

`$\color{red}{\text{vector1}}$` %in% `$\color{orange}{\text{vector2}}$`

---

## A logical operator for vectors: `%in%`

Let's say we want to check whether `giraffe`, `greenland_shark` or `lion` occur in `animals`.

If we use `|` we would have to write something like this:

```r
animals == "giraffe" | animals == "greenland_shark"  | animals == "lion"
```

With `%in%` we can simply pass a vector like this:

```r
animals_to_check <-  c("giraffe", "greenland_shark", "lion")
animals %in% animals_to_check
```

---

## A logical operator for vectors: `%in%`

Doesn't that look much better?

Now imagine you have dozens or hundreds of animals to check!

---

## A logical operator for vectors: `%in%`

With **`|`** it's very repetitive and utter chaos.

```
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
```
]

---

## A logical operator for vectors: `%in%`

With **`%in%`** it's much more readable.

```r
animals_to_check <-  c("honey_bee", "cardiocondyla_obscurior", "black_garden_ant", "pheidole_dentata", "squinting_bush_brown", "american_lobster", "firebelly_toad", "oriental_firebelly_toad", "yellow_bellied_toad",  "american_toad", "western_toad", "yosemite_toad", "great_plains_toad", "green_toad", "canadian_toad", "red_spotted_toad", "sonoran_green_toad", "southern_toad", "veragoa_stubfoot_toad", "common_european_toad",  "colorado_river_toad", "kihansi_spray_toad", "ridge_headed_toad", "cuban_toad", "european_green_toad","colombian_giant_toad", "argentine_toad", "cane_toad", "eurura_toad", "common_horned_frog", "colombian_horned_frog", "amazonian_horned_frog", "ornate_horned_frog")
```
]

```r
animals %in% animals_to_check
```

```
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
```

---

# R Basics

### Indexing (with vectors)

---

## Indexing

When you want to know a specific value within your object you can use indexing.

Indexing is done via square brackets `[]`.

The basic setup looks like this:

`$$\color{red}{\text{vector}}[\color{orange}{\text{elements}}]$$`
---

## Indexing

Exracting the first element of a vector:

```r
animal_lifespans[1]
```

```
#> [1] 392
```

```r
animals[1]
```

```
#> [1] "greenland_shark"
```

---

## Indexing

Exracting the fifth element of a vector:

```r
animal_lifespans[5]
```

```
#> [1] 0.3
```

```r
animals[5]
```

```
#> [1] "fly"
```

---

## Indexing with logical tests

You can also index using logical tests.

So if an expression evaluates to `TRUE` it will **keep** that element and when it evaluates to `FALSE` it will remove the **element**.

`$$\color{red}{\text{vector}}[\color{orange}{\text{vector of TRUE/FALSE of same length}}]$$`
---

## Indexing with logical tests

Let's first take a look at a logical test that extracts all animals that have greater lifespans than humans:

```r
longer_living <- animal_lifespans > human_lifespan

longer_living
```

```
#> [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
```

Now we can use square brackets to only keep those animals that have greater lifespans than humans.

```r
animals[longer_living]
```

```
#> [1] "greenland_shark"    "galapagos_tortoise"
```

---

## Short excourse: variable types

There are three-ish main types of variables:

```r
class(TRUE)
```

```
#> [1] "logical"
```

```r
class("I am a character")
```

```
#> [1] "character"
```

```r
class(2020)
```

```
#> [1] "numeric"
```

---

## Short excourse: variable types

Another important value to consider is `NA` (*Not Available*).

`NA` is a special value that simply means *missing value*.

```r
c(12, NA, 23, 22, NA)
```

```
#> [1] 12 NA 23 22 NA
```

---
class: inverse, middle, center

# R Basics

### Functions

---

# Functions

![](https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/9718f461-8060-433b-b014-b294da38d172_rw_1920.png?h=5749aa3c82e02c0c4c9428a6788714da)

Artist: [Allison Horst](https://github.com/allisonhorst/stats-illustrations)

---

## Functions

> Everything that happens in R is *.red[a function]*. .fifty[~John M. Chambers]

You can think of functions as little machines that (in most cases) process some kind of **input** and create an **output**.

Input is everything that goes *into* a function:

*   **arguments** you can think of as (pre-determined) input types like a lever or numpad.

*   **values** you can think of as the various settings that the levers or numpads can have.

`$$\text{function_name}(\color{orange}{\text{argument}}=\color{lightblue}{\text{value}})$$`
---

## Functions

> Everything that happens in R is *.red[a function]*. .fifty[~John M. Chambers]

You can think of functions as little machines that (in most cases) process some kind of **input** and create an **output**.

Input is everything that goes *into* a function:

*   **arguments** you can think of as (pre-determined) input types like a lever or numpad.

*   **values** you can think of as the various settings that the levers or numpads can have.

*Let's take a look at an example:  the star producer!*

---

---

## The Star Producer

Let's consider the following function (that does not exist unfortunately):
A `star_producer`!

This little machine creates tiny hand-drawn stars depending on some input. It takes two arguments:

* `how_many` tells the machine how many stars to produce
* `type` tells the machine how the stars should look like (in this case the machine only supports `"squiggly"` stars but it could be upgraded in the future when we learn how to create our own functions later on)

---

## Getting `?help`

How do we know what function takes what kind of arguments?

Within R you can always run the code:

```r
?function_name
```

And it will open up the *documentation* for the function that will tell you how to use it.

Googling the function (adding R or rstats) will also often bring you to some documentation in most cases!

---

## An example function: `seq`

From the help file we can learn that this function is used to

> "[g]enerate regular sequences".

Its first three arguments are this:

* `from`, `to`: the starting and (maximal) end values of the sequence.

* `by`: number: increment of the sequence.

Let's first take a look at this within our machine allegory.

---

---

## An example function: `seq`
If we would want to create a vector from 1 to 10 that increments by 1 we can simple specify the following input values for the arguments:

* `from`: 1
* `to`: 10
* `by`: 1

This is how that looks like in code:

```r
seq(from = 1, to = 10, by = 1)
```

```
#>  [1]  1  2  3  4  5  6  7  8  9 10
```

---

### A tip

There is however a much simpler way to create sequences that increment by one.

Simply use a *`:`* between two numbers and it generates a sequence:

```r
1:10
```

```
#>  [1]  1  2  3  4  5  6  7  8  9 10
```

```r
1000:1010
```

```
#>  [1] 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010
```

---

### Passing Values

Now there are two ways to pass values to functions in R:

1.   Pass by argument *names* (we already did this!)
2.   Pass by argument *position*

In the former case, we specifically mention which arguments we want to pass our values to.

For that, it doesn't matter in which **order** we pass our arguments.

```r
seq(to = 10, by = 1, from = 1)
```

```
#>  [1]  1  2  3  4  5  6  7  8  9 10
```

---

### Passing Values by position

But: **coders are lazy**.

There is no need to always specify which argument you mean exactly when you can just match *by position*.

So our sequence example could just as well look like this:

```r
seq(1, 10, 1)
```

```
#>  [1]  1  2  3  4  5  6  7  8  9 10
```

And it works because the documentation tells us that the first three arguments are `from`, `to`, and `by`.

In the future you will see it often that people just leave out the arguments completely so it's good to get used to it.

---

## More examples: Mean and Median

Many functions have so intuitive arguments that we often don't need to even look up the documentation.

An easy function to use is `mean` which simply calculates the average of a numeric vector.

Let's try this with the `animal_lifespans` vector we created earlier.

```r
mean(animal_lifespans)
```

```
#> [1] 94.53333
```

The mean value is quite high!

---

## More examples: Mean and Median

We can also try and take the median value:

```r
median(animal_lifespans)
```

```
#> [1] 27
```

When we take the median instead of the mean we can see that this is due to high outliers (the median is of course more robust to extreme values).

There are many more functions in R and we will get to learn some of them during this workshop.

---
class: inverse, middle, center

# R Basics

### Creating our own Functions

</center>

---

## Creating our own functions

We can create our own function using the call: `function()`.

We encode what is supposed to happen within curly brackets `{}`.

Here is the anatomy of a function:

`$\color{purple}{\text{my_function_name}}$` <- `$\text{function}(\color{orange}{\text{argument}})$`{

&nbsp;&nbsp;&nbsp; `$\color{green}{\text{# function body}}$`

&nbsp;&nbsp;&nbsp; `$\color{lightblue}{\text{output}}$` <- `$\color{orange}{\text{argument}}$`

&nbsp;&nbsp;&nbsp; `$\text{return(}\color{lightblue}{\text{output}}\text{)}$`

}

---

## Creating our own functions

* `$\color{purple}{\text{Function name}}$`:
  * An identifier by which the function is called

* `$\color{orange}{\text{Argument(s)}}$`:
  * Contains a list of values passed to the function
  * Can also contain a default value like this: `argument = 1`

* `$\color{green}{\text{Function body}}$`:
  * This is executed each time the function is called

* `$\color{lightblue}{\text{Return value}}$`:
  * Ends function call & sends the value back to the global environment

---

## Creating our own functions

Let's try this basic example:

```r
my_function_name <- function(argument){
  # function body
  output <- argument
  return(output)
}
```

```r
my_function_name("I am output!")
```

```
#> [1] "I am output!"
```

> Tip: In RStudio we can just type `fun` and enter after the popup and RStudio will just automatically generate a template for a function.

---

## Creating our own functions

We can also specify *default values* for our arguments:

```r
my_function_name <- function(argument = "I am a default value"){
  # function body
  output <- argument
  return(output)
}
```

```r
my_function_name()
```

```
#> [1] "I am a default value"
```

---

## Creating our own functions

Let's create a slightly more useful function: a function which squares numeric values.

```r
square <- function(here_goes_my_number) { 
  output <- here_goes_my_number^2        
  
  return(output)                  
}
```

```r
square(2)   
```

```
#> [1] 4
```

---

## Creating our own functions

Let's create a function that is able to calculate dog years into human years. We call the function `dog_to_human_years`.

```r
dog_to_human_years <- function(animal_years){

human_lifespan <- 122.5
  dog_lifespan <- 24

ratio <- human_lifespan/dog_lifespan

human_years <- animal_years*ratio

return(human_years)
}
```

```r
dog_to_human_years(15)
```

```
#> [1] 76.5625
```

---

## A quick note on errors and debugging

Soon we will go into exercise mode!

Before that, however, it's important to understand:

> Seasoned R user or complete beginner, **everyone makes mistakes.**
> Encountering errors sometimes is normal.

No R programmer ever just fell from the sky.

*Debugging* is the process of finding and resolving bugs/problems in code and it **happens all the time**.

---

## A quick note on errors and debugging

So you encountered an error:

```
#> Error: object of type 'closure' is not subsettable
```

Steps to take:

* Try to understand what it says.

* Easier said than done, because many R errors are actually quite cryptic unfortunately.

---

## A quick note on errors and debugging

So you encountered an error:

```
#> Error: object of type 'closure' is not subsettable
```

Steps to take:

* Try to understand what it says.

* Google (or other search engines)

* Search for the error
  * Search for what you were trying to do (add R or rstats)

---

## A quick note on errors and debugging

So you encountered an error:

```
#> Error: object of type 'closure' is not subsettable
```

Steps to take:

* Try to understand what it says.

* Google (or other search engines)

* If the error occurs in a function check whether you passed an object type that isn't expected by it.

* Checking the documentation can help here! Type: `?function`

---

## A quick note on errors and debugging

So you encountered an error:

```
#> Error: object of type 'closure' is not subsettable
```

Steps to take:

* Try to understand what it says.

* Google (or other search engines)

* If the error occurs in a function check whether you passed an object type that isn't expected by it.

* Ask for help! Create a *reproducible example* (reprex) and post to online communities

---

# Exercises

### It's time to type some R code

Open `02_exercises_I.Rmd`

---

# Data frames

---

## Data frames

Data frames are the main R object that we will be interacting with. In many ways you already know about them too.

An example for a data frame would be the table from the [Animal Ageing and Longevity Database](https://www.johnsnowlabs.com/marketplace/the-animal-aging-and-longevity-database/) we already saw earlier.

| Animal | Maximum Longevity (in years)|
| --- | --- | 
| Human | 122.5.5 | 
| Domestic dog | 24.0 | 
| Domestic cat | 30.0 | 
| American alligator | 77.0 |

---

## Data frames

To create a data frame from scratch we can simply pass two (same-sized) vectors to the function `data.frame`.

```r
animals_data <- data.frame(animals, animal_lifespans)

animals_data
```

```
#>              animals animal_lifespans
#> 1    greenland_shark            392.0
#> 2                dog             24.0
#> 3 galapagos_tortoise            177.0
#> 4              mouse              4.0
#> 5                fly              0.3
#> 6               lion             27.0
#> 7               boar             27.0
#> 8          alligator             77.0
#> 9              human            122.5
```

---

## Variable Names

We can also retrieve the variable names of any data frame by passing it to `names()`.

```r
names(animals_data)
```

```
#> [1] "animals"          "animal_lifespans"
```

---

## Retrieve variables

If we want to retrieve specific variables from a data frame we can do that via the `$` operator.

$$\color{red}{\text{dataset}}$\color{orange}{\text{variable_name}}$$

Think of the `$` symbol as a door opener that helps you check what is inside an object.

```r
animals_data$animal_lifespans
```

```
#> [1] 392.0  24.0 177.0   4.0   0.3  27.0  27.0  77.0 122.5
```

```r
animals_data$animals
```

```
#> [1] "greenland_shark"    "dog"                "galapagos_tortoise"
#> [4] "mouse"              "fly"                "lion"              
#> [7] "boar"               "alligator"          "human"
```

---

## (Re-)Code variables

We can also use the `$` data access to add **new variables**.

In the below case we create a variable called `animal_to_human` which holds all the human to animal years conversions.

We do that by simply assigning a vector containing that information to `animals_data$animal_to_human` even if that variable doesn't exist yet.

```r
animals_data$animal_to_human <- animals_data$animal_lifespans / human_lifespan
```

```r
animals_data
```

```
#>              animals animal_lifespans animal_to_human
#> 1    greenland_shark            392.0      3.20000000
#> 2                dog             24.0      0.19591837
#> 3 galapagos_tortoise            177.0      1.44489796
#> 4              mouse              4.0      0.03265306
#> 5                fly              0.3      0.00244898
#> 6               lion             27.0      0.22040816
#> 7               boar             27.0      0.22040816
#> 8          alligator             77.0      0.62857143
#> 9              human            122.5      1.00000000
```

---

## Indexing

Just as we did before with vectors we can also index data frames with square brackets: `[]`. However, unlike vectors, data frames have **two dimensions**.

So that is why the square brackets in this case take two inputs, separated by a comma:

`$$\color{red}{\text{dataset}}[\color{orange}{\text{rows}},\color{lightblue}{\text{columns}}]$$`

* The first value after the opening square bracket refers to `$\color{orange}{\text{which rows}}$` you want to keep.

* The second value refers to `$\color{lightblue}{\text{which columns}}$` you want to keep.

---

## Indexing

So if we only want to keep the first row of the first column of our `animals_data` that is how we would do that:

```r
animals_data[1, 1]
```

```
#> [1] "greenland_shark"
```

*If* we want to keep a certain row but all columns we can do this by leaving the *second* value within the square brackets empty.

```r
animals_data[1, ]
```

```
#>           animals animal_lifespans animal_to_human
#> 1 greenland_shark              392             3.2
```

---

## Indexing

*If* we want to keep a certain column but keep all rows we can do this by leaving the *first* value within the square brackets empty.

```r
animals_data[, 1]
```

---

## Indexing with logical tests

We can also do more complex indexing by keeping only the rows that fulfill a certain condition. Let's say we only want to keep the rows that contain animals that have longer lifespans than humans.

```r
animals_to_check <- animals_data$animal_lifespans > human_lifespan
```

```r
animals_data[animals_to_check, ]
```

```
#>              animals animal_lifespans animal_to_human
#> 1    greenland_shark              392        3.200000
#> 3 galapagos_tortoise              177        1.444898
```

---

## Break?

Break!

---

---

# R Packages

---

## R Packages

Packages are at the heart of R:

* R packages are basically a collection of functions that you load into your working environment.

* They contain code that other R users have prepared for the community.

* It's good to know your packages, they can really make your life easier.

* I suggest keeping track of package developments either on Twitter via #rstats

---

## R Packages

You can install packages in R like this using the `install.packages` function:

```r
install.packages("janitor")
```

However, installing is not enough. You also need to load the package via `library`.

```r
library(janitor)
```

Think of `install.packages` as buying a set of tools (for free!) and `library` as pulling out the tools each time you want to work with them.

---

![](https://predictivehacks.com/wp-content/uploads/2020/11/tidyverse-default.png)

---

## What is the `tidyverse`?

The tidyverse describes itself:

> The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

---

## Core principle: tidy data

* Every column is a variable.
* Every row is an observation.
* Every cell is a single value.

We have already seen tidy data:

| Animal | Maximum Lifespan | Animal/Human Years Ratio  |
| --- | --- | --- | 
| Domestic dog | 24.0 | 5.10 |
| Domestic cat | 30.0 | 4.08 |
| American alligator | 77.0 | 1.59 | 
| Golden hamster | 3.9 | 31.41 |
| King penguin | 26.0 |  4.71 |

---

## Untidy data I

.leftcol[
| Animal | Type | Value  |
| --- | --- | --- | 
| Domestic dog | lifespan | 24.0 |
| Domestic dog | ratio | 5.10 |
| Domestic cat | lifespan | 30.0 |
| Domestic cat | ratio | 4.08 |
| American alligator | lifespan | 77.0 | 
| American alligator | ratio | 1.59 |
| Golden hamster | lifespan | 3.9 |
| Golden hamster | ratio | 31.41 |
| King penguin | lifespan |  26.0 |
| King penguin | ratio |  4.71 |
]

<br>

The data on the right has multiple rows with the same observation (animal).

= not tidy

]

---

## Untidy data II

| Animal | Lifespan/Ratio  |
| --- | --- | 
| Domestic dog | 24.0 / 5.10 |
| Domestic cat | 30.0 / 4.08 |
| American alligator | 77.0 / 1.59 | 
| Golden hamster | 3.9 / 31.41 |
| King penguin | 26.0 /  4.71 |

The data above has multiple variables per column.

= not tidy

---

## Core principle: tidy data

## Core principle: tidy data

Tidy data has two decisive advantages:

* Consistently prepared data is easier to read, process, load and save.

* Many procedures (or the associated functions) in R require this type of data.

## Installing and loading the tidyverse

First we install the packages of the tidyverse like this:

```r
install.packages("tidyverse")
```

Then we load them:

```r
library(tidyverse)
```

---

## A new dataset appears..

We are going to work with a new data from here on out.

No worries, we will stay within the animal kingdom but we need a dataset that is a little more complex than what we have seen already.

---

## A new dataset appears..

We are going to work with a new data from here on out.

Meet the Palmer penguins! Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/).

.leftcol[
<center>
<img src="https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/lter_penguins.png" style="width: 80%" />
</center>
]

.rightcol[
<center>
<img src="https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/culmen_depth.png" style="width: 80%" />
</center>
.right[
.fifty[Artist: [Allison Horst](https://github.com/allisonhorst)]]
]

---

## Palmer Penguins

We could install the R package `palmerpenguins` and then access the data.

However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet.

We can use the `readr` package which provides many convenient functions to load data into R. Here we need `read_csv`:

```r
penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins_raw.csv")
```

---

## Palmer Penguins

```r
penguins_raw
```

```
#> # A tibble: 344 × 17
#>    studyName `Sample Number` Species         Region Island Stage `Individual ID`
#>    <chr>               <dbl> <chr>           <chr>  <chr>  <chr> <chr>          
#>  1 PAL0708                 1 Adelie Penguin… Anvers Torge… Adul… N1A1           
#>  2 PAL0708                 2 Adelie Penguin… Anvers Torge… Adul… N1A2           
#>  3 PAL0708                 3 Adelie Penguin… Anvers Torge… Adul… N2A1           
#>  4 PAL0708                 4 Adelie Penguin… Anvers Torge… Adul… N2A2           
#>  5 PAL0708                 5 Adelie Penguin… Anvers Torge… Adul… N3A1           
#>  6 PAL0708                 6 Adelie Penguin… Anvers Torge… Adul… N3A2           
#>  7 PAL0708                 7 Adelie Penguin… Anvers Torge… Adul… N4A1           
#>  8 PAL0708                 8 Adelie Penguin… Anvers Torge… Adul… N4A2           
#>  9 PAL0708                 9 Adelie Penguin… Anvers Torge… Adul… N5A1           
#> 10 PAL0708                10 Adelie Penguin… Anvers Torge… Adul… N5A2           
#> # ℹ 334 more rows
#> # ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
#> #   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
#> #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
#> #   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
```

---

## Palmer Penguins

We can also take a look at data set using the `glimpse` function from `dplyr`.

```r
glimpse(penguins_raw)
```

```
#> Rows: 344
#> Columns: 17
#> $ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
#> $ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
#> $ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
#> $ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
#> $ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
#> $ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
#> $ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
#> $ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
#> $ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
#> $ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
#> $ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
#> $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
#> $ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
#> $ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
#> $ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
#> $ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
#> $ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…
```

---

## initial data cleaning

### using `janitor`

---

## cleaning with `janitor`

]

]

`janitor` is not offically part of the tidyverse package compilation but in my view it is incredibly important to know.

Provides some convenient functions for basic cleaning of the data.

Just like any tidverse-style package it fullfills the following criteria for its functions:

> The data is always the first argument.

This helps us to match by position.

---

## cleaning with `janitor`

]

]

One annoyance with the `penguins_raw` data is that it has spaces in the variable names. Urgh!

R has to put quotes around the variable names that have spaces:

```r
penguins_raw$`Delta 15 N (o/oo)`
penguins_raw$`Flipper Length (mm)`
```

`janitor` can help with that:

using a function called `clean_names()`

---

## cleaning with `janitor`

]

]

`clean_names()` just magically turns all our messy column names into readable lower-case snake case:

```r
library(janitor)

penguins_clean <- clean_names(penguins_raw) 
```

That is how the variables look like now:

```r
penguins_clean$delta_15_n_o_oo
penguins_clean$flipper_length_mm
```

---

## cleaning with `janitor`

```r
glimpse(penguins_clean)
```

```
#> Rows: 344
#> Columns: 17
#> $ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
#> $ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
#> $ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
#> $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
#> $ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
#> $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
#> $ delta_15_n_o_oo   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
#> $ delta_13_c_o_oo   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
#> $ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
```

---

## cleaning with `janitor`

]

]

Now we have another problem. Not all variables in the `penguins_clean` data set are that useful.

Some of them are the same across all observations. We don't need those variables, like `region`.

```r
table(penguins_clean$region)
```

```
#> 
#> Anvers 
#>    344
```

We can use the base R function `table` to quickly get some tabulations of our variable.

---

## cleaning with `janitor`

]

]

Here to help get rid of these *constant* columns is the function `remove_constant()`.

```r
penguins_clean <- remove_constant(penguins_clean, quiet = F)
```

```
#> Removing 2 constant columns of 17 columns total (Removed: region, stage).
```

When we set `quiet = F` we even get some input as to what exactly was removed. Neat!

Another useful function in `janitor` is `remove_empty()` which removes all rows or columns that just consist of missing values (i.e. `NA`)

---

## Data cleaning using `tidyr`

---

## Data cleaning using `tidyr`

]

]

Now we are already fairly advanced in our tidying.

But our dataset is still not entirely tidy yet.

Consider the `species` variable:

```r
table(penguins_clean$species)
```

```
#> 
#>       Adelie Penguin (Pygoscelis adeliae) 
#>                                       152 
#> Chinstrap penguin (Pygoscelis antarctica) 
#>                                        68 
#>         Gentoo penguin (Pygoscelis papua) 
#>                                       124
```

---

## `tidyr`

```r
table(penguins_clean$species)
```

This variable violates the tidy rule that each cell should include a single value.

Species hold both the *common name* and the *latin name* of the penguin.

---

## `tidyr`

]

]

We can use a `tidyr` function called `separate()` to turn this into two variables.

Two arguments are important for that:

+ `sep`: specifies by which character the value should be split
+ `into`: a vector which specifies the resulting new variable names

---

## `tidyr`

]

]

In our case we want to split by opening bracket `\\(` and will name our variables `species` and `latin_name`:

```r
penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name"))
```

```r
penguins_clean
```

```
#> # A tibble: 344 × 16
#>    study_name sample_number species        latin_name       island individual_id
#>    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
#>  1 PAL0708                1 Adelie Penguin Pygoscelis adel… Torge… N1A1         
#>  2 PAL0708                2 Adelie Penguin Pygoscelis adel… Torge… N1A2         
#>  3 PAL0708                3 Adelie Penguin Pygoscelis adel… Torge… N2A1         
#>  4 PAL0708                4 Adelie Penguin Pygoscelis adel… Torge… N2A2         
#>  5 PAL0708                5 Adelie Penguin Pygoscelis adel… Torge… N3A1         
#>  6 PAL0708                6 Adelie Penguin Pygoscelis adel… Torge… N3A2         
#>  7 PAL0708                7 Adelie Penguin Pygoscelis adel… Torge… N4A1         
#>  8 PAL0708                8 Adelie Penguin Pygoscelis adel… Torge… N4A2         
#>  9 PAL0708                9 Adelie Penguin Pygoscelis adel… Torge… N5A1         
#> 10 PAL0708               10 Adelie Penguin Pygoscelis adel… Torge… N5A2         
#> # ℹ 334 more rows
#> # ℹ 10 more variables: clutch_completion <chr>, date_egg <date>,
#> #   culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
#> #   body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
#> #   comments <chr>
```

---

## `tidyr`

]

]

In our case we want to split by an empty space and an opening bracket ` \\(` and we will name our variables `species` and `latin_name`:

```r
penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name"))
```

There is a also a function called `unite()` which works in the opposite direction.

---

## `tidyr`

]

]

Now our data is in tidy format!

We were in luck because the data pretty much already came in a format that was: one observation per row.

But what if that is not the case?

---

### `pivot_wider()` and `pivot_longer()`

]

]

`tidyr` also comes equipped to deal with data that has more than one observation per row.

The function to use here is called `pivot_wider`.

Now our `penguin_clean` data is already tidy.

But we can just read in a dataset that isn't:

```r
untidy_animals <- read_csv("https://github.com/favstats/ds3_r_intro/blob/main/data/untidy_animals.csv?raw=true")
```

---

### `pivot_wider()` and `pivot_longer()`

```r
untidy_animals
```

```
#> # A tibble: 10 × 3
#>    Animal             Type     Value
#>    <chr>              <chr>    <dbl>
#>  1 Domestic dog       lifespan 24   
#>  2 Domestic dog       ratio     5.1 
#>  3 Domestic cat       lifespan 30   
#>  4 Domestic cat       ratio     4.08
#>  5 American alligator lifespan 77   
#>  6 American alligator ratio     1.59
#>  7 Golden hamster     lifespan  3.9 
#>  8 Golden hamster     ratio    31.4 
#>  9 King penguin       lifespan 26   
#> 10 King penguin       ratio     4.71
```

---

### `pivot_wider()` and `pivot_longer()`

]

]

You may recognize this data from the subsection *Untidy data I*

Now let's use `pivot_wider` to make every row an observation.

We need two main arguments for that:

1. `names_from`: tells the function where the new column names come from
2. `values_from`: tells the function where the values should come from

---

### `pivot_wider()` and `pivot_longer()`

```r
tidy_animals <- pivot_wider(untidy_animals, 
                            names_from = Type, 
                            values_from = Value)

tidy_animals
```

```
#> # A tibble: 5 × 3
#>   Animal             lifespan ratio
#>   <chr>                 <dbl> <dbl>
#> 1 Domestic dog           24    5.1 
#> 2 Domestic cat           30    4.08
#> 3 American alligator     77    1.59
#> 4 Golden hamster          3.9 31.4 
#> 5 King penguin           26    4.71
```

---

### `pivot_wider()` and `pivot_longer()`

]

]

`pivot_longer` can untidy our data again

The argument `cols = ` tells the function which variables to turn into long format:

```r
pivot_longer(tidy_animals, cols = c(lifespan, ratio))
```

```
#> # A tibble: 10 × 3
#>    Animal             name     value
#>    <chr>              <chr>    <dbl>
#>  1 Domestic dog       lifespan 24   
#>  2 Domestic dog       ratio     5.1 
#>  3 Domestic cat       lifespan 30   
#>  4 Domestic cat       ratio     4.08
#>  5 American alligator lifespan 77   
#>  6 American alligator ratio     1.59
#>  7 Golden hamster     lifespan  3.9 
#>  8 Golden hamster     ratio    31.4 
#>  9 King penguin       lifespan 26   
#> 10 King penguin       ratio     4.71
```

---

## Data manipulation using `dplyr`

---

## `select()`

helps you select variables

---

## `select()`

]

]

![](images/select.png)

`select()` is part of the dplyr package and helps you select variables

Remember: with tidyverse-style functions, **data is always the first argument**.

---

## `select()`

]

]

![](images/select.png)

Here we only keep `individual_id`, `sex` and `species`.

```r
select(penguins_clean, individual_id, sex, species)
```

```
#> # A tibble: 344 × 3
#>    individual_id sex    species       
#>    <chr>         <chr>  <chr>         
#>  1 N1A1          MALE   Adelie Penguin
#>  2 N1A2          FEMALE Adelie Penguin
#>  3 N2A1          FEMALE Adelie Penguin
#>  4 N2A2          <NA>   Adelie Penguin
#>  5 N3A1          FEMALE Adelie Penguin
#>  6 N3A2          MALE   Adelie Penguin
#>  7 N4A1          FEMALE Adelie Penguin
#>  8 N4A2          MALE   Adelie Penguin
#>  9 N5A1          <NA>   Adelie Penguin
#> 10 N5A2          <NA>   Adelie Penguin
#> # ℹ 334 more rows
```

---

## `select()`

]

]

We can also **remove** variables with a **`-`** (minus).

Here we remove `individual_id`, `sex` and `species`.

```r
names(select(penguins_clean, -individual_id, -sex, -species))
```

```
#>  [1] "study_name"        "sample_number"     "latin_name"       
#>  [4] "island"            "clutch_completion" "date_egg"         
#>  [7] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
#> [10] "body_mass_g"       "delta_15_n_o_oo"   "delta_13_c_o_oo"  
#> [13] "comments"
```

`individual_id`, `sex` and `species` are now removed, just as we wanted!

---

#### Selection helpers

]

]

These *selection helpers* match variables according to a given pattern.

`starts_with()`: Starts with a prefix.

`ends_with()`: Ends with a suffix.

`contains()`: Contains a literal string.

`matches()`: Matches a regular expression.

---

#### Selection helpers

]

]

For example: let's keep all variables that start with `s`:

```r
select(penguins_clean, starts_with("s"))
```

```
#> # A tibble: 344 × 4
#>    study_name sample_number species        sex   
#>    <chr>              <dbl> <chr>          <chr> 
#>  1 PAL0708                1 Adelie Penguin MALE  
#>  2 PAL0708                2 Adelie Penguin FEMALE
#>  3 PAL0708                3 Adelie Penguin FEMALE
#>  4 PAL0708                4 Adelie Penguin <NA>  
#>  5 PAL0708                5 Adelie Penguin FEMALE
#>  6 PAL0708                6 Adelie Penguin MALE  
#>  7 PAL0708                7 Adelie Penguin FEMALE
#>  8 PAL0708                8 Adelie Penguin MALE  
#>  9 PAL0708                9 Adelie Penguin <NA>  
#> 10 PAL0708               10 Adelie Penguin <NA>  
#> # ℹ 334 more rows
```

---

#### Even more ways to select

]

]

Select the first 5 variables:

```r
select(penguins_clean, 1:5)
```

```
#> # A tibble: 344 × 5
#>    study_name sample_number species        latin_name          island   
#>    <chr>              <dbl> <chr>          <chr>               <chr>    
#>  1 PAL0708                1 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  2 PAL0708                2 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  3 PAL0708                3 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  4 PAL0708                4 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  5 PAL0708                5 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  6 PAL0708                6 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  7 PAL0708                7 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  8 PAL0708                8 Adelie Penguin Pygoscelis adeliae) Torgersen
#>  9 PAL0708                9 Adelie Penguin Pygoscelis adeliae) Torgersen
#> 10 PAL0708               10 Adelie Penguin Pygoscelis adeliae) Torgersen
#> # ℹ 334 more rows
```

---

#### Even more ways to select

]

]

Select everything from `individual_id` to `flipper_length_mm`.

```r
select(penguins_clean, individual_id:flipper_length_mm)
```

```
#> # A tibble: 344 × 6
#>    individual_id clutch_completion date_egg   culmen_length_mm culmen_depth_mm
#>    <chr>         <chr>             <date>                <dbl>           <dbl>
#>  1 N1A1          Yes               2007-11-11             39.1            18.7
#>  2 N1A2          Yes               2007-11-11             39.5            17.4
#>  3 N2A1          Yes               2007-11-16             40.3            18  
#>  4 N2A2          Yes               2007-11-16             NA              NA  
#>  5 N3A1          Yes               2007-11-16             36.7            19.3
#>  6 N3A2          Yes               2007-11-16             39.3            20.6
#>  7 N4A1          No                2007-11-15             38.9            17.8
#>  8 N4A2          No                2007-11-15             39.2            19.6
#>  9 N5A1          Yes               2007-11-09             34.1            18.1
#> 10 N5A2          Yes               2007-11-09             42              20.2
#> # ℹ 334 more rows
#> # ℹ 1 more variable: flipper_length_mm <dbl>
```

---

## `filter()`

helps you filter rows

---

## `filter()`

]

]

helps you filter rows

![](images/filter.png)

Here we only keep penguins from the Island `Dream`.

```r
filter(penguins_clean, island == "Dream")
```

```
#> # A tibble: 124 × 16
#>    study_name sample_number species        latin_name       island individual_id
#>    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
#>  1 PAL0708               31 Adelie Penguin Pygoscelis adel… Dream  N21A1        
#>  2 PAL0708               32 Adelie Penguin Pygoscelis adel… Dream  N21A2        
#>  3 PAL0708               33 Adelie Penguin Pygoscelis adel… Dream  N22A1        
#>  4 PAL0708               34 Adelie Penguin Pygoscelis adel… Dream  N22A2        
#>  5 PAL0708               35 Adelie Penguin Pygoscelis adel… Dream  N23A1        
#>  6 PAL0708               36 Adelie Penguin Pygoscelis adel… Dream  N23A2        
#>  7 PAL0708               37 Adelie Penguin Pygoscelis adel… Dream  N24A1        
#>  8 PAL0708               38 Adelie Penguin Pygoscelis adel… Dream  N24A2        
#>  9 PAL0708               39 Adelie Penguin Pygoscelis adel… Dream  N25A1        
#> 10 PAL0708               40 Adelie Penguin Pygoscelis adel… Dream  N25A2        
#> # ℹ 114 more rows
#> # ℹ 10 more variables: clutch_completion <chr>, date_egg <date>,
#> #   culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
#> #   body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
#> #   comments <chr>
```

---

## `filter()`

]

]

Here the **`%in%`** operator can come in handy again if we want to filter more than one island:

```r
islands_to_keep <- c("Dream", "Biscoe")

filter(penguins_clean, island %in% islands_to_keep)
```

```
#> # A tibble: 292 × 16
#>    study_name sample_number species        latin_name       island individual_id
#>    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
#>  1 PAL0708               21 Adelie Penguin Pygoscelis adel… Biscoe N11A1        
#>  2 PAL0708               22 Adelie Penguin Pygoscelis adel… Biscoe N11A2        
#>  3 PAL0708               23 Adelie Penguin Pygoscelis adel… Biscoe N12A1        
#>  4 PAL0708               24 Adelie Penguin Pygoscelis adel… Biscoe N12A2        
#>  5 PAL0708               25 Adelie Penguin Pygoscelis adel… Biscoe N13A1        
#>  6 PAL0708               26 Adelie Penguin Pygoscelis adel… Biscoe N13A2        
#>  7 PAL0708               27 Adelie Penguin Pygoscelis adel… Biscoe N17A1        
#>  8 PAL0708               28 Adelie Penguin Pygoscelis adel… Biscoe N17A2        
#>  9 PAL0708               29 Adelie Penguin Pygoscelis adel… Biscoe N18A1        
#> 10 PAL0708               30 Adelie Penguin Pygoscelis adel… Biscoe N18A2        
#> # ℹ 282 more rows
#> # ℹ 10 more variables: clutch_completion <chr>, date_egg <date>,
#> #   culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
#> #   body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
#> #   comments <chr>
```

---

## `mutate()`

helps you create variables

---

## `mutate()`

]

]

![](images/mutate.png)

`mutate` will take a statement like this:

`variable_name = some_calculation`

and attach `variable_name` at the *end of the dataset*.

---

## `mutate()`

]

]

![](images/mutate.png)

Let's say we want to calculate penguin bodymass in kg rather than gram.

```r
pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000)
```

```r
select(pg_new, bodymass_kg, body_mass_g)
```

```
#> # A tibble: 344 × 2
#>    bodymass_kg body_mass_g
#>          <dbl>       <dbl>
#>  1        3.75        3750
#>  2        3.8         3800
#>  3        3.25        3250
#>  4       NA             NA
#>  5        3.45        3450
#>  6        3.65        3650
#>  7        3.62        3625
#>  8        4.68        4675
#>  9        3.48        3475
#> 10        4.25        4250
#> # ℹ 334 more rows
```

---

#### Recoding with `ifelse`

`ifelse()` is a very useful function that allows to easily recode variables based on logical tests.

It's basic functionality looks like this:

`$$\color{red}{\text{ifelse}}(\color{orange}{\text{logical test}},\color{blue}{\text{value if TRUE}}, \color{green}{\text{value if FALSE}})$$`

Here is a very basic example:

```r
ifelse(1 == 1, "Pick me if test is TRUE", "Pick me if test is FALSE")
```

```
#> [1] "Pick me if test is TRUE"
```

```r
ifelse(1 != 1, "Pick me if test is TRUE", "Pick me if test is FALSE")
```

```
#> [1] "Pick me if test is FALSE"
```

---

#### Recoding with `ifelse`

Let's use `ifelse` in combination with `mutate`.

Let's create the variable `sex_short` which has a shorter label for sex:

```r
pg_new <- mutate(penguins_clean, sex_short = ifelse(sex == "MALE", "m", "f"))

select(pg_new, sex, sex_short)
```

```
#> # A tibble: 344 × 2
#>    sex    sex_short
#>    <chr>  <chr>    
#>  1 MALE   m        
#>  2 FEMALE f        
#>  3 FEMALE f        
#>  4 <NA>   <NA>     
#>  5 FEMALE f        
#>  6 MALE   m        
#>  7 FEMALE f        
#>  8 MALE   m        
#>  9 <NA>   <NA>     
#> 10 <NA>   <NA>     
#> # ℹ 334 more rows
```

---

#### Recoding with `case_when`

]

]

`case_when` (from the `dplyr` package) is like `ifelse` but allows for much more complex combinations.

The basic setup for a `case_when` call look like this:

case_when(

&nbsp;&nbsp;&nbsp; `$\color{orange}{\text{logical test}}$` ~ `$\color{blue}{\text{what should happen if TRUE}}$`,

&nbsp;&nbsp;&nbsp; `$TRUE$` ~ `$\color{green}{\text{what should happen with everything else}}$`,

)

---

#### Recoding with `case_when`

]

]

The following code recodes a numeric vector (1 through 50) into three categorical ones:

```r
x <- 1:50

case_when(
  x %in% 1:10 ~ "1 through 10",
  x %in% 11:30 ~ "11 through 30",
  TRUE ~ "above 30"
)
```

```
#>  [1] "1 through 10"  "1 through 10"  "1 through 10"  "1 through 10" 
#>  [5] "1 through 10"  "1 through 10"  "1 through 10"  "1 through 10" 
#>  [9] "1 through 10"  "1 through 10"  "11 through 30" "11 through 30"
#> [13] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
#> [17] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
#> [21] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
#> [25] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
#> [29] "11 through 30" "11 through 30" "above 30"      "above 30"     
#> [33] "above 30"      "above 30"      "above 30"      "above 30"     
#> [37] "above 30"      "above 30"      "above 30"      "above 30"     
#> [41] "above 30"      "above 30"      "above 30"      "above 30"     
#> [45] "above 30"      "above 30"      "above 30"      "above 30"     
#> [49] "above 30"      "above 30"
```

---

#### Recoding with `case_when`

]

]

Let's use `case_when` in combination with `mutate`.

Creating the variable `short_island` which has a shorter label for `island`:

```r
pg_new <- mutate(penguins_clean, 
        island_short = case_when(
          island == "Torgersen" ~ "T",
          island == "Biscoe" ~ "B",
          island == "Dream" ~ "D"
        ))
```

```r
select(pg_new, island, island_short)
```

```
#> # A tibble: 344 × 2
#>    island    island_short
#>    <chr>     <chr>       
#>  1 Torgersen T           
#>  2 Torgersen T           
#>  3 Torgersen T           
#>  4 Torgersen T           
#>  5 Torgersen T           
#>  6 Torgersen T           
#>  7 Torgersen T           
#>  8 Torgersen T           
#>  9 Torgersen T           
#> 10 Torgersen T           
#> # ℹ 334 more rows
```

---

## `rename()`

helps you rename variables

---

## `rename()`

]

]

Just changes the variable name but leaves all else intact:

```r
rename(penguins_clean, sample = sample_number)
```

```
#> # A tibble: 344 × 16
#>    study_name sample species   latin_name island individual_id clutch_completion
#>    <chr>       <dbl> <chr>     <chr>      <chr>  <chr>         <chr>            
#>  1 PAL0708         1 Adelie P… Pygosceli… Torge… N1A1          Yes              
#>  2 PAL0708         2 Adelie P… Pygosceli… Torge… N1A2          Yes              
#>  3 PAL0708         3 Adelie P… Pygosceli… Torge… N2A1          Yes              
#>  4 PAL0708         4 Adelie P… Pygosceli… Torge… N2A2          Yes              
#>  5 PAL0708         5 Adelie P… Pygosceli… Torge… N3A1          Yes              
#>  6 PAL0708         6 Adelie P… Pygosceli… Torge… N3A2          Yes              
#>  7 PAL0708         7 Adelie P… Pygosceli… Torge… N4A1          No               
#>  8 PAL0708         8 Adelie P… Pygosceli… Torge… N4A2          No               
#>  9 PAL0708         9 Adelie P… Pygosceli… Torge… N5A1          Yes              
#> 10 PAL0708        10 Adelie P… Pygosceli… Torge… N5A2          Yes              
#> # ℹ 334 more rows
#> # ℹ 9 more variables: date_egg <date>, culmen_length_mm <dbl>,
#> #   culmen_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>,
#> #   sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
```

---

## `arrange()`

orders your dataset

---

## `arrange()`

]

]

You can order your data to show the highest or lowest value first.

Lowest first:

```r
arrange(penguins_clean, sample_number)
```

Highest first:

```r
arrange(penguins_clean, desc(sample_number))
```

---

## `group_by()` and `summarize()`

when you want to aggregate your data (by groups)

---

## `group_by()` and `summarize()`

]

]

Sometimes we want to calculate group statistics.

In other languages this is often a pain. With `dplyr` this is fairly easy **and** readable.

---

## `group_by()` and `summarize()`

]

]

First we group `penguins_clean` by `sex`.

```r
grouped_by_sex <- group_by(penguins_clean, sex)
```

`summarize` works in a similar way to `mutate`:

`variable_name = some_calculation`

```r
summarise(grouped_by_sex, avg_culmen_length_mm = mean(culmen_length_mm, na.rm = T))
```

```
#> # A tibble: 3 × 2
#>   sex    avg_culmen_length_mm
#>   <chr>                 <dbl>
#> 1 FEMALE                 42.1
#> 2 MALE                   45.9
#> 3 <NA>                   41.3
```

--- 
class: center, middle, inverse

## `count()`

When you want to count how often a certain value within variables(s) occurs

---

## `count()`

]

]

Now this is a function that I use all the time. Simply specify which variable you want to count:

```r
count(penguins_clean, species, sort = T)
```

```
#> # A tibble: 3 × 2
#>   species               n
#>   <chr>             <int>
#> 1 Adelie Penguin      152
#> 2 Gentoo penguin      124
#> 3 Chinstrap penguin    68
```

---

# **`%>%`**

## The pipe operator

---

## The `%>%` operator

]

]

The point of the pipe is to help you write code in a way that is easier to read and understand.

Let's consider an example with the data manipulation we have done so far:

```r
## first I select variables
pg <- select(penguins_clean, individual_id, island, body_mass_g)

## then I filter to only Dream island
pg <- filter(pg, island == "Dream")

## then I convert body_mass_g to kg
pg <- mutate(pg, bodymass_kg = body_mass_g/1000)

## rename individual id to simply id
pg <- rename(pg, id = individual_id)
```

---

## The `%>%` operator

]

]

Now this works but the problem is: we have to write a lot of code that repeats itself!

```r
pg
```

```
#> # A tibble: 124 × 4
#>    id    island body_mass_g bodymass_kg
#>    <chr> <chr>        <dbl>       <dbl>
#>  1 N21A1 Dream         3250        3.25
#>  2 N21A2 Dream         3900        3.9 
#>  3 N22A1 Dream         3300        3.3 
#>  4 N22A2 Dream         3900        3.9 
#>  5 N23A1 Dream         3325        3.32
#>  6 N23A2 Dream         4150        4.15
#>  7 N24A1 Dream         3950        3.95
#>  8 N24A2 Dream         3550        3.55
#>  9 N25A1 Dream         3300        3.3 
#> 10 N25A2 Dream         4650        4.65
#> # ℹ 114 more rows
```

---

## The `%>%` operator

]

]

Another (hardly readable) alternative is to *nest all the functions*:

```r
rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id)
```

---

## The `%>%` operator

]

]

*The piping style*:

Read from top to bottom and from left to right and the `%>%` as "and then".

```r
penguins_clean %>% 
  select(individual_id, island, body_mass_g) %>% 
  filter(island == "Dream") %>% 
  mutate(bodymass_kg = body_mass_g/1000) %>% 
  rename(id = individual_id)
```

---

## Small Note on the Pipe

Since R Version 4.1.0 Base R also provides a pipe.

It looks like this: `$|>$`

While it shares many similarities with the `%>%` there are also some differences.

It's beyond the scope of this workshop to go over it here but for the sake of simplicity we will stick with the `magrittr` pipe.

---

# Exercises

### It's time to type some R code

Open `04_exercises_II.Rmd`

---
class: center, middle, inverse

# Some Final Things

## If only we had more time..

---

## Using RStudio Projects

[RStudio projects](https://r4ds.had.co.nz/workflow-projects.html) make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.

![](http://www.rstudio.com/images/docs/projects_new.png)

---

## Further Resources to learn R

* [Book: R for Data Science](https://r4ds.had.co.nz/)

* [Danielle Navarro's YouTube channel](https://www.youtube.com/channel/UCfNGzUFfsy_3udMY8UyaqBA)

* [Start coding using RStudio.cloud Primers](https://rstudio.cloud/learn/primers)

* [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/)

* [Book: A ModernDive into R and the Tidyverse](https://moderndive.com/)

* [TidyTuesday - Community](https://www.tidytuesday.com/)

---

## Q&A

Ways to stay in touch with me:

---

### Thank you for participating!

I hope you had fun!