# R Language Notes

Notes on how to install and use the R statistical language.

## Installing R and RStudio

The easiest and quickest way to install or update R on Debian is to execute the following commands:

```
apt-get update
apt-get install r-base r-base-dev
```

**Note:** `r-base-dev`

is only required to compile R packages or software that depends on R.

Having done that, download and install the latest RStudio package as follows.

Copy the link to the latest version of RStudio from https://www.rstudio.com/products/rstudio/download/.

For example, this will look like `https://download1.rstudio.org/rstudio-1.1.453-amd64.deb`

.

Download it using `wget`

, for example, `wget https://download1.rstudio.org/rstudio-0.99.878-amd64.deb`

. Finally, install it using `dpkg`

as follows: `sudo dpkg -i rstudio-0.99.878-amd64.deb`

## Installing a package

Use `install.packages("package name")`

## Loading a package

Use `library(package name)`

## Viewing data available in a package

`data(package='package name')`

## Accessing data in a package

Use `data(dataset_name, package='package name')`

Following this command you can type `dataset_name`

to view the data set contents.

## Viewing data

Use `View(variable)`

## Listing objects in current environment

Use `ls()`

or `objects()`

to get a vector of strings, listing names of objects in current environment.

If the above functions are specified within a function, they will only list the names of objects specified within that function.

For example:

```
test <- function() {internal_variable <- 'testing'; ls()}
```

will output:

```
"internal_variable"
```

## Creating a vector

Use the `c()`

function to combine values into a list or vector, for example, `ages <- c(12,14,19,28,42)`

.

## Removing objects from an environment

Use the `rm()`

or `remove()`

functions as follows:

To remove a single object type:

```
rm(object_name)
```

To remove multiple objects type:

```
rm(list=c('an_object','other_object','final_object'))
```

To remove multiple objects matching a specific pattern, say starting with `pa`

, type:

```
rm(list=ls()[grep('^pa',ls())])
```

## Computing five-number summary plus mean

Suppose we have a vector of ages as follows:

```
ages = c(18,22,21,24,19,22,20,20,30,42)
```

To compute its five-number summary plus mean, just type `summary(ages)`

, which gives us:

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 20.0 21.5 23.8 23.5 42.0
```

## Computing sample mean and standard deviation

Using the above vector `ages`

, you can easily compute the sample mean using `mean(ages)`

and sample standard deviation using `sd(ages)`

, which return `23.8`

and `7.223`

respectively.

We can double-check these statistics by computing them using the following formulas for sample mean $\bar x$ and standard deviation $s$.

\[\bar x=\frac{\sum_{i=1}^nx_i}{n}\] \[s=\sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1}}\]```
ages
[1] 18 22 21 24 19 22 20 20 30 42
> x_bar = sum(ages)/length(ages)
> x_bar
[1] 23.8
> var = sum((ages - x_bar)^2)/(length(ages) - 1)
> var
[1] 52.17778
> s = sqrt(var)
> s
[1] 7.223419
```

## Quick introduction to R

A comprehensive quick introduction to the R language can be found at https://www.statmethods.net.

If you prefer a more hands-on approach, you might find DataCamp’s interactive free course, Introduction to R, more interesting.

## Basic data frame filtering

Suppose we have a data frame, people, with 2 columns, country and age.

```
> country = c("Italy","Greece","Spain","Italy","France","Italy")
> age = c(20,43,16,20,25,34)
> people = data.frame(country,age)
> people
country age
1 Italy 20
2 Greece 43
3 Spain 16
4 Italy 20
5 France 25
6 Italy 34
```

To extract only the records for a specific country, say Italy, do the following:

```
> people[people$country == "Italy", ]
country age
1 Italy 20
4 Italy 20
6 Italy 34
```

Since nothing is specified after the comma, R retrieves all columns, i.e. all record fields. If we need only the age of people from Italy, we specify the column age after the comma, as follows:

```
> people[people$country == "Italy", "age"]
[1] 20 20 34</pre>
```

## Check proportion of samples falling within range

There is more than one method to compute the proportion of values falling within a specific range. Each method uses the fact that R generates a boolean vector whenever a vector is compared to a value. We will use the `mtcars`

built-in dataset in the code below.

```
> data("mtcars")
> dim(mtcars)
[1] 32 11
> mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4
[17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
```

If we compare the `mpg`

vector to identify all values below 20mpg, R will generate a boolean vector as follows, with `TRUE`

in place of values that satisfy our condition.

```
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[14] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[27] FALSE FALSE TRUE TRUE TRUE FALSE
```

If we call `sum`

on this boolean vector it will count the number of `TRUE`

values, since these are treated as `1`

, whilse `FALSE`

is of course `0`

.

```
> sum(mtcars$mpg < 20)
[1] 18
```

If we divide the above sum by the length of the vector, we get the proportion of values satisfying a condition.

```
> sum(mtcars$mpg < 20) / length(mtcars$mpg)
[1] 0.5625
```

The same approach can be used to compute proportions of values falling within a specific range. All we need to do is use logical operators such as `&`

. Below we compute the proportion of cars that have miles per gallon (mpg) falling in the range $\left[20,25\right]$.

```
> sum(mtcars$mpg >= 20 & mtcars$mpg <= 25) / length(mtcars$mpg)
[1] 0.25
```

This could also be computed using the `mean`

function, as follows.

```
> mean(mtcars$mpg <= 25) - mean(mtcars$mpg < 20)
[1] 0.25
```