R Language Notes

8 minute read

Notes on how to install and use the R statistical language.

Installing R and RStudio

The easiest and quickest way to install or update R on Debian is to execute the following commands:

apt-get update
apt-get install r-base r-base-dev

Note: r-base-dev is only required to compile R packages or software that depends on R.

Having done that, download and install the latest RStudio package as follows.

Copy the link to the latest version of RStudio from https://www.rstudio.com/products/rstudio/download/.

For example, this will look like https://download1.rstudio.org/rstudio-1.1.453-amd64.deb.

Download it using wget, for example, wget https://download1.rstudio.org/rstudio-0.99.878-amd64.deb. Finally, install it using dpkg as follows: sudo dpkg -i rstudio-0.99.878-amd64.deb

Installing a package

Use install.packages("package name")

Loading a package

Use library(package name)

Viewing data available in a package

data(package='package name')

Accessing data in a package

Use data(dataset_name, package='package name')

Following this command you can type dataset_name to view the data set contents.

Viewing data

Use View(variable)

Listing objects in current environment

Use ls() or objects() to get a vector of strings, listing names of objects in current environment.

If the above functions are specified within a function, they will only list the names of objects specified within that function.

For example:

test <- function() {internal_variable <- 'testing'; ls()}

will output:

"internal_variable"

Creating a vector

Use the c() function to combine values into a list or vector, for example, ages <- c(12,14,19,28,42).

Removing objects from an environment

Use the rm() or remove() functions as follows:

To remove a single object type:

rm(object_name)

To remove multiple objects type:

rm(list=c('an_object','other_object','final_object'))

To remove multiple objects matching a specific pattern, say starting with pa, type:

rm(list=ls()[grep('^pa',ls())])

Computing five-number summary plus mean

Suppose we have a vector of ages as follows:

ages = c(18,22,21,24,19,22,20,20,30,42)

To compute its five-number summary plus mean, just type summary(ages), which gives us:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
18.0    20.0    21.5    23.8    23.5    42.0

Computing sample mean and standard deviation

Using the above vector ages, you can easily compute the sample mean using mean(ages) and sample standard deviation using sd(ages), which return 23.8 and 7.223 respectively.

We can double-check these statistics by computing them using the following formulas for sample mean $\bar x$ and standard deviation $s$.

\[\bar x=\frac{\sum_{i=1}^nx_i}{n}\] \[s=\sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1}}\]
ages
 [1] 18 22 21 24 19 22 20 20 30 42
> x_bar = sum(ages)/length(ages)
> x_bar
[1] 23.8
> var = sum((ages - x_bar)^2)/(length(ages) - 1)
> var
[1] 52.17778
> s = sqrt(var)
> s
[1] 7.223419

Quick introduction to R

A comprehensive quick introduction to the R language can be found at https://www.statmethods.net.

If you prefer a more hands-on approach, you might find DataCamp’s interactive free course, Introduction to R, more interesting.

Basic data frame filtering

Suppose we have a data frame, people, with 2 columns, country and age.

> country = c("Italy","Greece","Spain","Italy","France","Italy")
> age = c(20,43,16,20,25,34)
> people = data.frame(country,age)
> people
  country age
1   Italy  20
2  Greece  43
3   Spain  16
4   Italy  20
5  France  25
6   Italy  34

To extract only the records for a specific country, say Italy, do the following:

> people[people$country == "Italy", ]
  country age
1   Italy  20
4   Italy  20
6   Italy  34

Since nothing is specified after the comma, R retrieves all columns, i.e. all record fields. If we need only the age of people from Italy, we specify the column age after the comma, as follows:

> people[people$country == "Italy", "age"]
[1] 20 20 34</pre>

Check proportion of samples falling within range

There is more than one method to compute the proportion of values falling within a specific range. Each method uses the fact that R generates a boolean vector whenever a vector is compared to a value. We will use the mtcars built-in dataset in the code below.

> data("mtcars")
> dim(mtcars)
[1] 32 11
> mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4
[17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

If we compare the mpg vector to identify all values below 20mpg, R will generate a boolean vector as follows, with TRUE in place of values that satisfy our condition.

[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[14]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
[27] FALSE FALSE  TRUE  TRUE  TRUE FALSE

If we call sum on this boolean vector it will count the number of TRUE values, since these are treated as 1, whilse FALSE is of course 0.

> sum(mtcars$mpg < 20)
[1] 18

If we divide the above sum by the length of the vector, we get the proportion of values satisfying a condition.

> sum(mtcars$mpg < 20) / length(mtcars$mpg)
[1] 0.5625

The same approach can be used to compute proportions of values falling within a specific range. All we need to do is use logical operators such as &. Below we compute the proportion of cars that have miles per gallon (mpg) falling in the range $\left[20,25\right]$.

> sum(mtcars$mpg >= 20 & mtcars$mpg <= 25) / length(mtcars$mpg)
[1] 0.25

This could also be computed using the mean function, as follows.

> mean(mtcars$mpg <= 25) - mean(mtcars$mpg < 20)
[1] 0.25