# R Language Notes

Notes on how to install and use the R statistical language.

## Installing R and RStudio

The easiest and quickest way to install or update R on Debian is to execute the following commands:

apt-get update
apt-get install r-base r-base-dev


Note: r-base-dev is only required to compile R packages or software that depends on R.

Having done that, download and install the latest RStudio package as follows.

For example, this will look like https://download1.rstudio.org/rstudio-1.1.453-amd64.deb.

Download it using wget, for example, wget https://download1.rstudio.org/rstudio-0.99.878-amd64.deb. Finally, install it using dpkg as follows: sudo dpkg -i rstudio-0.99.878-amd64.deb

## Installing a package

Use install.packages("package name")

Use library(package name)

## Viewing data available in a package

data(package='package name')

## Accessing data in a package

Use data(dataset_name, package='package name')

Following this command you can type dataset_name to view the data set contents.

## Viewing data

Use View(variable)

## Listing objects in current environment

Use ls() or objects() to get a vector of strings, listing names of objects in current environment.

If the above functions are specified within a function, they will only list the names of objects specified within that function.

For example:

test <- function() {internal_variable <- 'testing'; ls()}


will output:

"internal_variable"


## Creating a vector

Use the c() function to combine values into a list or vector, for example, ages <- c(12,14,19,28,42).

## Removing objects from an environment

Use the rm() or remove() functions as follows:

To remove a single object type:

rm(object_name)


To remove multiple objects type:

rm(list=c('an_object','other_object','final_object'))


To remove multiple objects matching a specific pattern, say starting with pa, type:

rm(list=ls()[grep('^pa',ls())])


## Computing five-number summary plus mean

Suppose we have a vector of ages as follows:

ages = c(18,22,21,24,19,22,20,20,30,42)


To compute its five-number summary plus mean, just type summary(ages), which gives us:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
18.0    20.0    21.5    23.8    23.5    42.0


## Computing sample mean and standard deviation

Using the above vector ages, you can easily compute the sample mean using mean(ages) and sample standard deviation using sd(ages), which return 23.8 and 7.223 respectively.

We can double-check these statistics by computing them using the following formulas for sample mean $\bar x$ and standard deviation $s$.

$\bar x=\frac{\sum_{i=1}^nx_i}{n}$ $s=\sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1}}$
ages
[1] 18 22 21 24 19 22 20 20 30 42
> x_bar = sum(ages)/length(ages)
> x_bar
[1] 23.8
> var = sum((ages - x_bar)^2)/(length(ages) - 1)
> var
[1] 52.17778
> s = sqrt(var)
> s
[1] 7.223419


## Quick introduction to R

A comprehensive quick introduction to the R language can be found at https://www.statmethods.net.

If you prefer a more hands-on approach, you might find DataCamp’s interactive free course, Introduction to R, more interesting.

## Basic data frame filtering

Suppose we have a data frame, people, with 2 columns, country and age.

> country = c("Italy","Greece","Spain","Italy","France","Italy")
> age = c(20,43,16,20,25,34)
> people = data.frame(country,age)
> people
country age
1   Italy  20
2  Greece  43
3   Spain  16
4   Italy  20
5  France  25
6   Italy  34


To extract only the records for a specific country, say Italy, do the following:

> people[people$country == "Italy", ] country age 1 Italy 20 4 Italy 20 6 Italy 34  Since nothing is specified after the comma, R retrieves all columns, i.e. all record fields. If we need only the age of people from Italy, we specify the column age after the comma, as follows: > people[people$country == "Italy", "age"]
[1] 20 20 34</pre>


## Check proportion of samples falling within range

There is more than one method to compute the proportion of values falling within a specific range. Each method uses the fact that R generates a boolean vector whenever a vector is compared to a value. We will use the mtcars built-in dataset in the code below.

> data("mtcars")
> dim(mtcars)
[1] 32 11
> mtcars$mpg [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 [17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4  If we compare the mpg vector to identify all values below 20mpg, R will generate a boolean vector as follows, with TRUE in place of values that satisfy our condition. [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE [14] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE [27] FALSE FALSE TRUE TRUE TRUE FALSE  If we call sum on this boolean vector it will count the number of TRUE values, since these are treated as 1, whilse FALSE is of course 0. > sum(mtcars$mpg < 20)
[1] 18


If we divide the above sum by the length of the vector, we get the proportion of values satisfying a condition.

> sum(mtcars$mpg < 20) / length(mtcars$mpg)
[1] 0.5625


The same approach can be used to compute proportions of values falling within a specific range. All we need to do is use logical operators such as &. Below we compute the proportion of cars that have miles per gallon (mpg) falling in the range $\left[20,25\right]$.

> sum(mtcars$mpg >= 20 & mtcars$mpg <= 25) / length(mtcars$mpg) [1] 0.25  This could also be computed using the mean function, as follows. > mean(mtcars$mpg <= 25) - mean(mtcars\$mpg < 20)
[1] 0.25


Tags:

Categories:

Updated: