R Language Notes

First published: 13 Nov 2016
Last updated: 01 Jul 2018

Installing R and RStudio

The easiest and quickest way to install or update R on Debian is to execute the following commands:

apt-get update
apt-get install r-base r-base-dev

Note: r-base-dev is only required to compile R packages or software that depends on R.

Having done that, download and install the latest RStudio package as follows.

Copy the link to the latest version of RStudio from https://www.rstudio.com/products/rstudio/download/.

For example, this will look like https://download1.rstudio.org/rstudio-1.1.453-amd64.deb.

Download it using wget, for example, wget https://download1.rstudio.org/rstudio-0.99.878-amd64.deb. Finally, install it using dpkg as follows: sudo dpkg -i rstudio-0.99.878-amd64.deb

Installing a package

Use install.packages("package name")

Loading a package

Use library(package name)

Viewing data available in a package

Use data(package='package name')

Accessing data in a package

Use data(dataset_name, package='package name')

Following this command you can type dataset_name to view the data set contents.

Viewing data

Use View(variable)

Listing objects in current environment

Use ls() or objects() to get a vector of strings, listing names of objects in current environment.

If the above functions are specified within a function, they will only list the names of objects specified within that function.

For example:

test <- function() {internal_variable <- 'testing'; ls()}

will output:

"internal_variable"

Creating a vector

Use the c() function to combine values into a list or vector, for example, ages <- c(12,14,19,28,42).

Removing objects from an environment

Use the rm() or remove() functions as follows:

To remove a single object type rm(object_name)

To remove multiple objects type rm(list=c('an_object','other_object','final_object'))

To remove multiple objects matching a specific pattern, say starting with pa, type rm(list=ls()[grep('^pa',ls())])

Computing five-number summary plus mean

Suppose we have a vector of ages as follows: ages = c(18,22,21,24,19,22,20,20,30,42). To compute its five-number summary plus mean, just type summary(ages), which gives us:

 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
18.0    20.0    21.5    23.8    23.5    42.0

Computing sample mean and standard deviation

Using the following vector of ages as an example, ages = c(18,22,21,24,19,22,20,20,30,42), you can easily compute the sample mean using mean(ages) and sample standard deviation using sd(ages), which return 23.8 and 7.223 respectively.

We can double-check these statistics by computing them using the following formulas for sample mean $\bar x$ and standard deviation $s$.

$$\bar x=\frac{\sum_{i=1}^nx_i}{n}$$ $$s=\sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1}}$$
> ages
 [1] 18 22 21 24 19 22 20 20 30 42
> x_bar = sum(ages)/length(ages)
> x_bar
[1] 23.8
> var = sum((ages - x_bar)^2)/(length(ages) - 1)
> var
[1] 52.17778
> s = sqrt(var)
> s
[1] 7.223419
	

Quick introduction to R

A comprehensive quick introduction to the R language can be found at https://www.statmethods.net.

If you prefer a more hands-on approach, you might find DataCamp's interactive free course, Introduction to R, more interesting.

Some basic dplyr commands

Suppose that for this example we are using a data frame with 2 columns, country and age, for instance. To keep the data for say country Italy, we use italy_only = filter(data, country=="Italy"). This filters out all the other data except for the ones where country is equal to Italy.

To select only one column, say age, use select(italy_only, age).

It is useful to combine commands on one line by using pipes as follows:

Italy_age_only = filter(data, country=="Italy") %>% select(age) %>% unlist

Convert from a data frame to a vector

Use unlist(dataframe)

Check proportion of samples falling within range

Use mean(averages <= high) - mean(averages < low), where averages is an array of values, and low and high are the respective limits.