A gentle introduction to R

R is a free software environment for statistical computing and graphics. To use R we might use RStudio, the most popular R IDE, or directly use R in a terminal. In both cases, we need to first download and install R. To install R we can create a virtual environment by conda or directly download R source code or use a Linux package manager (apt, yum, and etc.). In this article we will learn some fundamental syntax in R including data structures and operators, control flows, functions, and an overview of R packages.

You may find more about plotting and programming in R at:


Operators

R operators include:

  • Arithmetic: +, -, *, /, ^, % any arithmetic operarors %
  • Negation: !
  • Indexing: [, [[
  • Sequence operator: :
  • Component/slot extraction: $, @
  • Logical (and/or): &, &&, |, ||
  • Membership: %in%
  • Assignment: =, <-, ->
  • Ordering and comparison: <, >, <=, >=, ==, !=

For example:

a = 8
b = 3
n = 2
a %/% b  # Intiger division
## 2 

a %% a   # Remainder
## 0

a ^ n    # nth power 
## 64

a ^ 1/n  # nth root
## 4

A = matrix(c(1,2,3,4), ncol = 2)
A
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

A %*% A  # Matrix multiplication
##      [,1] [,2]
## [1,]    7   15
## [2,]   10   22

Data structures

There are four major data structures in R:

  • Vectors: c
  • Matrices: matrix
  • Data frames: data.frame
  • Lists: list

Vectors are generating by c command which combines values into a vector. Vectors are subscriptable and mutable objects that can be concatenated. We can call them by using vector[index]. Vectors keeps all array with a same type.

matrix creates a matrix from the given set of values. Matrices are subscriptable and mutable objects and we can use matrix[row,col] to call columns and rows. Matrices keeps all array with a same type and they cannot be concatenated.

data.frame creates data frames, store each column separately as a different variable with different observations (n obs. of m variables). When we read a csv file it saves as a dataframe. Data frames are subscriptable objects and we can use data.frame[row,col] or data.frame[col] to call columns and rows and data.frame$col_name can be used to call certain column by their names. They also can concatenate.

R list is the object which contains elements of different types – like strings, numbers, vectors, matrices, functions and another list inside it. It also could contains different number of objects at each row. For example if we have a loop that do not generate same amount of results at each iteration then we can store them in a list format. Lists are subscriptable and we can use list$name or list[index] to call components (rows) and list$name[index_2] or list[[index]][index_2] to call members of each component (row). They also can concatenate.

# Vectors
c1 = c(1:3,7) # all int 
typeof(c1)
## [1] "double"

str(c1) # structure of c1
## num [1:4] 1 2 3 7

c2 = c(1:3,'a',7) # all str
typeof(c2)
## [1] "character"

str(c2)
##chr [1:5] "1" "2" "3" "a" "7"

letter = c('a','b','c','d')
letter[1] # first element
## [1] "a"

letter[1:3] # elements 1 to 3
## [1] "a" "b" "c"

letter[4] = 'z' # mutable
letter
## [1] "a" "b" "c" "z"

c(letter, 'cat') # concatenate
## [1] "a"   "b"   "c"   "z"   "cat"

append(letter, 'append')
## [1] "a"   "b"   "c"   "z"   "append"

# Matrices
mm = matrix(c(1:8), 2, 4)
mm
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

typeof(mm)
## [1] "integer"

str(mm)
## int [1:2, 1:4] 1 2 3 4 5 6 7 8

mm[1,2] # row 1 col 2
## [1] 3

mm[2,4] = 100 # mutable
mm
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6  100

## Dataframes
df = data.frame(col1 = 1:3, col2 = letters[1:3], col3 = 31:33)
df
##   col1 col2 col3
## 1    1    a   31
## 2    2    b   32
## 3    3    c   33

typeof(df)
## [1] "list"

str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ col1: int  1 2 3
##  $ col2: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ col3: int  31 32 33

df$col1 # column col1
## [1] 1 2 3

df[,1] # column 1
## [1] 1 2 3

df[,"col1"] # column 1
## [1] 1 2 3

df[["col1"]]
## [1] 1 2 3

df[1,] # row 1
##   col1 col2 col3
## 1    1    a   31

df[1,1] # row 1 and col 1
## [1] 1

df[1,1] = 100 # mutable
df
##   col1 col2 col3
## 1  100    a   31
## 2    2    b   32
## 3    3    c   33

df$col4 = c(103,102,101) # concatenate
df
##   col1 col2 col3 col4
## 1  100    a   31  103
## 2    2    b   32  102
## 3    3    c  500  101

## Lists
ls = list(x = 11:15, y = 1:7)
typeof(ls)
## "list"

str(ls)
## List of 2
##  $ x: int [1:5] 11 12 13 14 15
##  $ y: int [1:7] 1 2 3 4 5 6 7

ls$y[7] # or ls[[2]][7]
## [1] 7

ls$y[8] = 80 # concatenate
ls$y
## [1]  1  2  3  4  5  6  7 80

mpty_list = vector("list", 2) # make an empty list
names(empty_list) = paste("list", 1:2, sep = "_") # rename the list

Note that not only we can select by indexing the objects, but also we can remove entries. For instance:

letter[-4] # remove 4th element 
## [1] "a" "b" "c"

mm[,c(-2,-3)] # remove column 2,3
##      [,1] [,2]
## [1,]    1    7
## [2,]    2  100

df = df[,-4] # remove 4th col
df
##   col1 col2 col3
## 1  100    a   31
## 2    2    b   32
## 3    3    c  500

And since most of R data structures are subscriptable, we can easily filter them as well. For example:

## Let's select rows when:
df[df$col1 < 100,] # col1 < 100
##   col1 col2 col3
## 2    2    b   32
## 3    3    c  500

df[df$col3 %in% c(31,32),] # col3 is 31 or 32
##   col1 col2 col3
## 1  100    a   31
## 2    2    b   32

df[!df$col3 %in% c(31,32),] # col3 is not 31 nor 32
##   col1 col2 col3
## 3    3    c  500

df[df$col1 > 10 & df$col3 > 30, ] # col1 > 10 and col3 > 30 
##   col1 col2 col3
## 1  100    a   31

df[df$col1 > 10 | df$col3 > 40, ] # col1 > 10 or col3 > 30 
##   col1 col2 col3
## 1  100    a   31
## 3    3    c  500

## Let's order based col1
df[order(df$col1),]
##   col1 col2 col3
## 2    2    b   32
## 3    3    c  500
## 1  100    a   31

## Let's find which elemnts in col3 are > 31
which(df$col3 > 31)
## [1] 2 3

## Let's find percentage of col3 > 31
length(which(df$col3 > 31))/nrow(df)
## [1] 0.6666667

## Change col1 to 0,1 such that
df$col1[df$col1 < 100] = 0
df$col1[df$col1 >= 100] = 1
df
##   col1 col2 col3
## 1    1    a   31
## 2    0    b   32
## 3    0    c  500

Conversion

We can use the following commands to convert main R objects to other types:

  • as.numeric
  • as.integer
  • as.character
  • as.matrix
  • as.data.frame
  • as.list
  • as.Date
  • as.factor

Control flow tools

These statements allow us to control flow of the R script. The most common control statements include:

  • if, else
  • for
  • while
  • break
  • return
  • repeat

The following are some simple examples of using these statements in R.

n = 10
if (n == 7) {
  print("n is equal 7")
} else if (n > 7) {
  print("n is greater than 7")
} else {
  print("n is smaller than 7")
}
## [1] "n is greater than 7"

n = 7
while (n < 10) {
  print(n)
  n = n + 1 
}
## [1] 7
## [1] 8
## [1] 9

mysum = 0
for (i in c(10,20,30)) {
  mysum = mysum + i
}
print(mysum)
## [1] 60

mysum = 0
for (i in 1:100) { 
  mysum = mysum + i
  if (mysum > 25) {
    break
  }
}
print(mysum)
## [1] 28

a = 1:2
b = 1:2
for (i in a) {
  stopifnot(all.equal(a,b)) # if all are not TRUE then stop
  cat("'a' and 'b' both are equal to: ", i,"\n")
}
## 'a' and 'b' both are equal to:  1 
## 'a' and 'b' both are equal to:  2

Defining functions

By using function command we can define our own functions in R. For instance, lets define function Δ = b2 − 4ac and find the solution for a = 2, b = 3 and c = 4:

# Delta
delta = function(a, b, c) {
  b^2 - 4*a*c
}
delta(a = 2, b = 3, c = 4)
## [1] -23

Some other examples:

# Norm
norm = function(x) sqrt(x %*% x)
norm(1:4)
##          [,1]
## [1,] 5.477226

# Square
square = function(x) return(x * x)
square(2)
## [1] 4

# Factorial
fact_iter = function(n) {
  p = 1
  for (i in 1:n) {
     p = p * i # Not recursive 
  }
  return(p)
} 
fact_iter(8)
## [1] 40320

# Recersive function that compute n!
fact_rec = function(n) { 
  if (n == 1) 
    return(1)
  else 
    return(n * fact_rec(n - 1)) # Recursive function
}
fact_rec(8)
## [1] 40320

# Recersive function that compute a * b
mult = function(a, b) {
  if (b == 1) {
  return(a)
  } else {
    return(a + mult(a, b-1)) # Recursive function
  }
}
mult(6, 5)
## [1] 30

# Recersive function that compute matrix power
matrix.power = function(p, n) { 
  if (n == 1) 
    return(p)
  else 
    return(p %*% matrix.power(p, n-1)) # Recursive function
}
matrix.power(matrix(c(4,2,2,4), 2, 2), 3)
##      [,1] [,2]
## [1,]  112  104
## [2,]  104  112

# Matrix symmetric test
sym = function(a) {
  if (is.matrix(a) == TRUE) {
    if (identical(a, t(a)) == TRUE) {
      return("Matrix is symmetric")
    } else return("Matrix is not symmetric")
  } else return("Entry is not a Matrix")
}
sym(matrix(c(4,2,2,4), 2, 2))
## [1] "Matrix is symmetric"

Reading and writing

In R we can use read. and write. to read and write the file types that we want.

gpa = data.frame(name = c("Ashki", "Ari", "Dori", "Pishi"), gpa = c(3.4,3.7,3.9,3.5))

# write
write.table(gpa, file = "~/Documents/gpa.txt", sep = " ", row.names = FALSE, col.names = TRUE)

# add
write.table(data.frame(name = "Ellie", gpa = 3.3), file = "~/Documents/gpa.txt", append = TRUE, sep = " ", row.names = FALSE, col.names = FALSE)

# read
read.table("~/Documents/gpa.txt", header = T)

# csv
write.csv(gpa, file = "~/Documents/gpa.csv", row.names = FALSE)
read.csv("~/Documents/gpa.csv") # header is TRUE by default

Packages

Packages are very important component of R. RStudio is a great IDE for R that provides some basic libraries. But based on your requirements you may need to install and import other packages. We can use install.packages("package name") and library("package name") functions to install and import packages in RStudio. Knowing packages in R is a very important topic, some of packages that I am using are include:

  • Documentation: rmarkdown, kintr, kableExtra
  • Web application: shiny
  • Plot: lattice, ggplot2
  • GIS: sf, maps, leaflet
  • Bayesian analysis: R2OpenBUGS, RStan (need openBUGS and Stan)
  • Interface to Python: reticulate
  • JSON objects: rjson
  • Statistical learning:
    • Linear/quadratic discriminant analysis (LDA/QDA): MASS
    • k-nearest neighbors (KNN): class
    • Bootstrapping: boot
    • Ridge and LASSO: glmnet
    • Principal components regression (PCR) and Partial Least Squares (PLS): pls
    • Spline: splines
    • Generalized additive models (GAM): gam
    • Gradient Boosting Machines (GBM): gbm
    • tree, Random forest and bagging: tree, randomForest
    • Support Vector Machine (SVM): e1071
    • Linear, non-Linear and generalized mixed-effects models: lme4, nlme, MASS
    • Profile analysis of multivariate data: profileR
    • Panel regression: plm, splm