A gentle introduction to R
R is a free software
environment for statistical computing and graphics. To use R we might
use RStudio, the most popular R
IDE, or directly use R in a terminal. In both cases, we need to first
download and install R. To install R we can create a virtual
environment by conda or
directly download R source code or use a
Linux package manager (apt
, yum
, and etc.).
In this article we will learn some fundamental syntax in R including
data structures and operators, control flows, functions, and an
overview of R packages.
You may find more about plotting and programming in R at:
- Basic graphics in R
- Helpful functions in R
- R Tutorial
- R Language Definition
- Programming with R
- R for Reproducible Scientific Analysis
Operators
R operators include:
- Arithmetic:
+
,-
,*
,/
,^
,% any arithmetic operarors %
- Negation:
!
- Indexing:
[
,[[
- Sequence operator:
:
- Component/slot extraction:
$
,@
- Logical (and/or):
&
,&&
,|
,||
- Membership:
%in%
- Assignment:
=
,<-
,->
- Ordering and comparison:
<
,>
,<=
,>=
,==
,!=
For example:
= 8
a = 3
b = 2
n %/% b # Intiger division
a ## 2
%% a # Remainder
a ## 0
^ n # nth power
a ## 64
^ 1/n # nth root
a ## 4
= matrix(c(1,2,3,4), ncol = 2)
A
A## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
%*% A # Matrix multiplication
A ## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
Data structures
There are four major data structures in R:
- Vectors:
c
- Matrices:
matrix
- Data frames:
data.frame
- Lists:
list
Vectors are generating by c
command
which combines values into a vector. Vectors are
subscriptable and mutable objects
that can be concatenated. We can call them by using
vector[index]
. Vectors keeps all array with a same
type.
matrix
creates a matrix from the
given set of values. Matrices are subscriptable and
mutable objects and we can use
matrix[row,col]
to call columns and rows. Matrices keeps
all array with a same type and they cannot be concatenated.
data.frame
creates data frames, store
each column separately as a different variable with different
observations (n obs. of m variables). When we read a csv file it saves
as a dataframe. Data frames are subscriptable objects
and we can use data.frame[row,col]
or
data.frame[col]
to call columns and rows and
data.frame$col_name
can be used to call certain column by
their names. They also can concatenate.
R list is the object which contains elements of
different types – like strings, numbers, vectors, matrices, functions
and another list inside it. It also could contains different number of
objects at each row. For example if we have a loop that do not
generate same amount of results at each iteration then we can store
them in a list format. Lists are subscriptable and we
can use list$name
or list[index]
to call
components (rows) and list$name[index_2]
or
list[[index]][index_2]
to call members of each component
(row). They also can concatenate.
# Vectors
= c(1:3,7) # all int
c1 typeof(c1)
## [1] "double"
str(c1) # structure of c1
## num [1:4] 1 2 3 7
= c(1:3,'a',7) # all str
c2 typeof(c2)
## [1] "character"
str(c2)
##chr [1:5] "1" "2" "3" "a" "7"
= c('a','b','c','d')
letter 1] # first element
letter[## [1] "a"
1:3] # elements 1 to 3
letter[## [1] "a" "b" "c"
4] = 'z' # mutable
letter[
letter## [1] "a" "b" "c" "z"
c(letter, 'cat') # concatenate
## [1] "a" "b" "c" "z" "cat"
append(letter, 'append')
## [1] "a" "b" "c" "z" "append"
# Matrices
= matrix(c(1:8), 2, 4)
mm
mm## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
typeof(mm)
## [1] "integer"
str(mm)
## int [1:2, 1:4] 1 2 3 4 5 6 7 8
1,2] # row 1 col 2
mm[## [1] 3
2,4] = 100 # mutable
mm[
mm## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 100
## Dataframes
= data.frame(col1 = 1:3, col2 = letters[1:3], col3 = 31:33)
df
df## col1 col2 col3
## 1 1 a 31
## 2 2 b 32
## 3 3 c 33
typeof(df)
## [1] "list"
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ col1: int 1 2 3
## $ col2: Factor w/ 3 levels "a","b","c": 1 2 3
## $ col3: int 31 32 33
$col1 # column col1
df## [1] 1 2 3
1] # column 1
df[,## [1] 1 2 3
"col1"] # column 1
df[,## [1] 1 2 3
"col1"]]
df[[## [1] 1 2 3
1,] # row 1
df[## col1 col2 col3
## 1 1 a 31
1,1] # row 1 and col 1
df[## [1] 1
1,1] = 100 # mutable
df[
df## col1 col2 col3
## 1 100 a 31
## 2 2 b 32
## 3 3 c 33
$col4 = c(103,102,101) # concatenate
df
df## col1 col2 col3 col4
## 1 100 a 31 103
## 2 2 b 32 102
## 3 3 c 500 101
## Lists
= list(x = 11:15, y = 1:7)
ls typeof(ls)
## "list"
str(ls)
## List of 2
## $ x: int [1:5] 11 12 13 14 15
## $ y: int [1:7] 1 2 3 4 5 6 7
$y[7] # or ls[[2]][7]
ls## [1] 7
$y[8] = 80 # concatenate
ls$y
ls## [1] 1 2 3 4 5 6 7 80
= vector("list", 2) # make an empty list
mpty_list names(empty_list) = paste("list", 1:2, sep = "_") # rename the list
Note that not only we can select by indexing the objects, but also we can remove entries. For instance:
-4] # remove 4th element
letter[## [1] "a" "b" "c"
c(-2,-3)] # remove column 2,3
mm[,## [,1] [,2]
## [1,] 1 7
## [2,] 2 100
= df[,-4] # remove 4th col
df
df## col1 col2 col3
## 1 100 a 31
## 2 2 b 32
## 3 3 c 500
And since most of R data structures are subscriptable, we can easily filter them as well. For example:
## Let's select rows when:
$col1 < 100,] # col1 < 100
df[df## col1 col2 col3
## 2 2 b 32
## 3 3 c 500
$col3 %in% c(31,32),] # col3 is 31 or 32
df[df## col1 col2 col3
## 1 100 a 31
## 2 2 b 32
!df$col3 %in% c(31,32),] # col3 is not 31 nor 32
df[## col1 col2 col3
## 3 3 c 500
$col1 > 10 & df$col3 > 30, ] # col1 > 10 and col3 > 30
df[df## col1 col2 col3
## 1 100 a 31
$col1 > 10 | df$col3 > 40, ] # col1 > 10 or col3 > 30
df[df## col1 col2 col3
## 1 100 a 31
## 3 3 c 500
## Let's order based col1
order(df$col1),]
df[## col1 col2 col3
## 2 2 b 32
## 3 3 c 500
## 1 100 a 31
## Let's find which elemnts in col3 are > 31
which(df$col3 > 31)
## [1] 2 3
## Let's find percentage of col3 > 31
length(which(df$col3 > 31))/nrow(df)
## [1] 0.6666667
## Change col1 to 0,1 such that
$col1[df$col1 < 100] = 0
df$col1[df$col1 >= 100] = 1
df
df## col1 col2 col3
## 1 1 a 31
## 2 0 b 32
## 3 0 c 500
Conversion
We can use the following commands to convert main R objects to other types:
as.numeric
as.integer
as.character
as.matrix
as.data.frame
as.list
as.Date
as.factor
Control flow tools
These statements allow us to control flow of the R script. The most common control statements include:
- if, else
- for
- while
- break
- return
- repeat
The following are some simple examples of using these statements in R.
= 10
n if (n == 7) {
print("n is equal 7")
else if (n > 7) {
} print("n is greater than 7")
else {
} print("n is smaller than 7")
}## [1] "n is greater than 7"
= 7
n while (n < 10) {
print(n)
= n + 1
n
}## [1] 7
## [1] 8
## [1] 9
= 0
mysum for (i in c(10,20,30)) {
= mysum + i
mysum
}print(mysum)
## [1] 60
= 0
mysum for (i in 1:100) {
= mysum + i
mysum if (mysum > 25) {
break
}
}print(mysum)
## [1] 28
= 1:2
a = 1:2
b for (i in a) {
stopifnot(all.equal(a,b)) # if all are not TRUE then stop
cat("'a' and 'b' both are equal to: ", i,"\n")
}## 'a' and 'b' both are equal to: 1
## 'a' and 'b' both are equal to: 2
Defining functions
By using function
command we can define our own
functions in R. For instance, lets define function Δ = b2 − 4ac
and find the solution for a = 2, b = 3 and c = 4:
# Delta
= function(a, b, c) {
delta ^2 - 4*a*c
b
}delta(a = 2, b = 3, c = 4)
## [1] -23
Some other examples:
# Norm
= function(x) sqrt(x %*% x)
norm norm(1:4)
## [,1]
## [1,] 5.477226
# Square
= function(x) return(x * x)
square square(2)
## [1] 4
# Factorial
= function(n) {
fact_iter = 1
p for (i in 1:n) {
= p * i # Not recursive
p
}return(p)
} fact_iter(8)
## [1] 40320
# Recersive function that compute n!
= function(n) {
fact_rec if (n == 1)
return(1)
else
return(n * fact_rec(n - 1)) # Recursive function
}fact_rec(8)
## [1] 40320
# Recersive function that compute a * b
= function(a, b) {
mult if (b == 1) {
return(a)
else {
} return(a + mult(a, b-1)) # Recursive function
}
}mult(6, 5)
## [1] 30
# Recersive function that compute matrix power
= function(p, n) {
matrix.power if (n == 1)
return(p)
else
return(p %*% matrix.power(p, n-1)) # Recursive function
}matrix.power(matrix(c(4,2,2,4), 2, 2), 3)
## [,1] [,2]
## [1,] 112 104
## [2,] 104 112
# Matrix symmetric test
= function(a) {
sym if (is.matrix(a) == TRUE) {
if (identical(a, t(a)) == TRUE) {
return("Matrix is symmetric")
else return("Matrix is not symmetric")
} else return("Entry is not a Matrix")
}
}sym(matrix(c(4,2,2,4), 2, 2))
## [1] "Matrix is symmetric"
Reading and writing
In R we can use read.
and write.
to read
and write the file types that we want.
= data.frame(name = c("Ashki", "Ari", "Dori", "Pishi"), gpa = c(3.4,3.7,3.9,3.5))
gpa
# write
write.table(gpa, file = "~/Documents/gpa.txt", sep = " ", row.names = FALSE, col.names = TRUE)
# add
write.table(data.frame(name = "Ellie", gpa = 3.3), file = "~/Documents/gpa.txt", append = TRUE, sep = " ", row.names = FALSE, col.names = FALSE)
# read
read.table("~/Documents/gpa.txt", header = T)
# csv
write.csv(gpa, file = "~/Documents/gpa.csv", row.names = FALSE)
read.csv("~/Documents/gpa.csv") # header is TRUE by default
Packages
Packages are very important component of R. RStudio is a great IDE
for R that provides some basic libraries. But based on your
requirements you may need to install and import other packages. We can
use install.packages("package name")
and
library("package name")
functions to install and import
packages in RStudio. Knowing packages in R is a very important topic,
some of packages that I am using are include:
- Documentation:
rmarkdown
,kintr
,kableExtra
- Web application:
shiny
- Plot:
lattice
,ggplot2
- GIS:
sf
,maps
,leaflet
- Bayesian analysis:
R2OpenBUGS
,RStan
(need openBUGS and Stan) - Interface to Python:
reticulate
- JSON objects:
rjson
- Statistical learning:
- Linear/quadratic discriminant analysis (LDA/QDA):
MASS
- k-nearest neighbors (KNN):
class
- Bootstrapping:
boot
- Ridge and LASSO:
glmnet
- Principal components regression (PCR) and Partial Least Squares
(PLS):
pls
- Spline:
splines
- Generalized additive models (GAM):
gam
- Gradient Boosting Machines (GBM):
gbm
- tree, Random forest and bagging:
tree
,randomForest
- Support Vector Machine (SVM):
e1071
- Linear, non-Linear and generalized mixed-effects models:
lme4
,nlme
,MASS
- Profile analysis of multivariate data:
profileR
- Panel regression:
plm
,splm
- Linear/quadratic discriminant analysis (LDA/QDA):