R is a powerful tool for
statistical learning, data analysis and graphics that includes numerous
built-in functions. R is a very self explanatory software in terms of
documentation. We can learn about any R commands only by using
help(command)
in R. In this article, generally we will
learn more about R built-in functions related to calculus, matrix
operations, summary statistics and regular expressions.
As a prerequisite, first review the following article:
The following are also more resources to learn R programming:
ls()
rm(object)
rm(list = ls())
dir()
,
dir(path, pattern)
getwd()
setwd("path")
Sys.getenv("ENV")
Sys.setenv(ENV = "path")
system("command", intern = TRUE)
R is one of the best application statistical analysis and natively supports many advanced statistical methods. The following show some of these functions:
mean
, median
,
sd
, var
, cov
, cor
,
quantile
dnorm
, dchisq
,
dt
, df
pgamma
,
pbeta
, pexp
qpois
, qunif
,
qbinom
runif
, rlogis
,
rlnorm
lm
glm
nls
anova
manova
confint
t.test
prcomp
R provides an abundance of computational functions, such as:
nrow
, ncol
,
length
sort
, order
,
seq
cumsum
, cumprod
,
colSums
, rowSums
which.min
, which.max
apply
scale
sample
The following are some examples of these functions:
# Cumulative sums, products
cumsum(1:5)
## [1] 1 3 6 10 15
cumprod(1:5)
## [1] 1 2 6 24 120
set.seed(23)
= data.frame(sample = sample(1:100, 5), random = rnorm(5))
mydata
mydata## sample random
## 1 29 2.7075823
## 2 28 0.5284939
## 3 72 -0.4823752
## 4 43 -1.0835666
## 5 45 0.2366887
# Which min/max
which.min(mydata$sample)
## [1] 2
which.max(mydata$sample)
## [1] 3
# Apply
apply(mydata , 2, quantile) # Calculate quantile
## sample random
## 0% 28 -1.0835666
## 25% 29 -0.4823752
## 50% 43 0.2366887
## 75% 45 0.5284939
## 100% 72 2.7075823
apply(mydata, 2, mean) # Calculate mean
## sample random
## 43.4000000 0.3813646
colMeans(mydata)
## sample random
## 43.4000000 0.3813646
scale(mydata) # scale(x) = (x - mean(x)) / sd(x)
The most common matrix operations include:
dim
diag
diag(number)
Diagonal
isSymmetric
%*%
identical
t
det
solve
solve
qr
eigen
The following are some examples of above functions.
= matrix(c(4,4,-2,2,6,2,2,8,4), 3, 3)
M = matrix(c(3,2,1,0,1,-2,4,5,6), 3, 3)
N = matrix(c(0,0,1))
b
dim(M)
## [1] 3 3
diag(M)
## [1] 4 6 4
diag(3)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
::Diagonal(3, c(.1,.2,.3))
Matrix## [,1] [,2] [,3]
## [1,] 0.1 . .
## [2,] . 0.2 .
## [3,] . . 0.3
isSymmetric(M)
## [1] FALSE
%*% N
M ## [,1] [,2] [,3]
## [1,] 18 -2 38
## [2,] 32 -10 94
## [3,] 2 -6 26
identical(M, N)
## [1] FALSE
t(M)
## [,1] [,2] [,3]
## [1,] 4 4 -2
## [2,] 2 6 2
## [3,] 2 8 4
det(M)
## [1] 8
qr(M)
## $qr
## [,1] [,2] [,3]
## [1,] -6.0000000 -4.6666667 -5.3333333
## [2,] 0.6666667 -4.7140452 -7.4481914
## [3,] -0.3333333 0.7071068 0.2828427
##
## $rank
## [1] 3
##
## $qraux
## [1] 1.6666667 1.7071068 0.2828427
##
## $pivot
## [1] 1 2 3
##
## attr(,"class")
## [1] "qr"
solve(M) # Inverse of M
## [,1] [,2] [,3]
## [1,] 1.0 -0.5 0.5
## [2,] -4.0 2.5 -3.0
## [3,] 2.5 -1.5 2.0
solve(M,b) # Solve a system
## [,1]
## [1,] 0.5
## [2,] -3.0
## [3,] 2.0
eigen(M)
## eigen() decomposition
## $values
## [1] 9.4185507 4.3878731 0.1935761
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.3994272 -0.6725085 0.1632537
## [2,] 0.8980978 -0.5844552 -0.8354557
## [3,] 0.1840605 0.4540313 0.5247494
R base system supports both differentiation and integration with:
D(function, "wrt")
integrate(function, lower, upper)
For instance:
## Derivative
= expression(x^2+3*x)
f = D(f, "x")
ff
ff## 2 * x + 3
= 2
x eval(ff) # to find the answer for x = 2
## [1] 7
D(ff, "x") # to find the second derivate
## [1] 2
## Integrate
= function(x) x^2+3*x
f integrate(f, lower = 0, upper = 1)
## 1.833333 with absolute error < 2e-14
There are several functions to work with date and time classes such as:
Sys.time
, Sys.Date
,
timestamp
weekdays
, month
,
quarters
%a %b %d %Y %X %H %M %S
. Use
help(strftime)
to see all formatsdifftime
seq
strftime
ts
For example:
timestamp()
##------ Wed Feb 12 21:47:43 2020 ------##
Sys.time()
## [1] "2020-02-12 21:47:43 CST"
= Sys.Date()
d ## [1] "2020-02-12"
weekdays(d)
## [1] "Wednesday"
format(Sys.time(), "%Y-%m-%d")
## [1] "2020-02-12"
format(Sys.time(), "%a %b %d %Y %X")
## [1] "Wed Feb 12 2020 21:47:43"
format(Sys.time(), "%H:%M:%S")
## [1] "21:47:43"
seq(Sys.Date(), length = 3, by = "1 week") # Next three weeks
## [1] "2020-02-12" "2020-02-19" "2020-02-26"
difftime("2020-02-26", "2020-02-12", units = "auto")
## Time difference of 14 days
# Date-time conversion
strptime("20/2/06 11:16:16.683", format = "%d/%m/%y %H:%M:%OS") # convert character to time-date objects
## [1] "2006-02-20 11:16:16 CST"
ts
function can be used to create a vector or matrix of
time-series objects.
ts(1:12, start = c(1956, 3), frequency = 12) # for season: frequency = 4
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1956 1 2 3 4 5 6 7 8 9 10
## 1957 11 12
aggregate
function splits the data into subsets,
computes summary statistics for each, and returns the result in a
convenient form. table
uses the cross-classifying factors
to build a contingency table of the counts at each combination of factor
levels.
head(warpbreaks)
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
## wool tension breaks
## 1 A L 44.55556
## 2 B L 28.22222
## 3 A M 24.00000
## 4 B M 28.77778
## 5 A H 24.55556
## 6 B H 18.77778
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
## Month Ozone Temp
## 1 5 23.61538 66.73077
## 2 6 29.44444 78.22222
## 3 7 59.11538 83.88462
## 4 8 59.96154 83.96154
## 5 9 31.44828 76.89655
# Table of frequencies
= c('d','b','c','d','a','d','a', 'c')
letters = data.frame(table(letters))
fr order(fr$Freq, decreasing = TRUE),]
fr[## letter Freq
## 4 d 3
## 1 a 2
## 3 c 2
## 2 b 1
# Correlation matrix
cor(mtcars[,1:3])
## mpg cyl disp
## mpg 1.0000000 -0.8521620 -0.8475514
## cyl -0.8521620 1.0000000 0.9020329
## disp -0.8475514 0.9020329 1.0000000
# Symbolic coding correlation matrix
symnum(cor(mtcars))
## m cy ds h dr w q v a g cr
## mpg 1
## cyl + 1
## disp + * 1
## hp , + , 1
## drat , , , . 1
## wt + , + , , 1
## qsec . . . , 1
## vs , + , , . . , 1
## am . . . , , 1
## gear . . . , . , 1
## carb . . . , . , . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
split
function divides the data into a list of defined
groups. merge
function merge two data frames by common
columns or row names. merge
is similar to JOIN
in SQL:
merge(x, y, by)
merge(x, y, by, all.x = T)
merge(x, y, by, all.y = T)
merge(x, y, by, all = T)
= expand.grid(meat = c("grade-1","grade-2","grade-3"), food = c("burger", "steak", "pizza"))
data $value = round(rnorm(nrow(data)), 2)
data
data## meat food value
## 1 grade-1 burger 0.33
## 2 grade-2 burger -0.60
## 3 grade-3 burger 0.85
## 4 grade-1 steak 0.92
## 5 grade-2 steak 1.19
## 6 grade-3 steak 0.77
## 7 grade-1 pizza -0.60
## 8 grade-2 pizza -0.39
## 9 grade-3 pizza 0.88
= split(data, data$food)
data_by_food
data_by_food## $burger
## meat food value
## 1 grade-1 burger 0.33
## 2 grade-2 burger -0.60
## 3 grade-3 burger 0.85
##
## $steak
## meat food value
## 4 grade-1 steak 0.92
## 5 grade-2 steak 1.19
## 6 grade-3 steak 0.77
##
## $pizza
## meat food value
## 7 grade-1 pizza -0.60
## 8 grade-2 pizza -0.39
## 9 grade-3 pizza 0.88
class(data_by_food)
## [1] "list"
= data.frame(ID = 1:6, Product = c(rep("TV", 3), rep("Mobile", 3)))
x
x## ID Product
## 1 1 TV
## 2 2 TV
## 3 3 TV
## 4 4 Mobile
## 5 5 Mobile
## 6 6 Mobile
= data.frame(ID = c(2,4,6), Made_in = c(rep("Japan", 2), rep("China", 1)))
y
y## ID Made_in
## 1 2 Japan
## 2 4 Japan
## 3 6 China
merge(x, y, by = "ID") # Inner join
## ID Product Made_in
## 1 2 TV Japan
## 2 4 Mobile Japan
## 3 6 Mobile China
merge(x, y, by = "ID", all = TRUE) # Full outer join
## ID Product Made_in
## 1 1 TV <NA>
## 2 2 TV Japan
## 3 3 TV <NA>
## 4 4 Mobile Japan
## 5 5 Mobile <NA>
## 6 6 Mobile China
Regular
expressions (regex) are a simple way to find patterns in text. The
most common functions implementing regex in R are in the global
regular expression print (grep
) family. For more
information see help("regex")
and help(grep)
.
Learn more about regex functions in R at D-RUG. Following
are some simple examples.
sub("a", "#", "abcdfa") # Substitution: substitue first "a" in the expression "abcdfa" with "#"
## [1] "#bcdfa"
gsub("a", "#", "abcdfa") # General substitution: substitue all "a" in the expression "abcdfa" with "#"
## [1] "#bcdf#"
substr("abcdef", 1, 3) # substr(x, start, stop)
## [1] "abc"
substr("abcdef", 1, regexpr("d","abcdef")[1]) # From first to "d"
## [1] "abcd"
substr("abcdef", nchar("abcdef")+1-3, nchar("abcdef")) # First 3 caracters from right
## [1] "def"
grep("a", c("abc", "def", "cba a", "aa"), value = FALSE) # Shows arrays' number that include "a"
## [1] 1 3 4
grep("a+", c("abc", "def", "cba a", "aa"), value = TRUE) # Shows expressions that include one or more "a"
## [1] "abc" "cba a" "aa"
grepl("Jojo", "my name is Jojo") # True if "Jojo" is in the text
## [1] TRUE
grepl("^Jojo", "my name is Jojo") # True if "Jojo" is at the begining of the text
## [1] FALSE
grepl("Jojo$", "my name is Jojo") # True if "Jojo" is at the end of the text
## [1] TRUE
grepl("[J-j]ojo", "my name is Jojo") # True if "Jojo" or "jojo" is in the text
## [1] TRUE
= "(\\d{3})[-. )]*(\\d{3})[-. ]*(\\d{4})"
pattern grep(pattern, "888-555-7766", value = TRUE)
## [1] "888-555-7766"
= regexec(pattern, "888-555-7766")
r regmatches("888-555-7766", r)
## [[1]]
## [1] "888-555-7766" "888" "555" "7766"
list.files(path = "./doc", pattern = "*.pdf$") # List of "pdf" files in "doc" directory
## [1] "file1.pdf" "file2.pdf"