# Helpful functions in R

R is a powerful tool for statistical learning, data analysis and graphics that includes numerous built-in functions. R is a very self explanatory software in terms of documentation. We can learn about any R commands only by using help(command) in R. In this article, generally we will learn more about R built-in functions related to calculus, matrix operations, summary statistics and regular expressions.

As a prerequisite, first review the following article:

The following are also more resources to learn R programming:

## Shell-like commands

• List of objects: ls()
• Remove an object: rm(object)
• Remove all objects: rm(list = ls())
• List the files in a directory: dir(), dir(path, pattern)
• Get working directory: getwd()
• Set working directory: setwd("path")
• Get environmental variables: Sys.getenv("ENV")
• Set environmental variables: Sys.setenv(ENV = "path")
• Running shell commands: system("command", intern = TRUE)

## Statistical functions

R is one of the best application statistical analysis and natively supports many advanced statistical methods. The following show some of these functions:

• Basic stats: mean, median, sd, var, cov, cor, quantile
• Probability functions
• Density functions (ddist): dnorm, dchisq, dt, df
• Cumulative distribution functions (pdist): pgamma, pbeta, pexp
• Quantile functions (qdist): qpois, qunif, qbinom
• Random numbers (rdist) : runif, rlogis, rlnorm
• Regression analysis
• Linear models: lm
• Generalized linear models: glm
• Nonlinear models: nls
• Analysis of varience: anova
• Multivariate aalysis of varience: manova
• Confidence intervals: confint
• t-test: t.test
• principal components: prcomp

## Computational functions

R provides an abundance of computational functions, such as:

• Counting: nrow, ncol, length
• Ordinal and sequential: sort, order, seq
• Cumulative: cumsum, cumprod, colSums, rowSums
• Finding min/mix: which.min, which.max
• Mapping: apply
• Normalizing: scale
• Sampling: sample

The following are some examples of these functions:

# Cumulative sums, products
cumsum(1:5)
##   1  3  6 10 15

cumprod(1:5)
##    1   2   6  24 120

set.seed(23)
mydata = data.frame(sample = sample(1:100, 5), random = rnorm(5))
mydata
##   sample     random
## 1     29  2.7075823
## 2     28  0.5284939
## 3     72 -0.4823752
## 4     43 -1.0835666
## 5     45  0.2366887

# Which min/max
which.min(mydata$sample) ##  2 which.max(mydata$sample)
##  3

# Apply
apply(mydata , 2, quantile) # Calculate quantile
##      sample     random
## 0%       28 -1.0835666
## 25%      29 -0.4823752
## 50%      43  0.2366887
## 75%      45  0.5284939
## 100%     72  2.7075823

apply(mydata, 2, mean) # Calculate mean
##     sample     random
## 43.4000000  0.3813646

colMeans(mydata)
##     sample     random
## 43.4000000  0.3813646

scale(mydata) # scale(x) = (x - mean(x)) / sd(x)

## Matrix operations

The most common matrix operations include:

• Dimensions of a matrix: dim
• Diagonal of a matrix: diag
• Identity matrix: diag(number)
• Matrix diagonal: Diagonal
• Matrix symmetric test: isSymmetric
• Matrix multiplication: %*%
• Test matrices for exact equality: identical
• Matrix transpose: t
• Determinant of a matrix: det
• Inverse of a matrix: solve
• Solve a system of equations: solve
• QR Decomposition of a Matrix: qr
• Spectral Decomposition of a Matrix: eigen

The following are some examples of above functions.

M = matrix(c(4,4,-2,2,6,2,2,8,4), 3, 3)
N = matrix(c(3,2,1,0,1,-2,4,5,6), 3, 3)
b = matrix(c(0,0,1))

dim(M)
##  3 3

diag(M)
##  4 6 4

diag(3)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Matrix::Diagonal(3, c(.1,.2,.3))
##      [,1] [,2] [,3]
## [1,]  0.1    .    .
## [2,]    .  0.2    .
## [3,]    .    .  0.3

isSymmetric(M)
##  FALSE

M %*% N
##      [,1] [,2] [,3]
## [1,]   18   -2   38
## [2,]   32  -10   94
## [3,]    2   -6   26

identical(M, N)
##  FALSE

t(M)
##      [,1] [,2] [,3]
## [1,]    4    4   -2
## [2,]    2    6    2
## [3,]    2    8    4

det(M)
##  8

qr(M)
## $qr ## [,1] [,2] [,3] ## [1,] -6.0000000 -4.6666667 -5.3333333 ## [2,] 0.6666667 -4.7140452 -7.4481914 ## [3,] -0.3333333 0.7071068 0.2828427 ## ##$rank
##  3
##
## $qraux ##  1.6666667 1.7071068 0.2828427 ## ##$pivot
##  1 2 3
##
## attr(,"class")
##  "qr"

solve(M) # Inverse of M
##      [,1] [,2] [,3]
## [1,]  1.0 -0.5  0.5
## [2,] -4.0  2.5 -3.0
## [3,]  2.5 -1.5  2.0

solve(M,b) # Solve a system
##      [,1]
## [1,]  0.5
## [2,] -3.0
## [3,]  2.0

eigen(M)
## eigen() decomposition
## $values ##  9.4185507 4.3878731 0.1935761 ## ##$vectors
##           [,1]       [,2]       [,3]
## [1,] 0.3994272 -0.6725085  0.1632537
## [2,] 0.8980978 -0.5844552 -0.8354557
## [3,] 0.1840605  0.4540313  0.5247494

## Differentiation and integration

R base system supports both differentiation and integration with:

• Derivative: D(function, "wrt")
• Integrate: integrate(function, lower, upper)

For instance:

## Derivative
f = expression(x^2+3*x)
ff = D(f, "x")
ff
## 2 * x + 3

x = 2
eval(ff) # to find the answer for x = 2
##  7

D(ff, "x") # to find the second derivate
##  2

## Integrate
f = function(x) x^2+3*x
integrate(f, lower = 0, upper = 1)
## 1.833333 with absolute error < 2e-14

## Date and time

There are several functions to work with date and time classes such as:

• Current time and data: Sys.time, Sys.Date, timestamp
• Date information: weekdays, month, quarters
• Time formats: %a %b %d %Y %X %H %M %S. Use help(strftime) to see all formats
• Time difference: difftime
• Time sequences: seq
• Date-time conversion: strftime
• Time-series: ts

For example:

timestamp()
##------ Wed Feb 12 21:47:43 2020 ------##

Sys.time()
##  "2020-02-12 21:47:43 CST"

d = Sys.Date()
##  "2020-02-12"

weekdays(d)
##  "Wednesday"

format(Sys.time(), "%Y-%m-%d")
##  "2020-02-12"

format(Sys.time(), "%a %b %d %Y %X")
##  "Wed Feb 12 2020 21:47:43"

format(Sys.time(), "%H:%M:%S")
##  "21:47:43"

seq(Sys.Date(), length = 3, by = "1 week") # Next three weeks
##  "2020-02-12" "2020-02-19" "2020-02-26"

difftime("2020-02-26", "2020-02-12", units = "auto")
## Time difference of 14 days

# Date-time conversion
strptime("20/2/06 11:16:16.683", format = "%d/%m/%y %H:%M:%OS") # convert character to time-date objects
##  "2006-02-20 11:16:16 CST"

ts function can be used to create a vector or matrix of time-series objects.

ts(1:12, start = c(1956, 3), frequency = 12) # for season: frequency = 4
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1956           1   2   3   4   5   6   7   8   9  10
## 1957  11  12

## Summary statistics

aggregate function splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form. table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

head(warpbreaks)
##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L

aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
##   wool tension   breaks
## 1    A       L 44.55556
## 2    B       L 28.22222
## 3    A       M 24.00000
## 4    B       M 28.77778
## 5    A       H 24.55556
## 6    B       H 18.77778

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
##   Month    Ozone     Temp
## 1     5 23.61538 66.73077
## 2     6 29.44444 78.22222
## 3     7 59.11538 83.88462
## 4     8 59.96154 83.96154
## 5     9 31.44828 76.89655

# Table of frequencies
letters = c('d','b','c','d','a','d','a', 'c')
fr = data.frame(table(letters))
fr[order(fr$Freq, decreasing = TRUE),] ## letter Freq ## 4 d 3 ## 1 a 2 ## 3 c 2 ## 2 b 1 # Correlation matrix cor(mtcars[,1:3]) ## mpg cyl disp ## mpg 1.0000000 -0.8521620 -0.8475514 ## cyl -0.8521620 1.0000000 0.9020329 ## disp -0.8475514 0.9020329 1.0000000 # Symbolic coding correlation matrix symnum(cor(mtcars)) ## m cy ds h dr w q v a g cr ## mpg 1 ## cyl + 1 ## disp + * 1 ## hp , + , 1 ## drat , , , . 1 ## wt + , + , , 1 ## qsec . . . , 1 ## vs , + , , . . , 1 ## am . . . , , 1 ## gear . . . , . , 1 ## carb . . . , . , . 1 ## attr(,"legend") ##  0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1 ## Split and merge data split function divides the data into a list of defined groups. merge function merge two data frames by common columns or row names. merge is similar to JOIN in SQL: • Inner join: merge(x, y, by) • Left join: merge(x, y, by, all.x = T) • Right join: merge(x, y, by, all.y = T) • Outer join: merge(x, y, by, all = T) data = expand.grid(meat = c("grade-1","grade-2","grade-3"), food = c("burger", "steak", "pizza")) data$value = round(rnorm(nrow(data)), 2)
data
##      meat   food value
## 1 grade-1 burger  0.33
## 2 grade-2 burger -0.60
## 3 grade-3 burger  0.85
## 4 grade-1  steak  0.92
## 5 grade-2  steak  1.19
## 6 grade-3  steak  0.77
## 7 grade-1  pizza -0.60
## 8 grade-2  pizza -0.39
## 9 grade-3  pizza  0.88

data_by_food = split(data, data$food) data_by_food ##$burger
##      meat   food value
## 1 grade-1 burger  0.33
## 2 grade-2 burger -0.60
## 3 grade-3 burger  0.85
##
## $steak ## meat food value ## 4 grade-1 steak 0.92 ## 5 grade-2 steak 1.19 ## 6 grade-3 steak 0.77 ## ##$pizza
##      meat  food value
## 7 grade-1 pizza -0.60
## 8 grade-2 pizza -0.39
## 9 grade-3 pizza  0.88

class(data_by_food)
##  "list"

x = data.frame(ID = 1:6, Product = c(rep("TV", 3), rep("Mobile", 3)))
x
##   ID Product
## 1  1      TV
## 2  2      TV
## 3  3      TV
## 4  4  Mobile
## 5  5  Mobile
## 6  6  Mobile

y = data.frame(ID = c(2,4,6), Made_in = c(rep("Japan", 2), rep("China", 1)))
y
## 1  2   Japan
## 2  4   Japan
## 3  6   China

merge(x, y, by = "ID") # Inner join
##   ID Product Made_in
## 1  2      TV   Japan
## 2  4  Mobile   Japan
## 3  6  Mobile   China

merge(x, y, by = "ID", all = TRUE) # Full outer join
##   ID Product Made_in
## 1  1      TV    <NA>
## 2  2      TV   Japan
## 3  3      TV    <NA>
## 4  4  Mobile   Japan
## 5  5  Mobile    <NA>
## 6  6  Mobile   China

## Regular expressions

Regular expressions (regex) are a simple way to find patterns in text. The most common functions implementing regex in R are in the global regular expression print (grep) family. For more information see help("regex") and help(grep). Learn more about regex functions in R at D-RUG. Following are some simple examples.

sub("a", "#", "abcdfa") # Substitution: substitue first "a" in the expression "abcdfa" with "#"
##  "#bcdfa"

gsub("a", "#", "abcdfa") # General substitution: substitue all "a" in the expression "abcdfa" with "#"
##  "#bcdf#"

substr("abcdef", 1, 3) # substr(x, start, stop)
##  "abc"

substr("abcdef", 1, regexpr("d","abcdef")) # From first to "d"
##  "abcd"

substr("abcdef", nchar("abcdef")+1-3, nchar("abcdef")) # First 3 caracters from right
##  "def"

grep("a", c("abc", "def", "cba a", "aa"), value = FALSE) # Shows arrays' number that include "a"
##  1 3 4

grep("a+", c("abc", "def", "cba a", "aa"), value = TRUE) # Shows expressions that include one or more "a"
##  "abc"   "cba a" "aa"

grepl("Jojo", "my name is Jojo") # True if "Jojo" is in the text
##  TRUE

grepl("^Jojo", "my name is Jojo") # True if "Jojo" is at the begining of the text
##  FALSE

grepl("Jojo$", "my name is Jojo") # True if "Jojo" is at the end of the text ##  TRUE grepl("[J-j]ojo", "my name is Jojo") # True if "Jojo" or "jojo" is in the text ##  TRUE pattern = "(\\d{3})[-. )]*(\\d{3})[-. ]*(\\d{4})" grep(pattern, "888-555-7766", value = TRUE) ##  "888-555-7766" r = regexec(pattern, "888-555-7766") regmatches("888-555-7766", r) ## [] ##  "888-555-7766" "888" "555" "7766" list.files(path = "./doc", pattern = "*.pdf$") # List of "pdf" files in "doc" directory
##  "file1.pdf"   "file2.pdf"