2 Intermediate R
https://learn.datacamp.com/courses/intermediate-r
2.1 Conditionals And Control Flow
Reminder: ==
is for comparison and =
is for assignment.
TRUE
is treated as 1
for arithmetic, and FALSE
is treated as 0
.
Compare Vectors and Matrices
Number of views on each site:
<- c(16, 9, 13, 5, 2, 17, 14)
linkedin <- c(17, 7, 5, 16, 8, 13, 14) facebook
To find out which had more views:
>= facebook linkedin
## [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE
Compare data in matrices:
<- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)
views # When is views less than or equal to 14?
<= 14 views
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE TRUE TRUE
Logical Operators
With “Or” function: |', only one condition (the right one or the left one) needs to be satisfied to spit out
TRUE`:
<- 13
socks # Is last under 5 or above 10?
< 5 | socks >10 socks
## [1] TRUE
With “And” function: “&”, both conditions (the right and left ones) need to be satisfied to spit out TRUE
. In this case, the “left condition” is not satisfied, hence:
# Is last between 15 (exclusive) and 20 (inclusive)?
> 15 & socks <= 20 socks
## [1] FALSE
The !
operator negates a logical value:
<- 5
x <- 7
y !((x < 4) & !(y > 12))
## [1] TRUE
This can be a brain-twister. Both x
and y
element is FALSE
. However, there’s a !
in front of y
, therefore, (y-12)
is TRUE
. So, within the entire ()
, a FALSE
and TRUE
would spit out a FALSE
because one component is FALSE
. Then, the !
operator would flip the result, which becomes TRUE
in the end.
IF
function allows the logic of “if this happens then do A.” For example:
<- 14
num_views
if (num_views > 15) {
print("You're popular!")
}
This is the expanded of the above logic thought: “if this happens then do A, if doesn’t, then do B, and if both doesn’t happen, then do the C”:
if (num_views > 15) {
print("You're popular!")
else if (num_views <= 15 & num_views > 10) {
} print("Your number of views is average")
else {
} print("Try to be more visible!")
}
## [1] "Your number of views is average"
Logic are lawless inside “if-else” constructs. This is a good example from Datacamp:
if (number < 10) {
if (number < 5) {
result <- "extra small"
} else {
result <- "small"
}
} else if (number < 100) {
result <- "medium"
} else {
result <- "large"
}
print(result)
2.2 Loops
“While” Loop
While
function’s logic is “while this is true, keep doing the task”. Example from datacamp to understand:
<- 64
speed
# Extend/adapt the while loop
while (speed > 30) {
print(paste("Your speed is",speed))
if (speed > 48) {
print("Slow down big time!")
<- speed - 11
speed else {
} print("Slow down!")
<- speed - 6
speed
} }
## [1] "Your speed is 64"
## [1] "Slow down big time!"
## [1] "Your speed is 53"
## [1] "Slow down big time!"
## [1] "Your speed is 42"
## [1] "Slow down!"
## [1] "Your speed is 36"
## [1] "Slow down!"
“For” Loop
This loop will print out all the views
listed in the linkedin
vector orderly from left to right:
<- c(16, 9, 13, 5, 2, 17, 14)
linkedin
# Loop version 1
for (views in linkedin) {
print(views)
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
This loop will also print out all the views from the vector, but it does so by reffering to the location of the specific element within the vector to print out:
# Loop version 2
for (i in 1:length(linkedin)) {
print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
#Note: use "[[ ]]" to select the elements in loop version 2 when looping a list.
A for
loop inside a for
loop is called a nested
loop, example from Datacamp:
<- matrix(c( "O", NA, "X", NA, "O", "O", "X", NA, "X"), byrow = TRUE, nrow =3)
ttt
for (i in 1:nrow(ttt)) {
for (j in 1:ncol(ttt)) {
print(paste("On row", i, "and column", j, "the board contains", ttt[i,j]))
} }
## [1] "On row 1 and column 1 the board contains O"
## [1] "On row 1 and column 2 the board contains NA"
## [1] "On row 1 and column 3 the board contains X"
## [1] "On row 2 and column 1 the board contains NA"
## [1] "On row 2 and column 2 the board contains O"
## [1] "On row 2 and column 3 the board contains O"
## [1] "On row 3 and column 1 the board contains X"
## [1] "On row 3 and column 2 the board contains NA"
## [1] "On row 3 and column 3 the board contains X"
“Break” And “Next”
The break
terminate the running code if the condition is FALSE
.
The next
allow the code to the run after break
. It skip over the element that made the code FALSE
then continue.
<- c(16, 9, 13, 5, 2, 17, 14)
likes
for (heart in likes) {
if (heart > 10) {
print("You're popular!")
else {print("Be more visible!")
}
}if (heart > 16) {print("This is ridiculous, I'm outta here!")
break
} if(heart < 5) {print("This is too embarrassing!")
next
} print(heart)
}
## [1] "You're popular!"
## [1] 16
## [1] "Be more visible!"
## [1] 9
## [1] "You're popular!"
## [1] 13
## [1] "Be more visible!"
## [1] 5
## [1] "Be more visible!"
## [1] "This is too embarrassing!"
## [1] "You're popular!"
## [1] "This is ridiculous, I'm outta here!"
2.3 Functions
A way to see the components of a function is the args() function:
args(sum)
## function (..., na.rm = FALSE)
## NULL
Exclude “NA” From A Calculation
<- c(16, 9, 13, 5, NA, 17, 14)
linkedin <- c(17, NA, 5, 16, 8, 13, 14)
facebook
# Basic average of linkedin
mean(linkedin)
## [1] NA
# Advanced average of linkedin
mean(linkedin, na.rm = TRUE)
## [1] 12.33333
The default setting in the mean()
function is na.rm = FALSE
, which means it doesn’t exclude the NA
variables. However, when switched to TRUE
, the function excludes the NA
varaibles.
When Is It Required?
mean(x, trim = 0, na.rm = FALSE, ...)
x
is required; if you do not specify it, R will throw an error. trim
and na.rm
are optional arguments: they have a default value which is used if the arguments are not explicitly specified.
Create A Function
To create a function
, assign a variable the function function(condition){body}
:
<- function(x) {
pow_two <- x ^ 2
y print(paste(x, "to the power two equals", y))
return(y)
}pow_two(6)
## [1] "6 to the power two equals 36"
## [1] 36
#NOTE: "y" was defined inside the "pow_two()" function and therefore it is not accessible outside of that function. This is also true for the function's arguments of course - "x" in this case.
Internal Variables of “Function()” are FIXED
An external varaible can’t be entered in to change the internal make-up of a created function:
<- function(x) {
triple <- 3*x
x
x
}#Testing whether R updates the variable "a":
<- 5
a triple(a)
## [1] 15
a
## [1] 5
Even though the function triple(a)
outputted 15, R didn’t print the new a
variable as 15
but as 5
.
Example from Datacamp:
<- c(10, 18, 4)
likes
<- function(num_views) {
interpret if (num_views > 15) {
print("You're popular!")
return(num_views)
else {
} print("Try to be more visible!")
return(0)
}
}
interpret(likes[2])
## [1] "You're popular!"
## [1] 18
interpret(likes[3])
## [1] "Try to be more visible!"
## [1] 0
Load an R Package
There are basically two important functions when it comes to R packages:
install.packages()
installs a given package.
library()
which loads packages, i.e. attaches them to the search list on your R workspace.
Anonymous functions
An anonymous function is a function that’s NOT aasigned a variable(name):
# Named function
<- function(x) { 3 * x }
triple
# Anonymous function with same implementation
function(x) { 3 * x }
## function(x) { 3 * x }
2.4 The apply family
“lapply()”*
lappy()
applies the function inputted inside the ()
over a vector or list, and spit out a list.
lapply(X, FUN, ...)
For example:
<- list(c(17, 28, -2, 9, 22), c(2, -19, 54, 27, 11), c(91, 76, -34, 8, 10))
numbers_list
<- function(x) {
extremes_avg min(x) + max(x) ) / 2
(
}
# Apply extremes_avg() over numbers_list using lapply()
lapply(numbers_list, extremes_avg)
## [[1]]
## [1] 13
##
## [[2]]
## [1] 17.5
##
## [[3]]
## [1] 28.5
“sapply()”*
sapply()
applies the function inputted inside the ()
over a vector or list, and try to arrange the resulting list into an organized array. If not possible, sapply()
will return the same list as lapply()
spit out.
sapply(X, FUN, ...)
Continuing the example above:
# Apply extremes_avg() over numbers_list using sapply()
sapply(numbers_list, extremes_avg)
## [1] 13.0 17.5 28.5
The outputted result looks more compact than lapply()
.
“vapply()”*
vapply()
applies the function inputted inside the ()
over a vector or list like lapply()
or sapply()
. However, with vapply()
, it requires a specified output format, meaning tell it what result type it should spit out:
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
The FUN.VALUE
argument expects a template for the return argument of this function FUN.
USE.NAMES
is TRUE
by default; in this case vapply()
tries to generate a named array, if possible.
Example:
# Definition of the basics() function
<- function(x) {
basics c(min = min(x), mean = mean(x), median = median(x), max = max(x))
}
# Fix the error:
vapply(numbers_list, basics, numeric(4))
## [,1] [,2] [,3]
## min -2.0 -19 -34.0
## mean 14.8 15 30.2
## median 17.0 11 10.0
## max 28.0 54 91.0
In this example, if numerics
specified was 3
instead of 4
, the code would NOT run and give an error because vapply()
function requires a specific output format. In this case, the basics
function has 4 elements: min
, mean
, median
, and max
, therefore, vapply()
need to specified as numeric(4)
.
2.5 Utilities
Mathematical Utilities
abs()
: Calculate the absolute value.
sum()
: Calculate the sum of all the values in a data structure.
mean()
: Calculate the arithmetic mean.
round()
: Round the values to 0 decimal places by default.
Example:
<- c(1.9, -2.6, 4.0, -9.5, -3.4, 7.3)
digits
# Sum of absolute rounded values of errors
sum(abs(round(digits)))
## [1] 29
Data Utilities
seq()
: Generate sequences, by specifying the from
, to
, and by
arguments.
rep()
: Replicate elements of vectors and lists.
rep(seq(1, 7, by = 2), times = 7)
## [1] 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7
sort()
: Sort a vector in ascending order by default. Works on numerics, but also on character strings and logicals.
rev()
: Reverse the elements in a data structures for which reversal is defined.
str()
: Display the structure of any R object.
append()
: Merge vectors or lists.
is.*()
: Check for the class of an R object.
as.*()
: Convert an R object from one class to another.
unlist()
: Flatten (possibly embedded) lists to produce a vector.
Example:
<- list(16, 9, 13, 5, 2, 17, 14)
linkedin <- list(17, 7, 5, 16, 8, 13, 14)
facebook
# Convert linkedin and facebook to a vector: li_vec and fb_vec
<- unlist(linkedin)
li_vec <- unlist(facebook)
fb_vec
# Append fb_vec to li_vec: social_vec
<- append(li_vec, fb_vec)
social_vec
# Sort social_vec
sort(social_vec, decreasing = TRUE)
## [1] 17 17 16 16 14 14 13 13 9 8 7 5 5 2
Regular Expressions
Regular expressions can be used to see whether a pattern exists inside a character string or a vector of character strings.
grepl()
: the l
in grepl()
stands for logical, which indicates that this function returns TRUE
when a pattern is found in the corresponding character string.
grep()
: returns a vector contains the location of the character strings(by order) that contains the pattern searched for.
The caret: `^`, to match the content located in the start of a string.
The dollar-sign: `$`, to match the content located in the end of a string.
<- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
emails "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")
#Search for the email addresses in the vector above that contains "@", anything in between, and "edu":
grepl(pattern = "@.*\\.edu$", emails)
## [1] TRUE FALSE FALSE FALSE TRUE FALSE
grep(pattern = "@.*\\.edu$", emails)
## [1] 1 5
# Subset emails using hits
grep(pattern = "@.*\\.edu$", emails)] emails[
## [1] "john.doe@ivyleague.edu" "quant@bigdatacollege.edu"
.*
: can be read as “any character that is matched zero or more times”.Both the dot and the asterisk are metacharacters.
\\
: is like a cut-off. It put a separation wall between the .*
and .edu
.
sub()
and gsub()
can specify a replacement
argument. If inside the character vector x
, the regular expression pattern
is found, the matching element(s) will be replaced with replacement
.sub()
only replaces the first match, whereas gsub()
replaces all matches.
<- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
emails "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")
sub(pattern = "@.*\\.edu$", replacement = "@datacamp.edu", emails)
## [1] "john.doe@datacamp.edu" "education@world.gov"
## [3] "dalai.lama@peace.org" "invalid.edu"
## [5] "quant@datacamp.edu" "cookie.monster@sesame.tv"
#The [1] and [5] elements has been changed.
\\s
: Match a space. The “s” is normally a character, escaping it \\
makes it a metacharacter.
[0-9]+: Match the numbers 0 to 9, at least once (+).
([0-9]+): The parentheses are used to NOT confuse the pattern
matching criterias.
The \\1
: is to input the regular expression [0-9]+
matched into the replacement
argument.
<- c("Won 1 Oscar.", "Another 9 wins & 24 nominations.",
awards "2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
## [1] "Won 1 Oscar." "24" "3" "2"
The logic behind the pattern
criteria is as follow: skip any character and then a space between the number and the word “nomination”. Then, skip again any character after the word “nomination”.
Date And Time
Dates are represented by Date
objects. Times are represented by POSIXct
objects.
However, dates and times are simple numerical values. Date
objects store the number of days since the 1st of January in 1970. POSIXct
store the number of seconds since the 1st of January in 1970.
# Get the current date: today
<- Sys.Date()
today today
## [1] "2023-02-23"
# See what today looks like under the hood
unclass(today)
## [1] 19411
# Get the current time: now
<- Sys.time()
now now
## [1] "2023-02-23 12:31:11 CST"
# See what now looks like under the hood
unclass(now)
## [1] 1677177072
Use the as.Date()
function to create a Date
object from a simple character string.
%Y: 4-digit year (1982)
%y: 2-digit year (82)
%m: 2-digit month (01)
%d: 2-digit day of the month (13)
%A: weekday (Wednesday)
%a: abbreviated weekday (Wed)
%B: month (January)
%b: abbreviated month (Jan)
# Definition of character strings representing dates
<- "May 23, '96"
str1 <- "2012-03-15"
str2 <- "30/January/2006"
str3
# Convert the strings to dates: date1, date2, date3
<- as.Date(str1, format = "%b %d, '%y")
date1 <- as.Date(str2, format = "%Y-%m-%d")
date2 <- as.Date(str3, format = "%d/%B/%Y")
date3
# Convert dates to formatted strings
format(date1, "%A")
## [1] "Thursday"
format(date2, "%d")
## [1] "15"
format(date3, "%b %Y")
## [1] "Jan 2006"
Both Date
and POSIXct
objects are represented by simple numerical values under the hood.
<- Sys.Date()
today + 1 today
## [1] "2023-02-24"
- 1 today
## [1] "2023-02-22"
as.Date("2015-03-12") - as.Date("2015-02-27")
## Time difference of 13 days
Use as.POSIXct()
to convert a character string to a POSIXct
object.
%H: hours as a decimal number (00-23)
%I: hours as a decimal number (01-12)
%M: minutes as a decimal number
%S: seconds as a decimal number
%T: shorthand notation for the typical format %H:%M:%S
%p: AM/PM indicator
For a full list of conversion symbols, consult the strptime
documentation in the console: ?strptime
# Definition of character strings representing times
<- "May 23, '96 hours:23 minutes:01 seconds:45"
str1 <- "2012-3-12 14:23:08"
str2
# Convert the strings to POSIXct objects: time1, time2
<- as.POSIXct(str1, format = "%B %d, '%y hours:%H minutes:%M seconds:%S")
time1 <- as.POSIXct(str2, format = "%Y-%m-%d %H:%M:%S")
time2
# Convert times to formatted strings
format(time1, "%M")
## [1] "01"
format(time2, "%I:%M %p")
## [1] "02:23 PM"
Examples of doing calculations with POSIXct
objects:
<- Sys.time()
now + 3600 # add an hour now
## [1] "2023-02-23 13:31:11 CST"
- 3600 * 24 # subtract a day now
## [1] "2023-02-22 12:31:11 CST"
<- as.POSIXct("1879-03-14 14:37:23")
birth <- as.POSIXct("1955-04-18 03:47:12")
death <- death - birth
einstein einstein
## Time difference of 27792.56 days