Helpful YouTube video about Regular Expressions

Ch 14:Strings in R for Data Science by Garrett Grolemund and Hadley Wickham

Answer key to exercises

Introduction

“A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for”find" or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.

Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries." - Wikipedia

Base R functions:

R’s regular expression utilities work similar as in other languages. To learn how to use them in R, one can consult the main help page on this topic with ?regexp. Here is an overview from a Basic Regular Expressions in R Cheat Sheet

The Base R functions for dealing with regular expressions are

The stringr package in tidyverse.

The stringr package is part of the tidyverse collection of packages and wraps they underlying stringi package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the stringr functions. In addition, the stringr functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering.

Given what we have discussed so far, there is a fairly straightforward mapping from the base R functions to the stringr functions. In general, for the stringr functions, the data are the first argument and the regular expression is the second argument, with optional arguments afterwards.

14.3: MATCHING PATTERNS WITH REGULAR EXPRESSIONS

Regexps are a very terse language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.

To learn regular expressions, we’ll use str_view() and str_view_all(). These functions take a character vector and a regular expression, and show you how they match. We’ll start with very simple regular expressions and then gradually get more and more complicated. Once you’ve mastered pattern matching, you’ll learn how to apply those ideas with various stringr functions.

library(stringr)
library(tidyverse)

14.3.1: Basic matches

x <- c("apple", "banana", "pear")
str_view(x, "an")

. matches any character (except a newline)

str_view(x, ".a.")

For example, in the regex a., a is a literal character that matches just ‘a’, while ‘.’ is a metacharacter that matches every character except a newline.

Metacharacters that need to be escaped:

.[{()\^$|?*+

BACKSLASHES

are the key for “escaping” things. So if you want to include a literal single or double quote in your string, you can use a backslash to escape it.

double_quote <- "\""
single_quote <- '\''

That means if you want to include a literal backslash, you’ll need to put two slashes before that.

To create the regular expression, we need two slashes

dot <- "\\."

But the string itself only contains one?

writeLines(dot)
## \.

FYI : writeLines() writes character vectors to the console. Each element is written on its own line, without quotes. See print() for comparison.

print(dot)
## [1] "\\."

Pattern matches may vary from a precise equality to a very general similarity, as controlled by the metacharacters. For example, . is a very general pattern, [a-z] (match all lower case letters from ‘a’ to ‘z’) is less general and a is a precise pattern (matches just ‘a’).

And this tells R to look for an explicit .

x <- c("abc", "a.c", "bef")
str_view(x, "a\\.c")

THIS CONFUSES ME. The double backslashes tell R to not consider the period as a literal period (not any character),but when we used double backslashes before in dot, the string actually included one literal backslash along with the period

If \ is used as an escape character in regular expressions, how do you match a literal \? 
Well you need to escape it, creating the regular expression with \\. To create that regular expression, you need to use a string, which also needs to escape the \. That means to match a literal backslash you need to write \\\\ - four backslashes to match one!
x <- "a\\b"
writeLines(x)
## a\b
str_view(x, "\\\\")

14.3.1.1 Exercises

1. Explain why each of these strings don’t match a \: "\", "\\", "\\\".
y <- "a\bc"
str_view(y, "\\\\")

From answer key:

"\": This will escape the next character in the R string.

"\\": This will resolve to \ in the regular expression, which will escape the next character in the regular expression.

"\\\": The first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character. So in the regular expression, this will escape some escaped character.
2. How would you match the sequence "'\?

From answer key:

str_view("\"'\\", "\"'\\\\", match = TRUE)  
h <- "\"'\\"
writeLines(h)
## "'\
str_view(h, "\"'\\\\")

I don’t understand this ^ at all

3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

From answer key:

It will match any patterns that are a dot followed by any character, repeated three times.

str_view(c(".a.b.c", ".a.b", "....."), c("\\..\\..\\.."), match = TRUE)

14.3.2: Anchors

By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string.

You can use:

^ to match the start of the string.

$ to match the end of the string.

x <- c("apple", "banana", "pear")
str_view(x, "^a")
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")

14.3.2.1 Exercises

1. How would you match the literal string "$^$"?

My answer attempt:

str_view("$^$", "\\$^$")

But what if this was embedded in a larger string…

str_view("a$^$bcd", "\\$^$")

Hmm, that gave me back the the entire string, not just the parts I wanted…

From answer key:

str_view(c("$^$", "ab$^$sfas"), "^\\$\\^\\$$", match = TRUE)

Ok they had to include what they wanted to start with by using the ^ and what they wanted to end with using the $. And then we need the double backslashes in front of each of the characters. That makes sense.

2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Start with “y”.
End with “x”
Are exactly three letters long. (Don’t cheat by using str_length()!)
Have seven letters or more.

My answers:

str_view(stringr::words, "^y", match = TRUE)
str_view(stringr::words, "x$", match = TRUE)
str_view(stringr::words, "^...$", match = TRUE)
str_view(stringr::words, ".......", match = TRUE)

# not going to print those two because outputs are super long

14.3.3: Character classes and alternatives

There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

   \d : matches any digit.
   \s : matches any whitespace (e.g. space, tab, newline).
   [abc] : matches a, b, or c.
   [^abc] : matches anything except a, b, or c.

Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s".

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.

Look for a literal character that normally has special meaning in a regex

x <- c("abc", "a.c", "a*c", "a c")
str_view(x, "a[.]c")
str_view(x, ".[*]c")
str_view(x, "a[ ]")
This works for most (but not all) regex metacharacters: $ . | ? * + ( ) [ {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: ] \ ^ and -.
You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either ‘“abc”’, or “deaf”. Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
x <- c("grey", "gray")
str_view(x, "gr(e|a)y")

14.3.3.1 Exercises

  1. Create regular expressions to find all words that:
    • Start with a vowel.
str_subset(stringr::words, "^[aeiou]")
##   [1] "a"           "able"        "about"       "absolute"    "accept"     
##   [6] "account"     "achieve"     "across"      "act"         "active"     
##  [11] "actual"      "add"         "address"     "admit"       "advertise"  
##  [16] "affect"      "afford"      "after"       "afternoon"   "again"      
##  [21] "against"     "age"         "agent"       "ago"         "agree"      
##  [26] "air"         "all"         "allow"       "almost"      "along"      
##  [31] "already"     "alright"     "also"        "although"    "always"     
##  [36] "america"     "amount"      "and"         "another"     "answer"     
##  [41] "any"         "apart"       "apparent"    "appear"      "apply"      
##  [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
##  [51] "arm"         "around"      "arrange"     "art"         "as"         
##  [56] "ask"         "associate"   "assume"      "at"          "attend"     
##  [61] "authority"   "available"   "aware"       "away"        "awful"      
##  [66] "each"        "early"       "east"        "easy"        "eat"        
##  [71] "economy"     "educate"     "effect"      "egg"         "eight"      
##  [76] "either"      "elect"       "electric"    "eleven"      "else"       
##  [81] "employ"      "encourage"   "end"         "engine"      "english"    
##  [86] "enjoy"       "enough"      "enter"       "environment" "equal"      
##  [91] "especial"    "europe"      "even"        "evening"     "ever"       
##  [96] "every"       "evidence"    "exact"       "example"     "except"     
## [101] "excuse"      "exercise"    "exist"       "expect"      "expense"    
## [106] "experience"  "explain"     "express"     "extra"       "eye"        
## [111] "idea"        "identify"    "if"          "imagine"     "important"  
## [116] "improve"     "in"          "include"     "income"      "increase"   
## [121] "indeed"      "individual"  "industry"    "inform"      "inside"     
## [126] "instead"     "insure"      "interest"    "into"        "introduce"  
## [131] "invest"      "involve"     "issue"       "it"          "item"       
## [136] "obvious"     "occasion"    "odd"         "of"          "off"        
## [141] "offer"       "office"      "often"       "okay"        "old"        
## [146] "on"          "once"        "one"         "only"        "open"       
## [151] "operate"     "opportunity" "oppose"      "or"          "order"      
## [156] "organize"    "original"    "other"       "otherwise"   "ought"      
## [161] "out"         "over"        "own"         "under"       "understand" 
## [166] "union"       "unit"        "unite"       "university"  "unless"     
## [171] "until"       "up"          "upon"        "use"         "usual"
  • That only contain consonants. (Hint: thinking about matching “not”-vowels.)
str_subset(stringr::words, "[aeiou]", negate=TRUE)
## [1] "by"  "dry" "fly" "mrs" "try" "why"
#str_subset(stringr::words, "[^aeiou]") 
# ^ why didn't this one work?
  • End with ed, but not with eed.
str_subset(stringr::words, "[^e]ed$")
## [1] "bed"     "hundred" "red"
  • End with ing or ise.
str_subset(stringr::words, "i(ng|se)$")
##  [1] "advertise" "bring"     "during"    "evening"   "exercise"  "king"     
##  [7] "meaning"   "morning"   "otherwise" "practise"  "raise"     "realise"  
## [13] "ring"      "rise"      "sing"      "surprise"  "thing"
#str_subset(stringr::words, "[ing|ise]$")
# ^ why didn't that work? 

str_subset(stringr::words, "ing|ise$") 
##  [1] "advertise" "bring"     "during"    "evening"   "exercise"  "king"     
##  [7] "meaning"   "morning"   "otherwise" "practise"  "raise"     "realise"  
## [13] "ring"      "rise"      "sing"      "single"    "surprise"  "thing"
# ok so that worked - not sure I'm understanding when the [ ] work and when they don't
# ok so I think when the letters are inside brackets, the order no longer matters? 
  1. Empirically verify the rule “i before e except after c”.
str_subset(stringr::words, "cie|[^c]ei")
## [1] "science" "society" "weigh"
  1. Is “q” always followed by a “u”?
str_subset(stringr::words, "q[^u]") 
## character(0)
str_subset(stringr::words, "qu")
##  [1] "equal"    "quality"  "quarter"  "question" "quick"    "quid"    
##  [7] "quiet"    "quite"    "require"  "square"
  1. Write a regular expression that matches a word if it’s probably written in British English, not American English.

British English tends to use the following:

“ou” instead of “o” use of “ae” and “oe” instead of “a” and “o” ends in ise instead of ize ends in yse

"ou|ise$|ae|oe|yse$"
  1. Create a regular expression that will match telephone numbers as commonly written in your country.
x <- c("123-456-7890", "(123)456-7890", "(123) 456-7890", "1235-2351")
str_view(x, "\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d")

14.3.4: Repetiton

The next step up in power involves controlling how many times a pattern matches:

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')

Note that the precedence of these operators is high, so you can write: colou?r to match either American or British spellings. That means most uses will need parentheses, like bana(na)+.

You can also specify the number of matches precisely:

str_view(x, "C{2}")
str_view(x, "C{2,}")
#str_view(x, "C{,2}") # why doesnt this work? 

By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is an advanced feature of regular expressions, but it’s useful to know that it exists:

str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')

14.3.4.1 Exercises

  1. Describe the equivalents of ?, +, * in {m,n} form.

? : {0,1} : match at most 1 + : {1,} : match 1 or more * : {0,} : match 0 or more

I don’t understand how you could match zero of a pattern

  1. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

  2. ^.*$ : will match any string
  3. "\\{.+\\}" : will match any string with curly brackets surrounding at least one character
  4. \d{4}-\d{2}-\d{2} : will match four digits followed by a hyphen, followed by two digits, followed by another hyphen, followed by another two digits. This is a regular expression that can match dates formatted like “YYYY-MM-DD”
  5. "\\\\{4}" : is \\{4} which will match four backslashes.

  6. Create regular expressions to find all words that:
  • Start with three consonants.
str_view(words, "^[^aeiou]{3}", match = TRUE)
  • Have three or more vowels in a row.
str_view(words, "[aeiou]{3,}", match = TRUE)
  • Have two or more vowel-consonant pairs in a row.
str_view(words, "([aeiou][^aeiou]){2,}", match = TRUE)
  1. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

14.3.5: Grouping and backreferences

Earlier, you learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.

str_view(fruit, "(..)\\1", match = TRUE)

14.3.5.1 Exercises

  1. Describe, in words, what these expressions will match:

  2. (.)\1\1 : the same character appearing three times in a row
  3. "(.)(.)\\2\\1" : a pair of characters followed by the same pair of characters in reversed order
  4. (..)\1 : any two characters repeated
  5. "(.).\\1.\\1" : a character followed by any character, the original character, any other character, then the original character again.
  6. "(.)(.)(.).*\\3\\2\\1" : three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order.

  7. Construct regular expressions to match words that:
  • Start and end with the same character.
str_view(words, "([aeiou][^aeiou]){2,}", match = TRUE)
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_subset(words, "([A-Za-z][A-Za-z]).*\\1")
##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_subset(words, "([a-z]).*\\1.*\\1")
##  [1] "appropriate" "available"   "believe"     "between"     "business"   
##  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
## [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [21] "therefore"   "tomorrow"

14.4 TOOLS

Now that you’ve learned the basics of regular expressions, it’s time to learn how to apply them to real problems. In this section you’ll learn a wide array of stringr functions that let you:

14.4.1: Detect matches

To determine if a character vector matches a pattern, use str_detect(). It returns a logical vector the same length as the input:

x <- c("apple", "banana", "pear")
str_detect(x, "e")
## [1]  TRUE FALSE  TRUE

Remember that when you use a logical vector in a numeric context, FALSE becomes 0 and TRUE becomes 1. That makes sum() and mean() useful if you want to answer questions about matches across a larger vector:

# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306

When you have complex logical conditions (e.g. match a or b but not c unless d) it’s often easier to combine multiple str_detect() calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don’t contain any vowels:

# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")

# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")

identical(no_vowels_1, no_vowels_2)
## [1] TRUE

The results are identical, but I think the first approach is significantly easier to understand. If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.

A common use of str_detect() is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient str_subset() wrapper:

words[str_detect(words, "x$")]
## [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"

Typically, however, your strings will be one column of a data frame, and you’ll want to use filter instead:

df <- tibble(
  word = words, 
  i = seq_along(word)
)

df %>% 
  filter(str_detect(word, "x$"))
## # A tibble: 4 x 2
##   word      i
##   <chr> <int>
## 1 box     108
## 2 sex     747
## 3 six     772
## 4 tax     841

A variation on str_detect() is str_count(): rather than a simple yes or no, it tells you how many matches there are in a string:

x <- c("apple", "banana", "pear")
str_count(x, "a")
## [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
## [1] 1.991837

It’s natural to use str_count() with mutate():

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
## # A tibble: 980 x 4
##    word         i vowels consonants
##    <chr>    <int>  <int>      <int>
##  1 a            1      1          0
##  2 able         2      2          2
##  3 about        3      3          2
##  4 absolute     4      4          4
##  5 accept       5      2          4
##  6 account      6      3          4
##  7 achieve      7      4          3
##  8 across       8      2          4
##  9 act          9      1          2
## 10 active      10      3          3
## # … with 970 more rows

Note that matches never overlap. For example, in “abababa”, how many times will the pattern “aba” match? Regular expressions say two, not three:

str_count("abababa", "aba")
## [1] 2
str_view_all("abababa", "aba")

Note the use of str_view_all(). As you’ll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix _all.

14.4.1.1 Exercises

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

*Find all words that start or end with x.

# one regex
words[str_detect(words, "^x|x$")]
## [1] "box" "sex" "six" "tax"
# split regex into parts
start_with_x <- str_detect(words, "^x")
end_with_x <- str_detect(words, "x$")
words[start_with_x | end_with_x]
## [1] "box" "sex" "six" "tax"
  • Find all words that start with a vowel and end with a consonant.
str_subset(words, "^[aeiou].*[^aeiou]$") %>% head()
## [1] "about"   "accept"  "account" "across"  "act"     "actual"
# or

start_with_vowel <- str_detect(words, "^[aeiou]")
end_with_consonant <- str_detect(words, "[^aeiou]$")
words[start_with_vowel & end_with_consonant] %>% head()
## [1] "about"   "accept"  "account" "across"  "act"     "actual"
  • Are there any words that contain at least one of each different vowel?
pattern <-
  cross(rerun(5, c("a", "e", "i", "o", "u")),
    .filter = function(...) {
      x <- as.character(unlist(list(...)))
      length(x) != length(unique(x))
    }
  ) %>%
  map_chr(~str_c(unlist(.x), collapse = ".*")) %>%
  str_c(collapse = "|")

str_subset("aseiouds", pattern)
## [1] "aseiouds"
# or

str_subset(words, pattern)
## character(0)
words[str_detect(words, "a") &
  str_detect(words, "e") &
  str_detect(words, "i") &
  str_detect(words, "o") &
  str_detect(words, "u")]
## character(0)
  1. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
# highest number: 
vowels <- str_count(words, "[aeiou]")
words[which(vowels == max(vowels))]
## [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
## [6] "experience"  "individual"  "television"
# highest proportion:
prop_vowels <- str_count(words, "[aeiou]") / str_length(words)
words[which(prop_vowels == max(prop_vowels))]
## [1] "a"

14.4.2: Extract matches

To extract the actual text of a match, use str_extract(). To show that off, we’re going to need a more complicated example. I’m going to use the Harvard sentences, which were designed to test VOIP systems, but are also useful for practicing regexps. These are provided in stringr::sentences:

length(sentences)
## [1] 720
head(sentences)
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."

Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
## [1] "red|orange|yellow|green|blue|purple"

Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
## [1] "blue" "blue" "red"  "red"  "red"  "blue"

The above example has an error.

has_colour
##  [1] "Glue the sheet to the dark blue background."       
##  [2] "Two blue fish swam in the tank."                   
##  [3] "The colt reared and threw the tall rider."         
##  [4] "The wide road shimmered in the hot sun."           
##  [5] "See the cat glaring at the scared mouse."          
##  [6] "A wisp of cloud hung in the blue air."             
##  [7] "Leaves turn brown and yellow in the fall."         
##  [8] "He ordered peach pie with ice cream."              
##  [9] "Pure bred poodles have curls."                     
## [10] "The spot on the blotter was made by green ink."    
## [11] "Mud was spattered on the front of his white shirt."
## [12] "The sofa cushion is red and of light weight."      
## [13] "The sky that morning was clear and bright blue."   
## [14] "Torn scraps littered the stone floor."             
## [15] "The doctor cured him with these pills."            
## [16] "The new girl was fired today at noon."             
## [17] "The third act was dull and tired the players."     
## [18] "A blue crane is a tall wading bird."               
## [19] "Lire wires should be kept covered."                
## [20] "It is hard to erase blue or red ink."              
## [21] "The wreck occurred by the bank on Main Street."    
## [22] "The lamp shone with a steady green flame."         
## [23] "The box is held by a bright red snapper."          
## [24] "The prince ordered his head chopped off."          
## [25] "The houses are built of red clay bricks."          
## [26] "The red tape bound the smuggled food."             
## [27] "Nine men were hired to dig the ruins."             
## [28] "The flint sputtered and lit a pine torch."         
## [29] "Hedge apples may stain your hands green."          
## [30] "The old pan was covered with hard fudge."          
## [31] "The plant grew large and green in the window."     
## [32] "The store walls were lined with colored frocks."   
## [33] "The purple tie was ten years old."                 
## [34] "Bathe and relax in the cool green grass."          
## [35] "The clan gathered on each dull night."             
## [36] "The lake sparkled in the red hot sun."             
## [37] "Mark the spot with a sign painted red."            
## [38] "Smoke poured out of every crack."                  
## [39] "Serve the hot rum to the tired heroes."            
## [40] "The couch cover and hall drapes were blue."        
## [41] "He offered proof in the form of a lsrge chart."    
## [42] "A man in a blue sweater sat at the desk."          
## [43] "The sip of tea revives his tired friend."          
## [44] "The door was barred, locked, and bolted as well."  
## [45] "A thick coat of black paint covered all."          
## [46] "The small red neon lamp went out."                 
## [47] "Paint the sockets in the wall dull green."         
## [48] "Wake and rise, and step into the green outdoors."  
## [49] "The green light in the brown box flickered."       
## [50] "He put his last cartridge into the gun and fired." 
## [51] "The ram scared the school children off."           
## [52] "Tear a thin sheet from the yellow pad."            
## [53] "Dimes showered down from all sides."               
## [54] "The sky in the west is tinged with orange red."    
## [55] "The red paper brightened the dim stage."           
## [56] "The hail pattered on the burnt brown grass."       
## [57] "The big red apple fell to the ground."

Can your spot the issue with the regular expression?

Note that str_extract() only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
str_extract(more, colour_match)
## [1] "blue"   "green"  "orange"

This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use str_extract_all(). It returns a list:

str_extract_all(more, colour_match)
## [[1]]
## [1] "blue" "red" 
## 
## [[2]]
## [1] "green" "red"  
## 
## [[3]]
## [1] "orange" "red"

You’ll learn more about lists in lists and iteration.

If you use simplify = TRUE, str_extract_all() will return a matrix with short matches expanded to the same length as the longest:

str_extract_all(more, colour_match, simplify = TRUE)
##      [,1]     [,2] 
## [1,] "blue"   "red"
## [2,] "green"  "red"
## [3,] "orange" "red"
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
##      [,1] [,2] [,3]
## [1,] "a"  ""   ""  
## [2,] "a"  "b"  ""  
## [3,] "a"  "b"  "c"

14.4.2.1 Exercises

  1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
#It matches “flickered” because it matches “red”. The problem is that the previous pattern will match any word with the name of a color inside it. We want to only match colors in which the entire word is the name of the color. We can do this by adding a \b (to indicate a word boundary) before and after the pattern:

colour_match2 <- str_c("\\b(", str_c(colours, collapse = "|"), ")\\b")
colour_match2
## [1] "\\b(red|orange|yellow|green|blue|purple)\\b"
more2 <- sentences[str_count(sentences, colour_match) > 1]

str_view_all(more2, colour_match2, match = TRUE)
  1. From the Harvard sentences data, extract:
  • The first word from each sentence.

Finding the first word in each sentence requires defining what a pattern constitutes a word. For the purposes of this question, I’ll consider a word any contiguous set of letters. Since str_extract() will extract the first match, if it is provided a regular expression for words, it will return the first word.

str_extract(sentences, "[A-ZAa-z]+") %>% head()
## [1] "The"   "Glue"  "It"    "These" "Rice"  "The"

However, the third sentence begins with “It’s”. To catch this, I’ll change the regular expression to require the string to begin with a letter, but allow for a subsequent apostrophe.

str_extract(sentences, "[A-Za-z][A-Za-z']*") %>% head()
## [1] "The"   "Glue"  "It's"  "These" "Rice"  "The"
  • All words ending in ing.
pattern <- "\\b[A-Za-z]+ing\\b"

sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern))) %>%
  head()
## [1] "spring"  "evening" "morning" "winding" "living"  "king"
  • All plurals.

Finding all plurals cannot be correctly accomplished with regular expressions alone. Finding plural words would at least require morphological information about words in the language. See WordNet for a resource that would do that. However, identifying words that end in an “s” and with more than three characters, in order to remove “as”, “is”, “gas”, etc., is a reasonable heuristic.

unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}s\\b"))) %>%
  head()
## [1] "planks" "days"   "bowls"  "lemons" "makes"  "hogs"

14.4.3: Grouped matches

Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”. Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn’t a space.

noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
##  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
##  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract() gives us the complete match; str_match() gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:

has_noun %>% 
  str_match(noun)
##       [,1]         [,2]  [,3]     
##  [1,] "the smooth" "the" "smooth" 
##  [2,] "the sheet"  "the" "sheet"  
##  [3,] "the depth"  "the" "depth"  
##  [4,] "a chicken"  "a"   "chicken"
##  [5,] "the parked" "the" "parked" 
##  [6,] "the sun"    "the" "sun"    
##  [7,] "the huge"   "the" "huge"   
##  [8,] "the ball"   "the" "ball"   
##  [9,] "the woman"  "the" "woman"  
## [10,] "a helps"    "a"   "helps"

(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)

If your data is in a tibble, it’s often easier to use tidyr::extract(). It works like str_match() but requires you to name the matches, which are then placed in new columns:

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
## # A tibble: 720 x 3
##    sentence                                    article noun   
##    <chr>                                       <chr>   <chr>  
##  1 The birch canoe slid on the smooth planks.  the     smooth 
##  2 Glue the sheet to the dark blue background. the     sheet  
##  3 It's easy to tell the depth of a well.      the     depth  
##  4 These days a chicken leg is a rare dish.    a       chicken
##  5 Rice is often served in round bowls.        <NA>    <NA>   
##  6 The juice of lemons makes fine punch.       <NA>    <NA>   
##  7 The box was thrown beside the parked truck. the     parked 
##  8 The hogs were fed chopped corn and garbage. <NA>    <NA>   
##  9 Four hours of steady work faced us.         <NA>    <NA>   
## 10 Large size in stockings is hard to sell.    <NA>    <NA>   
## # … with 710 more rows

Like str_extract(), if you want all matches for each string, you’ll need str_match_all().

14.4.3.1 Exercises

  1. Find all words that come after a “number” like “one”, “two”, “three”, etc. Pull out both the number and the word.
numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences[str_detect(sentences, numword)] %>%
  str_extract(numword)
##  [1] "seven books"   "two met"       "two factors"   "three lists"  
##  [5] "seven is"      "two when"      "ten inches"    "one war"      
##  [9] "one button"    "six minutes"   "ten years"     "two shares"   
## [13] "two distinct"  "five cents"    "two pins"      "five robins"  
## [17] "four kinds"    "three story"   "three inches"  "six comes"    
## [21] "three batches" "two leaves"
  1. Find all contractions. Separate out the pieces before and after the apostrophe.
contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences[str_detect(sentences, contraction)] %>%
  str_extract(contraction) %>%
  str_split("'")
## [[1]]
## [1] "It" "s" 
## 
## [[2]]
## [1] "man" "s"  
## 
## [[3]]
## [1] "don" "t"  
## 
## [[4]]
## [1] "store" "s"    
## 
## [[5]]
## [1] "workmen" "s"      
## 
## [[6]]
## [1] "Let" "s"  
## 
## [[7]]
## [1] "sun" "s"  
## 
## [[8]]
## [1] "child" "s"    
## 
## [[9]]
## [1] "king" "s"   
## 
## [[10]]
## [1] "It" "s" 
## 
## [[11]]
## [1] "don" "t"  
## 
## [[12]]
## [1] "queen" "s"    
## 
## [[13]]
## [1] "don" "t"  
## 
## [[14]]
## [1] "pirate" "s"     
## 
## [[15]]
## [1] "neighbor" "s"

14.4.4: Replacing matches

str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-"  "p--r"   "b-n-n-"

With str_replace_all() you can perform multiple replacements by supplying a named vector:

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house"    "two cars"     "three people"

Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
## [1] "The canoe birch slid on the smooth planks." 
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."     
## [4] "These a days chicken leg is a rare dish."   
## [5] "Rice often is served in round bowls."

14.4.4.1 Exercises

  1. Replace all forward slashes in a string with backslashes.
str_replace_all("past/present/future", "/", "\\\\")
## [1] "past\\present\\future"
  1. Implement a simple version of str_to_lower() using replace_all().
replacements <- c("A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e",
                  "F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j", 
                  "K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o", 
                  "P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t", 
                  "U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y", 
                  "Z" = "z")
lower_words <- str_replace_all(words, pattern = replacements)
head(lower_words)
## [1] "a"        "able"     "about"    "absolute" "accept"   "account"
  1. Switch the first and last letters in words. Which of those strings are still words?
swapped <- str_replace_all(words, "^([A-Za-z])(.*)([A-Za-z])$", "\\3\\2\\1")
intersect(swapped, words)
##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "lead"       "read"       "depend"     "god"        "educate"   
## [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
## [16] "example"    "excuse"     "exercise"   "expense"    "experience"
## [21] "eye"        "dog"        "health"     "high"       "knock"     
## [26] "deal"       "level"      "local"      "nation"     "on"        
## [31] "non"        "no"         "rather"     "dear"       "refer"     
## [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
## [41] "transport"  "treat"      "trust"      "window"     "yesterday"
# or use the POSIX character class for letter ([[:alpha:]]): 

swapped2 <- str_replace_all(words, "^([[:alpha:]])(.*)([[:alpha:]])$", "\\3\\2\\1")
intersect(swapped2, words)
##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "lead"       "read"       "depend"     "god"        "educate"   
## [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
## [16] "example"    "excuse"     "exercise"   "expense"    "experience"
## [21] "eye"        "dog"        "health"     "high"       "knock"     
## [26] "deal"       "level"      "local"      "nation"     "on"        
## [31] "non"        "no"         "rather"     "dear"       "refer"     
## [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
## [41] "transport"  "treat"      "trust"      "window"     "yesterday"

14.4.5: Splitting

Use str_split() to split a string up into pieces. For example, we could split sentences into words:

sentences %>%
  head(5) %>% 
  str_split(" ")
## [[1]]
## [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
## [8] "planks."
## 
## [[2]]
## [1] "Glue"        "the"         "sheet"       "to"          "the"        
## [6] "dark"        "blue"        "background."
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish."  
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Because each component might contain a different number of pieces, this returns a list. If you’re working with a length-1 vector, the easiest thing is to just extract the first element of the list:

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
## [1] "a" "b" "c" "d"

Otherwise, like the other stringr functions that return a list, you can use simplify = TRUE to return a matrix:

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
##      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
## [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
## [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
## [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
## [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
## [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
##      [,9]   
## [1,] ""     
## [2,] ""     
## [3,] "well."
## [4,] "dish."
## [5,] ""

You can also request a maximum number of pieces:

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
##      [,1]      [,2]    
## [1,] "Name"    "Hadley"
## [2,] "Country" "NZ"    
## [3,] "Age"     "35"

Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
## [1] "This"      "is"        "a"         "sentence." ""          "This"     
## [7] "is"        "another"   "sentence."
str_split(x, boundary("word"))[[1]]
## [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
## [8] "sentence"

14.4.5.1 Exercises

  1. Split up a string like “apples, pears, and bananas” into individual components.
x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
## [1] "apples"  "pears"   "bananas"
  1. Why is it better to split up by boundary(“word”) than " "?

Splitting by boundary(“word”) is a more sophisticated method to split a string into words. It recognizes non-space punctuation that splits words, and also removes punctuation while retaining internal non-letter characters that are parts of the word, e.g., “can’t” See the ICU website for a description of the set of rules that are used to determine word boundaries.

Consider this sentence from the official Unicode Report on word boundaries,

sentence <- "The quick (“brown”) fox can’t jump 32.3 feet, right?"

Splitting the string on spaces considers will group the punctuation with the words,

str_split(sentence, " ")
## [[1]]
## [1] "The"       "quick"     "(“brown”)" "fox"       "can’t"     "jump"     
## [7] "32.3"      "feet,"     "right?"

However, splitting the string using boundary(“word”) correctly removes punctuation, while not separating “32.2” and “can’t”,

str_split(sentence, boundary("word"))
## [[1]]
## [1] "The"   "quick" "brown" "fox"   "can’t" "jump"  "32.3"  "feet"  "right"
  1. What does splitting with an empty string ("") do? Experiment, and then read the documentation.
str_split("ab. cd|agt", "")[[1]]
##  [1] "a" "b" "." " " "c" "d" "|" "a" "g" "t"

It splits the string into individual characters.

14.7: STRINGI

stringr is built on top of the stringi package. stringr is useful when you’re learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has 250 functions to stringr’s 49.

If you find yourself struggling to do something in stringr, it’s worth taking a look at stringi. The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. The main difference is the prefix: str_ vs. stri_.