文字字串處理 with stringr package(Wickham 2019b)

library(rvest)
## Loading required package: xml2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Newsurl <- "https://edition.cnn.com/cnn10"
articlebody <- read_html(Newsurl) %>% 
  html_nodes(".cd__description") %>% html_text()
articlebody
## [1] "Today's show discusses response plans, closures, market drops, and testing related to the new coronavirus. We also report on the rescue of Rockstar Freddy. Transcript and Newsquiz"
oneword <- strsplit(articlebody, split = " ")[[1]]
oneword
##  [1] "Today's"      "show"         "discusses"    "response"     "plans,"      
##  [6] "closures,"    "market"       "drops,"       "and"          "testing"     
## [11] "related"      "to"           "the"          "new"          "coronavirus."
## [16] "We"           "also"         "report"       "on"           "the"         
## [21] "rescue"       "of"           "Rockstar"     "Freddy."      "Transcript"  
## [26] "and"          "Newsquiz"
# search the location of "coronavirus"
grep("coronavirus", oneword) # order
## [1] 15
grepl("coronavirus", oneword) # TRUE/FALSE
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE
# paste "weather" to each of oneword
paste(oneword, "weather")
##  [1] "Today's weather"      "show weather"         "discusses weather"   
##  [4] "response weather"     "plans, weather"       "closures, weather"   
##  [7] "market weather"       "drops, weather"       "and weather"         
## [10] "testing weather"      "related weather"      "to weather"          
## [13] "the weather"          "new weather"          "coronavirus. weather"
## [16] "We weather"           "also weather"         "report weather"      
## [19] "on weather"           "the weather"          "rescue weather"      
## [22] "of weather"           "Rockstar weather"     "Freddy. weather"     
## [25] "Transcript weather"   "and weather"          "Newsquiz weather"
# search ", " and replace it as " %and% "
gsub(x = articlebody, pattern = ", ", replacement = " %and% ")
## [1] "Today's show discusses response plans %and% closures %and% market drops %and% and testing related to the new coronavirus. We also report on the rescue of Rockstar Freddy. Transcript and Newsquiz"

正規表示式 (Regular Expression)

字串處理通常會搭配正規表示式 (Regular Expression)

Regular Expression (正規表示式)是指一組能用來表示字串共同格式 (common structure)的樣式 (Pattern),,或是手機號碼固定是10碼,等等樣式。 在所有的程式語言中,只要用到字串比對字串取代字串相關功能,都會用到正規表示式。雖然正規表示式在不同程式語言中會有些許差異,但核心概念是相同的。

以下是常見的範例:

語法 正規表達 範例
整數 [0-9]+ 5815
浮點數 [0-9]+.[0-9]+ 58.15
純英文字串 [A-Za-z]+ CGUIM
Email [a-zA-Z0-9_]+@[a-zA-Z0-9_]+
URL網址 http://[a-zA-Z0-9_]+ http://www.yahoo.com.tw/

正規表示式的常用語法分類如下:

  1. 表示數量的語法
  • *: 出現0~無限多次
  • +: 出現1~無限多次
  • ?: 出現0~1次
  • {n}: 出現n次
  • {n,}: 出現n~無限多次
  • {n,m}: 出現n~m次
stringVector <- c("a","abc","ac","abbc","abbbc","abbbbc")
grep("ab*", stringVector, value=TRUE) # return value in stringVector
## [1] "a"      "abc"    "ac"     "abbc"   "abbbc"  "abbbbc"
grep("ab+", stringVector, value=TRUE)
## [1] "abc"    "abbc"   "abbbc"  "abbbbc"
grep("ab?c", stringVector, value=TRUE)
## [1] "abc" "ac"
grep("a{1}b{1}c{1}", stringVector, value=TRUE)
## [1] "abc"
grep("ab{2}c", stringVector, value=TRUE)
## [1] "abbc"
grep("ab{3}", stringVector, value=TRUE) == 
  grep("ab{3,}", stringVector, value=TRUE)
## [1] TRUE TRUE
  1. 表示位置的語法
  • ^: 出現在字串開始的位置
  • $: 出現在字串結束ˇ的位置
  • \b: 出現空字串(空白)開始或結束的位置
  • \B: 出現字串開始或結束的位置
stringVector <- c("abc","bcd","cde","def","abc def","bcdefg abc","blablammmc","k a")
grep("^bc",stringVector,value=T)
## [1] "bcd"        "bcdefg abc"
grep("^b",stringVector,value=T)
## [1] "bcd"        "bcdefg abc" "blablammmc"
grep("bc$",stringVector,value=T)
## [1] "abc"        "bcdefg abc"
grep("c$",stringVector,value=T)
## [1] "abc"        "bcdefg abc" "blablammmc"
grep("\\ba",stringVector,value=T) # "\" is needed!
## [1] "abc"        "abc def"    "bcdefg abc" "k a"
grep("\\Ba",stringVector,value=T)
## [1] "blablammmc"
grep("\\bde",stringVector,value=T)
## [1] "def"     "abc def"
grep("\\Bde",stringVector,value=T)
## [1] "cde"        "bcdefg abc"
  1. 運算子
  • .: 出現所有的字元一次,包括空字串
  • […]: 出現字元清單(…)中的字元一次,可用-表示範圍,如[A-Z],[a-z],[0-9]
  • [^…]: 出現字元清單(…)中的字元
  • \: 要搜尋字串中的特殊字元時,前方須加上\
  • |: 或
stringVector <- c("03-1234567","02-87654321","0988123456",
                  "07-118","0-888","03548445",
                  "csim@mail.cgu.edu.tw","csim@.","csim@","@gms.",
                  "http://www.is.cgu.edu.tw/","https://www.yahoo.com.tw/")
grep("[0-9]{2}-[0-9]{7,8}",stringVector,value=T)
## [1] "03-1234567"  "02-87654321"
grep("09[0-9]{8}",stringVector,value=T) # cellphone
## [1] "0988123456"
grep("03-|02-",stringVector,value=T) # Taipei or Hualian
## [1] "03-1234567"  "02-87654321"
grep("[A-Za-z.]{2,}@[A-Za-z.]{2,}",stringVector,value=T) # email
## [1] "csim@mail.cgu.edu.tw"
grep("http:|https:",stringVector,value=T) # url
## [1] "http://www.is.cgu.edu.tw/" "https://www.yahoo.com.tw/"
  1. 特殊符號
  • \d: 數字,等於 [0-9]
  • \D: 非數字,等於 [^0-9]
  • [:lower:]: 小寫字,等於 [a-z]
  • [:upper:]: 大寫字,等於 [A-Z]
  • [:alpha:]: 所有英文字,等於 [[:lower:][:upper:]] or [A-z]
  • [:alnum:]: 所有英文字和數字,等於 [[:alpha:][:digit:]] or [A-z0-9]
  • \w: 文字數字與底線,等於 [[:alnum:]_] or [A-z0-9_]
  • \W: 非文字數字與底線,等於 [^A-z0-9_]
  • [:blank:]: 空白字元,包括空白和tab
  • \s: 空白字元,
  • \S: 非空白字元
  • [:punct:]: 標點符號 ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ ^ _ ` { | } ~.
stringVector<-c("03-2118800","02-23123456","0988123456",
                "07-118","0-888","csim@mail.cgu.edu.tw","http://www.is.cgu.edu.tw/")
grep("\\d{2}-\\d{7,8}",stringVector,value=T)
## [1] "03-2118800"  "02-23123456"
grep("\\d{10}",stringVector,value=T)
## [1] "0988123456"
grep("\\w+@[a-zA-Z0-9._]+",stringVector,value=T)
## [1] "csim@mail.cgu.edu.tw"
grep("\\w{2,}@[a-zA-Z0-9._]{2,}",stringVector,value=T)
## [1] "csim@mail.cgu.edu.tw"