This vignette shows how to measure asymptotic performance of regular expression engines. First we define a grid of data sizes.
(subject.size.vec <- unique(as.integer(10^seq(0,3.5,l=100))))
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#> [16] 17 18 20 22 23 25 28 30 33 35 38 42 45 49 53
#> [31] 58 63 68 74 81 87 95 103 112 121 132 143 155 168 183
#> [46] 198 215 233 253 275 298 323 351 380 413 448 486 527 572 620
#> [61] 673 730 792 859 932 1011 1097 1190 1291 1401 1519 1648 1788 1940 2104
#> [76] 2283 2477 2687 2915 3162
Then we define a list of R expressions that run the regex pattern
on subject
.
Note that neither subject
nor pattern
are defined yet, and that is OK.
These expressions will be evaluated later, so we can define subject
and pattern
to depend on data size.
The list is called backtrackers
because these are the regex engines which use backtracking, and are therefore worst case exponential time complexity.
(backtrackers <- c(
if(requireNamespace("stringi"))atime::atime_grid(
ICU=stringi::stri_match(subject, regex=pattern)),
atime::atime_grid(
PCRE=regexpr(pattern, subject, perl=TRUE),
TRE=regexpr(pattern, subject, perl=FALSE))))
#> $ICU
#> stringi::stri_match(subject, regex = pattern)
#>
#> $PCRE
#> regexpr(pattern, subject, perl = TRUE)
#>
#> $TRE
#> regexpr(pattern, subject, perl = FALSE)
#>
The code below executes setup
for every value in N
, which defines subject
and pattern
to be a pathological combination that results in the worst case time complexity.
backtrackers.result <- atime::atime(
N=subject.size.vec,
setup={
subject <- paste(rep("a", N), collapse="")
pattern <- paste(rep(c("(a)?", "\\1"), each=N), collapse="")
},
expr.list=backtrackers)
backtrackers.best <- atime::references_best(backtrackers.result)
plot(backtrackers.best)
#> log-10 transformation introduced infinite values.
The plot above shows that ICU/PCRE/TRE are all exponential in N (subject/pattern size) when the pattern contains backreferences.
The code below creates a list with a fourth regex engine, RE2, that uses a different algorithm (not backtracking, so worst case polynomial time is much faster).
(all.exprs <- c(
if(requireNamespace("re2"))atime::atime_grid(
RE2=re2::re2_match(subject, pattern)),
backtrackers))
#> $RE2
#> re2::re2_match(subject, pattern)
#>
#> $ICU
#> stringi::stri_match(subject, regex = pattern)
#>
#> $PCRE
#> regexpr(pattern, subject, perl = TRUE)
#>
#> $TRE
#> regexpr(pattern, subject, perl = FALSE)
#>
Below we run the same subject
with a different pattern
, which does not include the \1
backreference (that feature is not supported in RE2, which is why is achieves a better worst case time complexity).
all.result <- atime::atime(
N=subject.size.vec,
setup={
subject <- paste(rep("a", N), collapse="")
pattern <- paste(rep(c("a?", "a"), each=N), collapse="")
},
expr.list=all.exprs)
all.best <- atime::references_best(all.result)
plot(all.best)
#> log-10 transformation introduced infinite values.
The plot above shows that ICU/PCRE are exponential time whereas
RE2/TRE are polynomial time. Exercise for the reader: modify the above
code to use the seconds.limit
argument so that you can see what
happens to ICU/PCRE for larger N (hint: you should see a difference at
larger sizes).
Typically the best way to present an atime
benchmark is by plotting the result of the predict()
method.
(all.pred <- predict(all.best))
#> atime_prediction object
#> unit expr.name unit.value N label
#> <char> <char> <num> <num> <char>
#> 1: seconds RE2 0.01 481.28376 RE2\nN=481
#> 2: seconds ICU 0.01 17.66099 ICU\nN=18
#> 3: seconds PCRE 0.01 17.69629 PCRE\nN=18
#> 4: seconds TRE 0.01 169.82779 TRE\nN=170
summary(all.pred)
#> Length Class Mode
#> seconds.limit 1 -none- numeric
#> references 16 data.table list
#> plot.references 16 data.table list
#> measurements 23 data.table list
#> by.vec 1 -none- character
#> prediction 5 data.table list
The predict
method above returns a list with a new element named
prediction
, which shows the data sizes that can be computed with a
given time budget (throughput). The plot
method is used below,
plot(all.pred)
#> log-10 transformation introduced infinite values.
The figure above shows each expression as a different colored curve, each with a direct label to indicate the throughput at the time limit. This representation makes it easy to see which expression is fastest, and it shows numbers that you can cite to explain what data sizes are possible in the given time limit.
atime_grid
to compare different enginesIn the nc
package there is an engine
argument which controls which C regex library is used, so the first comparison above can be re-done using the code below.
Using atime_grid()
as below can simplify the code required for benchmark comparisons (when possible).
(nc.exprs <- atime::atime_grid(
list(ENGINE=c(
if(requireNamespace("re2"))"RE2",
"PCRE",
if(requireNamespace("stringi"))"ICU")),
nc=nc::capture_first_vec(subject, pattern, engine=ENGINE)))
#> $`nc ENGINE=ICU`
#> nc::capture_first_vec(subject, pattern, engine = "ICU")
#>
#> $`nc ENGINE=PCRE`
#> nc::capture_first_vec(subject, pattern, engine = "PCRE")
#>
#> $`nc ENGINE=RE2`
#> nc::capture_first_vec(subject, pattern, engine = "RE2")
#>
#> attr(,"parameters")
#> expr.name expr.grid ENGINE
#> <char> <char> <char>
#> 1: nc ENGINE=ICU nc ICU
#> 2: nc ENGINE=PCRE nc PCRE
#> 3: nc ENGINE=RE2 nc RE2
nc.result <- atime::atime(
N=subject.size.vec,
setup={
rep.collapse <- function(chr)paste(rep(chr, N), collapse="")
subject <- rep.collapse("a")
pattern <- list(maybe=rep.collapse("a?"), rep.collapse("a"))
},
expr.list=nc.exprs)
nc.best <- atime::references_best(nc.result)
plot(nc.best)
The result/plot above is consistent with the previous result.