Considering the C implementation using PCRE is 4.1x Rust, whereas PHP (implement...

burntsushi · on Feb 9, 2017

The regex-dna benchmark biases toward regex engines with aggressive literal optimizations. Namely, there are 21 distinct regexes in the benchmark. Of those 21, only 1 of them actually uses Rust's core regex engine. The rest are compiled down to literal searches. Even in that one case, literal optimizations are used to speed it up.

I'm not familiar with Tcl's engine, but if it also applies aggressive literal optimizations, then that could explain its performance. However, given a more fine-grained analysis of regex-dna[1], that seems unlikely. The C program is large, so perhaps there is more to it than meets the eye.

The perf difference between PHP w/ PCRE and C w/ PCRE has always perplexed me. But there are... so many variables. Maybe PHP enables JIT in PCRE (the C program does not), or perhaps PHP's preg_replace is doing something clever, e.g., by combining all of the replacement regexes into one.[2]

> One of the big pieces that can easily be glossed over, especially during performance comparisons, is unicode handling.

Indeed. The regex-dna benchmark does not require any Unicode handling at all. It's a strictly ASCII based benchmark.

If your regex engine uses finite state machines, then one can typically build your encoding into the machines themselves, which results in little or no performance degradation in matching in the common case. (The cases where it matters is if your regex is large, e.g., `\pL{100}` is large indeed.)

[1] - https://github.com/rust-lang/regex/blob/master/bench/log/05/...

[2] - https://github.com/rust-lang/regex/blob/master/examples/shoo...

kibwen · on Feb 8, 2017

> One of the big pieces that can easily be glossed over, especially during performance comparisons, is unicode handling.

Patterns in Rust's regex library are Unicode-aware by default: https://doc.rust-lang.org/regex/regex/index.html#unicode