Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Considering the C implementation using PCRE is 4.1x Rust, whereas PHP (implemented in C, also using PCRE) is 1.2x Rust, that makes me think that this benchmark is… unhelpful at best.

The fastest C implementation uses TCL’s simplified regexes. However, http://lh3lh3.users.sourceforge.net/reb.shtml indicates that they found generally oniguruma and re2 trounce TCL.

Some other comparisons of regex engine performance is at http://sljit.sourceforge.net/regex_perf.html. Unfortunately neither of these are new enough to include Rust.

One of the big pieces that can easily be glossed over, especially during performance comparisons, is unicode handling.



The regex-dna benchmark biases toward regex engines with aggressive literal optimizations. Namely, there are 21 distinct regexes in the benchmark. Of those 21, only 1 of them actually uses Rust's core regex engine. The rest are compiled down to literal searches. Even in that one case, literal optimizations are used to speed it up.

I'm not familiar with Tcl's engine, but if it also applies aggressive literal optimizations, then that could explain its performance. However, given a more fine-grained analysis of regex-dna[1], that seems unlikely. The C program is large, so perhaps there is more to it than meets the eye.

The perf difference between PHP w/ PCRE and C w/ PCRE has always perplexed me. But there are... so many variables. Maybe PHP enables JIT in PCRE (the C program does not), or perhaps PHP's preg_replace is doing something clever, e.g., by combining all of the replacement regexes into one.[2]

> One of the big pieces that can easily be glossed over, especially during performance comparisons, is unicode handling.

Indeed. The regex-dna benchmark does not require any Unicode handling at all. It's a strictly ASCII based benchmark.

If your regex engine uses finite state machines, then one can typically build your encoding into the machines themselves, which results in little or no performance degradation in matching in the common case. (The cases where it matters is if your regex is large, e.g., `\pL{100}` is large indeed.)

[1] - https://github.com/rust-lang/regex/blob/master/bench/log/05/...

[2] - https://github.com/rust-lang/regex/blob/master/examples/shoo...


> One of the big pieces that can easily be glossed over, especially during performance comparisons, is unicode handling.

Patterns in Rust's regex library are Unicode-aware by default: https://doc.rust-lang.org/regex/regex/index.html#unicode




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: