Regular Expression Matching Can Be Simple and Fast (2007)

feniv · on June 29, 2019

The author, Russ Cox, has a whole series of articles on the topic of regex - https://swtch.com/~rsc/regexp/

The rust Regex crate docs even refer to his posts as a reference since the rust version is based on his efficient RE2 Regex implementation at Google.

dang · on June 29, 2019

Threads from 2018: https://news.ycombinator.com/item?id=16341519

2015: https://news.ycombinator.com/item?id=9374858

2009: https://news.ycombinator.com/item?id=820201

also 2009: https://news.ycombinator.com/item?id=466845

Quarrel · on June 29, 2019

But if you want faster, (but more complicated):

https://github.com/intel/hyperscan

rurban · on June 29, 2019

With much less features. hyperscan only works with very limited regular expressions.

Russ' re2 just cannot do backtracking.

DigitalTerminal · on June 29, 2019

Also relevant is Google V8's Irregexp: https://blog.chromium.org/2009/02/irregexp-google-chromes-ne.... They claim 3x speedups.

zamadatix · on June 29, 2019

https://github.com/google/re2 is probably a better matched library by Google since the post focuses on truly regular expressions and a lot of irregexp's design is around making the irregular part not as costly.

pier25 · on June 29, 2019

A bit off topic, but here are some Regex benchmarks that might be of interest:

https://github.com/mariomka/regex-benchmark

Surprisingly JS is about 10 times faster than Go.

statictype · on June 29, 2019

Presumably the JS's regex engine is implemented in C?

Can't believe the .NET implementation is so slow.

pier25 · on June 29, 2019

Yes, most likely, or maybe C++.

ridiculous_fish · on June 29, 2019

v8 and JSC both JIT-compile regexes.

pier25 · on June 29, 2019

Directly to assembly?

ErikCorry · on July 1, 2019

Yes, directly to machine code.

yxhuvud · on June 29, 2019

What I wonder is if the regexps could be preprocessed in a way so that the fast variant is used for all or at least most cases where it it possible to use it.

abainbridge · on June 29, 2019

Is there any information about the performance when not looking for things like when a?a?a?aaa matches aaaaaa?

hyperpallium · on June 29, 2019

Now you have an exponential space problem.

rstuart4133 · on June 29, 2019

Yes, you may have an exponential space problem but lets be precise about that: the regular expression is compiled to a graph, and in the worst case that graph may occupy exponential space. In other words if you do hit the worst case, your could will fail at compile time (of the regexp).

The alternative is your regexp may take exponential time if it is given a pathological input. Unfortunately when this happens depends on the backtracking algorithm used, and worse still often happens on regexp's that wouldn't have used exponential space (they are very rare). So what you end up with is a program with hidden bugs: there are inputs that make your program effectively go infinite and fail at random times, but your tests almost certainly won't find them. This has bitten me several times. I'll take a program that fails when unconditionally when tested over a program that fails at random in production any time.

PHP 7 managed to make the situation worse. They decided they would solve the randomly hanging problem by adding a step limit to the regexp execution. When the regexp hit the step limit the regexp appeared to return no matches. So you effectively ended up with the situation where the regexp engine returned no matches. So when you took a program that worked on PHP 5, on PHP 7 it would mysteriously misbehave at random times. But not in the easily diagnosable way PHP 5 would hang - in an infinite loop at the point the problem occurred. No - the program would proceed on it's merry way until 1000's of statements later the fact that it missed a match would rear it's ugly head, usually in the most obtuse manner.

MrBuddyCasino · on June 29, 2019

So that’s why. PHP is afaik the only implementation to return a number of steps, useful in regex speed tuning.

That it fails silently is... well I can’t think of any other language that would think that’s a good idea.

ErikCorry · on July 1, 2019

Safari does this too, last I checked.

shereadsthenews · on June 29, 2019

RE2 state graphs are not generated at compile time.

rstuart4133 · on June 30, 2019

I can't speak about RE2, but whether the graphs are created when the program is compiled is an implementation issue. It's certainly true that NFA's generally aren't - probably because O(n) where n is the size of regexp so it's very cheap. But since DFA could be O(2^n) they often are created at compile time. For example LEX does that, IIRC a Rust RE library does, and DFA's created by LR parsers are invariably do.

Secondly I specifically said "compile time (of the regular expression)". It is true that may not happen when the program is complied. But there always is a compile time, the outcome of the compile depends only one source and not any time input inputs, and it is completely deterministic. If PHP re's were compiled to DFA's for instance and you did hit the rare exponential space case, you would find out the first time the program attempted to use the regexp, rather than some random user finding it by having the program fail some arbitrary time later.

BeeOnRope · on June 29, 2019

Here's a regex implementation that is polynomial in time and space:

https://github.com/travisdowns/polyregex

If you want to use extender regex that includes backreferences, it is still polynomial as long as the number of backreferences is fixed: since many regex syntaxes only support \1 through \9, the count is fixed by construction.

pgeorgi · on June 29, 2019

The algorithm works using more complex state tracking, but given the worst case of being in all states at once at a given point of the input, you're at O_space(len_regex), where the input string length still doesn't factor in at all.

tempguy9999 · on June 29, 2019

A quick skim of the article doesn't show any sophisticated pattern reduction pattern so the parent's comment would appear fair. Straightforward regex implementations have to be exponential in time or space for pathologies, and if it's not time here then it's space, so why is this being downvoted?

chubot · on June 29, 2019

No, your understanding and the parent's isn't correct.

The NFA is linear in the size of the regex. That is, a regex like a+b+c+ will have an NFA that's 3x the size of the NFA for a+.

Interpreting the NFA is done in time LINEAR with respect to the input data. Code is given in Cox's articles. Yes worst case linear.

You're probably thinking of NFA -> DFA conversion, i.e. the "subset construction".

First of all, RE2 doesn't always do that. You can always operate on NFAs. (and it does that because of capturing is hard with DFAs, IIRC)

Second of all, "exponential in size of input pattern" is NOT a problem. Exponential in the size of the DATA is a problem, and that doesn't happen. When you are measuring the runtime of regexes, the size of the pattern is a small constant. (Think matching a+b+c+ against gigabytes of data.)

The magic of regexes is that you can match any regular language in linear time and constant space. You can't do that with "regexes". That's the WHOLE POINT of these articles.

I have wanted to write a summary of Russ Cox's articles on my blog for awhile. They are very good, but they are very dense, and most people still aren't getting the message.

The message is: "Use regular languages and not regexes". (e.g. constructs like backreferences and lookaround assertions make the language nonregular)

[1] Side note: I tested re2c and its regex compilation time is linear up to matching 47,000 fixed strings in parallel (from /usr/share/dict/words)

https://github.com/oilshell/blog-code/blob/master/fgrep-prob...

re2c ALWAYS builds a DFA, unlike RE2. (Note: they are completely unrelated projects, despite their similar names.)

Even taking into the account that exponential in the size of the pattern is not a problem, it's still not exponential in the size of the pattern (for the fgrep problem). If the DFA were exponentially sized, then it would take exponential time to construct it, but it doesn't in this case.

hyperpape · on June 29, 2019

There's definitely room for a helpful blog post laying out these principles in a less dense way. Go for it!

There's also a lot of room for discussion of the choices around implementation techniques different libraries use (things like whether or not DFAs are created on the fly). I'm interested in this area, so I should probably work up to writing up some of these examples, as a way of forcing myself to learn them better.

Also, I think you're not painting the full picture about the issue of DFA conversion being exponential. You're right that it's much less of an issue than being exponential in the input size, as it affects fewer use cases. However, it's still a potential DOS issue, so it definitely matters for any application that takes user controlled input and produces a DFA. I think it also matters for getting optimal performance out of an engine even in non-adversarial conditions.

For an example of how easy this is to trigger: (a|b)*a(a|b)^(n-1) produces a DFA of size exponential in n. With n = 15, re2c generates a 4mb file.

chubot · on June 30, 2019

RE2 was used in Google Code Search in an adversarial context, and it handles the DFA issue by simply putting an 8MB limit on the DFA for a regex:

https://github.com/google/re2/blob/master/re2/re2.h#L601

I learned a few months ago that Rust's regex crate, which is based heavily on RE2, does the same thing.

There is perhaps something worth exploring there, but after reading the Cox articles, I can see that there are so many issues that come up in regex implementation, and this seems to be one of the smaller ones.

I don't think the "regex as untrusted data" use case is very common. That is, there are probably 10,000 programs that use regexes for every one that accepts regexes from users.

-----

Although, one thing I've wanted to explore is Regex derivatives. Supposedly that technique constructs an optimal DFA directly. You don't have to make an NFA, convert it to a DFA, and then optimize the DFA.

Cox describes it as a "win-win" here (in contrast to parsing with derivatives, which is asymptotically worse than traditional parsing algorithms)

https://research.swtch.com/yaccalive

There a bunch of non-production implementations floating around like

https://github.com/google/redgrep

https://github.com/MichaelPaddon/epsilon

(This technique was revived by the 2009 paper cited)

-----

Regarding the blog post, I have some material here. I gave about 10% of this as a 5-minute talk, and realized it's about 2 hours worth of material.

https://www.oilshell.org/share/05-31-pres.html

I still think the table toward the end is a good summary and perhaps worth publishing with some more context / explanation.

----

One interesting test of whether the message from Cox's articles got through is what style of regex engine newer languages are using.

I just noticed Julia uses PCRE, which is unfortunate IMO. I like RE2 -- it has a nice API, great performance, and a lot of features.

https://docs.julialang.org/en/v1/manual/strings/index.html#R...

Swift also has Perl-style regexes, but maybe they inherited it from Objective C:

https://developer.apple.com/documentation/foundation/nsregul...

So basically Rust is the only newer language that appears to have gotten the message.

The Oil shell will of course use "regular languages" rather than regexes, but that's natural because shell never had Perl-style regexes.

One thing I think Cox didn't explore enough is that most libc's appear to be using the "fast" good method, because they don't need Perl constructs.

I guess that would be a good follow-up article. To test GNU libc and musl libc on those pathological cases. I looked at their code a few years ago and they looked "right" (I've forgotten the details), but I don't think I ever did the test.

wbl · on June 29, 2019

Lookaround is still regular

chubot · on June 30, 2019

Citation? How do you implement it with automata?

yorwba · on June 30, 2019

The various kinds of lookaround are essentially intersections plus some trickery to extract the correct spans.

E.g. in Python re.match('(?=(a.)+$)(.b.|..b)+', 'abacabax') matches 'abacab', while applying it to 'abacabxa' doesn't match. If you only focus on the question of whether there's a match starting at a given position, this is equivalent to there being a match for '(a.)+$' as well as '(.b.|..b)+' from the same position, i.e. the intersection of the languages matched by the two expressions.

If you have two automata with n and m states for the two expressions, then you can match the intersection by creating an automaton with nm states corresponding to all possible combinations of states of the smaller automata, with a transition from (i,j) to (k,l) iff there are transitions from i to k and j to l. (i,j) is an accepting state iff both i and j are. This essentially simulates both automata in parallel.

Once you've determined whether or not there's a match by using that automaton, you could extract the span that matched by re-running only the part of the regular expression that's not part of a lookaround assertion. I guess actual implementations just do some additional bookkeeping to do both matching and span extraction in one pass.

chubot · on July 1, 2019

Thanks! I don't understand this yet but I bookmarked it for later :) I saw some similar answers via Google.

yorwba · on June 29, 2019

A program that is not exponential in time can't require exponential space, because it can't touch that much space in that little time.

Even if you count the NFA as part of the input, the expansion into a DFA only happens while matching one character at a time, so to blow up the cache to a size exponential in the number of NFA states, you'd have to be matching on a string of comparable length that forces all possible state combinations to occur.

If you don't count the NFA as part of the input, the matching algorithm for any particular NFA is actually constant-space, because the maximum space required is independent of the string you're processing.

tempguy9999 · on June 29, 2019

I'm assuming the DFA is built ahead of time, which is pretty conventional in my understanding. This may be something in the article I missed, that the DFA is built incrementally at runtime by this algo. If so, my mistake. I'll read again.

@chubot: ok, I am wrong, thanks for the details! Upvoted both.

wbl · on June 29, 2019

Nope! NFA matching is O(nm) where n is the pattern and m the string.