https://github.com/google/re2 is probably a better matched library by Google since the post focuses on truly regular expressions and a lot of irregexp's design is around making the irregular part not as costly.
What I wonder is if the regexps could be preprocessed in a way so that the fast variant is used for all or at least most cases where it it possible to use it.
Yes, you may have an exponential space problem but lets be precise about that: the regular expression is compiled to a graph, and in the worst case that graph may occupy exponential space. In other words if you do hit the worst case, your could will fail at compile time (of the regexp).
The alternative is your regexp may take exponential time if it is given a pathological input. Unfortunately when this happens depends on the backtracking algorithm used, and worse still often happens on regexp's that wouldn't have used exponential space (they are very rare). So what you end up with is a program with hidden bugs: there are inputs that make your program effectively go infinite and fail at random times, but your tests almost certainly won't find them. This has bitten me several times. I'll take a program that fails when unconditionally when tested over a program that fails at random in production any time.
PHP 7 managed to make the situation worse. They decided they would solve the randomly hanging problem by adding a step limit to the regexp execution. When the regexp hit the step limit the regexp appeared to return no matches. So you effectively ended up with the situation where the regexp engine returned no matches. So when you took a program that worked on PHP 5, on PHP 7 it would mysteriously misbehave at random times. But not in the easily diagnosable way PHP 5 would hang - in an infinite loop at the point the problem occurred. No - the program would proceed on it's merry way until 1000's of statements later the fact that it missed a match would rear it's ugly head, usually in the most obtuse manner.
I can't speak about RE2, but whether the graphs are created when the program is compiled is an implementation issue. It's certainly true that NFA's generally aren't - probably because O(n) where n is the size of regexp so it's very cheap. But since DFA could be O(2^n) they often are created at compile time. For example LEX does that, IIRC a Rust RE library does, and DFA's created by LR parsers are invariably do.
Secondly I specifically said "compile time (of the regular expression)". It is true that may not happen when the program is complied. But there always is a compile time, the outcome of the compile depends only one source and not any time input inputs, and it is completely deterministic. If PHP re's were compiled to DFA's for instance and you did hit the rare exponential space case, you would find out the first time the program attempted to use the regexp, rather than some random user finding it by having the program fail some arbitrary time later.
If you want to use extender regex that includes backreferences, it is still polynomial as long as the number of backreferences is fixed: since many regex syntaxes only support \1 through \9, the count is fixed by construction.
The algorithm works using more complex state tracking, but given the worst case of being in all states at once at a given point of the input, you're at O_space(len_regex), where the input string length still doesn't factor in at all.
A quick skim of the article doesn't show any sophisticated pattern reduction pattern so the parent's comment would appear fair. Straightforward regex implementations have to be exponential in time or space for pathologies, and if it's not time here then it's space, so why is this being downvoted?
No, your understanding and the parent's isn't correct.
The NFA is linear in the size of the regex. That is, a regex like a+b+c+ will have an NFA that's 3x the size of the NFA for a+.
Interpreting the NFA is done in time LINEAR with respect to the input data. Code is given in Cox's articles. Yes worst case linear.
You're probably thinking of NFA -> DFA conversion, i.e. the "subset construction".
First of all, RE2 doesn't always do that. You can always operate on NFAs. (and it does that because of capturing is hard with DFAs, IIRC)
Second of all, "exponential in size of input pattern" is NOT a problem. Exponential in the size of the DATA is a problem, and that doesn't happen. When you are measuring the runtime of regexes, the size of the pattern is a small constant. (Think matching a+b+c+ against gigabytes of data.)
The magic of regexes is that you can match any regular language in linear time and constant space. You can't do that with "regexes". That's the WHOLE POINT of these articles.
I have wanted to write a summary of Russ Cox's articles on my blog for awhile. They are very good, but they are very dense, and most people still aren't getting the message.
The message is: "Use regular languages and not regexes". (e.g. constructs like backreferences and lookaround assertions make the language nonregular)
[1] Side note: I tested re2c and its regex compilation time is linear up to matching 47,000 fixed strings in parallel (from /usr/share/dict/words)
re2c ALWAYS builds a DFA, unlike RE2. (Note: they are completely unrelated projects, despite their similar names.)
Even taking into the account that exponential in the size of the pattern is not a problem, it's still not exponential in the size of the pattern (for the fgrep problem). If the DFA were exponentially sized, then it would take exponential time to construct it, but it doesn't in this case.
There's definitely room for a helpful blog post laying out these principles in a less dense way. Go for it!
There's also a lot of room for discussion of the choices around implementation techniques different libraries use (things like whether or not DFAs are created on the fly). I'm interested in this area, so I should probably work up to writing up some of these examples, as a way of forcing myself to learn them better.
Also, I think you're not painting the full picture about the issue of DFA conversion being exponential. You're right that it's much less of an issue than being exponential in the input size, as it affects fewer use cases. However, it's still a potential DOS issue, so it definitely matters for any application that takes user controlled input and produces a DFA. I think it also matters for getting optimal performance out of an engine even in non-adversarial conditions.
For an example of how easy this is to trigger: (a|b)*a(a|b)^(n-1) produces a DFA of size exponential in n. With n = 15, re2c generates a 4mb file.
I learned a few months ago that Rust's regex crate, which is based heavily on RE2, does the same thing.
There is perhaps something worth exploring there, but after reading the Cox articles, I can see that there are so many issues that come up in regex implementation, and this seems to be one of the smaller ones.
I don't think the "regex as untrusted data" use case is very common. That is, there are probably 10,000 programs that use regexes for every one that accepts regexes from users.
-----
Although, one thing I've wanted to explore is Regex derivatives. Supposedly that technique constructs an optimal DFA directly. You don't have to make an NFA, convert it to a DFA, and then optimize the DFA.
Cox describes it as a "win-win" here (in contrast to parsing with derivatives, which is asymptotically worse than traditional parsing algorithms)
So basically Rust is the only newer language that appears to have gotten the message.
The Oil shell will of course use "regular languages" rather than regexes, but that's natural because shell never had Perl-style regexes.
One thing I think Cox didn't explore enough is that most libc's appear to be using the "fast" good method, because they don't need Perl constructs.
I guess that would be a good follow-up article. To test GNU libc and musl libc on those pathological cases. I looked at their code a few years ago and they looked "right" (I've forgotten the details), but I don't think I ever did the test.
The various kinds of lookaround are essentially intersections plus some trickery to extract the correct spans.
E.g. in Python re.match('(?=(a.)+$)(.b.|..b)+', 'abacabax') matches 'abacab', while applying it to 'abacabxa' doesn't match. If you only focus on the question of whether there's a match starting at a given position, this is equivalent to there being a match for '(a.)+$' as well as '(.b.|..b)+' from the same position, i.e. the intersection of the languages matched by the two expressions.
If you have two automata with n and m states for the two expressions, then you can match the intersection by creating an automaton with nm states corresponding to all possible combinations of states of the smaller automata, with a transition from (i,j) to (k,l) iff there are transitions from i to k and j to l. (i,j) is an accepting state iff both i and j are. This essentially simulates both automata in parallel.
Once you've determined whether or not there's a match by using that automaton, you could extract the span that matched by re-running only the part of the regular expression that's not part of a lookaround assertion. I guess actual implementations just do some additional bookkeeping to do both matching and span extraction in one pass.
A program that is not exponential in time can't require exponential space, because it can't touch that much space in that little time.
Even if you count the NFA as part of the input, the expansion into a DFA only happens while matching one character at a time, so to blow up the cache to a size exponential in the number of NFA states, you'd have to be matching on a string of comparable length that forces all possible state combinations to occur.
If you don't count the NFA as part of the input, the matching algorithm for any particular NFA is actually constant-space, because the maximum space required is independent of the string you're processing.
I'm assuming the DFA is built ahead of time, which is pretty conventional in my understanding. This may be something in the article I missed, that the DFA is built incrementally at runtime by this algo. If so, my mistake. I'll read again.
@chubot: ok, I am wrong, thanks for the details! Upvoted both.
The rust Regex crate docs even refer to his posts as a reference since the rust version is based on his efficient RE2 Regex implementation at Google.