What is backtracking?

carussell · on Feb 8, 2017

Looks like there's (a lot of) confusion in these comments about the difference between backtracking and backreferences.

The `\2` in Phritzy's snippet is a backreference.

Backtracking is an implementation strategy for writing a regular expression engine.

I don't know why anyone choosing an engine would "require backtracking". It's an implementation detail, not a feature. (Although the fact that Thompson NFAs avoid exponential time complexity inherent to backtracking is something that I suppose could be considered a feature.)

Here's a link to some real literature:

https://swtch.com/~rsc/regexp/regexp1.html

geofft · on Feb 8, 2017

From the RE2 wiki:

"As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported."

From your link:

One common regular expression extension that does provide additional power is called backreferences. A backreference like \1 or \2 matches the string matched by a previous parenthesized expression, and only that string: (cat|dog)\1 matches catcat and dogdog but not catdog nor dogcat. As far as the theoretical term is concerned, regular expressions with backreferences are not regular expressions. The power that backreferences add comes at great cost: in the worst case, the best known implementations require exponential search algorithms, like the one Perl uses. Perl (and the other languages) could not now remove backreference support, of course, but they could employ much faster algorithms when presented with regular expressions that don't have backreferences, like the ones considered above. This article is about those faster algorithms.

If your regex has backreferences, or look-ahead/look-behind assertions, or similar things, then your regex engine must support a backtracking strategy.

If you do not need backreferences or similar things, then your regex engine may still choose to use backtracking, but does not need to.

RE2 is a regex engine that prioritizes never using backtracking as an implementation detail, and therefore does not support backreferences as a matter of external API.

RE2 is not able to evaluate Phritzy's regex.

carussell · on Feb 8, 2017

> If your regex has backreferences, or look-ahead/look-behind assertions, or similar things, then your regex engine must support a backtracking strategy.

There's a morphism that maps Phritzy's pattern to an equivalent pattern that can be matched with a Thompson NFA. (It's a similar strategy to the one that allows you to take an NFA and turn it into a DFA.) You can perform a sort of "static decomposition" on a subset of patterns containing backreferences, such that /(cat|dog)\1/ for example decomposes into /catcat|dogdog/. The caveat is that this isn't fully generalizable for all patterns containing backreferences.

But saying that backreferences cannot be supported by a pattern matching engine that uses something other than backtracking is like saying that because there is no general algorithm no prove any given program P halts, then no program can be proven to halt. But that's trivially demonstrated to be untrue. ∃ ≠ ∀

RE2 chooses not to support any PCRE-style backreferences, because RE2 is a pattern matching engine meant to support patterns that are actually regular, and backreferences are not regular.

geofft · on Feb 8, 2017

> But saying that backreferences cannot be supported by a pattern matching engine that uses something other than backtracking is like saying that because there is no general algorithm no prove any given program P halts, then no program can be proven to halt. But that's trivially demonstrated to be untrue. ∃ ≠ ∀

Yeah, sorry, I mixed up the sense of the quantifiers a bit. More accurately: "If you are interested in regexes that use backreferences, or look-ahead/look-behind assertions, or similar things, then your regex engine must support a backtracking strategy, or you need to rewrite all the regexes you care about to not use any of these."

Is it possible to automate that morphism, i.e., is it possible to build a wrapper around RE2 that converts (in suitably bounded time) a regex using backreferences to RE2 if the language described is in fact regular, and prints an error otherwise?

wahern · on Feb 9, 2017

I did something similar for word boundary zero-width assertions (i.e. \b and \B); that is, rewrite the PCRE by expanding permutations inline. The purpose was to compile as many PCREs as possible to DFAs using Ragel.

Fortunately, Ragel supported a limited form of assertions called semantic conditions, which it implements by expanding the code space of each state and allows you to associate a C boolean expression as a predicate for entering that state. For various reasons the semantic condition needed to be attached to a terminal, not to subexpressions (i.e. groupings and alternations).

Also, expressions like (cat?|dog?)\b needed to be permuted similar to the backreference problem, resulting in something that might look like (ca\b|cat\b|do\b|dog\b). This strategy didn't work for kleene stars (*), but usually a semantic condition could be attached to trailing terminals instead of leading terminals.

The actual code to perform these AST transformations was surprisingly simple. Excluding parsing of PCREs into an annotated AST and generation of the code, less than 200 lines of Lua code, maybe.

The PCREs weren't simple, but most were only moderately complex. For a corpus of about 10,000 PCREs about 1/2 used word boundary assertions. The vast majority of those (possibly even all of them... I can't remember) could be successfully transformed. With nested subexpressions (groups, alternations) the number of permutations rapidly grows, but I don't think our production corpus ever caused a problem in that regard, and only the most insane regular expressions nest subexpressions more than a few deep.

Before we tackled back references and other gnarly PCRE features I think we managed to coax better than 90% of our PCRE corpus to transform into semantically equivalent Ragel-based DFAs. In other words, back references were rare. With 90%+ compiled completely to native machine code and the remainder simplified into prefilters we saw greater than 10x performance improvement (> 1000%) over libpcre in terms of production throughput, not just benchmarks. (Ragel is awesome!) RE2 was usually faster than libpcre, but RE2's performance wasn't even remotely in the same league as the native machine code DFA Ragel produced, so I quickly ditched the idea of using RE2 anywhere in the pipeline.

That last 10% or so took much longer to tackle, and eventually most of my original code was dumped. But getting to about 90% was surprisingly easy. We did eventually get (and continue to maintain) 100% transformation to native machine code, but a small percentage still require backtracking. All-in-all I _think_ we're at 20x to 30x over baseline libpcre; that is, tacking the last 10% put us 2x or 3x atop the 10x. It wasn't a simple case of diminishing returns as the most difficult expressions to transform also tended to be the most costly at runtime, but if we had stopped early on at 90% it still would have been a resounding success.

Regarding RE2, it's important to note once an expression was transformed into a proper regular expression, they could be easily joined into a single union. Setting aside the bytecode vs machine code differences, RE2 just isn't capable of compiling huge unions of expressions (on the order of 100s or 1000s), whereas for Ragel it was a relative breeze. Another team was using re2c (not RE2) for a project, and it turned out that what it literally took re2c DAYS to compile only took SECONDS for Ragel to compile. GCC and clang also became bottlenecks compiling the Ragel generated code. And interestingly they both, coincidentally, exhibited (and still exhibit, AFAIK) quadratic complexity when parsing certain C constructs, such as initializer lists. So that necessitated some workarounds.

carussell · on Feb 9, 2017

Neat! What was the application? Or was it pure research?

wahern · on Feb 9, 2017

For a content filtering/scanning service.

By the time we wrapped up this project Intel finally released Hyperscan as Open Source. Hyperscan implements these and other transformation tricks, not to mention the SIMD optimizations.

However, Hyperscan doesn't have as strong compatibility for PCRE as what we ended up with--100% effectively--and would have necessitated keeping libpcre as a failover. Hyperscan is substantially faster than Ragel for small-to-medium length expressions thanks to their SIMD optimizations, but with huge, unioned expressions (larger than what Hyperscan can handle) Ragel-generated code is on par.

Starting from scratch today and presuming Hyperscan is still well-maintained, it would be most practical to build a solution around Hyperscan. Especially if you need capturing, as capturing expressions can't be unioned effectively. Ragel, however, makes a ton sense for many other tasks than merely speeding up PCRE matching.

What makes Ragel unique is:

1) The way it integrates with the host language (C, C++, Java, etc). It allows executing host-language expressions at state machine transitions. It doesn't directly tie into the host language typing model (e.g. like Boost's template-based regular expressions), but for various reasons it's hardly a liability and often an asset.

2) It allows yielding and resumption of matching at any and all input stream boundaries, with runtime machine state being a simple integer. (Application keeps any additional state however it wants.) Most regular expression tools require a complete string to match against, while a Ragel machine can be iteratively applied to an infinite stream of data. That capability wasn't useful in this scenario, but it's extremely useful for concisely, performantly, and securely implementing network protocols, for example. In a low-level language like C, and especially with asynchronous I/O, this can drastically simplify the task of I/O and buffer management. You could parse (if you wanted) an entire HTTP stream in a single logical pass, without having to deal with line buffering, header continuations, etc as separate concerns in your code, greatly simplifying networking code. In a some sense it can be thought of as a complement if not an alternative to callback schemes, futures/promises, or green threading (a la Go).

glangdale · on Feb 9, 2017

(Hyperscan team lead here)

Interesting. I should point out that Hyperscan is not abandonware; it is still maintained by Intel.

Streaming is not unique to Ragel.

You're not wrong about libpcre compatibility. We have very tight syntactic compatibility with libpcre (that is, we won't misidentify an unsupported construct and supply some erroneous semantics) but we make no secret of our inability to handle general lookarounds and backreferences.

I'm interested in how you went about handling backreferences in Ragel. They have been in our 'too hard' basket for years, although many instances of backrefs are tractable in our framework. It's always seemed easier to not do any of them rather than handle just a few.

Ragel certainly ties into the host language differently, and more tightly, than Hyperscan. We use it ourselves, to lex regexes on the way into Hyperscan, as it happens.

wahern · on Feb 9, 2017

As I alluded to earlier, the 100% (possibly +/- edge cases) solution was finished primarily by a contractor.[1] So the following only reflects the state of things while my Lua parser, analyzer, and transformation code was still in use. Also, my grasp of this stuff is nowhere near as strong as your's or the contractor's. I'm a pretender ;)

Backreferences are implemented with prefilters, NFA machines, and what are effectively zero-with assertions. It's not sexy, and the rest is easy to sketch out from there as a conceptual matter. What's more interesting are the Ragel-specific details.

Before we tackled backreferences we had a problem with repetitions causing unacceptable state blow up with unioned DFAs, repetitions, semantic conditions, and especially with their interactions. This was the biggest barrier for most of the final 10%. To help solve that the contractor added an NFA construct to Ragel, along with the magic needed for the generated Ragel state machines to switch from DFA to NFA modes mid-expression. The NFA construct expands on the notion of embedded host-language code in transition actions and semantic conditions, which allows embedding inline C code to track and control the matching behavior of each subexpression, as well as to handle bookkeeping, including state stack management.

Most simple captures can be accomplished using Ragel's regular action features. But IIRC even captures that could be done from a DFA machine would sometimes blow up state too much so I think the captures necessary for backreferences are always done in NFA mode. Similarly, the code is quick to fallback to NFA mode for expressions without backreferences as it simplified grouping some of the complex expressions into secondary unions, which allowed compiling their prefixes as a DFA; as opposed to immediately falling back to prefilters and a linear match of each expression that couldn't be added to the primary or secondary unions.

Matching of backreferences is implemented similar to lookarounds. Lookarounds, at least in my early code, are implemented using semantic conditions. In my later code I think lookbehinds still had to be a fixed size, but it looks like (bugs notwithstanding) backreferences could be variable sized. The Ragel match is anchored on the empty string at the relevant position using the NFA construct, auxiliary code handles matching against the capture, and on a match the `p' index variables is updated to skip over the length of the match. The NFA construct saves and restores `p' as it trials each subexpression. Of course this approach doesn't work well in streaming mode. Similar to Ragel scanners I think theoretically the host application could arrange for windowing, but I don't know if everything is in place to make that feasible. In our case we didn't need the streaming.

And at least in my code, anything with backreferences automatically gets a prefilter and is only run conditionally in a second phase. Presumably that's so we don't need to do captures for stuff that will never match, whether or not it involves an NFA. I'm sure that's how it's still implemented.

Regarding streaming, I totally forgot about Hyperscan's streaming mode. Streaming isn't something that (IME) can be readily bolted on after the fact, so it's a testament to it's careful design and forethought.

But in C, especially in a complex networking application doing asynchronous I/O, callbacks are so extremely costly--they break code locality and compound the normal headaches with ownership and lifetime. Managing that complexity is of utmost importance in a language like C. Using Hyperscan in streaming mode, especially in combination with DPDK... I imagine the project failure rate is really high. I know of at least one skillful team that bombed precisely as I predicted because of the excessive complexity and the resulting inability to stabilize the product. The callback/push/interrupt model is too difficult in C. In languages with lambdas, Hyperscan's streaming mode has no handicap. But especially in C, Ragel's idiosyncratic host-language interface, with it's embedded action handling and minimal state, is just so incredibly important for reducing complexity and for sticking as closely as possible to a pull model for data consumption without sacrificing performance. If developers spend most of their time working on and fixing the I/O code and related interface boundaries (as is typical), there's much less time for implementing the functional features that matter. I don't doubt people have built a ton of amazing things with DPDK and Hyperscan's streaming mode, but the number of teams that can do that _correctly_ are few and far between. Honestly, I wouldn't want a solution like that protecting my data unless I knew the team had the requisite skill.

On second thought... I should reel that criticism in a little bit. ;) For just matching and cataloging "hits", rather than performing complex operations on them, it's much less of a problem. The costs will be a function of how much work is done on the other side of the callback. But it's just so much more difficult to balance efficiency and complexity, and to constrain complexity in general. I've been writing asynchronous I/O services since before libevent and I've just become constitutionally averse to callbacks. I've never used libevent's buffering API because before it even debuted I had internalized how poorly suitable that sort of interface is in terms of interface composition. Callbacks are necessary--e.g. at the interface boundary of the event loop--but I try so hard to avoid them when possible, including switching to languages other than C when possible. It's just a huge consideration for me, perhaps unreasonably. :)

Also, I understand that Hyperscan's callback interface is largely dictated by other considerations, like the perfectly sensible choice for a runtime library interface. In streaming mode a generic queuing and retrieval interface would be awkward and suboptimal for most cases, and there's no good substitute for pushing that implementation detail to the application. DPDK, OTOH, wasn't forced to adopt an asynchronous interrupt model, it was just the easiest.

[1] I don't like to mention names of people not participating in a thread, even though in this case I'm sure you're familiar and he probably would appreciate the plug. You may have conversed with him at some point because your SIMD work is awesome, Hyperscan in general is amazing, and when it was finally open sourced we had to take a really hard look at adopting it or, alternatively, whether we should pursue SIMD optimizations on our own. And IIRC at some point he may have chatted with someone on the Hyperscan team.

glangdale · on Feb 9, 2017

Thanks for the kind words about Hyperscan. If you'd like to talk more, please get in contact (via the list, via our support alias, or our emails if you have them already).

I'm curious about the design issues w.r.t callbacks. We certainly haven't had complaints from most of our users about this, but maybe there would have been better models. Note that callbacks are partly a design pattern for our convenience - we need those big, complex stack frames for many of our engines. It's nice to throw a match out of the middle of a big unrolled loop and know that we can come back to the middle of the loop - not into the tedious 'preconditioning and warmup' part of the loop.

An alternate design is to fill a buffer of matches, but that has its own issues, especially with timeliness. Some users want to stop matching quickly after the first match - a problem if we go off and scan another megabyte to get the second through Nth matches.

In any case, we're always interested to hear about cases where we aren't the best solution, as this is always the most productive type of feedback to get. "Your library is great and you guys are really smart" is gratifying, but it doesn't lend itself well to continuous improvement.

carussell · on Feb 9, 2017

> It allows yielding and resumption of matching at any and all input stream boundaries [whereas] Most regular expression tools require a complete string to match against

Yes! This is how I've always implemented from-scratch pattern matchers, even for toy systems. To be fair, about 75% of it comes down to it being the only way that feels natural. But on the other hand, even if it's something you don't need now, the alternative all-at-once approach is so inflexible that, should you ever come to need it, then it's basically impossible to refactor an existing implementation to have this kind of orthogonally-cutting "re-entrancy". So you'll have to either throw out the whole implementation and start over, or just try to ignore it and be forced to work around it as you bump against it time and time again as punishment for not doing it right the first time.

carussell · on Feb 9, 2017

You can run into similar theoretical worst-case problems as what can happen when trying to conversion for arbitrary NFA -> DFA. See beeforpork's comment:

https://news.ycombinator.com/item?id=13603461

You'd want to target a restricted subset containing patterns that use only certain "tractable", er... motifs.[1] It would make for an interesting project. Specifically, how much you have to restrict yourself in your focus while still aiming to produce something that's useful. I'm not going to pretend that I've got an up-to-date perspective that includes all available research in this area. Something like this may have already been done.

1. For lack of a better word. (I'm trying to avoid using the word "pattern" multiple times in the same sentence, where each has a different meaning.)

stuckagain · on Feb 8, 2017

There are features of some regular expressions for which the only known solution is backtracking. If you want those features then you "require backtracking".

ue_ · on Feb 8, 2017

Out of interest, what are some of these? I have a hard time believing that the implementors of the Perl regex engine chose to write it that way for no reason while the Thompson NFA figures are thrown about. I knew there must havevbeen something this 'implementation detail' was good for.

geofft · on Feb 8, 2017

An easy example is matching palindromes. You simply can't match a palindrome by moving forward only; you have to go back and see if every letter matches. So, if you want to search for the longest palindrome in a string, you'll necessarily be doing a lot of backtracking.

There's no RE2-compatible regular expression for matching palindromes, but additional features as found in PCRE and similar "regex" engines can do it with backreferences or with look-around assertions. See http://stackoverflow.com/q/3746487 and http://stackoverflow.com/q/3664881 for two ways to write such a regex.

lilyball · on Feb 8, 2017

I believe backreferences require backtracking.

tatref · on Feb 8, 2017

No, but variable size lookahead/behind do. This is because the engine has to go back if the remaining part of a regex fails. For some examples, see http://www.regular-expressions.info/recursebacktrack.html)

EDIT: you are correct, backreferences do require backtracking, my bad.

Phritzy · on Feb 8, 2017

https://regex101.com/r/G23xYd/2

kbenson · on Feb 8, 2017

A feature of certain types of regular expression engines. It allows for certain types of regular expressions but at the cost of possibly going exponential if you aren't careful about your expression. [1]

1: http://www.regular-expressions.info/catastrophic.html

bowmessage · on Feb 8, 2017

http://stackoverflow.com/questions/9011592/in-regular-expres...