Badly needs static analysis. coverity is free for open source projects, will find many of the same issues, and produces reports that directly identify the buggy code, rather than the crash which might come thousands of instructions later in a completely different layer, requiring extensive reverse engineering to identify the source. And then when you've fixed one, the other test cases all need to be re-tested because they might have come from the same root cause, despite crashing in different locations.
Particularly when the hit rate is very high, fuzzing is a stupidly inefficient way to find bugs.
Assert like mad. Break out your debug allocators, your asan, your checked iterators. Fuzzing can be as bad as you say, but it can also give you extremely easy to diagnose repro cases - and catch an incredibly wide range of issues for incredibly little work, in my experience.
Static analysis generally won't tell you "Hey, if I almost-but-not-quite exhaust your address space allocating to successfully parsing this file, I can get your next closed-source API call to segfault!" - useful for tracking down each and every possible allocation related to networking deserialization so you can configure sane quotas (such as "array size probably shouldn't be larger than bytes sent, bail out safely instead of trying to allocate that")
With most sane setups that come to mind, no false positives either.
> And then when you've fixed one, the other test cases all need to be re-tested because they might have come from the same root cause, despite crashing in different locations.
Your tooling should do this automatically for you. If it doesn't, get better tooling. AFL. SDL MiniFuzz. Whatever floats your boat.
> Particularly when the hit rate is very high, fuzzing is a stupidly inefficient way to find bugs.
I've had it be stupidly efficient in terms of programmer time once setup. Occasionally leaving my computer on overnight? Worth.
Any rust lovers out there: Could I ask you do a benchmark comparison and a fuzz comparison. I'd be genuinely interested in the result and if (as you might hope) the Rust::regex is as fast as boost:regex, and never crashes, that would persuade at least me to finally learn some Rust!
You can look at a well known (but not very complete) benchmark comparison here [0], rust wins, the fastest boost program is c++ g++ #3 and takes 8.5 times as long, the fastest c++ implementation (using re2) takes twice as long.
I don't know of a fuzz comparison, but there has been fuzzing done on the rust library without finding anything bad, e.g. see this issue [1].
Considering the C implementation using PCRE is 4.1x Rust, whereas PHP (implemented in C, also using PCRE) is 1.2x Rust, that makes me think that this benchmark is… unhelpful at best.
The fastest C implementation uses TCL’s simplified regexes. However, http://lh3lh3.users.sourceforge.net/reb.shtml indicates that they found generally oniguruma and re2 trounce TCL.
The regex-dna benchmark biases toward regex engines with aggressive literal optimizations. Namely, there are 21 distinct regexes in the benchmark. Of those 21, only 1 of them actually uses Rust's core regex engine. The rest are compiled down to literal searches. Even in that one case, literal optimizations are used to speed it up.
I'm not familiar with Tcl's engine, but if it also applies aggressive literal optimizations, then that could explain its performance. However, given a more fine-grained analysis of regex-dna[1], that seems unlikely. The C program is large, so perhaps there is more to it than meets the eye.
The perf difference between PHP w/ PCRE and C w/ PCRE has always perplexed me. But there are... so many variables. Maybe PHP enables JIT in PCRE (the C program does not), or perhaps PHP's preg_replace is doing something clever, e.g., by combining all of the replacement regexes into one.[2]
> One of the big pieces that can easily be glossed over, especially during performance comparisons, is unicode handling.
Indeed. The regex-dna benchmark does not require any Unicode handling at all. It's a strictly ASCII based benchmark.
If your regex engine uses finite state machines, then one can typically build your encoding into the machines themselves, which results in little or no performance degradation in matching in the common case. (The cases where it matters is if your regex is large, e.g., `\pL{100}` is large indeed.)
Issue 203 indicates that the fuzzing was done, but doesn't link to the issues that were uncovered by fuzzing. The PR at https://github.com/rust-lang/regex/pull/262 does: one infinite loop, four panics, and one instance of the NFA reporting the wrong position of a match.
It's testing each language's implementation of regexes. A lot of the languages are probably linking in a regex library written in C or C++ anyways. It's evidence that if you're writing something that's heavy on regexes, any slowdown probably isn't due to the language itself but its library implementation.
Indeed, it is the regex implementation that is being tested (and so the language of that implementation is relevant), which is exactly what this discussion is about and also why that comment is careful to be specific about which programs are using which regex implementation.
Which resulted in just one issue? I'm not sure how long they fuzzed or what the methodology was, but this was the panic they found (still not a memory safety issue, more akin to an unchecked exception in Java than a crash in C++):
Since that issue is from the same time as Rust's 1.0 release, I suspect that either it hasn't been run again recently or that things are pretty stable in the regex crate w.r.t. fuzzing.
If you were interested in performance you probably would not have been using boost::regex to begin with. RE2 is often an order of magnitude faster. You might choose boost if you require backtracking, but that's crazy anyway due to exponential time.
In my experience, the exponential time thing isn't really a big deal. I've used Perl regular expressions on a very regular basis for about 16 years now. Exponential time has been an issue only once. Obviously if I were accepting regular expressions from random people, I'd use RE2. But for my day to day purposes, it's pretty much a complete non issue.
re2 is not only about exponential time: matching of regexes like a|b|c is O(N) in backtracking engines and O(1) in DFA-based engines like re2. It can make a big difference in practice for generated regexes - e.g. if you want to check if an URL has one of the thousand substrings in it (think adblock-like use cases). With backtracking regex or with a loop it'd be O(N) regarding the number of options, but with DFA it is O(1) regarding the number of options. I've seen 1000x speedups for similar use cases with re2 vs re from Python stdlib.
From my understanding the main benefit of RE2 is not speed, but linearly scaling execution time with respect to the input size, along with bounded memory usage. For certain inputs, it may outperform other engines, but the converse may also be true. As with any feature that may be abused, backtracking may also be useful: for example, you may need to write a script to munge text. Since you're not exposing it to arbitrary user input, it's a reasonable feature.
RE2 is not dramatically faster than all other regex implementations, but it is dramatically faster than boost::regex, which is among the slowest I've ever tested.
All regexes run in O(N) where N is the length of the string matched.
But some regex engines accept non-regular expressions. [0] The usual notation for it is an escaped number: \1 or \2 or so on. They're used to refer back to capturing groups earlier in the expression, usually marked by parentheses.
Regular expressions don't have backreferences but various enhanced expressions add them. If you use those extensions, you are in danger of exponential execution time unless you are careful and know what you're doing. In particular you should know not to use regular expressions as your principal tool to build a parser.
It's worth pointing out that if you're using a regex engine that only uses backtracking, then you can't assume all regular expressions take linear time. For example, running `(a)c` against `aaaaaaaaaa` takes exponential time in the number of `a` characters even though it is regular.
A hybrid regular expression engine could, in theory, recognize that a particular expression is regular and therefore use a finite state machine to guarantee linear time and space execution (where the size of the regex is held constant).
But unfortunately, converting a non-deteministic finite automaton (i.e., regexp) to a deterministic finite automaton (i.e., engine that can do matches in linear time) may take exponential time and/or space.
Yet, I should add, flex does that with extraordinary success. Most grammars are not that bad, it seems.
Executing an NFA on search text takes linear time and space, so what I said is true. ;-) In practice, it is hard to make NFA execution as fast as backtracking engines. (PCRE famously implements an NFA, calls it a DFA, and uses that to declare that the DFA engine is slow, which is incredibly misleading.[1] Thank you, Mr. Friedl. sigh)
Production grade regex engines with a DFA (like GNU grep, RE2 and Rust's) do conversion lazily. By doing it lazily, at most one new DFA state is added for each byte in the input in the worst case, which maintains the linear time bound. Unfortunately, this can result in memory growth proportional to the search text, which is why all such implementations use a fixed-size cache of states that is flushed once it's full. It works well in practice, but can slow down dramatically (to about the speed of an NFA) if the cache of states needs to be flushed frequently. The most common provoker of such behavior is large counted repetitions, e.g., `\pL{100}`.
Looks like there's (a lot of) confusion in these comments about the difference between backtracking and backreferences.
The `\2` in Phritzy's snippet is a backreference.
Backtracking is an implementation strategy for writing a regular expression engine.
I don't know why anyone choosing an engine would "require backtracking". It's an implementation detail, not a feature. (Although the fact that Thompson NFAs avoid exponential time complexity inherent to backtracking is something that I suppose could be considered a feature.)
"As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported."
From your link:
One common regular expression extension that does provide additional power is called backreferences. A backreference like \1 or \2 matches the string matched by a previous parenthesized expression, and only that string: (cat|dog)\1 matches catcat and dogdog but not catdog nor dogcat. As far as the theoretical term is concerned, regular expressions with backreferences are not regular expressions. The power that backreferences add comes at great cost: in the worst case, the best known implementations require exponential search algorithms, like the one Perl uses. Perl (and the other languages) could not now remove backreference support, of course, but they could employ much faster algorithms when presented with regular expressions that don't have backreferences, like the ones considered above. This article is about those faster algorithms.
If your regex has backreferences, or look-ahead/look-behind assertions, or similar things, then your regex engine must support a backtracking strategy.
If you do not need backreferences or similar things, then your regex engine may still choose to use backtracking, but does not need to.
RE2 is a regex engine that prioritizes never using backtracking as an implementation detail, and therefore does not support backreferences as a matter of external API.
> If your regex has backreferences, or look-ahead/look-behind assertions, or similar things, then your regex engine must support a backtracking strategy.
There's a morphism that maps Phritzy's pattern to an equivalent pattern that can be matched with a Thompson NFA. (It's a similar strategy to the one that allows you to take an NFA and turn it into a DFA.) You can perform a sort of "static decomposition" on a subset of patterns containing backreferences, such that /(cat|dog)\1/ for example decomposes into /catcat|dogdog/. The caveat is that this isn't fully generalizable for all patterns containing backreferences.
But saying that backreferences cannot be supported by a pattern matching engine that uses something other than backtracking is like saying that because there is no general algorithm no prove any given program P halts, then no program can be proven to halt. But that's trivially demonstrated to be untrue. ∃ ≠ ∀
RE2 chooses not to support any PCRE-style backreferences, because RE2 is a pattern matching engine meant to support patterns that are actually regular, and backreferences are not regular.
> But saying that backreferences cannot be supported by a pattern matching engine that uses something other than backtracking is like saying that because there is no general algorithm no prove any given program P halts, then no program can be proven to halt. But that's trivially demonstrated to be untrue. ∃ ≠ ∀
Yeah, sorry, I mixed up the sense of the quantifiers a bit. More accurately: "If you are interested in regexes that use backreferences, or look-ahead/look-behind assertions, or similar things, then your regex engine must support a backtracking strategy, or you need to rewrite all the regexes you care about to not use any of these."
Is it possible to automate that morphism, i.e., is it possible to build a wrapper around RE2 that converts (in suitably bounded time) a regex using backreferences to RE2 if the language described is in fact regular, and prints an error otherwise?
I did something similar for word boundary zero-width assertions (i.e. \b and \B); that is, rewrite the PCRE by expanding permutations inline. The purpose was to compile as many PCREs as possible to DFAs using Ragel.
Fortunately, Ragel supported a limited form of assertions called semantic conditions, which it implements by expanding the code space of each state and allows you to associate a C boolean expression as a predicate for entering that state. For various reasons the semantic condition needed to be attached to a terminal, not to subexpressions (i.e. groupings and alternations).
Also, expressions like (cat?|dog?)\b needed to be permuted similar to the backreference problem, resulting in something that might look like (ca\b|cat\b|do\b|dog\b). This strategy didn't work for kleene stars (*), but usually a semantic condition could be attached to trailing terminals instead of leading terminals.
The actual code to perform these AST transformations was surprisingly simple. Excluding parsing of PCREs into an annotated AST and generation of the code, less than 200 lines of Lua code, maybe.
The PCREs weren't simple, but most were only moderately complex. For a corpus of about 10,000 PCREs about 1/2 used word boundary assertions. The vast majority of those (possibly even all of them... I can't remember) could be successfully transformed. With nested subexpressions (groups, alternations) the number of permutations rapidly grows, but I don't think our production corpus ever caused a problem in that regard, and only the most insane regular expressions nest subexpressions more than a few deep.
Before we tackled back references and other gnarly PCRE features I think we managed to coax better than 90% of our PCRE corpus to transform into semantically equivalent Ragel-based DFAs. In other words, back references were rare. With 90%+ compiled completely to native machine code and the remainder simplified into prefilters we saw greater than 10x performance improvement (> 1000%) over libpcre in terms of production throughput, not just benchmarks. (Ragel is awesome!) RE2 was usually faster than libpcre, but RE2's performance wasn't even remotely in the same league as the native machine code DFA Ragel produced, so I quickly ditched the idea of using RE2 anywhere in the pipeline.
That last 10% or so took much longer to tackle, and eventually most of my original code was dumped. But getting to about 90% was surprisingly easy. We did eventually get (and continue to maintain) 100% transformation to native machine code, but a small percentage still require backtracking. All-in-all I _think_ we're at 20x to 30x over baseline libpcre; that is, tacking the last 10% put us 2x or 3x atop the 10x. It wasn't a simple case of diminishing returns as the most difficult expressions to transform also tended to be the most costly at runtime, but if we had stopped early on at 90% it still would have been a resounding success.
Regarding RE2, it's important to note once an expression was transformed into a proper regular expression, they could be easily joined into a single union. Setting aside the bytecode vs machine code differences, RE2 just isn't capable of compiling huge unions of expressions (on the order of 100s or 1000s), whereas for Ragel it was a relative breeze. Another team was using re2c (not RE2) for a project, and it turned out that what it literally took re2c DAYS to compile only took SECONDS for Ragel to compile. GCC and clang also became bottlenecks compiling the Ragel generated code. And interestingly they both, coincidentally, exhibited (and still exhibit, AFAIK) quadratic complexity when parsing certain C constructs, such as initializer lists. So that necessitated some workarounds.
By the time we wrapped up this project Intel finally released Hyperscan as Open Source. Hyperscan implements these and other transformation tricks, not to mention the SIMD optimizations.
However, Hyperscan doesn't have as strong compatibility for PCRE as what we ended up with--100% effectively--and would have necessitated keeping libpcre as a failover. Hyperscan is substantially faster than Ragel for small-to-medium length expressions thanks to their SIMD optimizations, but with huge, unioned expressions (larger than what Hyperscan can handle) Ragel-generated code is on par.
Starting from scratch today and presuming Hyperscan is still well-maintained, it would be most practical to build a solution around Hyperscan. Especially if you need capturing, as capturing expressions can't be unioned effectively. Ragel, however, makes a ton sense for many other tasks than merely speeding up PCRE matching.
What makes Ragel unique is:
1) The way it integrates with the host language (C, C++, Java, etc). It allows executing host-language expressions at state machine transitions. It doesn't directly tie into the host language typing model (e.g. like Boost's template-based regular expressions), but for various reasons it's hardly a liability and often an asset.
2) It allows yielding and resumption of matching at any and all input stream boundaries, with runtime machine state being a simple integer. (Application keeps any additional state however it wants.) Most regular expression tools require a complete string to match against, while a Ragel machine can be iteratively applied to an infinite stream of data. That capability wasn't useful in this scenario, but it's extremely useful for concisely, performantly, and securely implementing network protocols, for example. In a low-level language like C, and especially with asynchronous I/O, this can drastically simplify the task of I/O and buffer management. You could parse (if you wanted) an entire HTTP stream in a single logical pass, without having to deal with line buffering, header continuations, etc as separate concerns in your code, greatly simplifying networking code. In a some sense it can be thought of as a complement if not an alternative to callback schemes, futures/promises, or green threading (a la Go).
Interesting. I should point out that Hyperscan is not abandonware; it is still maintained by Intel.
Streaming is not unique to Ragel.
You're not wrong about libpcre compatibility. We have very tight syntactic compatibility with libpcre (that is, we won't misidentify an unsupported construct and supply some erroneous semantics) but we make no secret of our inability to handle general lookarounds and backreferences.
I'm interested in how you went about handling backreferences in Ragel. They have been in our 'too hard' basket for years, although many instances of backrefs are tractable in our framework. It's always seemed easier to not do any of them rather than handle just a few.
Ragel certainly ties into the host language differently, and more tightly, than Hyperscan. We use it ourselves, to lex regexes on the way into Hyperscan, as it happens.
As I alluded to earlier, the 100% (possibly +/- edge cases) solution was finished primarily by a contractor.[1] So the following only reflects the state of things while my Lua parser, analyzer, and transformation code was still in use. Also, my grasp of this stuff is nowhere near as strong as your's or the contractor's. I'm a pretender ;)
Backreferences are implemented with prefilters, NFA machines, and what are effectively zero-with assertions. It's not sexy, and the rest is easy to sketch out from there as a conceptual matter. What's more interesting are the Ragel-specific details.
Before we tackled backreferences we had a problem with repetitions causing unacceptable state blow up with unioned DFAs, repetitions, semantic conditions, and especially with their interactions. This was the biggest barrier for most of the final 10%. To help solve that the contractor added an NFA construct to Ragel, along with the magic needed for the generated Ragel state machines to switch from DFA to NFA modes mid-expression. The NFA construct expands on the notion of embedded host-language code in transition actions and semantic conditions, which allows embedding inline C code to track and control the matching behavior of each subexpression, as well as to handle bookkeeping, including state stack management.
Most simple captures can be accomplished using Ragel's regular action features. But IIRC even captures that could be done from a DFA machine would sometimes blow up state too much so I think the captures necessary for backreferences are always done in NFA mode. Similarly, the code is quick to fallback to NFA mode for expressions without backreferences as it simplified grouping some of the complex expressions into secondary unions, which allowed compiling their prefixes as a DFA; as opposed to immediately falling back to prefilters and a linear match of each expression that couldn't be added to the primary or secondary unions.
Matching of backreferences is implemented similar to lookarounds. Lookarounds, at least in my early code, are implemented using semantic conditions. In my later code I think lookbehinds still had to be a fixed size, but it looks like (bugs notwithstanding) backreferences could be variable sized. The Ragel match is anchored on the empty string at the relevant position using the NFA construct, auxiliary code handles matching against the capture, and on a match the `p' index variables is updated to skip over the length of the match. The NFA construct saves and restores `p' as it trials each subexpression. Of course this approach doesn't work well in streaming mode. Similar to Ragel scanners I think theoretically the host application could arrange for windowing, but I don't know if everything is in place to make that feasible. In our case we didn't need the streaming.
And at least in my code, anything with backreferences automatically gets a prefilter and is only run conditionally in a second phase. Presumably that's so we don't need to do captures for stuff that will never match, whether or not it involves an NFA. I'm sure that's how it's still implemented.
Regarding streaming, I totally forgot about Hyperscan's streaming mode. Streaming isn't something that (IME) can be readily bolted on after the fact, so it's a testament to it's careful design and forethought.
But in C, especially in a complex networking application doing asynchronous I/O, callbacks are so extremely costly--they break code locality and compound the normal headaches with ownership and lifetime. Managing that complexity is of utmost importance in a language like C. Using Hyperscan in streaming mode, especially in combination with DPDK... I imagine the project failure rate is really high. I know of at least one skillful team that bombed precisely as I predicted because of the excessive complexity and the resulting inability to stabilize the product. The callback/push/interrupt model is too difficult in C. In languages with lambdas, Hyperscan's streaming mode has no handicap. But especially in C, Ragel's idiosyncratic host-language interface, with it's embedded action handling and minimal state, is just so incredibly important for reducing complexity and for sticking as closely as possible to a pull model for data consumption without sacrificing performance. If developers spend most of their time working on and fixing the I/O code and related interface boundaries (as is typical), there's much less time for implementing the functional features that matter. I don't doubt people have built a ton of amazing things with DPDK and Hyperscan's streaming mode, but the number of teams that can do that _correctly_ are few and far between. Honestly, I wouldn't want a solution like that protecting my data unless I knew the team had the requisite skill.
On second thought... I should reel that criticism in a little bit. ;) For just matching and cataloging "hits", rather than performing complex operations on them, it's much less of a problem. The costs will be a function of how much work is done on the other side of the callback. But it's just so much more difficult to balance efficiency and complexity, and to constrain complexity in general. I've been writing asynchronous I/O services since before libevent and I've just become constitutionally averse to callbacks. I've never used libevent's buffering API because before it even debuted I had internalized how poorly suitable that sort of interface is in terms of interface composition. Callbacks are necessary--e.g. at the interface boundary of the event loop--but I try so hard to avoid them when possible, including switching to languages other than C when possible. It's just a huge consideration for me, perhaps unreasonably. :)
Also, I understand that Hyperscan's callback interface is largely dictated by other considerations, like the perfectly sensible choice for a runtime library interface. In streaming mode a generic queuing and retrieval interface would be awkward and suboptimal for most cases, and there's no good substitute for pushing that implementation detail to the application. DPDK, OTOH, wasn't forced to adopt an asynchronous interrupt model, it was just the easiest.
[1] I don't like to mention names of people not participating in a thread, even though in this case I'm sure you're familiar and he probably would appreciate the plug. You may have conversed with him at some point because your SIMD work is awesome, Hyperscan in general is amazing, and when it was finally open sourced we had to take a really hard look at adopting it or, alternatively, whether we should pursue SIMD optimizations on our own. And IIRC at some point he may have chatted with someone on the Hyperscan team.
Thanks for the kind words about Hyperscan. If you'd like to talk more, please get in contact (via the list, via our support alias, or our emails if you have them already).
I'm curious about the design issues w.r.t callbacks. We certainly haven't had complaints from most of our users about this, but maybe there would have been better models. Note that callbacks are partly a design pattern for our convenience - we need those big, complex stack frames for many of our engines. It's nice to throw a match out of the middle of a big unrolled loop and know that we can come back to the middle of the loop - not into the tedious 'preconditioning and warmup' part of the loop.
An alternate design is to fill a buffer of matches, but that has its own issues, especially with timeliness. Some users want to stop matching quickly after the first match - a problem if we go off and scan another megabyte to get the second through Nth matches.
In any case, we're always interested to hear about cases where we aren't the best solution, as this is always the most productive type of feedback to get. "Your library is great and you guys are really smart" is gratifying, but it doesn't lend itself well to continuous improvement.
> It allows yielding and resumption of matching at any and all input stream boundaries [whereas] Most regular expression tools require a complete string to match against
Yes! This is how I've always implemented from-scratch pattern matchers, even for toy systems. To be fair, about 75% of it comes down to it being the only way that feels natural. But on the other hand, even if it's something you don't need now, the alternative all-at-once approach is so inflexible that, should you ever come to need it, then it's basically impossible to refactor an existing implementation to have this kind of orthogonally-cutting "re-entrancy". So you'll have to either throw out the whole implementation and start over, or just try to ignore it and be forced to work around it as you bump against it time and time again as punishment for not doing it right the first time.
You can run into similar theoretical worst-case problems as what can happen when trying to conversion for arbitrary NFA -> DFA. See beeforpork's comment:
You'd want to target a restricted subset containing patterns that use only certain "tractable", er... motifs.[1] It would make for an interesting project. Specifically, how much you have to restrict yourself in your focus while still aiming to produce something that's useful. I'm not going to pretend that I've got an up-to-date perspective that includes all available research in this area. Something like this may have already been done.
1. For lack of a better word. (I'm trying to avoid using the word "pattern" multiple times in the same sentence, where each has a different meaning.)
There are features of some regular expressions for which the only known solution is backtracking. If you want those features then you "require backtracking".
Out of interest, what are some of these? I have a hard time believing that the implementors of the Perl regex engine chose to write it that way for no reason while the Thompson NFA figures are thrown about. I knew there must havevbeen something this 'implementation detail' was good for.
An easy example is matching palindromes. You simply can't match a palindrome by moving forward only; you have to go back and see if every letter matches. So, if you want to search for the longest palindrome in a string, you'll necessarily be doing a lot of backtracking.
There's no RE2-compatible regular expression for matching palindromes, but additional features as found in PCRE and similar "regex" engines can do it with backreferences or with look-around assertions. See http://stackoverflow.com/q/3746487 and http://stackoverflow.com/q/3664881 for two ways to write such a regex.
A feature of certain types of regular expression engines. It allows for certain types of regular expressions but at the cost of possibly going exponential if you aren't careful about your expression. [1]
As others have mentioned, Rust's regex library has been fuzz tested. Most of the bugs have been fixed. There are a couple outstanding that I hope to get to soon. None of them have been unsafe memory bugs. :-)
Benchmarking is a bit mirkier. The benchmark game has already been linked, but the regex-dna benchmark can't really be used to judge the overall performance of regex engines. It's not that the regex-dna benchmark is bad on its own, but rather, it's not enough. You need more coverage. For example, regex engines with aggressive literal optimizations will do very well on regex-dna.
Rust's regex library does have a benchmark harness that compares it to many other popular regex engines: PCRE1 w/ JIT, PCRE2 w/ JIT, RE2, Tcl's engine and Oniguruma. In general, Rust's regex engine is competitive with PCRE{1,2} and RE2, but does quite a bit better than both Tcl and Oniguruma. I haven't added Boost to this harness, but in theory it would be relatively easy. You can see how I did it for RE2[1].
Unfortunately, the set of benchmarks in the harness---while probably better than any other benchmark I've seen---has two major problems with it:
1. It is biased. I wrote most of the benchmarks and I'm not terribly familiar with the types of optimizations performed by sophisticated backtracking engines. Moreover, many of the benchmarks were added to measure optimizations I added to Rust's regex engine, so there will naturally be more benchmarks where Rust does well.
2. The benchmarks themselves lack analysis. Microbenchmarks are too hard to interpret using numbers alone. Someone needs to do the work to explain the results. I've always intended to do this, but it is a gargantuan task.
I've just updated the repo to contain the benchmark results as of today[2]. The most interesting comparison is PCRE2, which is quite fast.[3] Most of the interesting benchmarks are defined in sherlock.rs[4], which also contains some Rust specific notes.
Another source of benchmarks is my blog post on ripgrep[5]. This isn't really a benchmark for regex engines, but rather, a benchmark for line oriented search tools. Nevertheless, it is a collection of data points.
It's not related to fuzzing but this article is definitely an interesting one that involved a lot of regex in rust http://blog.burntsushi.net/ripgrep/.
The trac instance (and legacy SVN server) isn't really designed for slashdot (well, HN) effect traffic, runs on some jiggly piece of rust at some uni somewhere.
Not that I necessarily think it'd avoid the problem even then, but I wouldn't call this modern C++ exactly. For one thing, it's (c) 2002, so written in C++98. And for another, it looks very uh, C-ish (not that uncommon in the C++98 era). In the bad sense of lots of error-prone pointer-based string manipulation, stuff like this:
Maybe you're right that it's wrong to think that modern C++17 (unique_ptr, shared_ptr) written by experts is free of memory safety issues.
But in this specific case, it looks like old code from 2002[1]. No modern techniques such as std:unique_ptr. Just 1970s C style raw pointer manipulation. I suppose the contemporaneous compiler used by John Maddock would have been C++98?!?
The main windows compiler would have been Visual C++ 6, which was released before standardization, and had notable differences from C++98. It was also buggy for non-trivial templates. Visual C++ 2002 wasn't much better.
Well, yes if the strawman "all modern C++ written by experts is free from memory safety issues" is what you're countering. I find that to be gratuitous and petty, and not a good representation of Rust, however.
> Well, yes if the strawman "all modern C++ written by experts is free from memory safety issues" is what you're countering.
Well, I have seen exactly that sentiment. But, more importantly, this isn't exactly an obscure memory safety issue. It's a huge collection of flaws that showed up the instant Dmitry Vyukov threw a fuzzer at it. It's not just "all expertly-written C++ is free of memory safety issues" that this is a counterexample to: it's also a counterexample to "most C++ written by experts doesn't have memory safety issues that matter in practice".
> I find that to be gratuitous and petty, and not a good representation of Rust, however.
I haven't brought up Rust here. You can eliminate the memory safety issues by writing in Go, or Java, or C#, etc. etc.
I've seen it too but as a bunch of other people pointed out, the C++ in this library is not the C++ that sort of sentiment is about. This seems like the criticism you should be responding to.
Boost is peer-reviewed and receives more scrutiny than most C++ in the wild. Its peer review is its major selling point, in fact.
> Oh come on, you can't be serious given your advocacy for Rust.
I think it's possible for me to be able to express opinions in favor of memory safe programming languages in general (which is what I'm doing here) while also having worked on one.
If someone were to say "this illustrates why you should use Go for untrusted regexes", I'd agree with them!
Boost.regex was peer reviewed in 2001, the same year XP was released and three years before Firefox was released. Modern C++.
Let's talk about the state of security back then if you want and compare.
The disappointing thing in this is that they haven't done a security review and put a plan in place for handling security. Par for the course in the world of software development, sadly.
Safety may seem like a binary property, but it's really not.
Modern C++ is not as safe as Rust, but it is much safer than C, and significantly safer than doing manual memory management and raw pointer manipulation in C++. The interesting question is if that's enough for a particular project.
In general, I would argue that it is, because security is but one of the non-functional properties of software and the types of bugs that Rust prevents compared to modern C++ are but one category of security-relevant bugs.
The advantages that modern C++ can bring over Rust would compensate on average for its safety gap.
C on the other hand is simply a losing proposition safety-wise. Whatever I do, the type system won't do much if anything to help me prevent or catch bugs.
> the types of bugs that Rust prevents compared to modern C++ are but one category of security-relevant bugs
Memory safety bugs frequently result in basically the worst possible compromise imaginable: arbitrary remote code execution. It is possible to get RCE in other ways, but lack of memory safety makes it way way easier.
This doesn't change anything, you've just added another link at the end of the system > non-functional > security > code execution security chain.
Each project has the option at every link to decide that they're willing to accept a certain risk there and then tools delivering better results don't matter.
Memory safety is somewhat special because it's foundational: it is a prerequisite for any other sort of safety/security, as memory safety violations often can be(/are) exploited to trigger pretty much any other problem (e.g. RCE can do anything); other security problems are usually more constrained.
The thing is, and Rust advocates seem to consistently plug their ears when they hear it, is that most software written specifically in C++ is very highly specialized stuff with a small specific client set, where security is just not an issue.
According to the poll that JetBrains did when it started working on CLion, about half of C++ usage is in financial software. I will tell you first hand that in most applications in financial software, C++ servers will be talking to other internal servers, exchanges, reputable data distributors, etc. The idea that memory = security risk is just not a connection that people around here generally make.
I also know some game developers and my strong sense is that games are similar. Video games may talk to a specialized multiplayer server, an update server, and that's mostly it.
I could continue, but you get the idea. For most C++ developers I've ever encountered, memory issue == bug, != security risk.
Browsers, which operate on maximally complex data (a Turing complete programming, JS) and are by nature exposed to the entire world, including malicious users who want to harm other browser users, are simply not the typical domain for C++. I think there's every chance that Rust is a good fit for writing browsers, but outside of that things are much less clear (at least, at the moment).
Not sure what that has to do with anything. The parent is a very vocal Rust proponent and made a ridiculous comment which has been called out by myself and several others. It hasn't anything to do with any "small group of people" talking smack about C programmers. Although that's unfortunate, too.
I don't think it's ridiculous. It seems like almost every discussion about Rust vs C++ here has a few people saying that you don't need Rust's guarantees about memory safety if you're writing modern C++. And that oft-repeated comment is what pcwalton is referencing.
pcwalton is not framing the problem in a particularly useful way.
This is a question of risk management and his argument is basically that one should always reduce the risk of memory management errors to zero. Others say that they can tolerate some risk, as long as it's in acceptable margins, since it's expensive to totally eliminate it.
I don't think that lecturing everyone "No, you really want to have 0 risk, you fools" is a successful programming language advocacy strategy.
I wasn't talking about a specific example or these particular flaws, but in general, because this is just an instance of your generic argument that modern C++ is unsafe, where "unsafe" means not verifiably memory-safe.
That's correct, but you are framing the problem in an unrealistic way in which Rust wins by default. In reality, this type of memory-safety guarantees will be evaluated against other concerns and those other concerns might be more important.
For you memory-safety errors are unacceptable, that much is clear. For many they are more or less acceptable and insisting that they're unacceptable won't make them change their minds.
> lecturing everyone "No, you really want to have 0 risk, you fools"
On HN, please don't use quotation marks that make it look like you're quoting someone when you're not. It may seem a minor point, but we've found that it's important for clarity and respect.
Point taken about the quotes. It was thought as a figure of speech, but I can see how that could be misinterpreted.
"Lecturing everyone" is not at all name-calling. That's what he's been doing repeatedly, with code examples in several if not most threads related to C++ and safety.
I also don't consider pointing out to someone that they're lecturing offensive or abusive. In this case, this behaviour has derailed the thread and attracted criticism from many other people.
Memory-safety bugs can lead to RCEs, which are considered unacceptable risks to most in computer science. Additionally, as Rust shows, memory-safety bugs can be checked by a computer, which can consistently apply those checks, making it an excellent bang-for-buck to use Rust or a comparable memory-safety checker. That's what pcwalton is saying.
> How can you say these are considered unacceptable risks to most, when people write so much code in C and C++?
This argument was very compelling in, for example, 1997. But, nowadays, most code is written in memory safe languages. Choosing to write your next Unix daemon in Go or your next Windows app in C# is not exactly an uncommon choice in 2017.
> And that's what I'm trying to communicate: they're only unacceptable to pcwalton and the Rust community.
I'm not a member of the security community, but I have never seen anyone in that community disagree with my claim that C and C++ are unsuitable for writing secure software at scale (at least without a robust sandbox or restrictive subsets). I have seen the claim that sandboxes are sufficient to mitigate memory safety problems, but even in that case the comparison becomes "C and C++ with a correctly maintained, leak-proof sandbox" vs. "a memory-safe language", not "C and C++" vs. "a memory-safe language".
There exist entire industries built on C and C++ today. The security community may think that they're unsuitable, but they're not the ones building the software and making the decisions.
e.g: Mozilla exists because of C and C++. Take that away and Mozilla ceases to exist.
I don't know how to explain this more clearly: you are trying to enter an established market and "sell" a product based on its safety capabilities.
But your problem is that the market doesn't think it has a big safety problem, they think they can manage it. And instead of recognizing that and adapting your marketing efforts, you're just repeating the same safety pitch with more examples and details.
How has this been working for you, besides antagonizing the potential "customers"?
> But your problem is that the market doesn't think it has a big safety problem, they think they can manage it.
Is that not what his original comment was addressing though? There are a number of people who feel there is no safety problem, but examples like this are good indications otherwise. Boost is a very fundamental (as in low-level, not as is essential) library in many cases and having memory safety problem at such a level should be concerning.
That's not what's happening, this is an old(er) discussion.
pcwalton is asserting in general that modern C++ is not memory-safe, with the goal of promoting Rust and discouraging C++ usage.
Now that assertion is true, but it's not really useful, because programming language choices are not done only on the basis of a language being completely memory-safe. It's also misleading to emphasize only this one aspect (while ignoring others) and to such an extent (complete vs. good-enough safety).
It is concerning that this regex library has security issues. People should review their usage of it. But the original comment is something between offtopic and being inflammatory for the sake of it.
"There are a number of people who feel there is no safety problem"
The thing to remember here is: no safety problem... big enough that one should abandon C++ for Rust instead of working on tooling and idioms to improve it. That's what pcwalton would like.
That's quite a leap. It's not as if the world is C, C++, and Rust. The community behind any managed language is going to include not just people who've never used C++, but also plenty who have, and who reject it for the same reasons that motivate the Rust folks try their hand at doing better.
It is ridiculous. Both because he doesn't qualify his targets the way you have and for the reason others noted (it's not modern C++). It's a gratuitous and inflammatory insult.
I don't fundamentally disagree with you (any C++ bigger than a screenful of code probably has memory safety issues) but contrary to its reputation, I've find Boost to be of really poor code quality.
If your program isn't memory safe, it's very often the case that someone can make your program run their program, at which point the kernel doesn't know that your program didn't intend to modify itself.
W^X/NX bits and other technologies don't totally obviate the issue, as ROP gadgets can be used to defeat it. And so on, there's a whole domain of computer science dedicated to that arms race and no evidence that it's stopping any time soon.
On the other hand, a Rust program without unsafe should simply never execute code outside of its defined control flow, it's not possible to hijack any control flow on the stack or heap, and so a running program under arbitrary user input can only ever explore execution paths that exist in the absence of memory unsafety. No ROP gadgets or heap/stack smashing, no overflows leading to overwriting a function definition in memory, no overwriting the parameters of a function to cause impossible branches to be taken, and so on.
I recall that there was at least one shipping video game where "someone" was "my future self" -- they exploited a buffer overflow after-the-fact in a shipped video game (a EULA dialog?) in order to make it run an updater.
> Not every program is intended to be connected to the internet or have user provided input.
While true, I think the vast majority of programs do do one of these two things (and increasingly, both), whereas the majority of programmers seem to think that what they work on "isn't a security issue".
It's this disconnect that is giving us the Internet of Crap. Your phone apps, your fridge, your video games, your text editor -- every program I interact with every day -- it's all a security issue
So there's no such thing as a C++ expert? That statement is dangerously close to a No True Scotsman: IME even people who seem to fit any reasonable definition of expert (committee members, compiler developers) still make memory-safety mistakes.
This is Poe's law territory, e.g. a lot of your own comments sound very similar to the parent, and there are other emoticons more commonly used to indicate sarcasm/joke.
Heh, I think that's possibly the only virtue to sarcasm: it confuses the pedants. I use friendly smileys or the word "heh" when I'm trying to be jovial, and I'm willing to give that poster the benefit of the doubt. :-)
> a lot of your own comments sound very similar to the parent
Which comments?
> there are other emoticons more commonly used to indicate sarcasm/joke
I think any common standard about something like this probably only applies to a narrow clique.
Given that we're talking about bugs, I don't think this is a valid argument. Being forced to avoid bugs is good, yes? Even if it's not necessarily less hard to make them in the first place (which is arguable, too!).
I mean, I spend most of my time coding searching for bugs, not avoiding them.
Avoiding bugs is the default in software development in any language, no? If you have pointer DSL that can lead to various bugs, replacing it with a more complex pointer DSL is not a good way to avoid bugs. And I don't think C++ 'smart pointers' and other crap are very simple (relative to C pointers) or much less error-prone. That's just my opinion though.
Avoiding bugs is what you try to do by default in any language, yes. But success is not guaranteed :)
I agree that in general, if you have a pointer DSL that can lead to bugs, replacing to with a more complex one just makes things worse, assuming all else stays the same.
But! Maybe the extra complexity buys you something. For instance, maybe the complexity allows the compiler to prove that certain classes of bugs are impossible. Now we have a tradeoff. Let's say we want to get a correct program as fast as possible. Then you can quickly write a buggy program in the simple DSL, but you will need to spend some time debugging it. Or you can spend a long time wrestling with the complicated one, but you don't need to debug pointer issues anymore (you still have to fix other bugs of course.)
Then the question is: do you save more time in debugging than you pay in "fighting" a complex language. And the answer is always the same: maybe! Maybe you make less pointer bugs in C than most people, and maybe you're more allergic to smart pointers than most people, so for you it's pointless. Certainly pointer bugs are not very rare in the world though. Worse, they're not always noticed right away.
What you were doing is advocating Rust in a C++ thread, again. Safety first, and all that. Congratulations on getting voted to the top of this topic, but it's tiresome.
Let's pretend C++ is a car. I get in my car, and I drive somewhere. Yes, there are hundreds of thousands of accidents per year, but really, most people get where they want to go, and we aren't all dead. I've had my share of fender benders, but no RCE has ever been exploited in 25 years of my C or C++ code.
Let's pretend Rust is a car. Half the time I want to go somewhere, I can't even get out of the driveway before I stumble into a limitation of the language or the standard libraries. One time it's overly strict coherence rules that forbid me from doing something that shouldn't break coherence [1]. Some other time it's because the standard library doesn't have a trait for a commonly implemented method, so I can't write a generic function to call that method. Yet another time, I find out you really can't accomplish the task without writing unsafe code, so the compiler really wasn't going to protect me from myself anyways [2].
So yeah, Rust is saving me from car accidents because I don't even make it to the road much less my destination half the time. Do you drive wearing a 5 point harness and a helmet? Do you really think a true expert couldn't make a safe C++ regex library?
--
[1] Honestly, I think you guys are screwed on that one. I suspect you're going to have to break backwards compatibility to really fix it, so I believe you'll leave it broken instead.
[2] Outside of FFI calls, I have no idea what rules I'm supposed to respect in an unsafe block. I guess I just can't drive my Rust car to that neighborhood then. Very safe indeed.
> Do you really think a true expert couldn't make a safe C++ regex library?
I do believe that a C++ regex library written in reasonable time using normal development practices will have memory safety problems in it. This is based on the real-world experience we have with C++ projects.
> Yet another time, I find out you really can't accomplish the task without writing unsafe code, so the compiler really wasn't going to protect me from myself anyways [2].
If your project is 5% unsafe code and 95% safe code, that's a win due to isolating the trusted computing base. In fact, when you get down to it, this sort of setup describes all Rust projects, as the standard library has unsafe code.
> [1] Honestly, I think you guys are screwed on that one. I suspect you're going to have to break backwards compatibility to really fix it, so I believe you'll leave it broken instead.
I don't agree that it's broken. In order to "fix it", we'd either have to throw out generics in favor of C++/D-style templates or have Haskell-like overlapping instances link errors. Each of those options is unpalatable. You lose some expressiveness, sure, but writing newtype wrappers is not the end of the world.
In general, I find issues with the expressiveness of generics are often overblown. Remember that people write all sorts of software in Go, which doesn't have generics at all, much less overlapping instances!
> I do believe that a C++ regex library written in reasonable time using normal development practices will have memory safety problems in it. This is based on the real-world experience we have with C++ projects.
Me too. I'm guessing the JavaScript RegExp types in Firefox and Chrome are pretty battle hardened at this point though. I submit those as existence proofs that it could be possible. However, I'm sure they've had their exploitable bugs along the way, and I would guess the code is now ugly.
For what it's worth, I would never submit all of boost as a shining example of clean and modern C++. There are gems in there, but there is a lot of cruft too. I think the biggest benefits it has brought to the C++ world is as a testing ground for new ideas and as a compiler stress test. Some parts of boost are practically a fuzz test for finding the bugs in g++ and clang++. :-)
> If your project is 5% unsafe code and 95% safe code, that's a win
Unfortunately for me, it was the first 5%, and honestly I'd call it more like 20%. It was a weekend learning project, and all I wanted to do was create a freshman level data structure from scratch. I had already had other successes, so this seemed like a good next step, but it wasn't. The weekend passed, and what I really learned was to expect more pain the next time I want to write a low level data structure in Rust.
> I don't agree that it's broken.
I can't make you agree - we might be in opinion territory. It's certainly broken in my opinion.
I have been following along in the various blog posts and forum discussions. It seems like some of your coworkers think the rules are overly restrictive, so broken or not, it looks like they're trying to fix it at least a little. Unfortunately, I don't think they're very concerned about my particular use cases (and maybe they shouldn't be).
> In order to "fix it", we'd either have to throw out generics in favor of C++/D-style templates or have Haskell-like overlapping instances link errors.
I don't think those are the only two options, but you might find the other options even more unpalatable.
> In general, I find issues with the expressiveness of generics are often overblown.
I think all this says is that you don't write very much of the kind of code which benefits from generics. You can dismiss my point of view, but some of us do use them a lot, and it's one of the driving reasons I use C++.
Generics are a documented feature of Rust, but every time I try to use them I hit a wall and end up falling back on the (admittedly powerful) macro system. If I write a library, should I tell my users they need to invoke my macros for use with their types? This is exactly the use case for generics.
(I suspect I need an obligatory smiley here to let you know it's still a friendly discussion :-)
> Remember that people write all sorts of software in Go, which doesn't have generics at all
People write all sorts of software in all sort of languages. That doesn't say much one way or the other.
Go seems really nice in some ways. The learning curve looks small, so I don't think it would frustrate my coworkers like Rust would. However, if I tried to bring it to work I'd need some sort of code generator to avoid the multiple maintenance for functions that work on float32, float64, complex64, complex128 and so on.
Maybe they'd consider adding a macro facility like the one in Rust to work around the deficiencies in the language. :-)
Besides, Go has GC and some of our software already needs a lot of memory. The pause times might be getting pretty good, but I'm guessing the memory foot print would be at least twice what it is in C++. (I should measure that though.)
> , much less overlapping instances!
Disclosure: I don't really know what an "overlapping instance" means in this context. It might be possible we're talking about different deficiencies in Rust generics, and I just don't have the right vocabulary.
> It was a weekend learning project, and all I wanted to do was create a freshman level data structure from scratch.
It's worth mentioning that the implementation difficulty of data structures in Rust is not representative of the general difficulty of programming in Rust. Just because this is hard doesn't make it a bad language. Rust's philosophy is that you deal with the trouble of writing unsafe code each time you need a new low level abstraction, and when done, you don't have to worry about it again.
Also, writing generic datastructures in C++ without falling afoul of UB, and/or getting destruction semantics right can be quite tricky. Rust ... really doesn't have additional issues here; raw pointers in Rust work the same way as in C++, except they're a tad verbose. The only difference is that in Rust you probably want to provide a safe API, whereas in "freshman C++ datastructures" the datastructure does not have the unsafety neatly encapsulated. This is a hard task in both Rust and C++, especially with generic datastructures.
Writing datastructures isn't really a common task. It's a "simple" task because it's simple in C and C++, but here's no real reason it has to be in Rust. You can write inefficient datastructures in 100% safe Rust without too many problems, much like you can implement datastructures in regular-joe Java. Writing efficient, low-level datastructures can be pretty hard, but like I said it's a rare task so not much of an issue.
In general I advise folks to not write unsafe Rust till they are sure that they have a clear grasp of safe Rust. But if you want I can have a look at what you tried and help you improve your design. I'm very interested in making such things more accessible (on one hand, the Rust community doesn't really want to encourage the use of unsafe code, on the other hand, sometimes you need it and it would be nice if it was easier for people to use when that situation arises) so learning what people stumble on is important to me.
> Unfortunately, I don't think they're very concerned about my particular use cases (and maybe they should
>
> I think all this says is that you don't write very much of the kind of code which benefits from generics. You can dismiss my point of view, but some of us do use them a lot, and it's one of the driving reasons I use C++.
It seems like you're using generics the way you use templates for metaprogramming in C++. I totally get how powerful TMP is (I love to use it myself, though I can get carried away), but be aware that Rust is a different language and you have to approach it differently. It's important to consider the limitations of the system when designing a solution -- if I designed a solution for C++ and implemented it in Rust, I'd hit problems in the last mile and find it very annoying to make it work (using macros or something to fill in the gap). However, if you design the code from the start with Rust's advantages and limitations in mind; you may come up with a different but just as good system. This is something I hit every time I learn a new language. I hit it with Rust, Go, and many years ago, C++. So it may help to approach the language with a fresh mind.
Yes, coherence is painful. I seem to hit coherence issues pretty rarely myself, but they're annoying when they happen. There's work going on to improve these pain points (specialization!), but it's overall not that big a problem.
I think an example of what you tried might help. Preferably a more holistic example, condensed minimal examples tend to hide instances of the XY problem.
> Disclosure: I don't really know what an "overlapping instance" means in this context.
cases where you have more than one implementation that make sense for a given call. C++ solves this by allowing things to overlap and introducing overload resolution rules. Rust solves this by not allowing overlap.
> but I'm guessing the memory foot print would be at least twice what it is in C++. (I should measure that though.)
Go does make extensive use of the stack and has decent escape analysis, so it might not be that bad, really. But measuring it is probably the best way to go here.
> Some other time it's because the standard library doesn't have a trait for a commonly implemented method, so I can't write a generic function to call that method.
Do you have examples? In some cases generic methods are outside the stdlib in different crates, e.g. num_traits.
>> It was a weekend learning project, and all I wanted to do was create a freshman level data structure from scratch.
> I'm very interested in making such things more accessible
At that point, I simply wanted to implement my own version of something like a statically sized matrix to learn how to manage low level memory. Doing a growable array the right way in C++ involves placement new, explicit destructors, std::move and/or std::swap (and exception safety is a bitch), so I'll agree it would be unfair to expect implementing Vec to be simple. However, you can get a rudimentary matrix working almost trivially. I eventually ended up abusing the unsafe code in Vec to get what I wanted:
For what it's worth, I would really prefer the equivalent of this C++:
template<class T, long rows, long cols>
struct Matrix {
// data lives on the stack
// the type knows the sizes
T data[rows][cols];
};
But that will have to wait for type level integers.
> It seems like you're using generics the way you use templates for metaprogramming in C++.
I'm not opposed to TMP in C++, but I really only use it in C++98 to make up for the lack of decltype. In C++11, I don't really need it for what I do. Maybe you and I consider TMP to be different things, but I don't think I was trying to do TMP in Rust. I talked with @steveklabnik about this one already:
fn simplified_example<T>(a: T, b: T) -> T
where T: Copy + PartialOrd
{
if a.abs() < b.abs() {
// some irrelevant math goes here
} else {
// slightly different math goes here
}
}
What I learned is that I would need to write my own Abs trait to make this work, and that I would need a Cos, Sin, Exp, Log, and many others to make other similar generic functions work. (I've had a good look at the num crate, and I think it got a lot of things wrong, so that's not an attractive option.) This at least has a work around, but it's tempting to just rest on macros.
> [regarding coherence] I think an example of what you tried might help
I think it's completely broken that this is ok:
impl<T: Mul<T, Output=T> + Copy> Mul<T> for Matrix<T>
while this is not:
impl<T: Mul<T, Output=T> + Copy> Mul<Matrix<T>> for T
I've read an explanation or two, but this is a binary operator! Why does the order of the arguments matter for coherence here? I can write a macro to work around this, but maybe operator traits shouldn't be restricted with the same logic as other traits.
> There's work going on to improve these pain points (specialization!)
You've mentioned that before (at least I think it was you), around the time I was looking at this. I thought specialization had already been accepted, but I haven't stayed all that current. Regardless, how would it help with my binary operators example above (or the generic function above that)?
Anyways, these are all really simple things in C++, and they aren't simple in Rust, and that was my main point a few comments back.
> I eventually ended up abusing the unsafe code in Vec to get what I wanted:
That code isn't unsafe. It uses the [T] DST, which isn't a common type, but it's safe code.
Yeah, I thought you were implementing a vector or a doubly linked list or something, not a matrix. Lack of integer generics does make matrices a bit harder to implement -- heapifying is the simpler solution here which you did try, and heapifying should work, really.
> . (I've had a good look at the num crate, and I think it got a lot of things wrong, so that's not an attractive option.)
It does give you the Float and Signed traits, which get you exactly what you need here? You may need to cast to float for integers. It's not great, but it should be good enough.
> I've read an explanation or two, but this is a binary operator! Why does the order of the arguments matter for coherence here?
>
> but maybe operator traits shouldn't be restricted with the same logic as other traits.
I mean, operators have the same conflicting impl problem regardless of how you implement them (as traits or otherwise). C++ solves this with overload resolution rules. Rust doesn't want that, and instead doesn't allow overlaps to occur in the first place. An order-independent coherence system here would be stricter than the order-dependent one we have here. So as to be less strict, a choice has been made -- to make it easier to implement operator overloads for the first type than for the second.
(I think C++ has a similar asymmetry when it comes to choosing between overlapping operator overloads when the overlap is in the pre or in the post type)
> You've mentioned that before (at least I think it was you), around the time I was looking at this. I thought specialization had already been accepted, but I haven't stayed all that current.
Probably wasn't me. A simpler form of specialization has been accepted and it's currently experimental (so, nightly-only, but should stabilize). The design of the stronger form ("intersection impls") is still being discussed I think.
Not sure if the current form of specialization would help for that impl (it looks like it would, but I'm not sure), but intersection impls might. Either way, stabilization helps by letting you explicitly opt in to your impl being overridden by another (giving you a system more like C++), and in turn will relax coherence requirements since overlaps are now allowed.
Like I said, this is all ongoing work. Not there yet :)
> Anyways, these are all really simple things in C++, and they aren't simple in Rust, and that was my main point a few comments back.
But here's the thing -- you're not actually doing anything with these. This is sort of my point, and similar to Patrick's point about "languages can do fine without generics". You're designing a library for general use, but you're designing it with the concerns for use in C++. Asymmetric operators are considered bad in C++. Not in Rust; many operators are asymmetric. But ultimately, how much do asymmetric operators really affect you when programming? Just reorder the things. No big deal.
To give an analogy, I can make a similar complaint about C++ that it's hard to design bounded generic APIs (C++ concepts fixes this, but it's not there yet). This is a true complaint, but it doesn't really say anything about a programmer's ability to get things done in C++. I wouldn't say that it's hard to "get my car on the road" in C++ because I can't use strict bounds on my APIs. It just means that this feature is not one I can rely on when designing my overall code. I can make similar complaints about how hard it is to use algebraic datatypes in C++. The point is that I shouldn't be designing my abstractions around the assumption that these features work well in C++.
Rust and C++ differ in many areas when it comes to expressiveness. But ultimately you're still able to get things done well in both languages, in comparable time. There are some design patterns that work well in C++ and don't in Rust, and the reverse is true too. On an average this does not affect your ability to get stuff done -- to "get your card on the road" as you say.
I was pretty tired last night when I wrote that previous reply to you, and I don't think I stated my complaints very clearly. In several of the places where I quote you below, it's not your fault you didn't understand what I was getting at. Generally, I was trying to be concise to make my point, but it still turned into a wall of text, and I don't think I succeeded at either goal (making my point or being concise). Sorry in advance, but this message is a wall of text too.
> Yeah, I thought you were implementing a vector or a doubly linked list or something, not a matrix.
I was trying several things. For example, I got an immutable finger tree working (comparable difficulty to linked lists), but I was trying to describe the pain points, not where I succeeded. Please remember, these were all learning exercises for me, so I wasn't concerned about whether or not there was an existing library that did what I was trying to do - I wanted to learn how to write that kind of library.
Early on, I did intend to re-implement Vec. That is non-trivial in Rust and C++ for reasons I think we might both agree on, although honestly I did wish it was easy in Rust. I didn't think that should involve unsafe code, but it really seems like it does, so life goes on.
Later, I wanted to build a Matrix type wrapping around a low level array of memory. I did not want to wrap a Matrix around a Vec because I don't need a growable storage area, and I don't think I should have to pay for a capacity and current length field when I already keep track of the number of rows and columns. Reading the docs, it seemed like what I wanted was a "slice" to act as my array. I figured Vec must use a slice internally, so I should be able to do that as well! Diving through the Vec source code, I see RawVec, Unique, NonZero< * T> and finally a pointer. There's also something in there about PhantomData which I suspect is related to the phantom type stuff I've read about in Ocaml, but I figured I didn't really need to understand that right away. Translating to pseudo C++, this looks something like:
Piecing this together, my first thought was "crap, why is something as simple as Vec so complicated?". My second thought was "uh oh, where's the slice?". My next thought was, "well hmm, I can copy all that stuff", followed up by, "but a lot of those are marked as unstable, so I probably shouldn't do that".
So I poked around all over the docs and asked some questions about how to get a slice, and it seemed like I would need a Box to hold it. Even today, I really don't know if that's what I wanted, but mostly I gathered that you can't create one from a runtime chosen size without some unsafe code somewhere. That's when I found Vec::into_boxed_slice() which seemed to do almost what I wanted. I didn't want a Vec, and the implementation of that method has an unsafe block, so I used it and moved on. That's the first code snippet I included in the previous message.
In a previous post, you said you were interested in finding out what causes new Rust users to stumble. For me, I got hung up trying to figure out how to create a fixed sized array. Please don't say, "just use a Vec!" - that really misses the point. I'm still not sure what type I should use, but if it is a Box<[T]>, I would love a function like:
where f is called once for each element to provide a value. If something else is the right choice, imagine a different return type.
> That code isn't unsafe. It uses the [T] DST, which isn't a common type, but it's safe code.
Maybe DSTs are what I was looking for, but I see the docs are in the "nomicon", and I'm not sure that even existed when I was trying to figure out the stuff above. I will read it in the near future though.
> [the num crate] does give you the Float and Signed traits, which get you exactly what you need here? You may need to cast to float for integers. It's not great, but it should be good enough.
I didn't really want to dive into the ways I think the num crate is broken, but here goes. Floating point numbers and signed integers are not the only thing I want to call .abs() on. For instance, .abs() is relevant to complex numbers too (as are exp() and many others). Unfortunately, you can't implement Float or Signed for complex numbers because those traits have many other methods which don't make any sense for complex numbers. As example, Signed requires .is_positive() and Float requires .min_value().
Let's dodge the topic about calling .abs() on unsigned types, but in generic code that's really not as silly as it sounds. The num crate is trying to solve an ugly problem, and I don't think it's made the right trade offs. Really it isn't good enough, and I'm glad you guys removed it from the std library before version 1.0 and committing to it for the foreseeable future.
> I mean, operators have the same conflicting impl problem regardless of how you implement them (as traits or otherwise).
I don't see the conflict. These impls are fine:
impl Mul<f64> for Matrix<f64>
impl Mul<Matrix<f64>> for f64
And I can write a macro to call on any type I want. Try to use generics instead of a macro though, and the second one breaks.
> C++ solves this with overload resolution rules. Rust doesn't want that, and instead doesn't allow overlaps to occur in the first place.
I understand why Rust doesn't want SFINAE, and I understand why coherence is valuable, but you shouldn't believe something like SFINAE is required to make generic operators work. I've read nikomatasaki's blog posts several times, and it seems like he almost went with the "covered" rule. If the "covered" rule was applied to binary operator traits, but the current (nuanced /cough) rules applied everywhere else, I think the kind of generic operators I want to write would pass coherence. I'm also not saying this is the only way it could work, but I am saying that generic binary operators really should work symmetrically.
> I think C++ has a similar asymmetry when it comes to choosing between overlapping operator overloads when the overlap is in the pre or in the post type
That might be true if you define the operator as a member, so don't do that. If you declare it as a standalone function, these are really very general and symmetric:
template<class A, class B>
auto operator *(const A& a, const Matrix<B>& b)
-> Matrix<decltype(A() * B())>
{ /* implementation */ }
template<class A, class B>
auto operator *(const Matrix<A>& a, const B& b)
-> Matrix<decltype(A() * B())>
{ /* implementation */ }
> But here's the thing -- you're not actually doing anything with these.
Please don't assume that because I provided short examples of things which don't work like I think they should that you have any idea what I do. I'm not some novice who gets a kick out of finding ways to break the language. I understand why you might think that, but you're wrong, and it's condescending. I was trying to create the tools I would need for the kind of work I do, and I stumbled in several places. I'd like to ignore this part of your reply and get back to the rest of it.
> You're designing a library for general use, but you're designing it with the concerns for use in C++.
No. I was trying to design it with my understanding of idiomatic Rust, and I was trying to learn the language. We're only discussing C++ because it's the topic of this thread and because Rust aspires to be an alternative to it.
> Asymmetric operators are considered bad in C++. Not in Rust; many operators are asymmetric.
You state that like it was a well reasoned design decision instead of an oversight or unfortunate consequence, and I don't believe that is true. If you read nikomatsakis's blog post from my point of view, it looks like it was basically an accidental casualty because it wasn't in his list of use cases. Why wouldn't you want them to be symmetric? The num crate uses macros to implement symmetric operations on complex numbers in exactly the way I've described above. If asymmetry is something valuable, why would they do that?
> But ultimately, how much do asymmetric operators really affect you when programming? Just reorder the things. No big deal.
Sure, I could get by in Forth or Scheme too if they met my other requirements. However, processing arrays of numeric data is something I do nearly every single day. I translate equations I know and new ones I learn from published papers all the time. The order matters for clarity, and not all operations are commutative. Why have operators at all if you aren't trying represent math notation?
Look, it's fine if you don't want Rust to appeal to numerical programmers like me - you don't owe me anything. You seemed interested in what I found lacking, so I tried to share. Honestly, I had already assumed these kinds of things won't be fixed, I've already suffered the learning process for how to work around them, and maybe they'll be revisited in Rust 2.0.
> I can make similar complaints about how hard it is to use algebraic datatypes in C++.
Yes, implementing ADTs in C++ is terrible. This is an area where Rust shines, and I greatly prefer sum types to classes and inheritance.
> Piecing this together, my first thought was "crap, why is something as simple as Vec so complicated?".
This is because it uses some reusable primitives in the stdlib. A standalone vec can be done in a much simpler. NonZero isn't necessary, it just enabled optimizations. PhantomData is for variance stuff (explained in the nomicon) and drop order, which are sort of niche but interesting things. The variance problem in this case is only about being able to allow things like a Vec of a borrowed reference (so not including it just means that you can't use the vec for more niche things). The drop order part is necessary for safety in situations involving arenas and whatnot, but this is again one of those things you need to think about in C++ too.
The nomicon does build up a vec impl from scratch (https://doc.rust-lang.org/stable/nomicon/vec.html) and starts with a simple impl and slowly adds optimizations and refactorings. It depends on knowledge from the rest of the nomicon, however.
> and the implementation of that method has an unsafe block
Ah, I see, when you said "abusing the unsafe code" I thought you meant you were actually using unsafe code. Almost all stdlib things eventually drill down to unsafe calls so using a safely-wrapped API like into_boxed_slice is OK. That's what I mean by "that code isn't unsafe" :)
> For me, I got hung up trying to figure out how to create a fixed sized array. Please don't say, "just use a Vec!" - that really misses the point. I'm still not sure what type I should use, but if it is a Box<[T]>, I would love a function like
Box<[T]> is basically it, though it's a more obscure type (most newcomers would just use Vec, which is really fine, but if you are more acquainted with the language nothing wrong with using a boxed DST so you should use it). I wish we could get type level integers so that you can write generic types over [T; n] though.
Generally the stdlib doesn't include functions that are simple compositions of others, and since you can do something like `(0..n).iter().map(|_| func()).collect().into_boxed_slice()` such a function probably wouldn't exist. But it's not that clear cut, if you propose it it could happen! DSTs don't get used much in your average rust code so this is an area of the stdlib that could get more convenience functions.
> Maybe DSTs are what I was looking for, but I see the docs are in the "nomicon",
Yeah, DSTs are a more advanced feature of Rust. I'd prefer to wait for type level integers than bring them out to the forefront.
> but here goes. Floating point numbers and signed integers...
Good points; hadn't thought of that. If you have the time/inclination, I'd love to see an alternative traits lib better suited for this purpose.
> Try to use generics instead of a macro though, and the second one breaks.
So there's no conflict in the code written the way it is right now, but other blanket impls from other crates may conflict, basically.
> I understand why Rust doesn't want SFINAE,
Not talking about SFINAE; just talking about overload resolution (SFINAE is something built on top of it)
> If the "covered" rule was applied to binary operator traits, but the current (nuanced /cough) rules applied everywhere else, I think the kind of generic operators I want to write would pass coherence.
This is interesting. I think you would still have a problem with some kinds of blanket impls that currently are allowed on operators, but the ones you have listed would work.
Ultimately it's a tradeoff, though. The covered rule reduces some of the power of genericness of the RHS of operator overloads and balances it out. E.g. right now `impl<T: MaybeSomeBoundHere> Add<T> for Foo` works, but it doesn't by the covered rule. That's a pretty useful impl to have.
It might be possible to introduce a coherence escape hatches like `#[fundamental]` to be used with the operator traits. I'm not sure.
> If you declare it as a standalone function, these are really very general and symmetric:
Oh, forgot you can do that :)
> Please don't assume that because I provided short examples of things which don't work like I think they should that you have any idea what I do
I apologize. I inferred this from "all I wanted to do was create a freshman level data structure", which has the implication of "I can design this abstraction easily in C++, why not Rust".
Sorry about that :)
> You state that like it was a well reasoned design decision instead of an oversight or unfortunate consequence, and I don't believe that is true. If you read nikomatsakis's blog post from my point of view, it looks like it was basically an accidental casualty because it wasn't in his list of use cases.
I do think it's an unfortunate consequence. I think it's a tradeoff, and operator symmetry was forgone so that other things could exist. It's an unfortunate consequence of a well reasoned design decision where it was part of a tradeoff that was not decided in its favor. I don't think it was an oversight; these things were discussed extensively and operators were some of the main examples used, because operators are the primary example of traits with type parameters in Rust (and thus great fodder for coherence discussions).
> Look, it's fine if you don't want Rust to appeal to numerical programmers like me
I do! :) I used to be a physics person in the past, and did try to use Rust for my numerical programming. It was ... okay (this was many years ago, before some of the numerical inference features -- explicit literal types was hell). It's improved since then. I recognize that it's not the greatest for numerical programming (I still prefer mathematica, though I don't do much of that anymore anyway).
I think specialization (the "final form", not the current status) will help address your issues a lot. Also, type level integers should exist, I have some scratch proposals for them; but I keep getting bogged down in making it work with things like varaidic generics (I feel that a type level integer system should not be designed separately from whatever gets used to make it possible to operate generically on tuples as is done in some functional programming languages.)
> The order matters for clarity, and not all operations are commutative.
This is a great point. Ultimately macros pretty much are your solution here, which is not a great situation. Specialization would help, again.
I'm glad you replied. I was beginning to worry we were going too far into "agree to disagree" territory. I'm at work now, but I'd like to respond to a few of your items above this weekend.
We're getting pretty deep into a Hacker News thread about an almost unrelated topic, and the formatting options here are limited. Is there a better forum to have this kind of discussion? Some of it seems relevant to Rust internals, but I don't know if it's welcome there or not.
Really just posting on users.rust-lang.org (or /r/rust) about your issues would be nice. In particular if you're interested in creating a new num traits crate I recommend creating a separate post about that focused on the issues you came across and a sketch of what you'd prefer to see.
I'll put together a post on the users forum about num crate traits, but it'll probably be a day or two. In the mean time, a few replies to some of the other items above:
> I wish we could get type level integers so that you can write generic types over [T; n] though.
Yes, that would be very useful. I use fixed sized matrices for things like Kalman filters from time to time. These aren't usually the 3x3 or 4x4 kinds of matrices you see in the graphics world. For instance, they might be 6x9 or 12x4 in some specific case. It makes a huge difference in performance if they can be stack allocated (Eigen provides a template specialization for this).
For other problems, I use very large vectors and matrices, and those should be heap allocated to keep from blowing the stack. In those cases, the allocation time is usually dwarfed by the O(N^2) or O(N^3) algorithms anyways.
I just tried this, but rustc version 1.15.1 can't find the .iter() method for the Range. I'm assuming it's a small change (which I'd really like to see if you're willing), but that's quite a stack of legos you've snapped together there :-)
Let's add that to the list of things a new user like myself stumbles on: Even knowing I wanted a boxed slice, I'm not sure I would piece together "let's take a range, convert it to an iterator, map a function over each item, and collect that into a Vec so that I can extract the boxed slice I want".
Does that create and then copy (possibly large) temporaries? Walking through the code, I see it calls RawVec::shrink_to_fit() - which looks like it's possibly a no-op if the capacity is the right size. Then it calls Unique::read() - which looks like a memcpy. I honestly don't know if this does make copies, but if it does, that cost can be significant sometimes.
> just talking about overload resolution (SFINAE is something built on top of it)
I think Rust already dodges 90% of that problem by not providing implicit conversions (a good thing, IMO). However, really all I was trying to say is I don't believe you need to copy C++'s approach for generic operator traits and functions to work like I think they can/should in Rust. I don't understand the details to know if you could fix things and maintain backwards compatibility, but it's a false dichotomy to say only Rust's (current) way or C++'s way are the only possibilities.
> E.g. right now `impl<T: MaybeSomeBoundHere> Add<T> for Foo` works, but it doesn't by the covered rule. That's a pretty useful impl to have.
I think you're referring to the table in the orphan impls post:
+-------------------------------------------------+---+---+---+---+---+
| Impl Header | O | C | S | F | E |
+-------------------------------------------------+---+---+---|---|---+
| impl<T> Add<T> for MyBigInt | X | | X | X | |
| impl<U> Add<MyBigInt> for U | | | | | |
I honestly don't know if either of those should be allowed! They both seem very presumptuous and not at all in the spirit of avoiding implicit conversions. Let's instantiate T with a String, a File, or a HashTable - I don't see how adding MyBigInt could possibly make sense on either the left or the right. Maybe they make sense with the right bounds added.
I think it's a very different thing when the user of your crate explicitly instantiates your type with one of their choosing. If I had any say, my contribution to the use-case list would look like this:
+----------------------------------------------------------+---+
| Impl Header | ? |
+----------------------------------------------------------+---+
| impl<T> Add<T> for MyType<T> | X |
| impl<U> Add<MyType<U>> for U | X |
| impl<T> Sub<T> for MyType<T> | X |
| impl<U> Sub<MyType<U>> for U | X |
| impl<T> Mul<T> for MyType<T> | X |
| impl<U> Mul<MyType<U>> for U | X |
... and so on for 20 or 30 more lines :-)
When MyType is parameterized like this, I'm declaring something stronger, and I don't think it should introduce a coherence problem.
> I just tried this, but rustc version 1.15.1 can't find the .iter() method for the Range. I'm assuming it's a small change (which I'd really like to see if you're willing), but that's quite a stack of legos you've snapped together there :-)
Yeah, ranges are iterators already; you don't need to create iterators out of them.
`let boxslice = (0..10).map(|_| func()).collect::<Vec<_>>().into_boxed_slice();` is something that will actually compile. The turbofish `::<Vec<_>>` is necessary because `collect()` can collect into arbitrary containers (like HashSets) and we need to tell it which one to collect into. A two-liner `let myvec: Vec<_> = ....collect(); let boxslice = myvec.into_boxed_slice();` would also work and wouldn't need the turbofish.
In case of functions returning Clone types, you can just do `vec![func(); n].into_boxed_slice();`. My example was the fully generic one that would be suitable for implementing a function in the stdlib, not exactly what you might use -- I didn't expect you to be able to piece it together :). For your purposes just using the vec syntax is fine, and would work for most types.
Using ranges as iterators is basically the go-to pattern for "iterate n times", for future reference.
> Does that create and then copy (possibly large) temporaries? Walking through the code, I see it calls RawVec::shrink_to_fit() - which looks like it's possibly a no-op if the capacity is the right size. Then it calls Unique::read() - which looks like a memcpy. I honestly don't know if this does make copies, but if it does, that cost can be significant sometimes.
In this case .collect() is operating on an ExactSizeIterator (runtime known length) so it uses Vec::with_capacity and shrink_to_fit would be a noop. In general .collect().into_boxed_slice() may do a shrink (which involves copying) if it operates on iterators of a-priori unknown length. This is not one of those cases. At most you may have a copy involved of each element when it is returned from func() and placed into the vector. I suspect it can get optimized out.
vec![func(), n] will call func once and then create n copies by calling .clone(). Usually that cost is about the same as calling func() n times.
> Let's instantiate T with a String, a File, or a HashTable - I don't see how adding MyBigInt could possibly make sense on either the left or the right. Maybe they make sense with the right bounds added.
Yeah, that's why I had a bound there. I personally feel these impls make sense, both for traits and for operators. Perhaps more for non-operator traits.
I think your usecase is a good one, and it's possible that the covered rule could be made to work with the current rules. I don't know. It would be nice to see a post exploring these possibilities. It might be worth looking at how #[fundamental] works (https://github.com/rust-lang/rfcs/blob/1f5d3a9512ba08390a222...) -- it's a coherence escape hatch put on Box<T> and some other stdlib types which makes a tradeoff: Box<T> and some other types can be used in certain places in impls without breaking coherence, but the stdlib is not allowed to make certain trait impls on Box without it being a breaking change (unless the trait is introduced in the same release as the impl). The operator traits may have a solution in a similar spirit -- restrict how the stdlib may use them, but open up more possibilities outside. It's possible that this may not even be necessary; the current coherence rules are still conservative and could be extended. I don't think I'm the right person to really help you figure this out, however, I recommend posting about this on the internals forum.
(I'm not sure if this discussion is over, but if it isn't I think it makes more sense to continue over email. username@gmail.com. Fine to continue here if you don't want to use email for whatever reason)
Those new examples work nicely. I'll have to remember the word "turbofish" :-)
> (I'm not sure if this discussion is over, but if it isn't I think it makes more sense to continue over email. username@gmail.com. Fine to continue here if you don't want to use email for whatever reason)
Nah, I think we're at a good stopping point. I'll post the num traits topic on users, and the operator coherence one on internals, so maybe you will jump in there.
I'm generally pretty private online, so I wouldn't take you up on the offer to continue in email. However, you've been really helpful and patient, and I sincerely appreciate it. Thank you again.
> no RCE has ever been exploited in 25 years of my C or C++ code
I'm genuinely curious about you say this with any sort of surety. Do you have any sort of, say, crash reporting from users' computer, or some other way to know if a problem occurred in the wild? (Not that these will actually detect a successful RCE, only failed ones.)
Additionally, not being (known to be) exploited doesn't mean that much without more context, e.g. have malicious people/machines/fuzzers actually tried to find exploits in your code?
> I'm genuinely curious [how] you say this with any sort of surety.
Of all the things I said, is that really the only one you want to address?
In the future, I'll try harder to qualify my statements with a "to the best of my knowledge" clause when replying to Rust core developers, but constantly adding caveats to every word I say is tedious. No, I'm not omniscient - I can't prove my software has never been exploited.
I have some doubts whether your question is really all that sincere, but giving you the benefit of the doubt: For the last 15-20 years, my job-related software runs on networks which are effectively air-gapped. My users aren't shy about submitting bug reports, and they generally have my phone number. For what it's worth, truly "malicious users" run the risk of getting fired or facing a court-marshall. Anyone who runs my code already has a shell and a compiler on the machine. We give them the source. They frequently have sudo. It would be much simpler for them to just write a program and run it than inject it into my (hypothetical) buffer overflows. If they want to crash my software, they can simply kill -9 it (or not start it in the first place).
Before that I worked for a few unsuccessful startups, and I wish my software had seen enough exposure to run the risk of being exploited. I've written Netscape plugins which I know were exploitable, but that code vanished along with the stock options before most people had gotten past dial-up connections. Maybe someone somewhere curses the day I was born, but they didn't send a bug report.
Having said all of that, you probably think I'm in some rare position. However, it's a really big world, and you might be surprised how much sloppy C, C++, Fortran, and COBOL software is out there quietly getting the job done without the constant onslaught of black hats attacking it. We don't all write web browsers and servers. There are a lot of potentially profitable C++ targets in the finance industry, but somehow they survive.
I'm not one of the people saying modern C++ solves all the problems. I'm keenly aware of many short comings in C++. The language sucks in a lot of ways, and I dread trying to explain the complexities to non computer science developers.
Given all of that, I have been interested in Rust for reasons having nothing to do with safety. You have some great features, and I think you should advertise those. If you fixed the pain points in Rust instead of emphasizing the shortcomings in C++, I think you could win a lot more converts (and a lot of new developers who could grow your ecosystem outside of web clients and services).
> Of all the things I said, is that really the only one you want to address?
Yes, I guessed (correctly!) that others would inquire about the rest of the comment, I wanted the context. Thank you for taking the time to explain.
> Having said all of that, you probably think I'm in some rare position. However, it's a really big world, and you might be surprised how much sloppy C, C++, Fortran, and COBOL software is out there quietly getting the job done without the constant onslaught of black hats attacking it. We don't all write web browsers and servers. There are a lot of potentially profitable C++ targets in the finance industry, but somehow they survive.
I'm not surprised.
Your argument is something along the lines "a lot of code doesn't have particularly high security requirements" and "sandboxing/airgapping mitigates problems", which are both totally reasonable and indeed are things that Rust core team acknowledge (although the first is becoming less and less true, in "surprising" places, as more things are internet connected). Additionally, the latter is a "defense in depth" strategy that the Rust community is keen on (e.g. https://github.com/servo/gaol): it is understood that code can have bugs, inside unsafe code or not, and so limiting the interaction with things outside the program is always good.
However, neither of these facts support, for instance, an expert being able write a memory-safe regex library in C++ in a reasonable time (e.g. how long it took for Rust's regex library to be written by a single Rust expert), nor do they say anything about code that does have strict requirements about security or correctness (latent memory safety bug might not be exploited, but it can still lead to weird crashes and data corruption).
> Given all of that, I have been interested in Rust for reasons having nothing to do with safety. You have some great features, and I think you should advertise those. If you fixed the pain points in Rust instead of emphasizing the shortcomings in C++, I think you could win a lot more converts (and a lot of new developers who could grow your ecosystem outside of web clients and services).
And, you might be interested to know that the 2017 roadmap https://blog.rust-lang.org/2017/02/06/roadmap.html includes many things like fixing pain points, and features for both web and non-web developers.
> Yes, I guessed (correctly!) that others would inquire about the rest of the comment
Fair enough.
> Your argument is something along the lines "a lot of code doesn't have particularly high security requirements" and "sandboxing/airgapping mitigates problems"
Nah, you're mixing up separate posts, but it doesn't really matter. I didn't start out with any intention of making an argument. I'm just tired of the "ZOMG! RCE!" sentiment as though that's the most important thing for everyone.
> [marketing, pain points, roadmap] was extensively discussed a few weeks ago
Note that it is possible to design regex engines that deliberately avoid these pitfalls (e.g. RE2, which was originally written for Google Code Search, as well as Go's regex engine (written by the RE2 author) and Rust's regex engine). The downside is that these engines are forced to disallow certain exploitable features, such as backtracking.
Not exactly sure what you're getting at, but web browsers these days have a lot of protection from malicious pages, including the "JavaScript on this page is using 100% cpu; terminate? Yes/No?" dialogs.
pcre has been fuzzed by quite a few people a while ago, I remember also having reported a couple of issues. When I tested re2 nothing showed up.
oniguruma also had quite a few issues, I tested it with libfuzzer lately. But everything should be fixed now.
The top two comment chains right now are (to paraphrase) "See, modern C++ isn't free of memory issues" and "Maybe we should rewrite it in Rust and compare".
> "Maybe we should rewrite it in Rust and compare"
That already happened, almost three years ago.
It's an interesting comparison point. The OP contains memory corruption bugs in C++'s standard regex library. If Rust claims to prevent these kinds of bugs, does it actually hold up to scrutiny? One way of testing that is throwing a fuzzer against a regex library written in Rust.
They seem to be fuzzing the regex, not just the input it is applied to. This may or may not change the results, but if you're allowing users to input arbitrary regex patterns you have a whole lot of other problems.
Huh? It would be perfectly valid to have it power the regex part of the scripting engine in a browser, for example. If that would then lead to memory safety errors, you just got yourself a 0-day.
Regex engines used in browsers are both fast and hardened against these attacks.
Finding vulnerabilities in existing regexes would be significantly more dangerous/interesting though. Like say if an exploit was found that could be triggered with a carefully crafted input to an email validator. Half of the servers on the web would be vulnerable! See e.g. https://en.wikipedia.org/wiki/ReDoS
If you have to be able to run an arbitrary regex on the victim's machine, that makes it much more limited. You shouldn't be letting users access regex any more than you should let them execute arbitrary code. Sure browsers do both of those things. But only through monumental effort that shouldn't be expected everywhere else.
I've seen plenty of places that you may want to accept an arbitrary regex from the user. An app could allow the user to set up a filter for messages or usernames by putting in a regex. Or an interpreter for a sandboxed language could provide regex support.
All manner of problems in the programmers mind become trivial if only we allow users to input essentially code to express exactly what they want. Of course this is basically never a good solution.
The issue with allowing arbitrary regex patterns is DoS through exponential blowup. But if you allow running code anyway you might not very much care for that.
Assuming untrue things about the mathematical properties of your regular expression engine which are not supported by the documentation and allowing that to be exploited by user input is a different beast entirely than a bug in that library. The first, with careful examination of the properties may be preventable through sanitation of the input to exclude certain edge cases. You can't assume you'll be able to sanitize input ahead of time for bugs nobody knows about yet, which might be hidden in normally safe features.
> if you're allowing users to input arbitrary regex patterns you have a whole lot of other problems.
Why? I am legitimately asking, you say that like it is meant to be obvious or common knowledge. But many regular expression engines are self contained entities that can crash but cannot expose information or poison other parts of the program.
A DoS-like attack might be an argument against unfettered user inputed regular expressions. But that off the top of my head is the only BIG risk. Particularly in secure-by-design languages like Java, C#, and Rust.
Allowing users to enter regular expressions is definitely something that should only be entered into with caution. It is effectively code in a programming language that will be run, after all.
For example, it means their input needs parsing - and parsing is very hard to get right and frequently results in exploitable bugs. There are also potential problems where regexps could take a very long time to execute, which could be a denial of service issue.
Then don't vet it. Just run it as is, and limit how many local resources it can consume (CPU cycles and RAM). Then add on a timeout for good measure, and you are good to go.
Particularly when the hit rate is very high, fuzzing is a stupidly inefficient way to find bugs.