Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A millisecond isn't fast (and how we made it 100x faster) (jvns.ca)
109 points by Symmetry on Sept 10, 2015 | hide | past | favorite | 39 comments


I love this stuff. I'm currently working on a project where, left to my own devices, I would have come up with some complicated scheme that would reduce an O(n) (or even O(n^2)) task to O(log n). Fortunately, the guidelines I've been given (government, of course) offer a series of extremely-hard-to-understand-but-O(1) methods for retrieving the data. I haven't benchmarked it yet, but i'd imagine that my process would take about 5ms per record, and this process will probably take something on the order of 5 microseconds per record.

I'm learning all kinds of tricks that older developers can probably do in their sleep. Very cool.

(Note: the reason I say "older" instead of "experienced", even though for these purposes they're interchangeable, is that the tricks I'm learning are coming from a COBOL codebase written back in the 70s that I'm modernizing into Go.)

I also learned a long time ago that lookup tables for common requests are a no-brainer. If you only have 256 possible answers to a formula, just stick them in a lookup table.

Love it.


Migrating COBOL to Go? Government job? Where is this glorious country?


I don't want to be the guy rejecting change... But, is someone expecting a project in Go, to last 40+ years as that COBOL codebase lasted?? Wow!


It's not because COBOL was good that things written in it lasted; it's because it was there. The things written in it made COBOL last, not vice-versa.


Lookup table seems silly when you can take advantage of the floating point format. In C, I think you can more or less just do something like:

    uint64_t e = (1023-mj) << 52; 
    double exp = *((double*) &e)

edit: Whether or not there's a way to express that in Scala, I don't know...


In C you can use scalbn(), which a good compiler would hopefully optimize to something equivalent to what you wrote. Java has scalb(), presumably based on the C function.


That's not a legal optimization as it gets all sorts of edge cases wrong. For example scalbn(0, anything) is zero, whereas simply adding to the exponent will produce some arbitrary power of two or infinity.


It doesn't look like gcc (4.8.3) inlines scalbn on -O3 (unless I need another flag...). I see callq(scalbn) in the disassembly.


Scala has bitewise operators

http://www.tutorialspoint.com/scala/scala_bitwise_operators....

So the "<<" is easy enough to cover.

I do not know what the * is going, if it involves pointer something something then it may not be possible to replicate.


I'm just reinterpreting the 64-bit integer as a double-precision floating point number (there are other ways to do this in C, for example using a union).


Java has a type called bytebuffer that might do the trick: http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer...

I don't know how that would get optimized by the jit at runtime though.



From the article, she says they tried "1.0 / (1 << mj)", which turned out to be slower than the lookup table. May or may not be equivalent to java.lang.Math.scalb, I dunno.


That doesn't look like valid Scala code.


> In C


So, not valid Scala then.


Optimizations are so fun. But the key is always to measure from end to end before you investigate, and after you change things, in a comparable manner. Too many people guess or theorize or make random changes.


And you need to know how to perf test. The OP shows a test that runs for a fraction of a second. The JVM will not even think about JIT until it runs a piece of code 5-10k times... so this is likely a test of how fast interpreted code runs, which has little to no value.


To be fair, the blog post just shows the command output of a test run. That could have been the test runner extrapolating the run time from a longer running benchmark.

The pull-request linked references some Benchmark attributes and all manner of scala magic I don't want to interpret right now.

That said, your point holds and I'd be very skeptical that this benchmark measures what they think it does.


Optimization is a drug. If you do it too much you will get addicted and it will ruin your productivity. But damn it feels great :P


The reason Math.pow is so slow is because underneath, it's really doing:

    fn pow(x, n): exp( n * ln(x) )
This is how you assure that power operations take constant time regardless of the input. However, exp and ln are relatively slow in numerical computing land, so if you're using small powers simply doing the multiplication out or if you're using powers of two, bit shifting will be faster.


* Why does this ensure constant time?

* Why is constant time desirable here, if the operation can be quicker for some inputs?


It ensures constant time because the operation takes a constant amount of time to execute exp and ln. Think about the naive implementation in the general case:

x^6 = x * x * x * x * x // linear in n

Here's a slightly (probably slightly wrong) optimized version of that which is O(log(n)):

    fn pow(x, n): 
      if n == 0: 1
      elif n == 1: x
      elif n is even: 
         return pow(x * x, n/2)
      elif n is odd:
         return x * pow(x * x, (n-1)/2)
The nice thing is about the constant time is that it's relatively speedy (< 1ms) and is guaranteed to handle even large inputs gracefully[1]. Some implementations might add an additional optimization like if n == 2: x * x or n == 3: x * x * x since those are super common, but if you're really crunched, you've added if statements or function calls when you could have just inlined it.

[1] Though you'll start running into quantization error at some point from ln.


A compiler won't generally turn x * x * x * x into ( x * x) * (x * x) without being asked to (e.g. -ffast-math) as floating point math is not associative.


Java's Math.pow accepts non-integer exponents so your example doesn't handle all cases.


Right, it only works on positive integers, but my original example using exp and ln works in nearly all cases (though you probably want to do some rounding in the case of an integer input). That's actually another point in its favor. It's constant time and handles the odd cases.


This looks like an ideal case for something like ldexp(), but I couldn't find an equivalent function in Java at first glance.

A lookup table seems like a bit of a shame for something that should be inherently fast in binary floating point, but it is probably the next best thing if poking around in the floating point bit representation isn't possible.


ldexp( ) is a surprisingly expensive operation because it has to handle edge cases correctly; even if you don't hit those edge cases the best implementation of the fast path on most architectures is approximately extract exponent, add adjustment, check result, branch to fixup if out of range, insert exponent. If the range of adjustments you need to support is small, and you're not totally load-store bound, a lookup table is faster.

(That's not to say that Java shouldn't have ldexp( ), just that it's not a fix-all.)


Java does have Math.scalb() which would work. It would be interesting to see how that compares to the Math.pow() and table lookup methods.


Yeah a millisecond is a long time even in Java. Couple of years I was writing a java backend server for caching+computation. I was severely memory strapped and was designing the daily expiration of data. The core data structure was map<map>. Now the keys that I had for expiration was in the 2nd level and I was troubled for the memory usage for a kind of secondary index for fast expiry plus the complexity of managing it. At the end I said what the heck and just did a full iteration of the 2nd level map and doing a hash lookup on the expired key, kind of reversing the logic (its essentially what a DB hash join does). I don't remember if I did a hashmap benchmark first or not. Essentially in my very old desktop (compared to current machine) I was doing 0.5M hashlook (2 lookups actually) per second per core and the rest was a very successful history.

So I think yeah with layers upon layers of libraries we tend to forget how fast a computer is and thats often sad. The whole misconception of premature optimisation has some incorrect side affect, people often ignore performance when designing things. The core parts of most designs specially library or shared infrastructure code should have some concepts of overhead/cost even though this shouldn't be overblown. Unconsciously bad designs then are often followed by super efficient but complicated algorithm to fix the design inefficiency.

One thing I very much agree with the blog post. The more conscious you become of these things the more you can guess where a bottleneck exists or where you should take care. You wouldn't always be right and you should always benchmark specially if it's critical enough or you are changing design too much for performance. But a general awareness can be very helpful.

As I completely rewrote that backend server few months ago I started with a very simple design, data structure and simple algorithms (linear scanning with a bit of indexing through plain java hashmap/concurrentashmap) it worked flawlessly the first time (in terms of performance). It is serving over a million request per day of fairly complicated dashboard generation at ~0ms latency and 2 YGC/hr I think. Some might call it premature optimisation but I think you need to have an understanding of where performance is critical and design for that. That doesn't mean you should think everything is performance critical and start optimising every line of your code.


Does anyone know whether the JVM/JIT generates SIMD instructions for code like this? Looks like a pretty obvious target to me.


Is it denormalised numbers [1]? Or is it just math.pow being slow?

[1] http://stackoverflow.com/a/9314926


I started reading your post thinking I would never get it, and I am now amazed! Thanks for sharing.


This is such basic optimization I have no clue how this got any interest at all.

The level of "yeah, no shit" is pretty high here.


Everything is running just fine in this thread, I don't see the tone you're expressing. There's always some level of interest to someone - sharing is caring.


Some people reading Hacker News are interested in basic optimizations.


... or some people reading Hacker News who knew nothing about basic optimizations now do; which is something that that the industry desperately needs. Everyone has been incorrectly quoting Knuth to the point where optimization is widely regarded as a worthless exercise (very evident on e.g. StackOverflow) and I think it's absolutely great that someone new has learned to appreciate the art of it.

Lookup tables today, cache lines tomorrow.


Everyone wants everything pre-chewed for them. This is SO basic it's Sophomore CS. This guy is doing "machine learning" but WOW A TABLE LOOKUP IS FAST?!

This sounds like "code bootcamp" level trash.

Nevermind that optimizations as a whole have a rich history and there are fascinating examples aplenty that can actually teach you something.

Like, for example, this: http://files.righto.com/calculator/sinclair_scientific_simul...

About getting a scientific calculator to run on a $100 chip, when the chip's creator (TI) said it was impossible.

Or how the genesis of some of our optimizations (and architectural improvements) date back to parallelizing work...with different colored punch cards at Los Alamos by Feynman: http://blogs.msdn.com/b/msr_er/archive/2011/02/04/celebratin...

If you're a hacker, but everything has to be pre-chewed for you and this garbage rates as "interesting", what the fuck are you doing with your time? Buying "geeky" TShirts and playing ping pong? Jesus wept, what next, amazing Excel spreadsheet calculations? If this person is doing "machine learning", the machine doesn't stand a chance.

Sorry. I use HN as my 'smart things' filter - if I wanted 'for Dummies' I'd be on Reddit (and I have tons of stupid shit questions, and I ask them elsewhere). It makes me irrationally angry for some reason. Like, THIS is now the level of achievement that requires a blog post - and it gets to the front page!? The 'everybody gets a trophy!' level shit is here to stay. I have to do some serious thinking about why this pisses me off so much. Everybody is now a hacker, hackers are now at the level of blog posts over table lookups. I guess I'm angry over the dilution, commercialization and sanitation of the whole thing. Like the pool is now so shallow and full of piss, what the fuck happened? Everybody is a maker, a hacker, a tech-tropeneur, but subversive, active, hard-working, horizon seeking is largely gone. People want to reinvent the wheel in Javascript for the distributed machine learning {buzzword} cloud with great UI, blog about it, and cash out. It isn't about enlightenment anymore - personal or otherwise, unless it's dropping out to "find yourself" making artisinal bread and birdhouses.

How did we get so lost? Maybe it's an existential crisis. Sorry again. Table lookups are great. Go team.


I'd agree, wholeheartedly.

However, it's Friday. Take a nice deep breath. Relax.

Be happy that at least they are investigating the code and trying to find solutions (no matter how 'obvious' they may be) instead of just replacing it with another open-source library from somewhere else that happens to be faster in this case.

Everyone has to learn through their own experience, and I'm always happy when people realise how things can be better - even if it doesn't happen nearly enough these days.

Time for coffee :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: