I’d certainly seen this at Google, where they’re
pretty good at that kind of thing. But at Scalyr,
we settled on a much more brute-force approach: a
linear scan through the log
Art of Computer Programming mentions this methodology, and the importance of learning about it. There are entire sections of "Tape Algorithms" that maximize the "brute force linear scan" of tape drives, complete with diagrams on how a hypothetical tape drives work in the situation.
Few "fancy algorithms" books actually get into the low level stuff. But Knuth knows: modern computers are extremely fast at linear data, and accessing data linearly (as well as learning algorithms that access data linearly) is a good idea.
I'll bet you that a B-tree approach would be fastest actually. B-Trees are usually good at balancing "massive linear performance" with "theoretical asymptotic complexity". IIRC, there is some research into this area: (look up cache sensitive search trees)
We use some special tricks for searches that are executed
frequently, e.g. as part of a dashboard. (We’ll describe
this in a future article.)
And...
(You might wonder why we store log messages in this
4K-paged, metadata-and-text format, rather than
working with raw log files directly. There are many
reasons, which boil down to the fact that internally,
the Scalyr log engine looks more like a distributed
database than a file system. Text searches are often
combined with database-style filters on parsed log
fields; we may be searching many thousands of logs at
once; and simple text files are not a good fit for our
transactional, replicated, distributed data
management.)
It sounds like they're doing more than just "appending to the end of the log". If you're going to make an index of any kind, the index will likely be fastest with some sort of B-Tree.
Few "fancy algorithms" books actually get into the low level stuff. But Knuth knows: modern computers are extremely fast at linear data, and accessing data linearly (as well as learning algorithms that access data linearly) is a good idea.
I'll bet you that a B-tree approach would be fastest actually. B-Trees are usually good at balancing "massive linear performance" with "theoretical asymptotic complexity". IIRC, there is some research into this area: (look up cache sensitive search trees)