Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Rise and Fall of the OLAP Cube (holistics.io)
194 points by shadowsun7 on Jan 30, 2020 | hide | past | favorite | 53 comments


I used to admin OLAP infrastructure for a large Fortune 100. I think the article makes a lot of great points, but ultimately there are many paths to the same result. I'm not entirely sure what that group is up to these days but using column stores is a big part of it I'm sure, but I think ultimately deciding between precalculated vs. dynamically calculated pivots is the big choice.

Whatever system you have can do that, but the real work IMHO is understanding the org enough to cover the 80/20 of what people will want to see. Ideally you want to get to a higher level of abstraction such that you continuously codify your method of analysis or pivots to traditional tables if possible in order for maximum repeatability.

Sometimes I wonder if graphs or trees really make more sense though and OLAP being a tree of sorts is just a symptom of the RDBMS ubiquity, but this is just a meandering notion perhaps.


This piece reiterates a bunch of the usual misunderstandings of OLAP. Likely because it’s following the DBMS community’s redefinition of OLAP towards cross-tab group-by, and away from the higher end of the OLAP market. The “cube” is just as much a logical construct as a SQL table is, and just because row stores were bad at certain analytics didn’t mean that SQL was.

Compressed column-stores hurt OLAP, because update throughput (the “on-line” in OLAP) is relatively bad. Uncompressed / array stores are quite good.

The quantity and complexity of SQL necessary for a non-trivial OLAP view is daunting: time-series views (mix YTD and current-period calcs for transactional and balance accounts, get the ratio measures calculated, and handle the joins necessary for calculating and aggregating from the many (easily dozens) of fact tables that have different dimensionality and granularity and need to be joined at their finest detail. The user will want to see a set of metrics for a pair of orgsnizational units, and also as a percentage difference between the two org units. SQL does not have inter-row calculations, so quite a bit of work goes into re-shaping the SQL cursor’s results to something the user wants to see.

All that query generation and result transformation is part of the value-add of the “cube” server.

So the OLAP cube as a logical construct is definitely not fallen. Just the bad ones. They’re less flexible than SQL but provide way more productivity within the query space they’re built for.


> Compressed column-stores hurt OLAP, because update throughput (the “on-line” in OLAP) is relatively bad. Uncompressed / array stores are quite good.

But why do you want high update throughout when you're doing mostly reads? Every definition I read about OLAP says this is one of the fundamental differences, or am I misunderstanding something?


Your users want their dashboards and reports to contain data that's as up-to-date as possible. No one is happy about waiting 24 hours or more for a batch update to run if there are decisions they need to make right now. The difference vs. OLTP is that it might be acceptable for events to take minutes or even hours to be reflected in the database, vs. ~seconds or less for a transaction where the user is interacting and waiting for a confirmation.


Most of the users I work with, when asked why they want more than daily batches, respond with something emotional and not business value. I had a request for real-time updates for a massive metric that is based around monthly usage.

Ya just gotta push back, and hep people understand what the trade offs are for real-time. Most people don’t need it.


Absolutely true. The ones that do, do. For example, multiuser planning systems absolutely do. And those were what OLAP was originally intended to support.


You’re missing the important definitions from back when OLAP originated. There is a high update rate for the use cases I build software for.


I'm conflicted; its a nice write-up, and probably generally true right now, for most stuff.

However, I still live with databases big enough to still need cubes, although these cubes can afford to be less refined these days. Saying 'bigtable can do a regex on 30M rows per second' isn't saying it can't be done cheaper and quicker without paying google etc, if you just have some cubes.

And I think its going to track the normal sine wave: over time, data sets get bigger, and we keep oscillating between needing to cube and being able to have the reporting tool 'cube on the fly' behind the scenes.

I think there's a general move not mentioned in the article as data-lakes become faster, and then data outstrips them, and so on too.

The strength will be tooling that transparently cubes-on-demand. I wish there were efficient statistics and CDC that tracked metadata so tools can say 'this mysql table has been written to since I last snapshotted something', and, even better, 'this materialized view that I have in this database is now out of date because of writes that affect the expression it is used from on that other database over there' etc. Basic classic data-sources can do a lot of new things to make downstream tools able to cache better.

I have a slight problem with the terminology in the middle of the article, as I'm so far down the rabbit-hole that I think of cubes _as_ databases; I suffer cognitive dissonance when I read about shifts from cubes to databases etc. To me, a cube is just a fancy term for a table/view for a particular use-case.

One tool that I'm terribly excited about these days is presto. https://prestosql.io/ allows you to take a constellation of different normal databases and query them as though they were one big database. And you can just keep on adding data-sources. Awesome!


Thanks for your comment, I'm not familiar with presto at all - but I did do a bit of reading of an older article: https://www.slideshare.net/frsyuki/prestogres-internals

Would you view presto in its current state as a replacement for vanilla Postgres with FDW for standard data analysis queries? I don't fully understand the Postgres/Presto relationship.


Hmm, presto is not Postgres.

In a way, presto is like a bunch of FDWs on steroids, and a query planner that has above average cost model for hive etc.

There are plenty of things that presto isn’t, such as a good replacement for Postgres in classic oltp workloads.


The columnar database engines are powerful enough to answer the ad-hoc questions so you often don't need to materialize the summary data somewhere else or use BI tools such as Tableau that fetch the data into their server and let you run queries on their platform.

ELT solutions such as Airflow and DBT let you materialize the data on your database with (incremental) materialized views similar to the way how OLAP Cubes work but inside your database and only using SQL. That way, you won't get stuck to vendor-lock issues (looking at you, Tableau and Looker), instead manage the ELT workflow easily using these open-source tools.

These tools target the analysts/data engineers, not the business users though. Your data team needs to model your data, manage the ETL workflow and adopt a BI tool for you. When you want to get a new measure into a summary table, you need to contact the analyst in your company and make him/her change the data model. As someone who is working in this industry, I can say that we still have a way but the BI workflows will be much more efficient in a few years thanks to the columnar databases.

Shameless plug: We're also working for a data platform, you model your data (dimensions, measures, relations, etc.) and build up ad-hoc analytics interfaces for the business users. If the business user wants to optimize a specific set of queries (OLAP cubes), they simply select the dimension/measure pairs and the system automatically creates a DBT model that creates a summary table in your database similar to OLAP cubes thanks to the GROUPING SETS feature in ANSI SQL. Here are some of the public models if you're interested: https://github.com/rakam-io/recipes


After maintaining an OLAP cube system for some years, I'm not that sure after reading the article.

The nice thing of an OLAP cube is the UI and how business users can easily drag and drop items to explore data (standard reports are best created automatically and don't need an OLAP layout/setup).

If the UI (Tableau, Excel Power Pivot) is the same, then yes, OLAP cubes are a thing of the past. Otherwise not.


It's basically the same.


I think the article is a bit simplistic.

It's true that _often_ OLAP cubes are not needed. That's simply because the amount of data and the latency requirements are _often_ not too demanding.

Also, materialized views don't solve the major issue with OLAP cubes: the need of maintaining data pipelines.

I wonder if a solution to this problem could come from a different way of caching result sets: new queries that would produce a subset of a previously cached result could be run against the cached result itself. Of course this opens up a new set of problems, cache invalidation etc..


Two false statements in this article:

> ...Amazon, Airbnb, Uber and Google have rejected the data cube...

Airbnb uses Druid which is essentially an OLAP cube.

> BigQuery, for instance, doesn’t allow you to update data at all

It's not like that anymore since several years.


Druid is a column store geared towards OLAP workloads, which seems to support the article's point. It does not store its data in a "cube" layout as far as I can tell, just by time and individual columns.


> It's not like that anymore since several years.

Apparently the article's been updated to reflect that


Amazon uses Apache Kylin, which is a modern OLAP cube technology.


Amazon also uses Redshift, which is a modern columnar database.

An example does not an argument make.


Sorry, could someone ELI5 what is OLAP? And while you are there, what is Tabular Model? As background,I have worked with SQL and relational databases, and occasionally keep on hearing these, but nobody ever explained to me what these are and why I should be interested. So far I have just shrugged and thought that I guess my workloads/datamodels/whatnot just do not need these fancy things, but always I see them, there is someone nagging at the back of my head that maybe you should have a look...


OLAP: Online Analytical Processing. Cranking through large amounts of data with a focus on aggregations like sums, averages, medians, etc. Measures (numbers) are defined by dimensions (attributes with usually discrete domains). Aggregations are frequently precomputed on many (or all) dimension axes so that they are immediately available. Models can be built from something as simple as a wide CSV file or as complicated as a snowflake schema.

The kind of relational database you're familiar with is probably OLTP (Online Transactional Processing).


So, when my aggregate queries/views are too slow even after indexing and tuning, then I should start to google what OLAP is? My go-to tool for this has been materialized views (or in some cases simply a new table that is refreshed every now and then). What would be the cases when OLAP is better/worse than materialized view? (based on the main article, it sounds like pretty much no other advantage for OLAP than smaller resource requirements)


If your needs are well-defined in advance, you are fine with a materialized view. What I mean by "needs" is, "What questions are you trying to answer?"

OLAP's strength is that the platforms that implement it can precompute aggregations across all of your data and let you quickly answer questions that you might not have known you had.


What are typical data sizes for this?

I think the OLAP term has been around a long time, some OLAP tasks of the past are probably not so huge today, I wonder if the shrunked-by-time tasks are still called OLAP or if the smaller ones are implemented differently.


It's been about 15 years since I did any OLAP work, so "big" back then was a source database measured in gigabytes with queries that took minutes to run even with a lot of optimization.


OLAP is a way of constructing queries and persisting/refreshing their results. Some OLAP based systems have a query designer where you can drag and drop columns, set values for certain columns to reduce the data, aggregate certain columns or pivot certain columns. They also have special handlers for pivoting such as pivoting datetime columns by a given granularity.

All these reports can be kept up to date in a live fashion by the OLAP service as new relevant data comes in, much like a materialized view.

Many accounting systems support OLAP based queries in order for the accounting department to design reports and export them to excel.


well, I would like to say the OLAP Cube is just re-rising now. There are 1000+ companies deployed Apache Kylin (OLAP Engine for Big Data) in the past 5 years, for 100+B rows, for 100+ concurrent users...many different use cases are based that technology...it works very well with BI tools and so friendly to analysts who are using such "old fashion" every day over the decade (how hard for them to be Data Scientists?) check more here: http://kylin.apache.org/community/poweredby.html


+1 for Apache Kylin, it's a great project and awesome open source community. If anyone is curious about what modern OLAP is capable of, check it out.

If you want an alternative take on where OLAP is today and what it is capable of, I usually recommend this article: https://kyligence.io/blog/olap-analytics-is-dead-really/


There are huge problems with that piece. For starters, OLAP != OLAP cube. Columnar databases and OLAP cubes are both designed for OLAP workloads. They are simply different architectures. Therefore, it is impossible to argue that 'OLAP is dead' — it cannot be dead, because OLAP is simply a type of database usage.

At this point you might say, "oh, OLAP cubes refer to an abstraction, it can be implemented using columnar stores!" — and I would point you to 40 years worth of academic research that stretches back to the early 80s. The OLAP cube or data cube refers to a specific type of data structure. It just so happens that vendors like to use the term 'OLAP cube' even when they are using a columnar engine under the hood, because it sells well.


For anyone wondering, OLAP != OLAP Cubes

OLAP = A category of databases meant for analyzing data. These are eventually consistent db's, and not OLTP db's. OLAP db's include Redshift, Teradata, Snowflake, BigQuery, and others. Generally what makes a database an MPP database is partitioning compute and storage. Generally what differentiates one MPP db from another is whether or not data and compute are colocated.

OLAP Cubes = A feature built into SQL Server, that includes has its own dialect of SQL called MDX. OLAP Cubes are decreasing in popularity because you can achieve the same results through other means and less effort.


Thank you for saying this! I think your definition of OLAP is good, but the definition of OLAP cube could simply be 'a data structure that is basically a very sophisticated nested array'. I'm oversimplifying, of course — there are over four decades of work at this point on OLAP cubes, including various compression schemes (e.g. https://link.springer.com/chapter/10.1007/978-3-642-17569-5_...) and other techniques to aggregate data when the cube outstrips the hosts's memory. But by and large it is important to understand that an OLAP cube is simply a data structure ... and different vendors have different implementations and optimisations of said data structure.

I think this confusion exists because too many vendors conflate the two terms. They say OLAP when they mean OLAP cube. BigQuery and Redshift, however, do not: they are very clear that their dbs are designed for OLAP workloads but are not cubes.


Redshift is not eventually consistent, it's fully ACID compliant.


You are correct, and eventually consistent is a hand-wavy description of Redshift. While Redshift is ACID compliant, it squarely belongs in the category of OLAP db's and not OLTP db.s


Not unique to MS SQL Server - Oracle and IBM (InfoSphere) have cube engines, and there have been independent implementations as well: https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers


That’s a valid point. Thanks.


What's some concrete example of OLAP cube? What does Alice, a data analyst, actually do when she gets to work at her computer? What does she interact with on the screen? Does she use some specialized software to project the data cube into two dimensions (contingency tables) to find hidden meaning in the data? There's a lot of abstract talk and no actual examples on the Internet, except for SQL Server tutorials which always end up with some kind of E-R diagram.


Cost is a big factor the author underestimated in this big data era. Precalculated cube is not only faster but also times cheaper in the cloud, thanks to the reuse of precalculated result.

Dynamic query services in the cloud basically charge by processed data volume, like Google BigQuery and Amazon Redshift/Athena. For small and medium dataset, this works well. But for big data close to or above billions of rows, the cost will make you reconsider.

In the recent Apache Kylin Meetup in Berlin, OLX Group shared their comparison between OLAP cube and dynamic query in real case. Given 0.1 billion rows, cube technology (Apache Kylin and SSAS) prevails over MPP+Columnar (Redshift) easily. Especially Apache Kylin is 3.8x faster and 4.4x cheaper than Redshift for their business. (https://www.slideshare.net/TylerWishnoff/apache-kylin-meetup...)

For me, a mix of precalculation (80%) and dynamic calculation (20%) should hit the sweet point between cost effectiveness and query flexibility.


Thanks for sharing. Interesting write up.

While this article accurately captures the issues with traditional OLAP Cubes, it failed to recognize the latest development in this domain.

Projects like Apache Kylin, and its commercial version Kyligence, leverage modern computer architectures such as columnar storage, distributed processing, and AI optimization to build cubes over 100s of billions rows of data that covers 100s of dimensions. The performance result is unprecedented in either traditional OLAP cubes or today's MPP data warehouses. That's why the world's largest banks, retailers, insurance companies, and manufactures are turning to Kylin/Kyligence for the most challenging analytical problems.

Not to mention the rich semantic layer that modern OLAP cube technology provides, which greatly simplifies analytics architecture in the enterprises.

And, comparing columnar stores to OLAP cubes is like comparing apples to oranges. The former is a storage format and the latter is an analytical pattern. Modern OLAP cube technology like Kylin/Kyligence stores cubes in columnar stores anyway.


> And, comparing columnar stores to OLAP cubes is like comparing apples to oranges. The former is a storage format and the latter is an analytical pattern. Modern OLAP cube technology like Kylin/Kyligence stores cubes in columnar stores anyway.

This is mistaken. I went back to read most of the academic literature on OLAP cubes while working on this piece (which, unlike vendor marketing, is used with consistency since the early 80s). OLAP cubes or data cubes refer specifically to the data structure that grew out of nested arrays. An OLAP cube may be materialized from a column store, but a column store isn't an OLAP cube.

Relevant sources are included at the bottom of the piece.


If you're a hacker interested in SQL and OLAP, you might enjoy Greenspun's (he of of Greenspun's tenth rule fame) writings on these subjects:

https://philip.greenspun.com/wtr/data-warehousing.html https://philip.greenspun.com/sql/

Although technically obsolete (as in talking about 90s database systems that have bitten the dust since then), that's a minor defect. He spends most effort on teaching timeless principles.


Interesting perspective. What do you say to someone who's been using OLAP cube for their entire BI implementation? What would be the transition plan to adopting MPP databases?


Background: Someone who has worked in the old Microsoft Analysis Services stack 8 years ago and has moved to more columnar data formats like parquet or row based MPP's.

Transition plan I would say is find the dataset that is exploding in size or complexity and start your POC there. I did customer service datasets on OLAP so it only grew at a pretty small scale and the data model didn't change that much. So OLAP was fine except for the fact that nobody else knew how to maintain it.

The main growing pain is find what your front end developer flow will be. It will be the same governance as the OLAP but more democratized so be ready to make your back end more front end. For MSAS it was excel but more modern systems are also more wide open. The article suggests just SQL but that can get out of control. How do you reduce reinventing the wheel etc? How do you prevent a lineage mess of derivative on top of derivative if you give users write access. Etc. IMO Tableau is a great product that allows the OLAP like exploration but can use SQL as an input. Just make sure people get the sql behind under some kind of source control and governance.

From the data model perspective it I think the main difference is make the tables wider and "pre join" in your immutable dimensions with higher carnality (ie customer). Just be careful of highly mutable data and keep those in separate tables because it is very painful to rewrite columnar data. Ie if you partition by date to update a single record you rewrite the entire date.

(About governance) I mean more passive governance not gatekeeping. Pretend each end user and dataset costs you money. How do you track them passively with some thin yet easily trackable logging? Business Unit and unique Job are bare minimums.


I have one question for you guys. If my company is focused on Spark and Vertica, and I want to learn data modelling on top of those, does Kimball still make sense? The article says yes in general but I'd like to know your opinions.

Currently the BI team doesn't do much dimensional modelling as far as I see. Every thing is taken from Kafka and dumped into some wide tables with all columns that we the analysts need. Actually there is no data modelling at all.


It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by columnar data stores. I always understood OLAP cube as an abstract concept that can have various technical implementations, while column store is a kind of optimization in that technical implementation.


The article is generalizing, as such articles do. Necessarily so. There are always exceptions. It is unfortunate that when such articles appear the dominant response is nitpicking or exception pointing (I'd say 80%+ of the comments thus far).

In the overwhelming bulk of cases firms have a column store that they generate cubes from. OLAP is run against the cubes. Some put warehousing in between, though that changes little. Cubes are fundamentally a form of caching because historically it was prohibitive to do large-scale aggregations in real-time. With massive memory servers, and more importantly flash storage that improves aggregate performance at the enterprise scale by many magnitudes -- a million times faster analysis and aggregation is entirely possible -- that historic caching step becomes a hindrance and maintenance/timeliness issue. So it's discarded.

That's all 100% true. It has happened in many orgs.

Not all, of course. But as a general trend.


I went back to read most of the academic literature on OLAP cubes while working on this piece (which, unlike vendor marketing, is used with consistency since the early 80s). OLAP cubes or data cubes refer specifically to a data structure. An OLAP cube may be materialized from a column store, but a column store isn't an OLAP cube.

The proof of this? Go to any serious columnar database provider and search for the words 'OLAP cube'. You will find that they are careful to say 'OLAP workload', but not 'OLAP cube' — because in the strict definition of the term, an OLAP cube or data cube is an entirely different architecture.

Relevant sources are included at the bottom of the piece.

endorphone's comment has it right.


that's true


How does something like Tableau fit in? I know of people using Tableau with a Postgres connector, but I am not sure if that allows you the same kind of performance as you'd get with OLAP or even a columnar DB.


Tableau (the general visualization engine) just runs queries against whatever database (relational, cube, columnar, web service, file).

Tableau has a proprietary extract engine called Hyper that uses a sort of columnar storage. Then you can use the extract in dashboards.


I don't know about Tableau, but Microsoft Power BI does use a columnar engine - I believe its the same VertiPaq/xVelocity component used in Excel Power Pivot and SSAS Tabular Models.


Correct. It also shares a fair bit of implementation with columnstore indices in MSSQL Server.

There's also a proprietary relational query language shared among Power BI, Power Pivot, and SSAS Tabular, called DAX.


The author don't know today‘s OLAP。 Look Tabular Modeling in SQL Server Analysis Services(Power BI) in-memory column-oriented analytics engine.


> Codd got called out for his conflict of interest and was forced to retract his paper … but without much fallout, it seems: today, Codd is still regarded as ‘the father of the relational database’

I found this passage confusing. He is regarded as such because of his work on the relational algebra, and the shady OLAP backstory is unrelated to that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: