It looks very important, popular and well established, but what is it? I looked at the README hoping to understand what a semantic layer for building data applications means but no love.
It helps data engineers and application developers access data from modern data stores, organize it into consistent definitions, and deliver it to every application. Like an ORM? Or a middleware?
> Cube was designed to work with all SQL-enabled data sources, including cloud data warehouses like Snowflake or Google BigQuery, query engines like Presto or Amazon Athena, and application databases like Postgres.
Still not getting it. Is it that it can perform a single query across multiple databases?
As part of the Cube team, I have to admit that all descriptions in the sibling comments make a lot of sense. Of course, the "semantic layer" thing is quite known to data engineers/analysts and other data folks in general (they also know things like "metrics store", "headless BI", etc.) but not that well known outside of the data space. Probably, it would be best to describe what are the major use cases Cube is created for.
1. Embedded analytics — you have your data somewhere (data warehouse, database, etc.) and you'd like to embed it into a data app. Cube would provide connectivity to data sources, data modeling to define the metrics, caching to make your analytics fast, and APIs and SDKs to deliver them to the data app. E.g., if you decided to add a chart to your front-end app, fetching the data from the API would be as easy as sending a JSON query to Cube.
2. Semantic layer for the internal BI — you have your data somewhere and you'd like to provide access to insights based on that data to business users. Cube would provide connectivity to data sources, data modeling to define the metrics, access control to make sure only ones who need access to metrics have it, caching to make sure every dashboard loads instantly, and APIs to deliver the data to BI tools, notebooks, etc. E.g., if you want to create some dashboards in Superset, Metabase, Tableau, or Power BI, you'd just need to connect Cube's SQL API as if it was a regular database and start creating charts/dashboards.
That makes a lot of sense to me, and I see why it would be hard to coalesce all of that functionality into one or two sentences that would make sense to a more general, non-data, tech audience.
My understanding is that it's essentially Looker minus the dashboarding. What you would define via LookML is essentially the "semantic layer" that this is addressing. DBT is attempting to do similar work: https://www.getdbt.com/product/semantic-layer/
Cube has saved me hundred of hours. I use it as backend for reporting and dashboard inside our SaaS. In our frontend I've build a light-version of PowerBI and I use Cube for a backend. Instead of manipulating SQL directly I use Cube's JSON query format. Kind of difficult to explain, but Cube might be the best piece of software I have ever used.
Maybe a good tagline would be "self-hostable Backend as a Service for data analysis"?
Let’s say you work for a SaaS doing analytics. Your boss says “hey! We need to start reporting on new logos. Can you snag those from the DB?”
But what counts as a new logo? Does a pro serve engagement that doesn’t use the product count? What about a business using the SaaS but still in a trial period? Etc.
A semantic layer helps provide common agrees upon definitions to the business. So any one looking for common data entities can just look those things up… and can come to published definitions (which are backed by queries to databases, data lakes, etc).
Does that help? Another example of this would be dbt for example
It is kind of like an ORM. I find ORM's and semantic layers to be similar in many ways, except that semantic layers are meant for defining metrics too. These metrics describe aggregating data. Like summing order amounts to get revenue, or counting order_ids to get sales.
I think ORMs have got some bad press because they were intended to be used bi-directionally: map data from the data source to business objects and back. With semantic layers, data is only mapped to metrics and rarely back - which makes things much simpler, IMO.
I can't vouch for cube itself as I haven't used it but can confidently say such tools are highly valuable. I built one for use in my own business and have operated other businesses on similar tools.
It brings all data together, provides a consistent interface, and is way faster than writing SQL (though there will still be use cases for that). There is some up front cost to getting configured but it pays off in my case at least.
Say you want to build a dashboard with charts and custom timerange selection using data you already have in Postgres/other DB, without killing your DB under the pressure of queries AND without having to write an additional API?
Cube.js is the tool for that. Handles data modeling (you can define a schema on top of your SQL schema), caching, access control and API for you.
> It looks very important, popular and well established, but what is it?
It's easier to explain what Cube is if we first define what the Semantic Layer(SL) is. In a few words, the SL is the abstract representation of business objects, for example: sales, users, conversion rates, etc. Cube provides the language to define the SL, an API to access it, access control mechanisms and a caching layer. It's important to emphasize that Cube is a stand-alone SL, decoupled from any BI visualization tool. That's the "headless" part, and I would also add that is "feetless" since it supports multiple source DBs. Looker the other big name in the space has the incentive of selling you more usage of BigQuery and of locking you in with their UI, it just recently started to open up to the idea of APIs. The idea is that you have a central place where you define the SL and then you don't need to duplicate the definition on every downstream application, which may lead to errors or inconsistencies.
> Is it that it can perform a single query across multiple databases?
Cube allows you to join data from multiple databases at the caching layer, that's fundamentally differently than a federated query engine. But from the downstream application perspective it has the same outcome. By being done at the caching layer it has inherent advantages and limitations vs federated queries.
I really like these series of articles by David Jayatillake that go into deeper detail:
It helps data engineers and application developers access data from modern data stores, organize it into consistent definitions, and deliver it to every application. Like an ORM? Or a middleware?
> Cube was designed to work with all SQL-enabled data sources, including cloud data warehouses like Snowflake or Google BigQuery, query engines like Presto or Amazon Athena, and application databases like Postgres.
Still not getting it. Is it that it can perform a single query across multiple databases?