Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: PreQL/Trilogy – A Higher-Level, Composable SQL (github.com/preqldata)
3 points by efromvt on June 19, 2024 | hide | past | favorite | 5 comments
An open-source language implemented in Python that replaces the FROM/JOINs in SQL with a lightweight semantic layer and compiles to SQL against supported backends.

Key features are: Reusability - semantic layer supports full reuse of calculations, CTEs, and even entire models through a lightweight python inspired import syntax. Simple - writing queries that combines data at different granularity is straightforward and avoids common error patterns. Easy to refactor - changing the way the semantic layer binds to the database doesn’t require you to rewrite your queries. You don’t need to codemod 100 ETL scripts to deprecate a table!

Related packages [basic, links available from https://preqldata.dev/]: pypreql-etl -> basic DBT integration. pypreql-nlp -> obligatory GenAI integration. trilogy-studio -> electron IDE for running queries directly. trilogy-public-models -> common repository of public models (mostly BQ public datasets right now) that can be imported/queried.

Differs from other approaches to make a modern SQL I’m aware of (links below) in that it attempts to embrace current SQL syntax, not replace it. Each of these linked projects has a distinct value proposition and is worth checking out on their own!

Note: as you can see by names in the space, it turns out that I was not nearly as creative as I hoped with ‘PreQL’ as a ‘prequel/sequel’ play; PreQL/Trilogy is the intermediate Star Wars pun and it may migrate to just Trilogy (after the sequel!) to avoid confusion.

Other SQL Replacements:

Malloy (full rewrite, semantic focus) https://news.ycombinator.com/item?id=30053860

PRQL (pipelined SQL alternative, all new syntax) https://news.ycombinator.com/item?id=36866861

preql (much more ambitious, all new syntax) https://news.ycombinator.com/item?id=26447070

Things to do: - Define/implement better null handling - Rewrite some parsing/CLI in rust - More functions/backends, more window flexibility - Performance hooks/optimization



General comment: the README opens with

    that replaces tables/joins with a lightweight semantic binding layer
But then the included example doesn't do any joins. :(

Similarly, the "Concepts" section of the documentation has a diagram about joins, but no code examples.

On a more-technical note, the generated queries seem to use a LOT of CTEs. While they aren't a strict optimization barrier in PG anymore, there are definitely situations where they'll fall back to materialization - and the generated SQL seems to commonly build a "all of the rows" CTE and then apply conditions in a subsequent one.


Ah - so you don't ever express a join in the language - they're completely abstracted away from the user and resolved at runtime based on the bound tables in the semantic model. The 'hello world' and demo examples with multi-tables will demonstrate how this resolution happens.

Re: the structure - yeah, it's a lot of CTEs right now as that's the easiest way to guarantee correctness - I want to implement a optimization pass that will do some consolidation + predicate pushdown, but it's a balance with readability.


So one never specifies the tables/data sources, but refers to columns directly and that triggers the data source to be pulled-in somewhere?

Is this somewhere well defined? Are there docs explaining how it works?

This part of the language seems the most magic to me, and I've found a few times now that things that feel like magic work great, but only 80% of the time.


Datasource bindings are explicit, then the resolution to datasources at query time is automatic/consistent for a given query set and semantic model. If you have only one [non-partial] datasource with a column bound for a concept, that table will be used every time. If there are multiple, selection will optimize for the least joins and then the closest match to the target component grain.

The "datasources and joins" section attempts to cover this but I think I need to clean that up a bit!

As long as the semantic bindings are accurate, this can be a performance optimization - ex if it's a reporting query and there's an aggregate dataset that's available that can be safely used.

That's where the challenges will come up - there needs to be easy tooling to vet consistency across datasources (all these tables have the same # of orders) to validate the model, and I'm planning some form of query-level time/refresh hint (as of at least <x> date?) to help avoid resolving stale caches/consistency. These challenges should be pretty similar to what you'd get with plain SQL and the semantic layer hopefully makes the quality checks easier, so I'm optimistic that the tooling can help out here as well.

Ex: this binding from the hello world example says "this is a valid source for the sentence_id", and technically any of the 3 datasources could provide it. (the datasource here is a query not a table for portability)

  datasource word_one(
    sentence: sentence_id,
    word:word_one
  )
  grain(sentence_id)
  query '''
  select 1 as sentence, 'Hello' as word
  union all
  select 2, 'Bonjour'
  ''';


Updated the section on query resolution to be more explicit about datasources + join inference - thank you for the feedback, and let me know if that helps! I'll try to get a more detailed section on that further on to really dig into it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: