I recently had to write SQL query generation for AWS Athena, which is based off ...

jjoonathan · on Aug 4, 2022

AWS Athena: selling a buggy, old, stale copy of someone else's work (Presto / Trino) for high prices and getting away with it because you control the platform.

If that's not peak Amazon, I don't know what is.

RobinL · on Aug 4, 2022

I think many users just see they can execute a query on huge data cheaply and incredibly quickly and are delighted. That's certainly my experience.

It's one of the backends available in Splink, our FOSS record linkage software and it's revolutionary how it allows users to execute large scale probabilistic record linkage ridiculously cheaply. It wasn't long ago you needed very expensive proprietary software plus a big on prem cluster, costing in the hundreds of thousands, to achieve this.

A lot of the magic for me is on the infrastructure side: how they can read/write large datasets from s3 so quickly, so the value isn't just in the SQL engine.

jfim · on Aug 4, 2022

The biggest value for corporate users is that they get everything already included as part of their existing cloud agreement.

Adding a new vendor to the mix needs to involve procurement, the legal team, vendor negotiations, while using a new AWS feature is just a matter of using it, even if it's not as good as the original ISV's version and doesn't support the long term viability of the project.

pbronez · on Aug 4, 2022

Splink looks cool. I'm familiar with Tamr and Senzing, but this is the first FOSS option I've come across.

jjoonathan · on Aug 4, 2022

Yes, it's good to be platform king. We know. Low friction for you, high friction for everyone else.

dmitrykoval · on Aug 4, 2022

other arguments aside .. Athena costs $5 per 1TB scanned and also supports predicates pushdown to S3 Select. I wouldn't call this expensive, at least in comparison to self hosted Presto.

hashhar · on Aug 4, 2022

At a certain scale it does become very expensive. It's easy math.

When your monthly Athena bill crosses whatever it would cost to have 5 or 10 EC2 machines it'll be cheaper to use Trino. At my previous workplace we moved from ~$40,000/month to ~$18,000/month by replacing Athena.

Athena is a very good tool to start with - unless you have super large scale you'll probably not outgrow it. But when you do there's Trino.

I do contribute to Trino - although I was merely a user when that cost reduction happened.

tfehring · on Aug 4, 2022

I'm not sure the math is so easy. Even knowing the direct cost savings in hindsight, engineers' time is expensive, and it's not obvious that the ongoing engineering cost of maintaining Trino on an EC2 cluster would be that far below $22k/month. Even if you get a net cost savings on an ongoing basis (which, granted, you probably do), you may have a long payback period for the initial engineering time spent evaluating solutions and getting the deployment spun up.

And that's all with the benefit of hindsight - it's hard to know a priori how much cheaper your own deployment will be compared to a managed service or how long it will take to implement. Of course, anecdotes like yours help with that, so thanks for sharing your experience!

dmitrykoval · on Aug 4, 2022

Sure, I agree, above certain usage threshold hosted Trino becomes totally justified. But then, some engineering time to maintain the cluster has to be factored in as well.. for anything ad-hoc in nature, I would start with Athena by default.

lopatin · on Aug 4, 2022

Can’t you say the same thing for EC2 but with Linux instead of Presto? Personally I like Athena. The fact that it’s in the Amazon platform and managed is a plus for me.

jjoonathan · on Aug 4, 2022

EC2 involved substantial VM management and networking innovations that I respect. Ditto lambda and S3. I would not categorize any of these as OSS flips in nearly the way that I would categorize Athena as an OSS flip.

mr_toad · on Aug 4, 2022

Setting up and maintaining your own Trino cluster isn’t exactly trivial. You pay a premium for not having to do that.

slt2021 · on Aug 4, 2022

AWS is just managed open source as a service

hashhar · on Aug 4, 2022

Works just fine in Trino.

  trino> USE memory.default;
  USE
  trino:default> create table artist (artistid int);
  CREATE TABLE
  trino:default> create table album (albumid int, artistid int);
  CREATE TABLE
  trino:default> insert into artist values 1, 2;
  INSERT: 2 rows
  
  Query 20220804_182827_00005_n4rat, FINISHED, 1 node
  Splits: 19 total, 19 done (100.00%)
  0.52 [0 rows, 0B] [0 rows/s, 0B/s]
  
  trino:default> insert into album values (11, 1), (12, 1), (21, 2);
  INSERT: 3 rows
  
  Query 20220804_182857_00006_n4rat, FINISHED, 1 node
  Splits: 19 total, 19 done (100.00%)
  0.18 [0 rows, 0B] [0 rows/s, 0B/s]
  
  trino:default> select * from (select * from artist limit 2) a cross join lateral (select * from album where album.artistid = a.artistid limit 2);
   artistid | albumid | artistid
  ----------+---------+----------
          1 |      12 |        1
          1 |      11 |        1
          2 |      21 |        2
  (3 rows)
  
  Query 20220804_182930_00007_n4rat, FINISHED, 1 node
  Splits: 41 total, 41 done (100.00%)
  0.35 [8 rows, 232B] [22 rows/s, 661B/s]

bitsondatadev · on Aug 4, 2022

Check out this PR. I believe we may have tackled this one but you'd need to try it out on Trino: https://github.com/trinodb/trino/pull/1415

gavinray · on Aug 4, 2022

Hooray! Yet another data point for Trino > Presto as far as I'm concerned ;^)

bitsondatadev · on Aug 4, 2022

If you want to try an SaaS Athena alternative that's backed by Trino you can check out Starburst Galaxy: https://www.starburst.io/platform/starburst-galaxy/

Full disclosure I work at Starburst.

gavinray · on Aug 4, 2022

Oh nice, I have high opinions of you folks!

Guy who goes by the name of "Randgalt" online builds some great Java libraries and works there too I believe.

simpligility · on Aug 4, 2022

Yep .. Jordan works on Trino and Starburst Galaxy. We got lots of other great engineers helping as well btw.

https://github.com/randgalt

ofrzeta · on Aug 7, 2022

On the homepage there's two occurrences of "ELT" (instead of "ETL" I would guess). Is that correct?

stevenmih · on Aug 4, 2022

Steven here, I'm Ahana CEO (Managed Service for Presto) and member of Linux Foundation's Presto Foundation Since you asked about an alternative to AWS Athena, we offer a free-forever in-VPC Managed Service for Presto on AWS, which keeps up with the latest Presto releases.

Note: the original Presto continues to run in production at Meta (fmr. Facebook), Uber, and recently ByteDance TikTok data platform talked about running 1M queries a day with tens of thousands of cores. Some reasons to stay with Presto: - Reliability and scalability per above - Cutting edge innovations only in later versions of Presto: multi-level caching (project RaptorX) to boost query performance by 10X+ and table scan improvements (project Aria) to name a few - Only PrestoDB is hosted by Linux Foundation, giving confidence to community users that future releases will remain open.

bitsondatadev · on Aug 5, 2022

[flagged]

DataGeekette · on Aug 5, 2022

Or possibly he is just passionate about a project that he believes in, just like others in the larger Presto / Trino community. Personally, I think both Ahana and Starburst have interesting takes on how to monetize the original Presto platform. And like many, many other data projects, there is room in the market for different platforms. Long gone are the days where everyone was just stuck with a single RDMBS to do everything!

DataGeekette · on Aug 5, 2022

But IMO, personal attacks seem pretty off topic for a thread about cool tech.

karma_fountain · on Aug 4, 2022

Is it possible to achieve this with a window function?

gavinray · on Aug 4, 2022

I found out it is! Kudos to this kind internet stranger for telling me:

https://stackoverflow.com/a/73129836/13485494

But man is it a huge PITA (especially when doing programmatic code generation of the SQL) compared to LATERAL joins

Someone familiar with the CockroachDB query planner showed me that a window function like this is what Cockroach turns LATERAL joins into for instance:

    demo@127.0.0.1:26257/movr> explain select * from abc, lateral (select * from xyz where x = a limit 2);

    • filter
    │ estimated row count: 1
    │ filter: row_num <= 2
    │
    └── • window
        │ estimated row count: 2
        │
        └── • hash join
            │ estimated row count: 2
            │ equality: (x) = (a)
            │
            ├── • scan
            │     estimated row count: 6 (100% of the table; stats collected 2 minutes ago)
            │     table: xyz@xyz_pkey
            │     spans: FULL SCAN
            │
            └── • scan
                    estimated row count: 1 (100% of the table; stats collected 3 minutes ago)
                    table: abc@abc_pkey
                    spans: FULL SCAN