Scalable PostgreSQL Connection Pooler

Tostino · on Nov 12, 2021

I appreciate the work going on to make high quality poolers for Postgres...I just really wish the work was done in-core so we could have built in connection pooling, or the work done to make grabbing a new connection as cheap as if you were using a pooler. It sucks to have to add complexity to your stack to fix a "deficiency" in Postgres' design.

Still, I am glad there is effort put into project even if I selfishly wish it was done in-core.

x4m · on Nov 12, 2021

There's ongoing work on making Postgres connections more scalable. I really hope that one day we will not need a pooler. https://www.citusdata.com/blog/2020/10/08/analyzing-connecti...

gfody · on Nov 12, 2021

pooling at all is still pretty much a hack. ideally resource management (and tuning) wouldn't revolve around the connection and actual connections would become as thin as possible

anarazel · on Nov 12, 2021

There has been some work towards that - things did get cheaper in 14.

Note that you very well might still want a separately run pooler for some workloads - having local poolers on application servers can be good for latency.

saurik · on Nov 12, 2021

Yeah: the latency of establishing a new connection fundamentally requires at least one network round trip, and likely a few (and so I tend to have my servers going through two layers of pgbouncer, one on each side of the network); and using in-process pools couples your pool state to your front end code updates... so, since you are going to want to build this connection pooling mechanism anyway--and then are likely going to want to be able to take advtanage off application-specific capability limitations (such as not supporting connection state outside of a transaction) to further improve performance--it doesn't really make much sense to prioritize this inside the database server itself.

anarazel · on Nov 12, 2021

> it doesn't really make much sense to prioritize this inside the database server itself.

I wouldn't go that far. There's a lot of things that are hard to do within an external pooler, unless the pooler uses a 1:1 connection model. Leading to most setups using 1:1 connection pooling methods. Which in turn puts pressure on postgres to handle large numbers of connections gracefully.

whitepoplar · on Nov 13, 2021

Would an app-side ORM that has built-in connection pooling + PG14 (w/ connection scalability improvements) generally benefit from a connection pooler like Odyssey/PgBouncer, either server-side or app-side, when the number of app servers is low (<5-10)?

CoffeeOnWrite · on Nov 13, 2021

As a data point FWIW, I have operated a PG11 cluster (on RDS), client-side pooling only, with 300-450 app servers and max_connections=2000-3000 (can’t remember), and didn’t observe significant backend resource contention or performance degradation, just a memory cost linear to number of connections,

anarazel · on Nov 13, 2021

Unless the app servers are particularly large, or the app-side ORM connection pool doesn't work well (e.g. because the number of concurrent requests changes sufficiently over time that thread/process count constantly changes), I'd guess not. But without knowing a bit more it's hard to say.

whitepoplar · on Nov 13, 2021

Thanks!

remram · on Nov 13, 2021

Applications shouldn't be connecting once warm though.

Znafon · on Nov 12, 2021

It is being worked on: https://www.postgresql.org/message-id/flat/ac873432-31cf-d5e... !

funny_falcon · on Nov 13, 2021

But it is quite limited at the moment:

- connection is bound to backend process it were passed to, - backend process can perform only one transaction simultaneously, - therefore if backend process performs long transaction, all other connections on same backend process are waiting for.

Even pgbouncer in transaction mode performs better since it could switch incoming connections to different outgoing connections.

There will be no good builtin connection pooler until PostgreSQL adopts some kind of intra-process concurrency - threads or green-threads (preemptive coroutines). And I say, most likely it will never happen due to all extensions written for PostgreSQL.

riyadparvez · on Nov 12, 2021

Can someone point to an article that explains the connection issue in detail?

hn_throwaway_99 · on Nov 12, 2021

Short of it is that Postgres uses a process per connection, so architectures that spin up and close connections frequently can have serious scalability issues.

Note the landing page for the AWS RDS Proxy, https://aws.amazon.com/rds/proxy/ , is as good a discussion as any as to why you'd want to put a pooling proxy in front of Postgres.

staticassertion · on Nov 13, 2021

You're grey'd out, but based on some quick research this appears correct.

> PostgreSQL is implemented using a simple "process per user" client/server model. In this model there is one client process connected to exactly one server process. As we do not know ahead of time how many connections will be made, we have to use a master process that spawns a new server process every time a connection is requested. This master process is called postgres and listens at a specified TCP/IP port for incoming connections. Whenever a request for a connection is detected the postgres process spawns a new server process. The server tasks communicate with each other using semaphores and shared memory to ensure data integrity throughout concurrent data access.

https://www.postgresql.org/docs/8.2/connect-estab.html

bradknowles · on Nov 14, 2021

So, we tried AWS RDS Proxy for MySQL (well, actually AWS RDS Aurora MySQL), and we found that it did not improve our situation at all. It added latency, but no additional performance, and memory usage was unchanged.

The applications already had their own connection pooling functionality built-in, so AWS RDS Proxy didn’t buy us anything more beyond that.

I can’t speak for what this technology can do for situations where your application code does not already have connection pooling, or for cases regarding RDS Postgres.

ddorian43 · on Nov 12, 2021

https://www.citusdata.com/blog/2020/10/08/analyzing-connecti...

x4m · on Nov 12, 2021

FWIW I maintain Odyssey and will be happy to answer any question regarding it. Actually there was just a CVE-triggered release... I really hope to release more often with more new functionality.

If you are interested in connection pooling maybe you will find interesing SPQR too https://github.com/pg-sharding/spqr Currently it's our experiment to build pooling-based PostgreSQL sharding.

3np · on Nov 13, 2021

Some questions on auth. We have one cluster with a large number of databases, all with automatically provisioned and rotating credentials for service roles using postgres password auth. Currently, clients connect directly to the db cluster (there's a tcp lb but no pgbouncer etc)

In this case, I'd like for services to continue authenticating as previously without changing authentication method, just switching address.

Should I use password_passthrough/auth_query? Does odyssey need pg credentials itself? Is storage_db referring to the db where odyssey stores internal data for itself...?

I guess I don't get why a connection pooler would need to handle auth, much like an HTTPS proxy doesn't need API tokens itself.

x4m · on Nov 13, 2021

Currently in Odyssey password_passthrough works only for cleartext auth, PAM and LDAP. Probably, we could make this for MD5 work too. And it's the whole purpose of SCRAM to defeat this feature :)

The other way to do so is auth_query - you provide a storage password to access auth data of the DB. This works for MD5 auth. When user wants to authenticate we just check credentials against what we see in the database.

3np · on Nov 14, 2021

Thanks for the pointers! I did play around a bit but it seems to segfault and I'm not sure how to proceed.

https://github.com/yandex/odyssey/issues/361

x4m · on Nov 15, 2021

It seems like there's issue when you have a user for accessing authentication table you login with the same user. It fails when due to mixing same pool for auth and for work. I'll look into differentiating these pools more closely. As a simple workaround you can setup different user for accessing password and not login into Odyssey with this user.

3np · on Nov 15, 2021

Thanks for following up. auth_query_user is not the same as the user I'm attempting to authenticate with, though. And I get the same behavior with other users, including non-existing ones.

Or are you referring to something else?

hn_throwaway_99 · on Nov 12, 2021

Thanks very much for posting this. Odyssey is analogous to pgbouncer, correct? I haven't set up a connection pooler yet for our postgres instance but was planning to do it eventually with pgbouncer - are you able to comment how Odyssey compares?

x4m · on Nov 12, 2021

Connection pooler allows you to get more transactions per second on the same hardware. All it does - pack a lot of connections into handful of concurrently running PG backends. In theory all it does - relay bytes between sockets. But there are workloads when it's CPU intensive, for example lots of concurrent TLS handshakes. And pgbouncer is single-threaded! Odyssey is more scalable.

PgBouncer is a great software, actually. We built Odyssey merely to create competitor for PgBouncer. But then started to add functionality that we needed for cloud installations. E.g. Odyssey computes transaction time quantiles, when PgBouncer computes average transaction time.

whitepoplar · on Nov 12, 2021

How does Odyssey compare to other high-performance poolers like pgagroal?

x4m · on Nov 12, 2021

Depends on how you measure performance :) pgagroal utilizes a lot of linux processes to track each connection state, we use coroutines packed into small number of processes. Actually if you have a bottleneck in multithreaded pooler - something really goes wrong. The database should consume most of resources. I'd be happy to help tuning Odyssey to someone who will bachmark both poolers (in fact there's only one number - number of worker processes..well, maybe pool_size too).

pgagroal claims performance superiority over all poolers [0]. I doubt that Odyssey was used in transaction pooling mode in those experiments.

[0] https://github.com/agroal/pgagroal/blob/master/doc/PERFORMAN...

whitepoplar · on Nov 12, 2021

Thanks! For a Postgres n00b, can Odyssey work alongside a HA solution like pg_auto_failover running on the same nodes?

x4m · on Nov 12, 2021

Sure. We use Odyssey as a component of HA cluster in Yandex.Cloud. Odyssey runs on each node of a Patroni-like system. One day we want to implement load balancing functionality in Odyssey. So that your queries go to other node if local is overloaded or lagging long behind primary.

tsenart · on Nov 12, 2021

This, please! Native support for read-replicas would be awesome. Ideally it would now if a query is read-only or not without application changes.

jpgvm · on Nov 13, 2021

For a variety of reasons this is incredibly difficult. Functions, etc make SELECT queries writes, not just UPDATE/DELETE, etc.

It's a lot easier for your application to know what a write is and just establish connections to 2 separate poolers (or hosts on the same poolers) and direct the reads/writes appropriately.

x4m · on Nov 13, 2021

There's already working part of libpq protocol for this - target_session_attrs. But the problem with target_session_attrs is that it just takes too long to discover new primary after failolver. We want to fix this within Odyssey.

kerblang · on Nov 12, 2021

Maybe a dumb question but does it lock the postgres connection down to one pool client (no sharing) when the client has a transaction in progress? How does it compare to pgbouncer in this regard?

x4m · on Nov 15, 2021

Transaction pooling rules are equivalent in Odyssey and PgBouncer.

jerrysievert · on Nov 12, 2021

Any thoughts on an arm64 version?

x4m · on Nov 14, 2021

Odyssey could work fine on arm, but we've a small ASM function https://github.com/yandex/odyssey/blob/master/third_party/ma... . We need to reimplement it for ARM64 or at least fallback on some generic context swap.

whitepoplar · on Nov 12, 2021

Everything that comes out of Yandex is seemingly great. It feels like a lighter, leaner FAANG and as a result, it feels like their software is more usable by small teams (as compared to what the actual FAANGs put out).

x4m · on Nov 13, 2021

I heard FAANG is now MANGA :)

sudhirj · on Nov 12, 2021

For those wondering why use a connection pooler at all with Postgres, I wrote a (really long) post about it: https://sudhir.io/understanding-connections-pools/

SOLAR_FIELDS · on Nov 12, 2021

This is a great write up that really helped me understand the issue in depth. Thanks for sharing.

colesantiago · on Nov 12, 2021

Has anyone used Yandex Cloud [0] in production?, How does it compare to DigitalOcean, AWS, GCP and others?

Also the tech from Yandex such as Clickhouse [1] is really interesting, as they recently spun it out of Yandex.

[0] https://cloud.yandex.com/en/

[1] https://clickhouse.com/

x4m · on Nov 12, 2021

We are also developing managed databases for US\EU market https://double.cloud/

ddorian43 · on Nov 13, 2021

Do you have a plan to support dedicated servers? Ie: I put ssh keys and you manage the cluster.

d0uble · on Nov 15, 2021

That sounds like DBA as a Service. No, we don't plan to support that in the near future. One thing that is definitely on our roadmap and may be relevant is BYOA (bring your own account) when hosts of the cluster are running in your AWS/Azure/GCP account, not ours.

ddorian43 · on Nov 15, 2021

It's the same thing but you don't have to pay 10x for your aws vps.

x4m · on Nov 14, 2021

I'm not actually aware of all detailed plans. My job is to advance PostgreSQL ecosystem. But I'll pass a question to someone with broader view.

the-alchemist · on Nov 12, 2021

Only used Clickhouse, although not through their Cloud offering.

It's an impressive piece of engineering. The best column-oriented DB in the open source space, I would say.

zaius · on Nov 12, 2021

Last time I tried they wouldn’t take a US credit card, which made it too big of a challenge to consider

hodgesrm · on Nov 13, 2021

Altinity.Cloud accepts credit card payments.

Disclaimer: I work at Altinity.

sandGorgon · on Nov 12, 2021

is the Dockerfile not production ready ? https://github.com/yandex/odyssey/blob/master/docker/Dockerf...

i mean i see valgrind, vim, etc in there. this would be a very fat dockerfile.

It seems that way - https://github.com/yandex/odyssey/issues/29#issuecomment-764...

>But it's hard to "bless" some "official" as image: no one from active contributors uses Odyssey in Docker to maintain it thoroughly.

OTOH pgbouncer docker images are rock-solid in production.

Very quickly updated to track upstream.

e.g. the Bitnami ones - https://hub.docker.com/r/bitnami/pgbouncer/ which also have CVE security scans https://quay.io/repository/bitnami/pgbouncer?tab=tags

Microsoft releases an officially supported k8s sidecar of pgbouncer - https://hub.docker.com/_/microsoft-azure-oss-db-tools-pgboun...

x4m · on Nov 12, 2021

Yes, that's a problem. We use Docker only for developement. And mostly because Odyssey is based on epoll and just will not compile on MacOS.

sandGorgon · on Nov 12, 2021

so im not sure what you meant by that - because MacOS compatibility is generally unrelated to production (which is almost always on Linux).

Am I understanding this wrong ? do you have production workloads that use Odyssey on MacOS ?

x4m · on Nov 12, 2021

No. It's just about developer experience. PostgreSQL core developemnt under MacOS is just so great. You can develop PG patches completely without internet connection on a very old MacBook Air. And everything will be so blazingly fast. For Odyssey development MacOS user needs VM, Linux box or Docker.

jordanthoms · on Nov 12, 2021

We’ve been using this for a while after we found pgbouncer became a bottleneck - it’s so much faster and has been totally reliable. Thanks!

r1b · on Nov 12, 2021

Why might I want to use this or pgbouncer instead of, say, the connection pooling built into ORMs like SQLAlchemy?

welder · on Nov 12, 2021

Because the way you scale up a Python web app is by adding more servers, and ORMs like SQLAlchemy can't share(aka pool) connections across processes let alone servers. Scaling a web app means thousands of open connections to Postgres, and the only way to fix that is server side connection pooling. (Server as in Postgres, because here your app server is the client)

x4m · on Nov 12, 2021

You might want to have many app backends, each with it's own Alchemy pool. This setup would create a lot of Postgres connections. When the network flaps you need to reinstall lots of TLS connections. Maybe 100ms of cpu for each handshake. Also each connection may cost you a fork with lots of following CoW, also up to 100ms of cpu. And also Postgres gives you maximum TPS throughput when you run some hundreds of connections, not thousands. And you might want to isolate one microservice from another (in terms of DB throughput utilization). Poolers can help here too.

maxpert · on Nov 12, 2021

You would only see the real value in these poolers in dire times. Screwing up apps or deploys leaking connections is really easy. If that happens you will bring down the whole DB. With these poolers you will not only be able to limit that but also enforce per user (service based if every service has it's own creds) limits. These tools are useful beyond certain scale and you will only realize there true value once you are working in distributed micro-service world specially when they are connecting to same DB.

PhoenixReborn · on Nov 12, 2021

The ORM connection pooler is client-side - so each instance of your application using SQLAlchemy will have its own connection pool. This may become unwieldy as you scale up the number of app servers/containers/processes.

In contrast, connection poolers like Odyssey or pgbouncer live external to your applications, meaning that:

1. You can connect to them from any application written in any language (not just Python) 2. They provide a global pool for all of the applications connecting and help to not overload Postgres.

djbusby · on Nov 12, 2021

Why not python? Works on my machine.

nh2 · on Nov 13, 2021

I believe you read "just not Python", when the parent poster wrote "not just Python", meaning that a connection pooler can be used also from other languages besides Python, which don't have SQLAlchemy specifically.

djbusby · on Nov 13, 2021

Oof. Yep. And the timer is over so I can't edit or delete :(

ggregoire · on Nov 12, 2021

Just to be sure, Odyssey (or pgBouncer) requires to run on the same box than Postgres, right? So if you are on AWS, you would have to migrate from RDS to running Postgres on a EC2 and install the pooler on the same EC2?

Which comes with some big drawbacks (you obviously lose all the benefits of using RDS) but also some benefits (can install every pg extension that you need, have access to every pg settings, etc)

speedyapoc · on Nov 12, 2021

I run pgBouncer on an EC2 instance, connecting to my database on RDS. Does not have to be on the same box.

ggregoire · on Nov 12, 2021

Oh nice. Thanks for the quick answer!

x4m · on Nov 12, 2021

I'd certainly recommend to run pooler on the same node to avoid double encryption. Reestablishing 2x connections, dealing with 2x restarts of VM - is kind of a downside too. However it's not strictly neccesary, Odyssey can run even in another AZ :)

phamilton · on Nov 12, 2021

We've run in both of these configurations connecting to Aurora:

1) Run a pgbouncer ASG with a load balancer in front for applications to connect to. 2) Run pgbouncer on all our hosts and application connect to localhost.

(2) was significantly cheaper once we switched over. All our hosts are 24xlarge so there was plenty of application traffic to make pgbouncer meaningful on the same host.

claytonjy · on Nov 12, 2021

Does anyone know if you can specify parent roles (group-style non-login roles)? That's something I've wanted from pgbouncer, so instead of making settings at the user level (applications) I could make them at the group level (e.g. apiv2-readonly).

x4m · on Nov 12, 2021

We are going to implement config inheritance. This should bring somewhat resemblant functionality.

sojournerc · on Nov 13, 2021

I lean on client-side pooling these days. It works and is easier to reason about for me, e.g. this client process gets 2 connections. Specifically knex has client pooling.

Provides good back pressure and hasn't fallen down yet. Any reason not to do it this way?

manquer · on Nov 13, 2021

If you have lot of client sources (say in each micro service) different languages and frameworks connecting to same server ( perhaps different db/schema ).

It is easier then to manage it in one place . Also dynamic scaling of server pools etc you need change the pooler config instead of every application.

And postgres connection management(open/close) is expensive, if you can handle it outside the the db infra it can be beneficial ( only some client libs will pool their connections and others won't). Even with client pooling support in the library, each container/forked runner will have its own connection, and that can be quickly few hundreds.

Finally AFAIK most libraries don't handle master failover or server restart without crashing the connection, pooler will usually keep the upstream open for some time.

TedShiller · on Nov 13, 2021

Custom poolers are very common even in extremely large deployments with Oracle for example

david-wb · on Nov 13, 2021

How can I use this from a nodejs app?

x4m · on Nov 14, 2021

Odyssey stands before the databases and packs thousands of NodeJS connection into few tens of real server connections.

barefeg · on Nov 12, 2021

What is the use case for this?

hn_throwaway_99 · on Nov 12, 2021

Probably the biggest downside in my mind with postgres (which is otherwise an amazing product!) is that it doesn't scale connections well, as there is 1 process per connection. So teams often put a connection pooler in front of postgres - end clients connect to that connection pooler, which forwards on requests to postgres. AFAIK pgbouncer is the 'de facto standard' for postgres connection pooling, so I asked the author how Odyssey compares.

lazyant · on Nov 13, 2021

What's a good way to have a HA setup so we don't have a single point of failure (hearbeat etc)?

x4m · on Nov 13, 2021

I think there's a lot of different HA solutions - Patroni, Stolon, Pacemaker to name few.

derefr · on Nov 12, 2021

Question for anyone who knows a lot about Postgres connection pooling:

We want to move to using a connection pooling infrastructure component, but so far, we haven't been able to figure out how to do so given our setup.

We've got a wacky data model where we have N Postgres schemas, each of which has the same tables (but with different data, obviously.) They're not tenants; more like separate "datasets." Picture something like: each schema is a digraph, with tables like "vertices", "edges", "edge_labels", etc. (Not what we're doing, but it's close.)

In theory, we could solve our problem by just sticking all the "vertices" tables together as list-partitions of one mega "vertices" table keyed by dataset_id; but 1. it's kind of incoherent — the data makes no sense when combined together like that, it's only valid when considered in isolation; and 2. it's very operationally convenient for us to manage datasets as separate schemas, re: security, naming, onlining/offlining data, etc.

Also, while each user request is only going to interact with a single dataset (schema), a given user might care about a lot of datasets, and make requests about many of them at arbitrary times—so, as far as I can tell, we don't have the sort of stable load per dataset that would allow us to hold separate connection pools per dataset.

And, since there's no way to make a schema name into a bind variable, fully-qualifying our queries means generating dynamic SQL strings, which is both expensive on the business layer, and on the SQL query planner, which can't just recognize+unify its plans from the query's history from other schemas.

Right now, the way we query the data, is that in our business layer, before each query, we have a middleware that injects a statement like this:

    SET search_path TO 'name_of_dataset', 'public';

We then run a non-fully-schema-qualified query, and it finds whichever tables that have been made visible to it.

Our queries currently don't run in transactions (i.e. we're using JDBC auto-commit on the client side), because they're by-and-large just single (complex) SELECT statements. So the middleware-generated statement above (for now) runs as a separate statement, in a separate single-statement transaction, on the same connection, that runs the query. So per-statement connection-pooling would break everything, as our queries would be being routed to backend connections "primed" with the wrong search_path.

And I get the feeling that transaction-level connection pooling won't work for us either (at least for now), because our whole operational problem currently is that our HTTP requests acquire DB connections and then sit on them while other things are processed, depleting the connection pool; and if we turned off JDBC auto-commit, our search_path-injecting middleware (which injects its statement on connection-pool checkout) would just end up starting a DB transaction at the beginning of each of those HTTP requests, where the business layer would then be sitting around with an idle in transaction DB connection, mapping directly to an open pgBouncer backend connection, completely destroying any wins connection-pooling would gain us. We'd be right back where we started.

So, in other words, we want/need per-statement pooling; but we need it to allow us to also specify the search path per statement.

I've seen that at least pgBouncer has an explicit WONTFIX for this requirement, since in their minds it conflicts with their "can't tell it apart from a regular Postgres session" session-state semantics.

Should we bite the bullet and move to putting everything in one schema + list-partitioning + an explicit dataset_id column, so that we can parameterize out the dataset per-query using a bind variable? Or something else?

chatmasta · on Nov 12, 2021

We are building a solution for this problem at Splitgraph [0] – it sounds like we could probably help with your use case, assuming this is for analytical (OLAP) data. You can get it to work yourself with our open source code [1], but our (private beta, upcoming public) SaaS will put all your schemas on a more scalable “data delivery network,” which incidentally, happens to be implemented with PgBouncer + rewriting + ephemeral instances. We also have private repositories / fine-grained ACL shipping shortly.

In a local engine (just a Postgres DB managed by Splitgraph client to add extra stuff), there is no PgBouncer or batteries-included authN/Z, but we use Foreign Data Wrappers to accomplish the same query resolving. Our goal is for you to have the ability to do everything locally as an individual, with the SaaS becoming useful for “multiplayer” (teams and orgs).

On Splitgraph, every dataset – and every version of every dataset – has an address. Think of it like tagged Docker images. The address either points to an immutable “data image” (in which case we can optionally download objects required to resolve a query on-the-fly, although loading up-front is possible too) or to a live data source (in which case we proxy directly to it via FDW translation). This simple idea of _addressable data products_ goes a long way – for example, it means that computing a diff is now as simple as joining across two tables (one with the previous version, one with the new).

Please excuse the Frankenstein marketing site – we’re in the midst of redesign / rework of info architecture while we build out our SaaS product.

Feel free to reach out if you’ve got questions. And if you have a business case, we have spots available in our private pilot. My email is in my profile – mention HN :)

[0] https://www.splitgraph.com/connect

[1] examples: https://github.com/splitgraph/splitgraph/tree/master/example...

derefr · on Nov 12, 2021

We're already sharding our datasets onto separate PG nodes using FDWs (mostly due to us running PG on GCP, and GCE VMs having inherent vertical scaling limitations on node-local storage, which we rely 100% on for our workloads.)

Also, our "datasets" are all live data. They aren't modified by the users querying our API, but they are constantly appended to (and I mean constantly, i.e. every few milliseconds.)

For us, PG is in our stack at this point because of its hybrid-OLAP nature: it can do realtime, row-at-a-time ingestion of data without degradation, like an OLTP store / time-series DB; but it can then perform intensive OLAP workloads against that up-to-the-moment data, involving joins, CTEs, partial computed-expression indices, etc.

(The use-case for this sort of mixed workload? Think "realtime monitoring/alerting on custom-per-user financial Business-Intelligence queries from a common financial transaction stream." We can't pre-denormalize the data, because each client wants something different. Instead, we need a normalized representation with tons of indices that serves all potential queries equally well, in roughly real time.)

For non-realtime analysis of "data at rest" (i.e. data that can be ingested in at-most-hourly batches), we can just use Snowflake. We already do, for logs and other things.

To be honest, our fondest dream would be to have a two-tiered DB setup:

1. a cold layer, where we get the SQL semantics of Postgres's querying engine, but where the storage engine is similar to those of scale-out DW services like Snowflake, with elastic per-workload compute warehouse nodes fetching compressed-columnar data from object storage into local per-worker caches (it sounds like this is similar to what you're building?)

2. a hot layer, using traditional Postgres storage, which serves as a sort of writeback cache for the cold layer:

• where all INSERTs hit the hot layer and then get async-batched to the cold layer;

• where the hot layer keeps its own indices to enable fancy OLAP querying of the hot data;

• where those queries see the cold data as "part of" the hot data, in a UNION ALL sense (probably through a FDW);

• where those queries will constraint-exclude the cold data (and thus trigger no workload on the cold layer) if the cold data isn't relevant to resolving the query — like PG11 partition constraint-exclusion.

AFAIK, you can't currently use a foreign table as a partition of a local parent table, so that'd be the first thing needing to be solved there. The next problem after that would be resolving the need for an AccessExclusiveLock to modify the range constraint on each of the parititons, since 1. the split-point between "historical" and "timely" data would be sliding forward every hour-or-so, but 2. the table would be just saturated with long-running SELECTs, so much so that it would never get a moment to lock itself to make a modification like that.

(Really, the magic wand there would be to allow partitions to have multiple parents and for "a partition" — the metadata that contains the list/range/hash constraint — to be a separate DB object from the DB table it refers to, where multiple "partitions" can reference one table-with-the-partition's-data-in-it, as long as the data meets the union of all their constraints. With those abstractions, you could build a new parent table that also references the old child tables through new partition metadata, and then just do an atomic swap of the parent tables' names when you're ready, where old queries would go on using the tables they had already dereferenced to their OIDs, and new queries would start using the new tables. Then queue up an effectively-async DROP TABLE on the old parent table, that would resolve when the DB runs out of old queries locking it open.)

mildbyte · on Nov 13, 2021

The other founder of Splitgraph here! I've been experimenting recently with this, since we want users to be able to write to Splitgraph images efficiently without having to turn them into PG tables and then re-snapshot them.

My setup is similar to your cold-hot idea (diagram [1]), where the cold layer uses our "layered querying" FDW. A scan through it is basically a scan through a UNION of cstore_fdw files (columnar format) and we can use the object metadata to determine which objects to download/scan through. If the constraints don't match the object boundaries at all, the scan returns empty. This emulates PG partitions but more dynamically (you can change them without a full table lock).

This means that we can have a view on top of the "hot" and the "cold" blocks that will query both of them, but exclude the "cold" OLAP layer most of the time. We can also redirect writes to the view to hit the "hot" OLTP layer instead using `INSTEAD OF` triggers. In my PoC, I record inserts/updates/deletes in that table and then collapse them into one action per PK using a window function, but if the table is append-only, this should be much easier.

I'm not completely sure re: the locks at changeover/flush time. The idea is that we can create a new object (up to some bookmark) in one transaction, attach it to the Splitgraph table in the second transaction, then truncate the "hot" table up to the same bookmark in the final transaction. Duplicate rows can be eliminated by the view, so the final table will look consistent between transactions.

Re: "using a foreign table as a partition of a local parent table", I think it's actually possible, even though we don't use it here: there's a cool setup in [3] where they use it to shard a table onto multiple worker nodes (the coordinator node has the partitioned table, each partition points to a postgres_fdw shim on the same machine that then points to the actual physical partition on the worker).

We've had some ideas around using this for distributed querying: in our case, each node responsible for a given partition of a dataset would be able to download just the objects in that partition on the fly (though constraint pruning), so we wouldn't need to knowingly seed each worker with data. Interesting to think about, though at some point it feels like reinventing Presto/Snowflake/Spark.

Hope this helps!

[1] https://imgur.com/a/1p394PI

[2] https://www.splitgraph.com/docs/large-datasets/layered-query...

[3] https://swarm64.com/post/scaling-elastic-postgres-cluster/

derefr · on Nov 13, 2021

> We've had some ideas around using this for distributed querying: in our case, each node responsible for a given partition of a dataset would be able to download just the objects in that partition on the fly (though constraint pruning), so we wouldn't need to knowingly seed each worker with data.

IMHO, if you're going to do this, I'd recommend not doing this in Postgres itself, but rather doing it at the filesystem level. It's effectively just a tiered-storage read-through cache, and filesystems have those all figured out already.

You know how pgBackRest does "partial restore" (https://pgbackrest.org/user-guide.html#restore/option-db-inc...), by making all the heap files seem to be there, but actually the ones you don't need are sparse files ftruncate(1)'d to the right length to make PG happy? And that this works because PG only cares about DB objects it's not actively querying insofar as making sure they're there under readdir(2) with the expected metadata?

Well, an object-storage FUSE filesystems, e.g. https://github.com/kahing/goofys, would make PG just as happy, because PG could see all the right files as "being there" under readdir(2), even though the files aren't really "there", and PG would block on first fopen(2) of each file while goofys fetched the actual object to back the file.

(IIRC PG might fopen(2) all its files once on startup, just to ensure it can; you can hack around this by modding the origin-object-storage filesystem library to not eagerly "push down" its fopen(2)s into object fetches — instead just returning a file-descriptor connected to a lazy promise for the object — and then have read(2) and write(2) thunk that lazy promise, such that the first real IO done against the virtual file be what ends up blocking to fetch the object.)

So you could just make your pg_base dir into an overlayfs mountpoint for:

• top layer: tmpfs (only necessary if you don't give temp tables their own tablespace)

• middle layer: https://github.com/kahing/catfs

• bottom layer: goofys mount of the shared heap-file origin-storage bucket

Note that catfs here does better than just "fetching objects and holding onto them" — it does LRU cache eviction of origin objects when your disk gets full!

(Of course, this setup doesn't allow writes to the tables held under it. So maybe don't make this your default tablespace, but instead a secondary tablespace that "closed" partitions live in, while "open" partitions live in a node-local tablespace, with something like pg_partman creating new hourly tables, and then pg_cron running a session to note down the old ones and do a VACUUM FREEZE ?; ALTER TABLE ? SET TABLESPACE ?; on them to shove them into the secondary tablespace — which will write-through the catfs cache, pushing them down into object storage.)

x4m · on Nov 12, 2021

This may sound naive, but implementing your own pooler is very easy e.g. in Go pgproto3 already does all packet-parsing for you. Transaction poolers are looking on ReadyForQuery packet and it's "in trnsaction" property like this [0]. All you need - is stick server connection on new ParameterStatus[1] packet for "SET search_path" instead of ReadyForQuery.

[0] https://github.com/pg-sharding/spqr/blob/358f816cd8a964a9c9e... [1] https://www.postgresql.org/docs/10/protocol-flow.html#id-1.1...

CyberDem0n · on Nov 12, 2021

It's not enough to intercept `set search_path`, in fact any arbitrary function could change it, but Postgres doesn't report it back, unlike some other gucs.

x4m · on Nov 12, 2021

Hmm, yes. ParameterStatus is not sent when search_path is changed. But I think it would be trivial to patch PostgreSQL to send it. I bet one even can push such patch through commitfest.

nezirus · on Nov 12, 2021

Maybe this can help https://github.com/awslabs/pgbouncer-rr-patch

It is a pgbouncer fork/patch with query rewriting support.