PostgreSQL 11 Partitioning Improvements

craigkerstiens · on May 21, 2018

The native time partitioning in Postgres 10 was a great foundation, but definitely had a few rough edges. With Postgres 10 you mostly needed some additional extension like pg_partman (https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partm...) or timescale to smooth out the rough edges and make things as usable as you'd expect.

This set of changes in 11 is pretty great and makes it much more usable out of the box. For those not reading the article the short list of improvements:

  - Updates move records across partitions
  - A default/catch all partition
  - Automatic index creation
  - Foreign keys for partitions
  - Unique indexes
  - Hash partitioning (where as 10 was just time/range based)

jcadam · on May 21, 2018

Sounds great. Hopefully someone will do up a guide on how to migrate a partitioning scheme created using pg_partman over to Postgres 11. It would be great to remove that dependency.

londons_explore · on May 21, 2018

Still missing:

* Dynamic repartitioning to balance query load/space/create new partitions when tables get over a certain size. Without this, when you get more than a few hundred TB of data, you'll end up with terrible hotspots where single partitions are much more loaded than others and theres nothing you can do about it.

* Query replanning: Query plans can sometimes have a lot of error. By replanning, one can start on another query plan if the existing plan is grossly wrong in its estimates.

* Lock Eliding instead of *Exclusive locks. Some major operations, such as schema changes, take locks which effectively stop the server responding to any other requests till the operation is done. Instead, these operations should work on a copy-on-write image of the database, repeatedly re-syncing with recently changed data until the final change necessary is small, and then atomically commit the change.

arghwhat · on May 21, 2018

#1 seems a bit weird. You could theoretically have the DB automatically create new buckets for a hash partition, but hash partitions shouldn't have hot spots unless you hash on a bad column, and creating new buckets will require all existing rows to be moved.

For other types of partitions, partition hotspots would be the fault of the DBA partitioning on a bad column. The DB could theoretically half partition ranges if a range partition is very large, but there are other things that are more important.

#2 is a nice performance feature which has been talked about before, but do any DB's actually have this yet? I suspect that it is quite tricky, and won't be universally beneficial.

Is #3 really a problem? I certainly wouldn't expect schema changes to perform well, and increasing complexity just to help in this area would seem silly.

rantanplan · on May 21, 2018

> Is #3 really a problem? I certainly wouldn't expect schema changes to perform well, and increasing complexity just to help in this area would seem silly.

Not the parent, but obviously they refer to the online schema migration problem. Without that online schema migrations are basically impossible. Or you get handed the burden of implementing or finding a tool that does it for you.

And yes, pt-online-schema-change[1] is such a tool, but for MySQL. MariaDB has this feature built-in with their `ALTER ONLINE TABLE` [2].

So something similar for Postgres would be godsent.

[1] https://www.percona.com/doc/percona-toolkit/LATEST/pt-online...

[2] https://mariadb.com/kb/en/library/alter-table/

perlwle · on May 22, 2018

mysql 5.7 has pretty good online DDL support[1]

I have been using it exclusively now instead of pt-online-schema-change.

[1] https://dev.mysql.com/doc/refman/5.7/en/innodb-online-ddl.ht...

hobs · on May 22, 2018

Oracle has had it for awhile (though I forget what its called.)

SQL Server is getting it in 2017 https://docs.microsoft.com/en-us/sql/relational-databases/pe...

gigatexal · on May 22, 2018

Oracle is also your first born expensive and Postgres is free.

gigatexal · on May 22, 2018

Really? Someone was offended and/ or defending the obscene pricing of Oracle?

howdydoo · on May 22, 2018

Oracle has the right to charge money for their products if they want to. Most people here make a living from exchanging things for money that are better in some way than things you can get for free.

(I didn't downvote you, btw)

arghwhat · on May 22, 2018

Of course. And likewise, people have the right to disagree with or be upset about their pricing model, billing approach and business practices.

gigatexal · on May 22, 2018

They could charge a trillion dollars for their software. I am not poo-pooing that. I am just upset with their practices in extracting those payments.

londons_explore · on May 21, 2018

> hash partitions shouldn't have hot spots unless you hash on a bad column

In the nice world of all data having an even load, sure...

But in the real world, you can easily get a handful of users in your "Users" table sending millions of requests per second, and in that case, you would really like to re-partition so that they don't happen to all end up on the same partition.

Implementation can be as simple as allowing splittable partitions, so that whenever load/size gets too high, a partition can be split in half, and half the records moved to a new host. The partition map is only a few kilobytes, so is globally shared/updated. There are no concurrency issues, because during the splitting process, either the old or new partition is responsible for each record, and both the old and new partition hosts can respond to a read or write query, either themselves or by forwarding it to the other host.

Whenever two neighbouring partitions both see low load/space usage, join them with the same method in reverse. By joining only neighbouring partitions, you can't suffer terrible fragmentation and blowing up the size of the global partition map.

pgaddict · on May 21, 2018

I'm not sure what you mean by "move to a new host" - all the built-in partitioning features are single-host for now, so this does not make much sense. There are plans to support remote (foreign) partitions, but that is in the future.

A more immediate use case is partition-aware algorithms. For example joining equi-partitioned tables can be done at the partition level, and the smaller the partitions the faster the join (likely). Or partition-aware aggregation, when the partition key is included in the GROUP BY keys. And so on. In that case it makes sense to split the largest partitions to make them smaller.

arghwhat · on May 21, 2018

Partitions provide separation, not load balancing, and does thus not concern themselves with load or size.

While a can imagine a way to make hashed partitioning sub-divide buckets, it won't really be useful, as you can still have a single user blowing their bucket out of proportions with others being almost empty, which partitioning can never solve.

What you want is round robin distribution and parallel query execution, not partitioning. This of course only makes sense if the tables are on different disk systems/hosts or if all tables fit in memory.

pgaddict · on May 22, 2018

No, that's not true. When combined with parallelism, ability to place partitions to other hosts and partition-aware algorithms (partition-wise joins/aggregation etc.) it can be a powerful scalability feature.

arghwhat · on May 22, 2018

It is true. However, while partitioning is not load balancing, partitioning can be used to make some form of load distribution.

Partitioning primarily deals with isolating related data into small pools so that queries do not need to touch full tables. The purpose of the buckets is to group related data so that a single partition may serve a query (instead of having to touch the entire table), thereby lowering the load from each individual query, not distributing all queries.

Load balancing deals with distributing all queries to ensure even node load, not increasing efficiency of queries (which it does not do at all). Data distribution leading to round-robin accesses across nodes/"partitions" is best. This is the exact opposite of partitioning, which tries to group all that data into a single partition.

Hash partitioning on serial or uuid might give you round-robin like behavior, but it's not really what partitioning is meant for.

pgaddict · on May 23, 2018

I'm not really sure what is your point. I was not suggesting partitioning alone is a load-balancing solution, but that partitioning in combination with other features is a powerful scalability feature. Assuming there is a (1) way to keep the partitions withing some size limit automatically, and (2) the partitioning scheme is such that the queries can leverage the partitioning scheme (eliminate most partitions, join tables partitioned in the same way).

We don't have (1) yet - at least not in core PostgreSQL, but this part should be doable in an extension until it gets into the core. We do have (2) to a large extent, although I'm sure there are gaps and room for improvement.

And when combined with ability to place partitions to other hots, that will be another step.

No one is claiming partitioning alone magically solves scalability, but IMHO it's an important building block in doing that (for a large class of applications).

arghwhat · on May 24, 2018

My point is that partitioning is the polar opposite of load balancing, and the wrong tool to implement it.

tmd83 · on May 21, 2018

Which database does this dynamic repartitioning and how without user intervention?

londons_explore · on May 21, 2018

Bigtable does it. [1] . Spanner does it [2].

[1]: https://static.googleusercontent.com/media/research.google.c..., Section 5, "and also splits tablets that have grown too large"

[2]: https://storage.googleapis.com/pub-tools-public-publication-..., Section 1, "it automatically migrates data across machines (even across datacenters) to balance load"

tmd83 · on May 21, 2018

Those are distributed databases and have completely different characteristics. That's an interesting nice feature to have but no traditional RDBMS supports anything close I presume.

kuschku · on May 21, 2018

All the partitioning tools for postgres allow it.

You can say that it should aim for ~100GB per partition, partitioned by time, and it'll aromatically add or remove partitions.

You can go even further with the time scale extension for postgres.

arghwhat · on May 21, 2018

The question was which DB's do this out-of-the-box like you requested, not plugins or tools. However, seeing that there already is an extension for postgres to do it, why is it a problem in the first place?

derefr · on May 21, 2018

The partitioning features in PG10 and PG11 were also supported by said extensions, but now they're batteries-included features. The GP is implying that, as long as they're eating the lunch of these extensions, they should eat the whole lunch.

(Which I personally agree with, given that you almost never get these extension—or support for arbitrary extensions—on Postgres DBaaS hosts. Upstreaming a feature = making it available on these hosts.)

tpetry · on May 21, 2018

The development model of postgres is to start with a simply le implementation and improve it with every release. So maybe with pg13 or 14 this will be available in the core if one developer starts working on the remaining features.

arghwhat · on May 22, 2018

There is a very sharp distinction between providing first-class partitioning, and providing automated maintenance and performance tuning.

Providing first-class partitioning is something that extensions can't possibly provide, due to how many places this has to integrate with.

On the other hand, manipulating partitions (which is effectively automatic table creation) seems like the sort of magic I expect to only exist in extensions.

As mentioned in a different comment, it does sound like people are looking for round-robin load distribution rather than partitioning.

pgaddict · on May 23, 2018

I don't think any of the extensions actually provides the type of auto-tuning / balancing you're asking for, actually. For example pg_partman helps with setting up some partitioning schemes (by time, by ID), but does not rebalance it later (e.g. to split a hot or unexpectedly large partition). It probably provides more convenient interface to do that, but it's still up to the DBA to do it.

I'm sure this is one of the things people on pgsql-hackers will be talking about, but I wouldn't hold my breath for it to get into PG12. I do expect this to emerge in an extension first, and then getting the most useful bits into core (either in the form of built-in features, or hooks allowing extensions to be smarter).

coryodaniel · on May 21, 2018

FKs and unique indexes are a solid win. Very excited for this!

pgaddict · on May 21, 2018

Keep in mind there are limitations - the unique index / constraint has to include the partition keys, and foreign keys are possible in one direction only. This may be improved in the future, of course.

coryodaniel · on May 21, 2018

You've really got me on an emotional rollercoaster here. :D

arghwhat · on May 21, 2018

Wait, so you cannot have a FK reference to the partitioned master table? :(

pgaddict · on May 21, 2018

Unfortunately not in PG11, but not because we don't want it - it simply didn't get ready in time.

https://github.com/postgres/postgres/commit/3de241dba86f3dd0...

The relevant limitation is described by CREATE TABLE docs that say

    Partitioned tables do not support <literal>EXCLUDE</literal> constraints
    however, you can define these constraints on individual partitions.
    Also, while it's possible to define <literal>PRIMARY KEY</literal>
    constraints on partitioned tables, creating foreign keys that
    reference a partitioned table is not yet supported.

(the wording is slightly broken in the commit, but got fixed later).

koolba · on May 21, 2018

> With Postgres 11, you can create a unique index on the master

> The columns in the index definition should be a superset of the partition key columns. This means that the uniqueness is enforced locally to each partition, and you can’t use this to enforce uniqueness of alternate-primary-key columns.

Are there any workarounds for global indexes across partitions outside of the partition key? Say if you have a a table partitioned by (some_date) with a unique primary key (id), how would one speed up looks up that are purely by id? Having per-partition local indexes would require an index scan per partition.

I supposed you could have a separate/smaller (id, some_date) non-partitioned table (or maybe partitioned on HASH(id)) and then do the indirection yourself wherever you're doing "SELECT ... WHERE id = ..." but seems like a kludge. Any other workarounds?

derefr · on May 21, 2018

Are your IDs monotonic by date (i.e. increasing the date will never decrease the ID)? Or, to put that another way, is partitioning by date equivalent to partitioning by ID for you, such that your date-based partitions will each end up with defined (if unknown) ranges of IDs?

If so, you could potentially create 1. a table that maps ID ranges to partitions, and then 2. create a stored procedure that takes a set of IDs and returns the set of partitions you must scan to find them.

koolba · on May 21, 2018

No they're entirely separate. The IDs are pure UUIDs with no structure tying them to the desired partition key.

derefr · on May 21, 2018

Could you switch from UUIDv4s (what I assume you mean by “pure UUIDs”) to UUIDv1s?

koolba · on May 21, 2018

No I've considered that as well as splitting the key to be a composite of the partition key but I'd like an eventual solution for this in the general case where there's no relationship between the partitioning method and the primary key.

snissn · on May 22, 2018

As far as I'm understanding your situtation it seems like you're using range partitioning and then hoping to get benefits of hash partitioning from a range partition. I think you would need to use hash partitions to avoid look ups on every every table. You could do something weird like have each range partition have an exclusion filter and make sure that each of your randomly generated IDs accommodate that exclusion filter, and generate IDs until you find one.. But it sounds like you would benefit from a hash partition here instead of a range one.

endymi0n · on May 21, 2018

Having done this with MySQL before, there‘s basically three things you can do: 1) Do what you proposed, but this kinda defeats the whole purpose, as that side table can get very, very large (and if it‘s not fitting into working memory anymore, good luck keeping up a sane insert rate). 2) Just don‘t load that kind of data into an RDS subject to these scaling limitations at all (what I would do nowadays, but in a lot of ways, easier said than done) 3) (what we did back then): If you can, accept a window of uniqueness instead of real, true global uniqueness. If you can live with that, define a cronjob to build you a VIEW on the last TODAY-N partitions and check for duplicates using that in a SELECT CTE before inserting.

Shelnutt2 · on May 21, 2018

Using Spider storage engine in Mariadb it is possible to do this by having vertical partitions using different partition by clauses [1]. Essentially you will create sub tables with the partition by clauses, then use a spider table to reference these subtables. Spider than intelligently will select the partition to use based on the query and where clause. The downside of this is it will require storage of the data multiple times on disk, since each subtable needs the full data set. Spider does handle inserting the data, so once schemas are build, only the single spider table is needed for inserts/selects/updates.

In addition to this, in 10.4 it was mentioned spider is working to have internal joins. This means it would be possible to have 1 table defined with the full data set, then a second table that just has id, datefield partition by id. When you select WHERE id = X, it will read from the small table partitioned by id, find the dates, then join against the larger table to pull in remaining columns/data. This isn't mentioned in the presentation linked, but it was brought up in discussion at the mariadb developers unconference.

[1] page 6: https://mariadb.org/wp-content/uploads/2018/03/Merging_patch... (note this is about pages for 10.4 but the VP functionality is in 10.3)

tinus_hn · on May 21, 2018

Wouldn’t the solution be to automatically partition the index you use for ensuring uniqueness, along with the main table? I don’t see how it would be very difficult, as long as the number of partitions is stable.

Diggsey · on May 21, 2018

That seems inherently impossible - either you need to include the date in the lookup, include the ID in the partition, or look at all the partitions.

anarazel · on May 21, 2018

It's technically possible to have both global indexes, and constraint checking that just checks every single partition for conflicts. The former isn't trivial to implement in PG (there's a fixed-width scheme for the target tuples in several index implementations RN) and obviously adds a point of contention (but a smaller than the sidetable proposed nearby). The latter is fairly simple to implement (we do something very similar internally for exclusion constraints), but scales linearly with the number of partitions...

pgaddict · on May 21, 2018

I think eventually we'll go with the second option, and do stuff to eliminate the scalability issues with many partitions. For example we could maintain a global bloom index (still global, but tiny compared to the sidetable), which should tell us whether there might be a duplicate value. If yes, it might be a false positive, in which case we have to do the expensive per-partition stuff. I'm sure it's going to be more complicated than this in practice, of course ...

anarazel · on May 21, 2018

I suspect we'll want both. The bloom index only helps if you have decent spatial locality. Which you won't for say a serial, uuid, etc. There's also types of indexes where it's probably going to be more efficient to have one index (I'd suspect a spatial indexes are often going to be that) for a number of usecases.

pgaddict · on May 22, 2018

Possibly, but considering how much easier is the second case to implement (compared to global indexes), I'd expect that to get in first.

I'm not sure why the bloom filter would require good spatial locality?

Diggsey · on May 21, 2018

Yes, that's a reasonable solution and how ElasticSearch handles partitions. It still would "require an index scan per partition." though, which the OP wanted to avoid.

londons_explore · on May 21, 2018

PG should allow an index to have a different partitioning scheme and locality from the data it refers to.

That would solve nearly all the major remaining limitations.

yashap · on May 21, 2018

Indeed. Or similarly, allow indexes to not be partitioned at all - i.e. your records and most indexes could be partitioned by date, but you could also declare indexes that apply to the entire logical table (very useful for things like primary keys).

pgaddict · on May 22, 2018

That is what was mentioned as "global indexes" in this discussion. The problem with this approach is twofold - firstly it negates many of the partitioning benefits (e.g. removing data is not merely a DROP PARTITION but you have to walk the global structure), secondly it requires fairly significant surgery in the indexes (that now point to rows in a single table, and so [block,item] is enough, but global indexes will require [partition,block,item]).

anarazel · on May 22, 2018

> so [block,item] is enough, but global indexes will require [partition,block,item]).

That's probably the easier bit, I'd guess that adjusting a lot of the relevant in-memory structures to share relations, uniqueness checks across relations, adjusting the locking-model to deal with multiple underlying relations etc. is going to be more work.

jeffdavis · on May 21, 2018

Partitioning and parallel query go great together!

For applicable workloads, it means you get nearly perfect scaling up to the number of partitions you have.

PG11 will be huge.

john37386 · on May 21, 2018

I use zabbix a lot with PG database and there are 2 tables that grow indefinitely. The zabbix housekeeper lags a bit when the tables become +200GB. I tried the autopartitioning and the housekeeper becomes irrelevant. I can now easily keep 1 year of data for my +200k monitored items. I like my critical alerts to keep 3 months at 60 seconds interval. Pg partitioning is really awesome

gigatexal · on May 22, 2018

Very excited about this. Postgres10 is awesome. But I am looking forward to working with 11.