The native time partitioning in Postgres 10 was a great foundation, but definitely had a few rough edges. With Postgres 10 you mostly needed some additional extension like pg_partman (https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partm...) or timescale to smooth out the rough edges and make things as usable as you'd expect.
This set of changes in 11 is pretty great and makes it much more usable out of the box. For those not reading the article the short list of improvements:
- Updates move records across partitions
- A default/catch all partition
- Automatic index creation
- Foreign keys for partitions
- Unique indexes
- Hash partitioning (where as 10 was just time/range based)
Sounds great. Hopefully someone will do up a guide on how to migrate a partitioning scheme created using pg_partman over to Postgres 11. It would be great to remove that dependency.
* Dynamic repartitioning to balance query load/space/create new partitions when tables get over a certain size. Without this, when you get more than a few hundred TB of data, you'll end up with terrible hotspots where single partitions are much more loaded than others and theres nothing you can do about it.
* Query replanning: Query plans can sometimes have a lot of error. By replanning, one can start on another query plan if the existing plan is grossly wrong in its estimates.
* Lock Eliding instead of *Exclusive locks. Some major operations, such as schema changes, take locks which effectively stop the server responding to any other requests till the operation is done. Instead, these operations should work on a copy-on-write image of the database, repeatedly re-syncing with recently changed data until the final change necessary is small, and then atomically commit the change.
#1 seems a bit weird. You could theoretically have the DB automatically create new buckets for a hash partition, but hash partitions shouldn't have hot spots unless you hash on a bad column, and creating new buckets will require all existing rows to be moved.
For other types of partitions, partition hotspots would be the fault of the DBA partitioning on a bad column. The DB could theoretically half partition ranges if a range partition is very large, but there are other things that are more important.
#2 is a nice performance feature which has been talked about before, but do any DB's actually have this yet? I suspect that it is quite tricky, and won't be universally beneficial.
Is #3 really a problem? I certainly wouldn't expect schema changes to perform well, and increasing complexity just to help in this area would seem silly.
> Is #3 really a problem? I certainly wouldn't expect schema changes to perform well, and increasing complexity just to help in this area would seem silly.
Not the parent, but obviously they refer to the online schema migration problem. Without that online schema
migrations are basically impossible. Or you get handed the burden of implementing or finding a tool that does it for you.
And yes, pt-online-schema-change[1] is such a tool, but for MySQL. MariaDB has this feature built-in with their `ALTER ONLINE TABLE` [2].
So something similar for Postgres would be godsent.
Oracle has the right to charge money for their products if they want to. Most people here make a living from exchanging things for money that are better in some way than things you can get for free.
> hash partitions shouldn't have hot spots unless you hash on a bad column
In the nice world of all data having an even load, sure...
But in the real world, you can easily get a handful of users in your "Users" table sending millions of requests per second, and in that case, you would really like to re-partition so that they don't happen to all end up on the same partition.
Implementation can be as simple as allowing splittable partitions, so that whenever load/size gets too high, a partition can be split in half, and half the records moved to a new host. The partition map is only a few kilobytes, so is globally shared/updated. There are no concurrency issues, because during the splitting process, either the old or new partition is responsible for each record, and both the old and new partition hosts can respond to a read or write query, either themselves or by forwarding it to the other host.
Whenever two neighbouring partitions both see low load/space usage, join them with the same method in reverse. By joining only neighbouring partitions, you can't suffer terrible fragmentation and blowing up the size of the global partition map.
I'm not sure what you mean by "move to a new host" - all the built-in partitioning features are single-host for now, so this does not make much sense. There are plans to support remote (foreign) partitions, but that is in the future.
A more immediate use case is partition-aware algorithms. For example joining equi-partitioned tables can be done at the partition level, and the smaller the partitions the faster the join (likely). Or partition-aware aggregation, when the partition key is included in the GROUP BY keys. And so on. In that case it makes sense to split the largest partitions to make them smaller.
Partitions provide separation, not load balancing, and does thus not concern themselves with load or size.
While a can imagine a way to make hashed partitioning sub-divide buckets, it won't really be useful, as you can still have a single user blowing their bucket out of proportions with others being almost empty, which partitioning can never solve.
What you want is round robin distribution and parallel query execution, not partitioning. This of course only makes sense if the tables are on different disk systems/hosts or if all tables fit in memory.
No, that's not true. When combined with parallelism, ability to place partitions to other hosts and partition-aware algorithms (partition-wise joins/aggregation etc.) it can be a powerful scalability feature.
It is true. However, while partitioning is not load balancing, partitioning can be used to make some form of load distribution.
Partitioning primarily deals with isolating related data into small pools so that queries do not need to touch full tables. The purpose of the buckets is to group related data so that a single partition may serve a query (instead of having to touch the entire table), thereby lowering the load from each individual query, not distributing all queries.
Load balancing deals with distributing all queries to ensure even node load, not increasing efficiency of queries (which it does not do at all). Data distribution leading to round-robin accesses across nodes/"partitions" is best. This is the exact opposite of partitioning, which tries to group all that data into a single partition.
Hash partitioning on serial or uuid might give you round-robin like behavior, but it's not really what partitioning is meant for.
I'm not really sure what is your point. I was not suggesting partitioning alone is a load-balancing solution, but that partitioning in combination with other features is a powerful scalability feature. Assuming there is a (1) way to keep the partitions withing some size limit automatically, and (2) the partitioning scheme is such that the queries can leverage the partitioning scheme (eliminate most partitions, join tables partitioned in the same way).
We don't have (1) yet - at least not in core PostgreSQL, but this part should be doable in an extension until it gets into the core. We do have (2) to a large extent, although I'm sure there are gaps and room for improvement.
And when combined with ability to place partitions to other hots, that will be another step.
No one is claiming partitioning alone magically solves scalability, but IMHO it's an important building block in doing that (for a large class of applications).
Those are distributed databases and have completely different characteristics. That's an interesting nice feature to have but no traditional RDBMS supports anything close I presume.
The question was which DB's do this out-of-the-box like you requested, not plugins or tools. However, seeing that there already is an extension for postgres to do it, why is it a problem in the first place?
The partitioning features in PG10 and PG11 were also supported by said extensions, but now they're batteries-included features. The GP is implying that, as long as they're eating the lunch of these extensions, they should eat the whole lunch.
(Which I personally agree with, given that you almost never get these extension—or support for arbitrary extensions—on Postgres DBaaS hosts. Upstreaming a feature = making it available on these hosts.)
The development model of postgres is to start with a simply le implementation and improve it with every release. So maybe with pg13 or 14 this will be available in the core if one developer starts working on the remaining features.
There is a very sharp distinction between providing first-class partitioning, and providing automated maintenance and performance tuning.
Providing first-class partitioning is something that extensions can't possibly provide, due to how many places this has to integrate with.
On the other hand, manipulating partitions (which is effectively automatic table creation) seems like the sort of magic I expect to only exist in extensions.
As mentioned in a different comment, it does sound like people are looking for round-robin load distribution rather than partitioning.
I don't think any of the extensions actually provides the type of auto-tuning / balancing you're asking for, actually. For example pg_partman helps with setting up some partitioning schemes (by time, by ID), but does not rebalance it later (e.g. to split a hot or unexpectedly large partition). It probably provides more convenient interface to do that, but it's still up to the DBA to do it.
I'm sure this is one of the things people on pgsql-hackers will be talking about, but I wouldn't hold my breath for it to get into PG12. I do expect this to emerge in an extension first, and then getting the most useful bits into core (either in the form of built-in features, or hooks allowing extensions to be smarter).
Keep in mind there are limitations - the unique index / constraint has to include the partition keys, and foreign keys are possible in one direction only. This may be improved in the future, of course.
The relevant limitation is described by CREATE TABLE docs that say
Partitioned tables do not support <literal>EXCLUDE</literal> constraints
however, you can define these constraints on individual partitions.
Also, while it's possible to define <literal>PRIMARY KEY</literal>
constraints on partitioned tables, creating foreign keys that
reference a partitioned table is not yet supported.
(the wording is slightly broken in the commit, but got fixed later).
> With Postgres 11, you can create a unique index on the master
> The columns in the index definition should be a superset of the partition key columns. This means that the uniqueness is enforced locally to each partition, and you can’t use this to enforce uniqueness of alternate-primary-key columns.
Are there any workarounds for global indexes across partitions outside of the partition key? Say if you have a a table partitioned by (some_date) with a unique primary key (id), how would one speed up looks up that are purely by id? Having per-partition local indexes would require an index scan per partition.
I supposed you could have a separate/smaller (id, some_date) non-partitioned table (or maybe partitioned on HASH(id)) and then do the indirection yourself wherever you're doing "SELECT ... WHERE id = ..." but seems like a kludge. Any other workarounds?
Are your IDs monotonic by date (i.e. increasing the date will never decrease the ID)? Or, to put that another way, is partitioning by date equivalent to partitioning by ID for you, such that your date-based partitions will each end up with defined (if unknown) ranges of IDs?
If so, you could potentially create 1. a table that maps ID ranges to partitions, and then 2. create a stored procedure that takes a set of IDs and returns the set of partitions you must scan to find them.
No I've considered that as well as splitting the key to be a composite of the partition key but I'd like an eventual solution for this in the general case where there's no relationship between the partitioning method and the primary key.
As far as I'm understanding your situtation it seems like you're using range partitioning and then hoping to get benefits of hash partitioning from a range partition. I think you would need to use hash partitions to avoid look ups on every every table. You could do something weird like have each range partition have an exclusion filter and make sure that each of your randomly generated IDs accommodate that exclusion filter, and generate IDs until you find one.. But it sounds like you would benefit from a hash partition here instead of a range one.
Having done this with MySQL before, there‘s basically three things you can do:
1) Do what you proposed, but this kinda defeats the whole purpose, as that side table can get very, very large (and if it‘s not fitting into working memory anymore, good luck keeping up a sane insert rate).
2) Just don‘t load that kind of data into an RDS subject to these scaling limitations at all (what I would do nowadays, but in a lot of ways, easier said than done)
3) (what we did back then): If you can, accept a window of uniqueness instead of real, true global uniqueness. If you can live with that, define a cronjob to build you a VIEW on the last TODAY-N partitions and check for duplicates using that in a SELECT CTE before inserting.
Using Spider storage engine in Mariadb it is possible to do this by having vertical partitions using different partition by clauses [1]. Essentially you will create sub tables with the partition by clauses, then use a spider table to reference these subtables. Spider than intelligently will select the partition to use based on the query and where clause. The downside of this is it will require storage of the data multiple times on disk, since each subtable needs the full data set. Spider does handle inserting the data, so once schemas are build, only the single spider table is needed for inserts/selects/updates.
In addition to this, in 10.4 it was mentioned spider is working to have internal joins. This means it would be possible to have 1 table defined with the full data set, then a second table that just has id, datefield partition by id. When you select WHERE id = X, it will read from the small table partitioned by id, find the dates, then join against the larger table to pull in remaining columns/data. This isn't mentioned in the presentation linked, but it was brought up in discussion at the mariadb developers unconference.
Wouldn’t the solution be to automatically partition the index you use for ensuring uniqueness, along with the main table? I don’t see how it would be very difficult, as long as the number of partitions is stable.
It's technically possible to have both global indexes, and constraint checking that just checks every single partition for conflicts. The former isn't trivial to implement in PG (there's a fixed-width scheme for the target tuples in several index implementations RN) and obviously adds a point of contention (but a smaller than the sidetable proposed nearby). The latter is fairly simple to implement (we do something very similar internally for exclusion constraints), but scales linearly with the number of partitions...
I think eventually we'll go with the second option, and do stuff to eliminate the scalability issues with many partitions. For example we could maintain a global bloom index (still global, but tiny compared to the sidetable), which should tell us whether there might be a duplicate value. If yes, it might be a false positive, in which case we have to do the expensive per-partition stuff. I'm sure it's going to be more complicated than this in practice, of course ...
I suspect we'll want both. The bloom index only helps if you have decent spatial locality. Which you won't for say a serial, uuid, etc. There's also types of indexes where it's probably going to be more efficient to have one index (I'd suspect a spatial indexes are often going to be that) for a number of usecases.
Yes, that's a reasonable solution and how ElasticSearch handles partitions. It still would "require an index scan per partition." though, which the OP wanted to avoid.
Indeed. Or similarly, allow indexes to not be partitioned at all - i.e. your records and most indexes could be partitioned by date, but you could also declare indexes that apply to the entire logical table (very useful for things like primary keys).
That is what was mentioned as "global indexes" in this discussion. The problem with this approach is twofold - firstly it negates many of the partitioning benefits (e.g. removing data is not merely a DROP PARTITION but you have to walk the global structure), secondly it requires fairly significant surgery in the indexes (that now point to rows in a single table, and so [block,item] is enough, but global indexes will require [partition,block,item]).
> so [block,item] is enough, but global indexes will require [partition,block,item]).
That's probably the easier bit, I'd guess that adjusting a lot of the relevant in-memory structures to share relations, uniqueness checks across relations, adjusting the locking-model to deal with multiple underlying relations etc. is going to be more work.
I use zabbix a lot with PG database and there are 2 tables that grow indefinitely. The zabbix housekeeper lags a bit when the tables become +200GB. I tried the autopartitioning and the housekeeper becomes irrelevant. I can now easily keep 1 year of data for my +200k monitored items. I like my critical alerts to keep 3 months at 60 seconds interval. Pg partitioning is really awesome
This set of changes in 11 is pretty great and makes it much more usable out of the box. For those not reading the article the short list of improvements: