i have done a shitload of different sitmap.xml logic over the time, the most advanced (which is still in use by a top 500 worldwide website) is a real time sitemap. if a new dataset gets added, it is calculated if the dataset will result in new landingpages or will lead to an update of an existing landingpage. if the new/updated landingpages are within a content quality range the "shelving logic" looks up into which shelve (speak sitemap.xml) it belongs. it then sets the various last modified datas (which bubble up from the landingpages to the sitemap.xml to the sitemap-index to the robots.txt)
we achieved a 97% crawl to landingpage ratio, and an overall 90%+ indexing ratio. (after a lot of trial and errors with the quality metric) on sites with 35M+ pages.
said that, for sites with less than 10M pages i do not care anymore, just submit big complete sitemaps, update them if when a sitemap gets added (or a big bunch of them deleted). the overhead of running and maintaining a real time sitemap for small websites (less than 10M pages) is just too much.
Where I explain how we did this at eBay when I was working there.
eBay is quite unique from the fact that millions of new URL's get created on a daily basis, with a short lifespan.
we achieved a 97% crawl to landingpage ratio, and an overall 90%+ indexing ratio. (after a lot of trial and errors with the quality metric) on sites with 35M+ pages.
said that, for sites with less than 10M pages i do not care anymore, just submit big complete sitemaps, update them if when a sitemap gets added (or a big bunch of them deleted). the overhead of running and maintaining a real time sitemap for small websites (less than 10M pages) is just too much.