Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd love to see how people approach generating their sitemaps.

I was doing it by batch, but once you go over a few million URLs I realised the effort of doing so. And in a farm of cloud web servers I had one doing this big batch job and then syncing the others... effectively a master.

So I scrapped that and went to dynamically generated upon request. But the problem I found with this was content deletion changing the URLs in the higher numbered sitemap files... i.e. content in the first few files get deleted, all subsequent files shift slightly across the now visible URLs. Because Google and others may only take some sitemaps one day, and some the next... you risk appearing to have duplicate info in your sitemaps... I prefer long-cacheable sitemaps anyway... the URLs in file #23 should always be in #23 and not another file.

So I'm, moving towards dynamic generation based on a database table that stores all possible URLs and will associate batches of 20,000 URLs per sitemap file... if I delete content referenced by sitemap #1, then that now has 19,999 URLs and site map #2 remains at 20,000 URLs. A second benefit of such a table is that I can use a flag to indicate whether the content has been deleted and use that to determine whether to 404 or 410 when that URL is accessed.

If anyone feels that they have a better way of doing this, I'd love to know it.

Ideally, it would be non-batch generated, and strongly associate a URL to a given sitemap file.



i have done a shitload of different sitmap.xml logic over the time, the most advanced (which is still in use by a top 500 worldwide website) is a real time sitemap. if a new dataset gets added, it is calculated if the dataset will result in new landingpages or will lead to an update of an existing landingpage. if the new/updated landingpages are within a content quality range the "shelving logic" looks up into which shelve (speak sitemap.xml) it belongs. it then sets the various last modified datas (which bubble up from the landingpages to the sitemap.xml to the sitemap-index to the robots.txt)

we achieved a 97% crawl to landingpage ratio, and an overall 90%+ indexing ratio. (after a lot of trial and errors with the quality metric) on sites with 35M+ pages.

said that, for sites with less than 10M pages i do not care anymore, just submit big complete sitemaps, update them if when a sitemap gets added (or a big bunch of them deleted). the overhead of running and maintaining a real time sitemap for small websites (less than 10M pages) is just too much.


You might want to ready my answer to this question on Quora: http://www.quora.com/If-I-have-a-website-with-millions-of-un...

Where I explain how we did this at eBay when I was working there. eBay is quite unique from the fact that millions of new URL's get created on a daily basis, with a short lifespan.


We do the simplest thing we could imagine. Our sitemap of ~50,000,000 entries is written to static xml once a week as part of a batch job and pushed to S3. Is there any reason to believe it needs to be updated near real time? How often does Google read yours?


one thing is: freshness is a factor for google ranking. so if you could communicate a fresh page to google in exact the moment when new content arrives (+ there is a chance, that there is a "fresh" spike for search demand) then it's a factor.

but more important:

also with 50 000 000 URLs, as your site gets crawled with about 500 000 pages a day (which is average) or 1M pages a day (which is good) it takes already 50 to 100 days to index your whole site - so it makes sense to communicate only the changed sitemaps (at the exact time when they changed) to google, as the sitemaps get fetched quite fast you up your chances, that the new LP gets crawled/indexed faster. it depends on how fast your page turnaround is (new pages, updated pages, deleted pages) if it makes sense for you, or not.

(p.s.: in most cases for most business, a (near) real-time sitemap is overhead.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: