Over the weekend, I started working on creating a sitemap for my business research project. I made one permanent url per company page and one permanent url per sic category.
The result is a theoretical sitemap that is about 605,000 items long. That’s a lot of XML, which means a LOT of data, which means the sitemap.xml is like… 20-50 megabytes. What could possibly go wrong.
It was at this point that I learned on Google’s sitemap documentation that they except files in xml, rss, and txt files. For the txt variant, it only had to be a line-delimited list of fully formed page urls. Whoopie!
So, I created a script that produced a long list of URLs according to my project’s existing api schema. Then, I saved the file, uploaded it to the root of the site, and then tested the sitemap on Google Search Console.
I had overlooked a detail in the sitemap documentation about list length limits. Apparently, each file must be no longer than 50,000 elements. That meant I’d have to go back and re-produce the sitemap as a series of files, with length of 50,000 items per file. So, I did that, and had a couple hiccups along the way, but ultimately got it working. Then, I re-tested a sitemap in the Google Search Console.
This time, the search console said “Looks good!” and I said, “whoopie” again, and uploaded the series of sitemap files. It feels kinda scary seeing a huge number in there for the registered pages count.
So, for now, I’m just waiting for some of those pages to index. I’m half-expecting to get in trouble and get a bunch of rejections and hack-notices. I’ve never done anything at this scale before.
Wish me luck!