As we described in our earlier article about Google Sitemaps (“Google Sitemaps to Expose Web Dark Matter, Transforming the Face of Search Forever”), the leading search engine’s new facility for gathering URLs to be indexed enables webmasters to submit lists of URLs which they would like Google’s bots to visit. Surprisingly, quite a bit of confusion and controversy has surfaced across the web about the significance of Google’s new service.
As in our earlier article, we believe the primary impact of this new development will be to expose the web’s so-called ‘dark matter’ — the literally billions of pages of information which are presently unavailable to search engine spiders because those pages sit behind a database front-end. Think of the Library of Congress, or the US Copyright Office or US Patents and Trademark Office: each of these possesses databases of millions of pages, none of which are at present effectively indexed by the leading search engines.
From a spider’s eye view, right now the landscape of the web is limited to those areas which can be visited by following hyperlinks found in web pages. Some search engines, Google included, will even parse links which include parameters passed to a database — but this, of course, is not what we’re talking about when we refer to inaccessible databases. What has been entirely inaccessible to search engines, however — at least until now — are those pages which cannot be found simply by following links found in web pages.
What are those pages?
Those are the pages served up by database engines which require specific parameters to be passed to them that are not already found in the source code of an existing web page, but which require information to be provided instead by the end user, usually via an (X)HTML-based form.
Here’s an example…
Suppose I have a set of 10 articles, and I want to provide users access to those 10 articles via a database. I might build a simple web page which asks the user to type a number from 0 to 9 into an input field and press a button. When that button was pressed, the user might then be taken to, say, article.php?id=0, or article.php?id=7, and so on. Of course, I might also provide a pre-built set of links, which users could browse through, simply clicking on the link for the article they would like to read.
The key point is that until the appearance of Google’s Sitemaps, those 10 articles would only ever be found by a spider if there were a pre-built set of links to those articles. With only the input field and the button, those articles would never be found — because bots do not fill in forms and press buttons. And while some websites running large databases have gone to great effort to make sure that their individual database entries are connected to the web via plain hyperlinks, and thus can be indexed by Google and other search engines — Amazon being the foremost example — most have not.
So, while the general benefits of Google Sitemaps can be debated — and are being debated at sites like Search Engine Roundtable and Cre8asite Forums — the clear and unequivocal benefit of Sitemaps comes down to Google’s new-found ability to expose the ‘dark matter’ of the web, bringing billions of pages of new information within site of its bots.
Of course, that ‘benefit’ will cut both ways, by massively expanding the total volume of web pages being indexed and literally changing the shape of the competitive landscape within which we all operate…
