A spat has blown up over the weekend regarding Oodle and Vast.com “scraping” content from 3rd party sites and re-purposing it inside their environments. This essay is my reaction to the spat. As a founder of edgeio I clearly have an interest in the answer to the question. edgeio does not scrape or crawl. All of its content is permission based (published using the “listing” tag; uploaded directly into edgeio OR published on edgeio directly to a personal listings blog that we host).
However, there is more at stake here than competitive issues between edgeio on the one hand and Vast/Oodle on the other. The wider issue is whether or not scraping (which is very like crawling and indexing except it reads displayed content not files) constitutes stealing of data.
The following is taken from an article on ClickZ:
“This is called stealing contentâ€¦there’s no advantage to me to have them steal,” commented Laurel Touby, founder and CEO of media industry site mediabistro.com, upon learning that Vast.com had linked from its search results to full mediabistro.com job listings pages, even though those pages require registration when accessed on the mediabistro.com site.
Vast.com CEO Naval Ravikant said Vast.com’s crawlers do not automatically register or login to sites, so they must have found passage through the mediabistro.com system via a legitimate entryway.
So let’s try and address this broader issue. Firstly this is a new discussion. Nobody accuses Google of stealing the data that is in it’s index (except book publishers of course). Why not? Well, because Google primarily indexes the “visible” web. That is to say, sites that are linked to from other sites and are not behind a password protection system of any kind; and even then it respects directives in a file called robots.txt where a publisher can ask not to be indexed. And secondly, Google does not display entire documents (although its cache is getting very close to doing so and may give rise to similar discussions in future). Rather it points to the original source for reading/viewing the content. Thus the business model of the original publisher is left in tact.
With the emergence of vertical search aggregators, especially in the commerce space, the issue of ownership and permission become far more pronounced. Why? Because the data represents an inventory, and often an “invisible” web inventory – that is to say, behind a password protected site. The effort to aggregate that inventory into a central marketplace is done without permission of the owner of the inventory. Whether password protected or not this is going to give rise to disputes like the one between Craigslist and Oodle a little while ago.
There is no need to invent new means of dealing with this. But there is a need for good behavior. Crawlers always should respect robots.txt. Scrapers are different. The spiders can read displayed contentl directly and do not crawl the file system. As such they can bypass robots.txt. If scrapers respected robots.txt then a publisher could effectively put its content out of the reach of the crawlers. It isn’t clear at this point whether the scrapers do respect robots.txt files. A better solution is to use RSS for syndication rather than crawling and scraping. More on this below.
The second issue is whether the item level link from a result set points to the original source or to a hosted copy of the original. Oodle and Googlebase had a difference of opinion about this issue. Content publishers will care what the answer is.
The third issue with scraping is a quality issue. On its home page Vast.com states:
All results are automatically extracted by crawling the web
Vast.com cannot guarantee the accuracy or availability of the results
And the oodle blog notes that its index:
…only includes listings that are fresh and relevant: we keep track of all the listings weâ€™ve seen and auto-expire old ones that are still online and exclude things that look like listings but aren’t (reviews, spam, etc.).
The issue here is twofold. To stay current with a live inventory of listings is hard. To even attempt to do so creates a need to crawl and index very aggressiveley, and the results are often not good. Craigslist’s gripe with Oodle was at least in part driven by its experience with Oodle’s crawlers. They were apparently polling and sucking content very aggressiveley and needed to in order to stay current. If you do not poll aggressiveley your index gets even more out of sync with the original source than it already is.
It seems to me that RSS is a custom made solution to these problems. Scraping and Crawling are the wrong tools.
If publishers who wish their content to be syndicated to a third party publish an RSS feed and the third party consumes the feed we have a) a permission based syndication system; b) a real time ability to update inventory. edgeio made the decision to follow the Craigslist model whereby a listing is explicitly requested by a publisher. Publishers of listings, from your Mom to a large site, are a community, made up of many smaller communities. A central listings service (CLS) should be a service to that community. Permission based, real-time publishing, via RSS, is the right tool for the job. Over time this is a highly scalable solution. Publishers can opt in and out at will.
I predict many more accusations of stealing insofar as the industry continues to mine the “invisible” web, and the specialist web, via scraping and crawling.
And finally, edgeio publishes RSS feeds of every item (either individual items, or our entire inventory). Oodle and Vast are not competitors, but distribution partners. Our data is more valuable insofar as more people see it. That will happen if the data is placed in more environments. So, take it, for free. But please, do not scrape it or crawl it. Just read the RSS feeds. That is why we have them.
Last word goes to And Beal’s The Marketing Pilgrim blog. After reading the ClickZ piece he says:
Certainly Vast is not alone in convincing classified sites that they’re helping them bring new visitors, but if the classified search engines are to see a bright future, they’ll need to secure strong partnerships with their partner sites.
Tech.memeorandum link for this subject.