…it ain’t nobody’s business but my own :-)
Looking at Sovereign Debt
I am in Heathrow on my way to Cape Town for the World Cup. Sitting in the lounge I was watching Japan v Cameroon on the TV and, as the game was boring, began to read the Independent newspaper. Then this graphic caught my attention.
After Chirp, is Twitter related investing still smart?
Robert Scoble cornered Ron Conway in the hallway at the Chirp conference yesterday and in the aftermath of Twitter acquiring Tweetie, and announcing their own URL shortening service, asked the big question. Is it still sensible to invest in companies seeking to expand or enhance the use of Twitter in some way? Ron is unequivocal in his answer. For what it's worth I think Ron is right.......
Popular Science Mag implements Mag+ vision
Mag+ live with Popular Science+ from Bonnier on Vimeo. No comment really. I do think video and audio are missing from this vision, but it is a great first step.
seriouslyipad.com
Many of you will be familiar with the project I have been incubating over the past 18 months or so. s.erious.ly. It is predicated on two ideas. One is the trend (now almost complete) of the deportalization of internet content. The second is the success of companies like Glam Media and Sugar Publishing is proving the value of passion based content networks. Today, the 4th site in the group was laun...
Internet and TV, are we at the tipping point?
Walt Mossberg today reviewed a couple of new technologies that allow you to beam video from a PC to a TV wirelessly. Pretty cool, but IMHO there is not a big demand for this. More interesting is the discussion about whether we are at the tipping point between TV and the internet, where more and more people will get their video from the Internet. In the video below Walt is a sceptic, but his ...
Deportalization and Internet Advertising
Glam hired a new guy today. Techcrunch, VentureBeat and PaidContent all posted about it. All of the reporting on this hire focus on Glam's coup in getting their man, and on their profitability heading into Q4. There is little in the way of analysis, which is probably quite reasonable on a news-filled Monday morning here on the West Coast.. As TechCrunch's Jason Kincaid reports: Glam Media h...
Real Time Streams
John Borthwick has captured in words what many have been grappling with in a less articulate way for about 18 months. The new paradigm we need to think about the internet has finally emerged. This snippet outlines the broad trend: Start with this constant, real time, flowing stream of data getting published, republished, annotated and co-opt’d across a myriad of sites and tools. The s...
In Defense of “nothing”
Columnist Henry Porter is generally considered to be a wise observer of the human condition. Today, in an article in the UK Guardian owned Sunday, The Observer, he blew it ..... badly. As a newspaper man he ought to have been aware of his almost certain bias and perhaps counted to ten before pushing "send". And, given that he didn't,  his editor should have saved him from himself after the fact,...
RSS has peaked! – Forrester. Nope, it hasn’t! – Me
Forrester released a report today ($279 download if you want it). Titled "What's holding RSS back?" it claims that only 11% of Internet consumers use RSS and that those who have not don't understand it. Steve Rubel at Micro Persuasion responds that : "..while feed adoption may have crested the idea of online opt-in communications is just getting going. The Facebook newsfeed, Twitter and Frie...
OpenID and Data Portability
Nicolas Popp - a leading advocate of Open Identity and data solutions - posted on his VeriSign blog today following the rather heated discussions that have ensued since Google announced its Friend Connect product recently. Nico's employer - VeriSign - along with Microsoft, Yahoo, Google, AOL and others, is a member of the board of the OpenID foundation.Nico's primary argument (emphasis mine) is...

Is scraping and crawling stealing?

Posted By: Keith Teare on March 26, 2006 in Internet, Search,, Web 2.0 - Comments: View Comments

A spat has blown up over the weekend regarding Oodle and Vast.com “scraping” content from 3rd party sites and re-purposing it inside their environments. This essay is my reaction to the spat. As a founder of edgeio I clearly have an interest in the answer to the question. edgeio does not scrape or crawl. All of its content is permission based (published using the “listing” tag; uploaded directly into edgeio OR published on edgeio directly to a personal listings blog that we host).

However, there is more at stake here than competitive issues between edgeio on the one hand and Vast/Oodle on the other. The wider issue is whether or not scraping (which is very like crawling and indexing except it reads displayed content not files) constitutes stealing of data.

The following is taken from an article on ClickZ:

“This is called stealing content…there’s no advantage to me to have them steal,” commented Laurel Touby, founder and CEO of media industry site mediabistro.com, upon learning that Vast.com had linked from its search results to full mediabistro.com job listings pages, even though those pages require registration when accessed on the mediabistro.com site.

Vast.com CEO Naval Ravikant said Vast.com’s crawlers do not automatically register or login to sites, so they must have found passage through the mediabistro.com system via a legitimate entryway.

So let’s try and address this broader issue. Firstly this is a new discussion. Nobody accuses Google of stealing the data that is in it’s index (except book publishers of course). Why not? Well, because Google primarily indexes the “visible” web. That is to say, sites that are linked to from other sites and are not behind a password protection system of any kind; and even then it respects directives in a file called robots.txt where a publisher can ask not to be indexed. And secondly, Google does not display entire documents (although its cache is getting very close to doing so and may give rise to similar discussions in future). Rather it points to the original source for reading/viewing the content. Thus the business model of the original publisher is left in tact.

With the emergence of vertical search aggregators, especially in the commerce space, the issue of ownership and permission become far more pronounced. Why? Because the data represents an inventory, and often an “invisible” web inventory – that is to say, behind a password protected site. The effort to aggregate that inventory into a central marketplace is done without permission of the owner of the inventory. Whether password protected or not this is going to give rise to disputes like the one between Craigslist and Oodle a little while ago.

There is no need to invent new means of dealing with this. But there is a need for good behavior. Crawlers always should respect robots.txt. Scrapers are different. The spiders can read displayed contentl directly and do not crawl the file system. As such they can bypass robots.txt. If scrapers respected robots.txt then a publisher could effectively put its content out of the reach of the crawlers. It isn’t clear at this point whether the scrapers do respect robots.txt files. A better solution is to use RSS for syndication rather than crawling and scraping. More on this below.

The second issue is whether the item level link from a result set points to the original source or to a hosted copy of the original. Oodle and Googlebase had a difference of opinion about this issue. Content publishers will care what the answer is.

The third issue with scraping is a quality issue. On its home page Vast.com states:

All results are automatically extracted by crawling the web
Vast.com cannot guarantee the accuracy or availability of the results

And the oodle blog notes that its index:

…only includes listings that are fresh and relevant: we keep track of all the listings we’ve seen and auto-expire old ones that are still online and exclude things that look like listings but aren’t (reviews, spam, etc.).

The issue here is twofold. To stay current with a live inventory of listings is hard. To even attempt to do so creates a need to crawl and index very aggressiveley, and the results are often not good. Craigslist’s gripe with Oodle was at least in part driven by its experience with Oodle’s crawlers. They were apparently polling and sucking content very aggressiveley and needed to in order to stay current. If you do not poll aggressiveley your index gets even more out of sync with the original source than it already is.

It seems to me that RSS is a custom made solution to these problems. Scraping and Crawling are the wrong tools.

If publishers who wish their content to be syndicated to a third party publish an RSS feed and the third party consumes the feed we have a) a permission based syndication system; b) a real time ability to update inventory. edgeio made the decision to follow the Craigslist model whereby a listing is explicitly requested by a publisher. Publishers of listings, from your Mom to a large site, are a community, made up of many smaller communities. A central listings service (CLS) should be a service to that community. Permission based, real-time publishing, via RSS, is the right tool for the job. Over time this is a highly scalable solution. Publishers can opt in and out at will.

I predict many more accusations of stealing insofar as the industry continues to mine the “invisible” web, and the specialist web, via scraping and crawling.

And finally, edgeio publishes RSS feeds of every item (either individual items, or our entire inventory). Oodle and Vast are not competitors, but distribution partners. Our data is more valuable insofar as more people see it. That will happen if the data is placed in more environments. So, take it, for free. But please, do not scrape it or crawl it. Just read the RSS feeds. That is why we have them.

Last word goes to And Beal’s The Marketing Pilgrim blog. After reading the ClickZ piece he says:

Ouch!

Certainly Vast is not alone in convincing classified sites that they’re helping them bring new visitors, but if the classified search engines are to see a bright future, they’ll need to secure strong partnerships with their partner sites.

My emphasis!

Update:

Tech.memeorandum link for this subject.

  • http://www.archimedesventures.com/2006/03/26/scraping-and-stealing/ Scraping and Stealing? | Archimedes Ventures LLC

    [...] have posted an essay on my personal blog covering this weekends controversy about scraping and stealing [...]

blog comments powered by Disqus

Go direct

Sponsorship

Copyright - All Rights Reserved / Developed By Appchain.com