Opinionated

Archive for March, 2006

A spat has blown up over the weekend regarding Oodle and Vast.com “scraping” content from 3rd party sites and re-purposing it inside their environments. This essay is my reaction to the spat. As a founder of edgeio I clearly have an interest in the answer to the question. edgeio does not scrape or crawl. All of its content is permission based (published using the “listing” tag; uploaded directly into edgeio OR published on edgeio directly to a personal listings blog that we host).

However, there is more at stake here than competitive issues between edgeio on the one hand and Vast/Oodle on the other. The wider issue is whether or not scraping (which is very like crawling and indexing except it reads displayed content not files) constitutes stealing of data.

The following is taken from an article on ClickZ:

“This is called stealing content…there’s no advantage to me to have them steal,” commented Laurel Touby, founder and CEO of media industry site mediabistro.com, upon learning that Vast.com had linked from its search results to full mediabistro.com job listings pages, even though those pages require registration when accessed on the mediabistro.com site.

Vast.com CEO Naval Ravikant said Vast.com’s crawlers do not automatically register or login to sites, so they must have found passage through the mediabistro.com system via a legitimate entryway.

So let’s try and address this broader issue. Firstly this is a new discussion. Nobody accuses Google of stealing the data that is in it’s index (except book publishers of course). Why not? Well, because Google primarily indexes the “visible” web. That is to say, sites that are linked to from other sites and are not behind a password protection system of any kind; and even then it respects directives in a file called robots.txt where a publisher can ask not to be indexed. And secondly, Google does not display entire documents (although its cache is getting very close to doing so and may give rise to similar discussions in future). Rather it points to the original source for reading/viewing the content. Thus the business model of the original publisher is left in tact.

With the emergence of vertical search aggregators, especially in the commerce space, the issue of ownership and permission become far more pronounced. Why? Because the data represents an inventory, and often an “invisible” web inventory - that is to say, behind a password protected site. The effort to aggregate that inventory into a central marketplace is done without permission of the owner of the inventory. Whether password protected or not this is going to give rise to disputes like the one between Craigslist and Oodle a little while ago.

There is no need to invent new means of dealing with this. But there is a need for good behavior. Crawlers always should respect robots.txt. Scrapers are different. The spiders can read displayed contentl directly and do not crawl the file system. As such they can bypass robots.txt. If scrapers respected robots.txt then a publisher could effectively put its content out of the reach of the crawlers. It isn’t clear at this point whether the scrapers do respect robots.txt files. A better solution is to use RSS for syndication rather than crawling and scraping. More on this below.

The second issue is whether the item level link from a result set points to the original source or to a hosted copy of the original. Oodle and Googlebase had a difference of opinion about this issue. Content publishers will care what the answer is.

The third issue with scraping is a quality issue. On its home page Vast.com states:

All results are automatically extracted by crawling the web
Vast.com cannot guarantee the accuracy or availability of the results

And the oodle blog notes that its index:

…only includes listings that are fresh and relevant: we keep track of all the listings we’ve seen and auto-expire old ones that are still online and exclude things that look like listings but aren’t (reviews, spam, etc.).

The issue here is twofold. To stay current with a live inventory of listings is hard. To even attempt to do so creates a need to crawl and index very aggressiveley, and the results are often not good. Craigslist’s gripe with Oodle was at least in part driven by its experience with Oodle’s crawlers. They were apparently polling and sucking content very aggressiveley and needed to in order to stay current. If you do not poll aggressiveley your index gets even more out of sync with the original source than it already is.

It seems to me that RSS is a custom made solution to these problems. Scraping and Crawling are the wrong tools.

If publishers who wish their content to be syndicated to a third party publish an RSS feed and the third party consumes the feed we have a) a permission based syndication system; b) a real time ability to update inventory. edgeio made the decision to follow the Craigslist model whereby a listing is explicitly requested by a publisher. Publishers of listings, from your Mom to a large site, are a community, made up of many smaller communities. A central listings service (CLS) should be a service to that community. Permission based, real-time publishing, via RSS, is the right tool for the job. Over time this is a highly scalable solution. Publishers can opt in and out at will.

I predict many more accusations of stealing insofar as the industry continues to mine the “invisible” web, and the specialist web, via scraping and crawling.

And finally, edgeio publishes RSS feeds of every item (either individual items, or our entire inventory). Oodle and Vast are not competitors, but distribution partners. Our data is more valuable insofar as more people see it. That will happen if the data is placed in more environments. So, take it, for free. But please, do not scrape it or crawl it. Just read the RSS feeds. That is why we have them.

Last word goes to And Beal’s The Marketing Pilgrim blog. After reading the ClickZ piece he says:

Ouch!

Certainly Vast is not alone in convincing classified sites that they’re helping them bring new visitors, but if the classified search engines are to see a bright future, they’ll need to secure strong partnerships with their partner sites.

My emphasis!

Update:

Tech.memeorandum link for this subject.


I posted to the comments on Rob Hof’s BusinessWeek Online blog this morning. A reader has said that edgeio’s new features make it “like CraigsList” and implied that edgeio’s distinctiveness was threatened.

I’m re-posting it here because there are some themes in it that I think are important to expose.

Here is my reply:

Rob,

Thanks for writing about our new features. I want to respond to Mike
Masnick. On the face of it he asks a fair question - if you can list on
edgeio then how is that different from CraigsList?
It’s a multi-part answer:

1. We are giving you a hosted listings blog. It is yours to post to.
edgeio is still aggregating the content from your blog, just as if your
blog were elsewhere. The goal here is to expand the universe of people
who can utilize edgeio from bloggers, to well … everybody. So this is
still self publishing, and although it is hosted by edgeio, still
decentralized. The idea of a blog as a storefront is definitely not a
CraigsList idea.

2. Having said that, lets assume there is no difference, or that the
difference is mainly semantic (which I don’t think it is). Then there is
still a huge difference. CraigsList has been in existence for many years
now. After all that time it is in around 100 cities, mainly in the USA.
After 2 weeks edgeio has attracted listings from more than 1000 cities
and that is growing by around 300 a week. By the end of the year edgeio
will have around 10,000 cities with listings. This is possible due to
our “bottoms up” approach. Self publishers can “light up” their city
simply by listing an item. Of course one downside of this in the short
term is that a city can have a small number of listings. But the upside
is that it is a highly scalable model. Over time it will be organic and
probably very big. For Craig to launch a new city is a fairly cumbersome
thing.

3. Final point. We don’t have “be different from CraigsList” as a
goal. Our goal is to build a massively scaled global listing
marketplace, with millions of local cities and their citizens
participating. Like CraigsList we are all about “community”. In fact we
share a philosophy with Craig - let the community police the
marketplace. And our tools are all about removing cost and friction from
person to person commerce, whilst expanding the universe of
participants. We do not scrape or crawl for content. We only take
content expressley for edgeio. And we chose bloggers as the first
audience to enable because they represent a community of communities.
So, if we are becoming “like CraigsList” but in thousands of more towns
and cities globally, and with tools and services appropriate to a
service built in 2005-06. great. I have no problem with that. There is a
new word beginning to make the rounds - Glocalization. As well as
pioneering structured blogging, microformats and self-publishing, edgeio
is a big believer in creating a Glocal (not a mis-spelling)
marketplace.

Links:


http://tech.memeorandum.com

Technorati


23 Mar, 2006

New edgeio features to launch tonight

Posted by: Keith Teare In: announcement| edgeio| news

I am pleased to report that edgeio is going to launch some new features tonight. I have posted details on the edgeio corporate blog. The features will go live some time after midnight Pacific time.

One of the things we heard loud and clear is that many want to publish listings on a blog but do not want them to be part of their primary blog. So, now you can have a listings blog hosted by edgeio. It’s free and it allows you to put free listings into edgeio and all of the sites that edgeio syndicates its listings to in the future.

We also heard that many do not want to have to deal with the idiosyncracies of ping servers and tagging. Especially from those who have blogs on hosting platforms with poor support for tagging. So now we allow you to simply type your URL into the edgeio hime page and choose your listings from a list of posts on your blog.

Thirdly we have added Tagclouds. Try clicking on the number of cities listed on the edgeio home page. Or the “more” link at the end of each top level category on the edgeio home page.

Finally, there are now instructions (linked from the home page) to those who like to publish into edgeio via their RSS feed about how to do so and take full advantage of a feature we call the “edgeio control language” or ECL.

There is a lot more coming from us in the not too distant future. We are committed to evolving the system in real time and in full public view. if you have opinions please email feedback@edgeio.com.

Rob Hof has a piece about the new features. And there are already others, posted below:

Other links:

SomeWhat Frank
Pete Cashmore’s Mashable


John Furrier is a genuine entreprenuer. After closing down Labrador and joining me at RealNames in 1999 he branched out on his own. He survived the nuclear winter in Silicon Valley, increased his family to 4 children, and embarked - a little more than a year ago - on PodTech. It is now a formidable PodCast network. And today he announced $5.5m in funding from some stellar VCs.

John, congratulations. You deserve the success and you most definitely earned it. PodTech also launched a new site to cooincide with the announcement.

Today is a good day.


14 Mar, 2006

PC Forum

Posted by: Keith Teare In: Internet| Strategy| Technology| edge

I’m at PC Forum this week. Monday saw the pitches of 9 startups (including edgeio)

Here is the movie of their 2 minute infommercials.




About

You can check me out on LinkedIn

Ads

BrightCove
Rackspace
Microsoft
GoGrid
Seesmic
OpenDeals