Category Archives: Web 2.0

The Pareto Principle is nonsense.

In response to the current discussion on Techmeme and TailRank hipmojo writes that the Pareto principle is in play on the internet and that no matter how much we want it to be otherwise 80% of online advertising will go to 20% of the web sites.

When the dust settles, the top 20% of websites will get 80% of ad revenues. It’s that simple. Portals might change in shape, form or nature, but whatever they represent loosely will still get the bulk of revenues and traffic.

With respect, that is nonsense. Since the advent of Google Adsense the shape of internet advertising spend has mirrored the flattening of traffic I speak of on the edgeio blog. Almost half of Google’s revenue comes from Adsense. And about 75% of the dollars earned through Adsense stay with the publishers whose sites the ads run on. Clearly the lions share of the money spent through Google is shared about 50-50 with the publishers in the “foothills”.

It may be worth listening to the Google Earnings calls on Earningscast to validate this.

That is why Google talks so much about “inventory”. That is, traffic from outside google.com. The size and cost of this inventory is a major variable and the need to grow it helps us to understand deals like the one with YouTube.

If you roll the clock back to the pre-Adsense days when DoubleClick ruled, and online advertising was only going to large sites, it is a huge change in monetization and traffic flows. Give Google credit for this.

One of the things my piece argues is that there is a new trend on top of this established one – publisher monetization of their own content through direct relationships to advertisers (job boards, sponsorships and Techmeme like ad units being examples).

Sure the portals are still big but the collective foothills are as big now, and will be a lot bigger in the future.

De-portalization and Internet revenues

Last week Fred Wilson did a post on a phenomena he called de-portalization. I think he is right on the money.

I just posted a piece on the edgeio blog that picks up on that theme and discusses the consequences of the trend.

The top 10 consequences are:

1. The revenue growth that has characterized the Internet since 1994 will continue. But more and more of the revenue will be made in the foothills, not the mountains.
2. If the major destination sites want to participate in it they will need to find a way to be involved in the traffic that inhabits the foothills.
3. Widgets are a symptom of this need to embed yourself in the distributed traffic of the foothills.
4. Portals that try to widgetize the foothills will do less well than those who truly embrace distributed content, but better than those who ignore the trends.
5. Every pair of eyeballs in the foothills will have many competing advertisers looking to connect with them. Publishers will benefit from this.
6. Because of this competition the dollar value of the traffic that is in the foothills will be (already is) vastly more than a generic ad platform like Google Adsense or Yahoo’s Panama can realize. Techcrunch ($180,000 last month according to the SF Chronicle) is an example of how much more money a publisher who sells advertising and listings to target advertisers can make than when in the hands of an advertiser focused middleman like Google.
7. Publisher driven revenue models will increasingly replace middlemen. There will be no successful advertiser driven models in the foothills, only publisher centric models. Successful platform vendors will put the publisher at the center of the world in a sellers market for eyeballs. There will be more publishers able to make $180,000 a month.
8. Portals will need to evolve into platform companies in order to participate in a huge growth of Internet revenues. Service to publishers will be a huge part of this. Otherwise they will end up like Infospace, or maybe Infoseek. Relics of the past.
9. Search however will become more important as content becomes more distributed. Yet it will command less and less a proportion of the growing Internet traffic.
10. Smart companies will (a) help content find traffic by enabling its distribution. (b) help users find content that is widely dispersed by providing great search. (c) help the publishers in the rising foothills maximize the value of their publications.

Discussion

Kevin Burton
Techmeme
Mike Arrington
Syntagma
Dan Farber at ZDNet
Mark Evans
Fred Wilson
Ivan Pope at Snipperoo
Tech Tailrank
Collaborative Thinking
David Black
Surfing the Chaos
Ben Griffiths
Dave Winer (great pics)
Kosso’s Braingarden
Dizzy Thinks
Mark Evans

Is scraping and crawling stealing?

A spat has blown up over the weekend regarding Oodle and Vast.com “scraping” content from 3rd party sites and re-purposing it inside their environments. This essay is my reaction to the spat. As a founder of edgeio I clearly have an interest in the answer to the question. edgeio does not scrape or crawl. All of its content is permission based (published using the “listing” tag; uploaded directly into edgeio OR published on edgeio directly to a personal listings blog that we host).

However, there is more at stake here than competitive issues between edgeio on the one hand and Vast/Oodle on the other. The wider issue is whether or not scraping (which is very like crawling and indexing except it reads displayed content not files) constitutes stealing of data.

The following is taken from an article on ClickZ:

“This is called stealing content…there’s no advantage to me to have them steal,” commented Laurel Touby, founder and CEO of media industry site mediabistro.com, upon learning that Vast.com had linked from its search results to full mediabistro.com job listings pages, even though those pages require registration when accessed on the mediabistro.com site.

Vast.com CEO Naval Ravikant said Vast.com’s crawlers do not automatically register or login to sites, so they must have found passage through the mediabistro.com system via a legitimate entryway.

So let’s try and address this broader issue. Firstly this is a new discussion. Nobody accuses Google of stealing the data that is in it’s index (except book publishers of course). Why not? Well, because Google primarily indexes the “visible” web. That is to say, sites that are linked to from other sites and are not behind a password protection system of any kind; and even then it respects directives in a file called robots.txt where a publisher can ask not to be indexed. And secondly, Google does not display entire documents (although its cache is getting very close to doing so and may give rise to similar discussions in future). Rather it points to the original source for reading/viewing the content. Thus the business model of the original publisher is left in tact.

With the emergence of vertical search aggregators, especially in the commerce space, the issue of ownership and permission become far more pronounced. Why? Because the data represents an inventory, and often an “invisible” web inventory – that is to say, behind a password protected site. The effort to aggregate that inventory into a central marketplace is done without permission of the owner of the inventory. Whether password protected or not this is going to give rise to disputes like the one between Craigslist and Oodle a little while ago.

There is no need to invent new means of dealing with this. But there is a need for good behavior. Crawlers always should respect robots.txt. Scrapers are different. The spiders can read displayed contentl directly and do not crawl the file system. As such they can bypass robots.txt. If scrapers respected robots.txt then a publisher could effectively put its content out of the reach of the crawlers. It isn’t clear at this point whether the scrapers do respect robots.txt files. A better solution is to use RSS for syndication rather than crawling and scraping. More on this below.

The second issue is whether the item level link from a result set points to the original source or to a hosted copy of the original. Oodle and Googlebase had a difference of opinion about this issue. Content publishers will care what the answer is.

The third issue with scraping is a quality issue. On its home page Vast.com states:

All results are automatically extracted by crawling the web
Vast.com cannot guarantee the accuracy or availability of the results

And the oodle blog notes that its index:

…only includes listings that are fresh and relevant: we keep track of all the listings we’ve seen and auto-expire old ones that are still online and exclude things that look like listings but aren’t (reviews, spam, etc.).

The issue here is twofold. To stay current with a live inventory of listings is hard. To even attempt to do so creates a need to crawl and index very aggressiveley, and the results are often not good. Craigslist’s gripe with Oodle was at least in part driven by its experience with Oodle’s crawlers. They were apparently polling and sucking content very aggressiveley and needed to in order to stay current. If you do not poll aggressiveley your index gets even more out of sync with the original source than it already is.

It seems to me that RSS is a custom made solution to these problems. Scraping and Crawling are the wrong tools.

If publishers who wish their content to be syndicated to a third party publish an RSS feed and the third party consumes the feed we have a) a permission based syndication system; b) a real time ability to update inventory. edgeio made the decision to follow the Craigslist model whereby a listing is explicitly requested by a publisher. Publishers of listings, from your Mom to a large site, are a community, made up of many smaller communities. A central listings service (CLS) should be a service to that community. Permission based, real-time publishing, via RSS, is the right tool for the job. Over time this is a highly scalable solution. Publishers can opt in and out at will.

I predict many more accusations of stealing insofar as the industry continues to mine the “invisible” web, and the specialist web, via scraping and crawling.

And finally, edgeio publishes RSS feeds of every item (either individual items, or our entire inventory). Oodle and Vast are not competitors, but distribution partners. Our data is more valuable insofar as more people see it. That will happen if the data is placed in more environments. So, take it, for free. But please, do not scrape it or crawl it. Just read the RSS feeds. That is why we have them.

Last word goes to And Beal’s The Marketing Pilgrim blog. After reading the ClickZ piece he says:

Ouch!

Certainly Vast is not alone in convincing classified sites that they’re helping them bring new visitors, but if the classified search engines are to see a bright future, they’ll need to secure strong partnerships with their partner sites.

My emphasis!

Update:

Tech.memeorandum link for this subject.

edgeio launches “instant listings”

edgeio home pageThe product manager at edgeio, Matt Kaufman, has been working hard with the engineering team since the launch to support instant listings on edgeio. Last night they pushed the first release. Now all you have to do is enter the url of your blog on the edgeio home page and you can add any item to edgeio without having to go through the ping server and rss feed notification process. This makes the addition instant. Here is a screen shot.

Instant Listings - from Crunchnotes

This uses crunchnotes.com as the blog and edgeio displays the recent posts. The first one is checked “add to edgeio” and is added to the “Events” category with the tags “news” and “announcement”. The rest are not selected. Clicking submit enters it into edgeio.

For bloggers not familiar with tagging, or who use blog platforms that do not allow tags, or that prevent modifications to the ping server to be notified, this is the simplest way to post on edgeio. For bloggers who prefer tagging and ping servers, posting to a blog with the tag “listing” stil works. Matt posted to the edgeio blog last night. Listings are now over 12,000 (in a little under 2 weeks) and they come from over 1600 cities.

Coming soon – your own personal listings blog on edgeio.com. For those who do not want listings on their main blog, or those who do not have a blog. Matt says this may take a couple more weeks depending on priorities.

Susan Mernit joins Yahoo as Senior Director of Personals

Susan Mernit has announced that she is joining Yahoo. I have had the good fortune to meet with Susan a couple of times over the last 3 months or so. She has a great analytical mind, she is execution focused, she “gets” what is happening with Web 2.0. I suspect Yahoo are the big winners here. Congrats Susan.

Links:

Susan Mernit’s Blog: Newsflash: I’m joining Yahoo!

Teare’s theorem: The first law of RSS (updated)

Umair has a post about why the “Rise of the Edge“? is something highly disruptive to orthodox Internet companies. In “Umair Rocks”? Fred Wilson says he wants to understand better what Umair means here, and plans to spend the time doing so.

For me the key is to comprehend that “the edge”? is a concept that only makes sense in a networked world. In a network “the edge”? is “the people”?. And “the edge”? plays the role of both subject (consumers) and object (creators).Blogs are a great example of the edge. Multi-player gaming is another example. Of course the edge is not yet highly diversified. But with the emergence of AJAX and Tagging the diversity of edge content is set to explode. Inputs from the edge to the center and Outputs from the center to the edge (old fashioned IO where the center plays the role of a hub, not a destination) become more important than web 1.0 aggregators that primarily serve as silos of content.

The growing role of the edge – as the originating point of content and the end point of its consumption – forces the redefinition of the the role and meaning of the center of the network. Content hosting is now a peripheral function (at best a means of having an index). Content discovery and distribution takes over as the primary role of the center.

Googlebase isn’t yet getting this (it is so far based on too centralized a publishing model). Craigslist, with it’s centralized publishing model, and evidenced by its recent outlawing of Oodle from taking it’s content, also isn’t getting it.

Yahoo – which has made some smart acquisitions – also begins to look out of date in this world. It seems to have no concept of enabling the edge; it is a network center seeing the edge as merely a source of user generated (read cheap) content and of potential subscribers to it’s centralized system. Opeining it’s API’s is a move in the right direction, but then the limits need to be removed. Even Flickr is centralized from a publishing point of view, albeit with good feed api’s for that centralized content. How much better would it be if you could publish photos and albums to your own blog and have Flickr acquire them, organize them and distribute them.

In a few weeks Mike and I will launch edgeio (note: for geeks it’s meaning is clear – edge content consumed (The I) and then re-disributed (the O). For my mom it’s just a cool word, spoken with an Italian accent, edge^ee oh). edgeio may well help clarify the possibilities of the new edge based network we all now use and inhabit. At least that is one of its goals.
edgeio-base :-)

edgeio is founded on a law we believe in. This is the first articulation of the law and we may be able to improve on it. But for now (until Dave; Mike; Scoble; Jeremy or others gives feedback :-) )

…the first law of RSS is:

“The value of edge published data (say a post) is directly proportional to the velocity of it’s consumption and re-production, that is, the number of input and output operations it goes through each day”?

RSS has enabled data to be freed from the confines of it’s initial point of publishing and to re-appear, through an RSS or ATOM feed at another point in the network. This takes place in a p2p (I read your feed) and an edge to center (I republish your post) and then a center to edge (others read my version of your post and so discover you) manner. As a post is consumed and republished, it, and the links to the original that are generated, create growing awareness, attention and probably traffic value which may or may not have a $ value.

edgeio has been built as an enabler of a more diverisified edge, with a role as a hub in accelerating the velocity of data as it travels around the network. We can’t wait to show it. We are now on the final UI usability tests for a beta. Shouldn’t be too long.

Links

Bubblegeneration Strategy Lab
Umair Rocks
Techcrunch
edgeio