Blogger Hacks, Categories, Tips & Tricks

Thursday, September 07, 2006
Dapper: Content Integration For Everyone
It's not often that we get to see disruptive technologies emerge. The new Dapper service (from Dappit) is firmly in that category. Billed as a mashup tool, it allows you to grab content simply from static or dynamic web pages and integrate it in a mind-boggling range of ways. Services like this will further breakdown the distinction between content providers and consumers and force an overhaul of existing customs and business models around web content.

What Does Dapper Do?

Similar to Ning, it's a managed service to let people create and host their own applications, or clone and edit others'. At its heart, the service is an online web scraper; it lets you extract content and shunt it around. You nominate elements on the target page you'd like, and it will go and fetch them for you.

In principle, all you ever needed to do this was the good old Unix utilities wget and grep (this is how FreshTags started, in the days before JSON). Oh, but you'll also need a net-connected, secure machine. And the ability to write hairy regular expressions. Plus a lot of patience. And if you were thinking of actually doing anything with the results, then you'd need something like a PHP server and working knowledge of Perl, with various libraries for handling outputs and transformations. Not to mention a managed website to publish it all through. Yeah, it's sounding like one big headache, isn't it?

With Dapper, you simply create a "Dapplication". It's a very straight-forward, step-by-step process, with the requisite Web 2.0 flavour. You submit some examples of your target page and use a simple "point and click" interface to nominate the elements you wish to extract. These elements could be tabular data, links, text and so on. The Dapper system guesses (and is, in turn, corrected by you) to figure out the underlying structure of the page. You assign some names to these fields and (optionally) group them. That's it - your Dapp is done.

A really neat feature of the system is the way you can specify inputs for the target page (you rarely want to scrape exactly the same page each time). If your target page uses URL parameters, you can instruct Dapper to pass those in for you using curly brackets eg

http://somewhere.com/action?display=printer&mo={month}&da={day}&range=allusers

This will cause Dapper to prompt users for month and day variables. Or, you can nominate the existing input fields on the target page for data insertion in the same way. It's really very simple and intuitive.

Once it's up and running, there is a truly dizzying array of options for getting the data out: the usual vegetable soup of web standards (XML, HTML, JSON, YAML, RSS), plus some novel ones (email, image loop, Google Gadget and an alert mechanism). What's more, Dapper's not shy about accepting requests. The service was lacking callbacks for the JSON feed, which make it easy for lightweights like me to play with the data. I emailed the developer, Jon Aizen, and within a couple of hours it was done! Thanks, Jon!

Case Study

To test out the service, I built a Dapp in a few minutes to extract tabular data from a particular website (I'm currently negotiating an informal content and link-sharing agreement with a website related to Speccy, so I'm afraid I'll have to keep a bit vague). One of the roadblocks on negotiation is that some data I want from the other site is locked up in their SQL database, only served up as an HTML table by PHP. It would be very difficult for me to get at this. With Dapper, I was able to nominate the fields and extract the data I wanted, lowering the hassle involved and improving the chances of concluding a mutually-beneficial deal.

Dapper has two mechanisms to let you fine-tune your content selection: a slider that selects how "restrictive" it is in guessing what you're after, and a container that limits the grab to certain elements. Unfortunately, in my case, it wasn't too successful with either and I was getting unwanted extra content. I tried some other pages and both mechanisms worked as advertised; I must have been dealing with a pathological site. In the end, I opted to "over extend", grabbing more content than required, and knocked together some regular expressions to parse out the bits I wanted. It works beautifully: I can enter the parameters and Dapper builds the appropriate URL (with parameters inserted into the query string), fetches the pages, strips out the data I want (plus a bit more) and hands it back to me a JSON object - with a callback function!

With a bit of confidence and some practice, I'm sure that anyone can extract content from a page of interest and display it on their own page (perhaps as an iframe element or an image loop, for simplicity). Blogger Beta's new RSS display widget really open things up. This, I believe, is the disruptive element of the technology.

Implications For The Web

Many of us involved in blog hacking are comfortable with content being passed around like this; we provide RSS and Atom versions of our content (plus social bookmarking of titles and summaries) and actively encourage others to pick it up. We also have an informal code for link sharing ("link love") that defines norms and governs behaviour. Dapper knocks all that on its head, and provides new challenges for content production and consumption.

For starters, web feeds are a push technology; publishers elect to syndicate their content in this way. By contrast, Dapper is a pull approach, whereby others suck content out of your site without your permission or knowledge.

Despite what the odd (and I mean weird) lawyer might believe, no one can control who you link to on the web. But extracting slabs of content ... that's different. Clearly, new customs and practices - not laws - will have to emerge to deal with this. (I believe existing intellectual "property" laws are simply not up to it, being too clumsy and blunt.) Dapper has gone some way to facilitate this, with its "empowering content providers" (ie site-based access restriction) form. Hopefully, more content providers will see the benefits of their users figuring out new and powerful uses of their content rather than just blocking requests from the dappit.com domain.

(For what it's worth, in my case, I'm not using the tech as an excuse to barge in and pillage the target site. Instead, I see it as a means for lowering the barriers to exchange and thus (hopefully) allowing a fruitful partnership to develop where it might not have been possible before.)

Of course, content-sharing issues don't arise if you scrape your own stuff. You could create Dapps to parse out interesting bits and pieces from your blog and offer them as emails, alerts, feeds, looped images and the like. Nearly all blogs - and wikis for that matter - employ templating of some kind. This means that you can be more-or-less guaranteed that the Dapper will have a good shot at easily parsing the underlying structure. Headings, profiles, post titles, dates, leading paragraphs, links, quotes, tags, authors, comments, timestamps ... all that static content (ie not generated by JavaScript) is ripe for extraction and syndication.

Another suggestion is to look for stuff with RSS or Atom feeds. Chances are, if the publisher is happy pushing out content in this way, they'll also be cool with you grabbing it with a Dapp. Ditto for content released under (some) Creative Commons licences. And, hey, you can always ask: I'd love to know if someone's built a Dapp to do something novel with my content! I'm sure that proper attribution, link back, notification and being respectful of server/bandwidth load will all be part of the basis of an emerging Dapper netiquette.

If this post has piqued your interest, please go ahead and check out the growing list of Dapps already available, read the Dapper blog or just dive right in and create your own Dapp. I'm sure that within five minutes you'll grok the disruptive nature of this service and get a glimpse of the jaw-dropping possibilities.

Filed in: , , , ,
Posted at 2:08 AM by Greg.
11 Comments:
<    >
Blogger Singpolyma said...
Will be trying this out... as I write more and more screen-scraping code this just may be my ticket to saving time! :D

<    >
Blogger Johan Sundström said...
My play with Dapper some week ago (I did some reworkings of a MySpace profile data puller, and got it to find, but not properly name, most content) suggested that Dapps are "scrape once per 24 hours", no customizations on offer. Might be worth mentioning (if I am right) somewhere too.

<    >
Anonymous Anonymous said...
i just made my recent comments appear in the sidebar with dapper. a few weeks i switched to beta and the comment feed isn't working yet. so i used my comments blog (i had it from the days of the comment-hack which included forwarding comments by googlemail to a blog listing my comments on one page). i used that with dapper. and it works. it took me a few hours. it works not as i expected but i am quite satisfied at the moment.

<    >
Blogger Greg said...
@Johan: I haven't come across anything about a once-a-day limit. Certainly the Dapps I've created aren't subject to that, unless it starts after the 24 hour trial period? Since it's just come into beta, is it likely that you hit it during some teething issues?

@失踪: Hooray! That's a great use of Dapper (though it puts Hearsay out of business). Here's a link to the comments Dapp so that others may capitalise on 失踪's labour.

<    >
Anonymous Anonymous said...
hi greg, i wrote a post about getting recent comments using dapper. this should help others to implement it.

<    >
Anonymous Anonymous said...
A friend of mine always mentions this dapper thing. Now I know what she means by it. Thanks for the information!

<    >
Blogger Singpolyma said...
Last time I checked the comment feeds were working fine, just weren't displaying... maybe they're being finicky for others though, huh.

<    >
Anonymous Anonymous said...
Dapper is indeed innovative, and their technology is very impressive. it sure will bring more surprise. just wait n see :)

<    >
Anonymous Anonymous said...
Cool tip! Btw, have you checked out Feedity ( www.feedity.com )? Its like Dapper but much simpler at creating custom RSS feeds.

<    >
Anonymous Anonymous said...
Cool! Btw, have you looked at Feedity ( www.feedity.com )? Its like Dapper but much simpler at creating custom RSS feeds.

<    >
Blogger atlas245 said...
Interesting points on web scrapers, For simple stuff i use python to get or simplify data, data extraction can be a time consuming process but for other projects that include documents, the web, or files i tried "web scraper" which worked great, they build quick custom screen scrapers, web scrapers, and data parsing programs


eXTReMe Tracker