What Does Dapper Do?
Similar to Ning, it's a managed service to let people create and host their own applications, or clone and edit others'. At its heart, the service is an online web scraper; it lets you extract content and shunt it around. You nominate elements on the target page you'd like, and it will go and fetch them for you.In principle, all you ever needed to do this was the good old Unix utilities wget and grep (this is how FreshTags started, in the days before JSON). Oh, but you'll also need a net-connected, secure machine. And the ability to write hairy regular expressions. Plus a lot of patience. And if you were thinking of actually doing anything with the results, then you'd need something like a PHP server and working knowledge of Perl, with various libraries for handling outputs and transformations. Not to mention a managed website to publish it all through. Yeah, it's sounding like one big headache, isn't it?
With Dapper, you simply create a "Dapplication". It's a very straight-forward, step-by-step process, with the requisite Web 2.0 flavour. You submit some examples of your target page and use a simple "point and click" interface to nominate the elements you wish to extract. These elements could be tabular data, links, text and so on. The Dapper system guesses (and is, in turn, corrected by you) to figure out the underlying structure of the page. You assign some names to these fields and (optionally) group them. That's it - your Dapp is done.
A really neat feature of the system is the way you can specify inputs for the target page (you rarely want to scrape exactly the same page each time). If your target page uses URL parameters, you can instruct Dapper to pass those in for you using curly brackets eg
http://somewhere.com/action?display=printer&mo={month}&da={day}&range=allusers
This will cause Dapper to prompt users for month and day variables. Or, you can nominate the existing input fields on the target page for data insertion in the same way. It's really very simple and intuitive.
Once it's up and running, there is a truly dizzying array of options for getting the data out: the usual vegetable soup of web standards (XML, HTML, JSON, YAML, RSS), plus some novel ones (email, image loop, Google Gadget and an alert mechanism). What's more, Dapper's not shy about accepting requests. The service was lacking callbacks for the JSON feed, which make it easy for lightweights like me to play with the data. I emailed the developer, Jon Aizen, and within a couple of hours it was done! Thanks, Jon!
Case Study
To test out the service, I built a Dapp in a few minutes to extract tabular data from a particular website (I'm currently negotiating an informal content and link-sharing agreement with a website related to Speccy, so I'm afraid I'll have to keep a bit vague). One of the roadblocks on negotiation is that some data I want from the other site is locked up in their SQL database, only served up as an HTML table by PHP. It would be very difficult for me to get at this. With Dapper, I was able to nominate the fields and extract the data I wanted, lowering the hassle involved and improving the chances of concluding a mutually-beneficial deal.Dapper has two mechanisms to let you fine-tune your content selection: a slider that selects how "restrictive" it is in guessing what you're after, and a container that limits the grab to certain elements. Unfortunately, in my case, it wasn't too successful with either and I was getting unwanted extra content. I tried some other pages and both mechanisms worked as advertised; I must have been dealing with a pathological site. In the end, I opted to "over extend", grabbing more content than required, and knocked together some regular expressions to parse out the bits I wanted. It works beautifully: I can enter the parameters and Dapper builds the appropriate URL (with parameters inserted into the query string), fetches the pages, strips out the data I want (plus a bit more) and hands it back to me a JSON object - with a callback function!
With a bit of confidence and some practice, I'm sure that anyone can extract content from a page of interest and display it on their own page (perhaps as an iframe element or an image loop, for simplicity). Blogger Beta's new RSS display widget really open things up. This, I believe, is the disruptive element of the technology.
Implications For The Web
Many of us involved in blog hacking are comfortable with content being passed around like this; we provide RSS and Atom versions of our content (plus social bookmarking of titles and summaries) and actively encourage others to pick it up. We also have an informal code for link sharing ("link love") that defines norms and governs behaviour. Dapper knocks all that on its head, and provides new challenges for content production and consumption.For starters, web feeds are a push technology; publishers elect to syndicate their content in this way. By contrast, Dapper is a pull approach, whereby others suck content out of your site without your permission or knowledge.
Despite what the odd (and I mean weird) lawyer might believe, no one can control who you link to on the web. But extracting slabs of content ... that's different. Clearly, new customs and practices - not laws - will have to emerge to deal with this. (I believe existing intellectual "property" laws are simply not up to it, being too clumsy and blunt.) Dapper has gone some way to facilitate this, with its "empowering content providers" (ie site-based access restriction) form. Hopefully, more content providers will see the benefits of their users figuring out new and powerful uses of their content rather than just blocking requests from the dappit.com domain.
(For what it's worth, in my case, I'm not using the tech as an excuse to barge in and pillage the target site. Instead, I see it as a means for lowering the barriers to exchange and thus (hopefully) allowing a fruitful partnership to develop where it might not have been possible before.)
Of course, content-sharing issues don't arise if you scrape your own stuff. You could create Dapps to parse out interesting bits and pieces from your blog and offer them as emails, alerts, feeds, looped images and the like. Nearly all blogs - and wikis for that matter - employ templating of some kind. This means that you can be more-or-less guaranteed that the Dapper will have a good shot at easily parsing the underlying structure. Headings, profiles, post titles, dates, leading paragraphs, links, quotes, tags, authors, comments, timestamps ... all that static content (ie not generated by JavaScript) is ripe for extraction and syndication.
Another suggestion is to look for stuff with RSS or Atom feeds. Chances are, if the publisher is happy pushing out content in this way, they'll also be cool with you grabbing it with a Dapp. Ditto for content released under (some) Creative Commons licences. And, hey, you can always ask: I'd love to know if someone's built a Dapp to do something novel with my content! I'm sure that proper attribution, link back, notification and being respectful of server/bandwidth load will all be part of the basis of an emerging Dapper netiquette.
If this post has piqued your interest, please go ahead and check out the growing list of Dapps already available, read the Dapper blog or just dive right in and create your own Dapp. I'm sure that within five minutes you'll grok the disruptive nature of this service and get a glimpse of the jaw-dropping possibilities.
Filed in: dapper, web2.0, feeds, webtech, syndication
@失踪: Hooray! That's a great use of Dapper (though it puts Hearsay out of business). Here's a link to the comments Dapp so that others may capitalise on 失踪's labour.