Harpers: a Semantic Web(ish) site for Harperâ€™s Magazine

Harpers.org: a Semantic Web(ish) site for Harper’s Magazine Paul Ford Associate Web Editor, Harpers.org ford@harpers.org

Harper’s is… • A magazine of literature, politics, culture, and the arts published continuously from 1850 • A small non-profit

Available content • The Weekly Review, an emailed summary of world events, from 2000 • The Harper’s Index, a statistical portrait of the world, from 1998 • Public domain, scanned-in archives from 1850-1982 • Readings • Occasional features

And that’s it. • Maybe full text of issues will be offered someday, but not soon. So… • How do we get more value out of limited content?

Solution • Hack up the what we have into bits by content type, then… • Reassemble it according to link targets… • Which are arranged in a taxonomy… • Creating a very small “Semantic Web” for Harpers.org

A quick demo… • >>>

How it works • Simple set of ontological relationships (partOf, supervisorOf) • Taxonomy of content • & narrative content • that is split into smaller pieces • & links into the taxonomy

Markup • Text: “Country Y announced that it had cut off relations with country Z. On Wednesday, something happened to persons X and Y.”

Markup <event> Country Y announced that it had cut off relations with country Z. </event> <event> On Wednesday, something happened to persons W and X. </event>

Markup <event on=“2004-03-12” id=“24848”> Country Y announced that it had cut off relations with country Z. </event>

Markup <event on=“2004-03-12” id=“24848”> <link to=“#CountryY”>Country Y</link> announced that it had cut off relations with <link to=“#CountryZ”>country Z</link>. </event>

Conditionals • Some text required conditional markup • Text: “Country Y announced that it had cut off relations with country Z, and on Wednesday, something happened to persons X and Y.”

Conditionals: ugly, but simple <event> Country Y announced that it had cut off relations with country Z <cond is=“id”>, and</cond> <cond not=“id”>.</cond> </event> <event> <cond is=“id”>on</cond> <cond not=“id”>On</cond> on Wednesday, something happened to persons X and Y. </event>

Conditionals: ugly, but simple • Narrative version • Country Y announced that it had cut off relations with country Z, and on Wednesday, something happened to persons X and Y. • Timeline-friendly version • Country Y announced that it had cut off relations with country Z. • On Wednesday, something happened to persons X and Y.

All of it gets slurped up • And turned into a set of triples • Then processed in-memory • With HTML pages spit out as a result

Hard, then easy • Hard to get started (lots of events, facts, and links) • Easy to keep going, if you don’t mind the markup and use a good text editor

Tools used • emacs, vi, bbedit • XSLT2.0 (SAXON) • CVS

Why not RDF? • Not right for redundant content and conditionals • Easy enough to transform arbitrary structured XML into RDF with XSLT, as needed • (Or into RSS1.0, RSS2.0, Atom, etc.) • ?

For free… • From 300 individual pages… • To 1100 pages of “remixed” content – all unique and relevant • And Google-friendly

And also for free… • Semantically relevant in-site advertising, if we want it • Topic-sorted, reusable content • Permanent, readable URIs

Do people get it? • Some do, and others just navigate the site as usual • Harper’s was fine with the learning curve • “Odd but useful” – Gawker

Results • Uptick in traffic and subscription revenues • Low cost of maintenance • Ever-increasing database of facts and events – adding one Weekly Review adds value to 50 different pages • Happy client

Why the SemWeb(ish) framework? • Leaves plenty of room to grow • Web-only content • Full text of issues • Subscriber services • Etc • Take advantage of new SemWeb tools • Incorporate RDF sources into the taxonomy • Anticipate Semantic Web browsers

Next?

Make it pretty • Redesign • Hide some of the navigation • Turn links on and off

Make it scale • Currently maxes out at about 20-30 megs of content, due to limits of in-memory DOM representation (10-12x XML document size) • Use a publicly available storage layer (Kowari, Jena, etc) • Go triple-crazy

Make it easy to query and navigate • “Show me everything related to George Bush and Iraq.” or • “Show me everything related to politicians and the Middle East.” • New navigation • ?

Harpers: a Semantic Web(ish) site for Harperâ€™s Magazine