200 likes | 324 Views
HUMANS do it better!. dmoz: The Open Directory Project. What is dmoz?. dmoz stands for Directory MOZilla Also known as the Open Directory Project (ODP) Searchable directory, similar to Yahoo! Administered by Netscape as a non-commercial entity. Who maintains dmoz?.
E N D
HUMANS do it better! dmoz: The Open Directory Project
What is dmoz? • dmoz stands for Directory MOZilla • Also known as the Open Directory Project (ODP) • Searchable directory, similar to Yahoo! • Administered by Netscape as a non-commercial entity
Who maintains dmoz? • Data maintained by “expert” volunteers • Anyone can become an editor • 47,083 editors • ODP categorizes “quality” information • 378,028 categories
Interface features • Simple • No ads • Browseable directory • Regular and advanced search • http://www.dmoz.org/
Web coverage • dmoz - 3,260,681 documents • Google - 2,073,418,204 documents
RDF Format <RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://directory.mozilla.org/rdf"> <Topic r:id="Top"> <tag catid="1"/> <d:Title>Top</d:Title> </Topic> <Topic r:id="Top/Arts"> <tag catid="2"/> <d:Title>Arts</d:Title> <link r:resource="http://www3.bc.sympatico.ca/PHILLIPSHOTGLASS/GlassPage.html"/> </Topic> <ExternalPage about="http://www3.bc.sympatico.ca/PHILLIPSHOTGLASS/GlassPage.html"> <d:Title>John phillips Blown glass</d:Title> <d:Description>A small display of glass by John Phillips</d:Description> </ExternalPage> <Topic r:id="Top/Computers"> <tag catid="4"/> <d:Title>Computers</d:Title> <link r:resource="http://www.cs.tcd.ie/FME/"/> <link r:resource="http://pages.whowhere.com/computers/pnyhlen/Timeline.html"/> </Topic> <ExternalPage about="http://www.cs.tcd.ie/FME/"> <d:Title>FME HUB</d:Title> <d:Description>Formal Methods Europe (FME) is a European organization supported by the Commission of the European Union (via ESSI of the ESPRIT programme), with the mission of promoting and supporting the industrial use of formal methods for computer systems development.</d:Description> </ExternalPage> <ExternalPage about="http://pages.whowhere.com/computers/pnyhlen/Timeline.html"> <d:Title>Computer Timeline</d:Title> <d:Description>A brief description of the eras in computing.</d:Description> </ExternalPage>
Using dmoz data • Data is freely available for download • http://dmoz.org/rdf.html • http://dmoz.org/license.html • Must provide attribution and back-link • No Warranty
dmoz data • Many sites use dmoz data • AOL Search • Google • Lycos • HotBot • over 200 others • Some sites add enhancements and extensions • Google adds page rank • Lycos adds targeted ads
Searching dmoz • Boolean • implicitly AND • AND, OR, ANDNOT • allows shorthand (+, |, -) • Wildcard search (pup*) • Phrasal search • Mixed searches • Field based queries
Search relevance • Queries performed against fields in the RDF database • For documents: title, description, URL • For categories: title, terms/keywords • Keywords are chosen manually; potentially more relevant • Results clustered by category and ranked according to the number of matches within a given category • Some inconsistency, but it doesn't seem to be publicly documented • Some documents are flagged with a star and appear at the top of a directory listing (these do not seem to get special promotion in search results)
Relevance feedback • Not directly supported • Web forms for reporting feedback • http://dmoz.org/cgi-bin/feedback.cgi
Engine • Uses I-Search • http://www.etymon.com/Isearch/ • Open source • Modules may be added to enable searching of different document types • dmoz extensions to I-Search • RDF parsing module • Special search module, to return sub-records
More about I-Search • Supports many different kinds of queries • Vector search (or at least some sort of weighted keyword search) • Soundex (looks for "similar" words, English and similar only) • Boolean search • Geographic search (hits within a given x1,y1,x2,y2 box) • field searches (for structured documents, like RDF) • Thesaurus expansion and stopword lists supported • Queries translated into an RPN, and pushed onto a stack • Operations/operands are handled in a generic fashion • Has a number of options for searching (for exact terms): • dictionary (hash table) • binary search of sorted index
dmoz vs. UNCA Library Catalog • UNCA Library Catalog has a fixed vocabulary • Library catalog created by trained professionals; dmoz uses “expert” volunteers • Both use field-based queries • dmoz always searches the same fields
dmoz vs. Google • Google uses dmoz’s data • Google is a search engine (good for finding specific information) • dmoz is a directory (good for finding general information) • Google adds page ranking to dmoz documents
+"Chinese calendar" +"year of the ram“ Documents returned Google: 10 dmoz: 0 Library: 0 No dead links No overlap Relevance Google: 70% dmoz: N/A Library N/A +"Chinese calendar" Documents returned Google: 15,200 dmoz: 10; 7 categories Library: 2 No dead links Overlap 4 pages (Google/dmoz) Relevance Google: 30% dmoz: 30% Library: 50% Query 1: When is the next year of the Ram on the Chinese calendar?
"douglas adams" hitchhiker guide galaxy "meaning of life" Documents returned Google: ~364 dmoz: 0 Library: 0 No dead links No overlap Relevance Google: 60% dmoz: N/A Library N/A “meaning of life“ answer Documents returned Google: 49,700 dmoz: 1 Library: 0 No dead links No overlap Relevance Google: 0% dmoz: 0% Library: 0% Query 2: According to Douglas Adams, author of "HitchHiker's Guide to the Galaxy,“ what is the answer to the question: "What is the meaning of life?"
morgan horse breeders north carolina Documents returned Google: 1140 dmoz: 0 Library: 0 No dead links No overlap Relevance Google: 40% dmoz: N/A Library N/A Query 3: Find Morgan horse breeders in North Carolina