230 likes | 345 Views
Wikitology Wikipedia as an Ontology. Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County. http://ebiquity.umbc.edu/resource/html/id/250/. Motivation. Identifying the topics and concepts associated with text or text entities is a task common to many applications:
E N D
WikitologyWikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County http://ebiquity.umbc.edu/resource/html/id/250/
Motivation Identifying the topics and concepts associated with text or text entities is a task common to many applications: • Annotation and categorization of documents • Modelling user interests • Business intelligence • Selecting advertisements • Improving information retrieval • Better named entity extraction and disambiguation
What’s a document about? Two common approaches: (1) Select words and phrases using TF-IDF that characterize the document (2) Map document to a list of terms from a controlled vocabulary or ontology (1) is flexible and doesn’t require creating and maintaining an ontology (2) Can connect documents to a rich knowledge base
Wikitology ! • Using Wikipedia as an ontology offers the best of both approaches • each article (~4M) is a concept in the ontology • terms linked via Wikipedia’s category system (~200k) and inter-article links • Lots of structured and semi-structured data • It’s a consensus ontology created and maintained by a diverse community • Broad coverage, multilingual, very current • Overall content quality is high
Constructing the Wikitology KB RDF and OWLstatements Freebase KB Yago WordNet Databases Human input & editing
ACE 2008 KB entities Documents • ACE 2008 is a NIST sponsored exercise in entity extraction from text • Focus on resolving entities across documents, e.g., “Dr. Rice” mentioned in doc 18397 is the same as “Secretary of State” in doc 46281 • 20K documents in English and Arabic • We participated on a team from the JHU Human Language Technology Center of Excellence NLP ML clust FEAT
ACE 2008 KB entities Documents • BBN’s Serif system produces text annotated with named entities (people or organizations) Dr. Rice, Ms. Rice, the secretary, she, secretary Rice • Featurizers score pairs of entities for co-reference (CNN-264772-E32, AFP-7373726-E19, 0.6543) • A machine learning system combines the evidence • A simple clustering algorithm identifies clusters NLP ML clust FEAT
Wikitology tagging • Using Serif’s output, we produced an entity document for each entity. Included the entity’s name, nominal and pronom-inal mentions, APF type and subtype, and words in a window around the mentions • We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity • We used the vectors to compute fea-tures measuring entity pair similarity/dissimilarity
Entity Document & Tags Wikitology article tag vector Webster_Hubbell 1.000 Hubbell_Trading_Post National Historic Site 0.379 United_States_v._Hubbell 0.377 Hubbell_Center 0.226 Whitewater_controversy 0.222 Wikitology category tag vector Clinton_administration_controversies 0.204 American_political_scandals 0.204 Living_people 0.201 1949_births 0.167 People_from_Arkansas 0.167 Arkansas_politicians 0.167 American_tax_evaders 0.167 Arkansas_lawyers 0.167 <DOC> <DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO> <TEXT> Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" NAM: "Mr . " "friend” "income" PRO: "he” "him” "his" , . abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years </TEXT> </DOC>
Wikitology derived features • Seven features measured entity similarity using cosine similarity of various length article or category vectors • Five features measured entity dissimilarity: • two PER entities match different Wikitology persons • two entities match Wikitology tags in a disambiguation set • two ORG entities match different Wikitology organizations • two PER entities match different Wikitology persons, weighted by 1-abs(score1-score2) • two ORG entities match different Wikitology orgs, weighted by 1-abs(score1-score2)
Challenges • Wikitology tagging is expensive • ~2 seconds/document on a single processor • Took ~24 hrs on a cluster for 150K entity docs • A spreading activation algorithm on the underlying graphs improves accuracy at even more cost • Exploiting the RDF metadata and data and the underlying graphs • requires reasoning and graph processing • Extract entities from Wiki text to find more relations • More graph processing
Next Steps • Construct a Web-based API and demo system to facility experimentation • Process Wikitology updates in real-time • Exploit machine learning to classify pages and improve performance • Better use of cluster using Hadoop, etc. • Exploit cell processor technology for spreading activation and other graph-based algorithms • e.g., recognize people by the graph of relations they are part of
Spreading activation example = W at-1 at 1 0.5 0.8 from 1.0 to 2 4 1 .5 1 .5 3 = 0.5 1 1 1 .5 1.0 0.5 0.5 .8 1 .5 .8 1 5 6 1 a1 a0 0.9 0.3
Spreading activation example = W at-1 at 1 0.5 0.8 from 1.0 to 2 4 1 .5 1 .5 .9 3 = 0.5 1 1 1 .5 1.0 0.5 0.5 .8 1 .5 .8 1 5 6 1 .3 a1 a0 0.9 0.3
Spreading activation example = W at-1 at 1 0.5 0.8 from 1.0 to 2 4 .45 1 .5 .9 1 .5 .9 3 = 0.5 .15 1 1 1 .5 1.0 0.5 0.5 .15 .8 1 .5 .45 .8 1 5 6 .3 1 .3 a1 a0 0.9 0.3
Spreading activation example = W at-1 at 1 0.5 0.8 from 1.0 to 2 4 .45 1 .5 .45 .9 1 .5 .9 3 = 0.5 .3 1 1 1 .5 .15 1.0 0.5 0.5 .51 .8 1 .5 .15 .45 .8 1 .45 5 6 .3 1 .3 a2 a1 0.9 0.3
SA as matrix multiplication • Good news: SA is matrix multiplication • Model graph as nxn matrix W where Wij is strength of connection from node i to j • Vector A of length n, Ai is node I’s activation • A(t) = W*A(t-1) • Bad news: is n is huge • 140K category nodes and 4.2M edges • 2.9M articles and 50M edges. • Good news: matrices are sparse
Sparse Matrix Vector Multiplication Exploiting parallelism for sparse matrix-vector multiplication (SPMV) has several challenges • High storage overhead and Indirect and irregular memory access patterns • How to parallelize • Load balancing
Sparse Matrix Representation Compressed Sparse Row Format (CSR) is a simple storage format Values: The non-zero values in the matrix Columns: Column indices of non-zero values Pointer B: Column index of first non-zero value in a row Pointer E: Column index of last non-zero value in a row
Thread Level Parallelism • Partition matrix rows among processors • Statically load balance SPMV by approximately equally distributing the non-zero values among processors/threads
Heuristic Load Balancing • Sort rows in decreasing order of number of non-zeros • Assign rows to processes/threads iteratively • Assign row #1 to processes 0 • Assign subsequent rows to the process with smallest total number of non-zeros • Guarantees the maximum difference in the number of non-zero values between any two processes/threads will be at most the largest number of non-zeros in a row
Conclusion • Our initial experiments showed that the Wikitology idea has merit • Wikipedia is increasingly being used as a knowledge source of choice • Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia • Serious use requires exploiting cluster machines and cell processing • Key processing of associated graph data can exploit cell architecture