370 likes | 573 Views
Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Shuai Yuan @ Emory. Semantic relatedness. Association between (two) texts according to background knowledge. Example: “Cat” <-> “mouse” “Preparing a manuscript” <-> “writing an article”
E N D
Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Shuai Yuan @ Emory
Semantic relatedness Association between (two) texts according to background knowledge. • Example: • “Cat” <-> “mouse” • “Preparing a manuscript” <-> “writing an article” • “Chair” <-> “Sun”
Why does it matter? Association of ideas is important for artificial intelligence. Help to extract meaningful information from texts.
What is ESA? Explicit Semantic Analysis. • Explicitly represent the meaning of any text in terms of Wikipedia-based concepts • How about implicit one? “latent concepts”
Indices • direct index • Concept -> words • Inversed index • Word -> Concepts
Building Semantic Interpreter(2) • For a single word • Using TFIDF to decide weights of every concept • Discard insignificant associations.
Building Semantic Interpreter(4) • For a text fragment TFID V1 W1 C1:K11 C2:K12 C3:K13 V2 W2 C1:K21 C2:K22 C3:K23 V3 W3 C1:K31 C2:K32 C3:K33 C1:K1 C2:K2 C3:K3
Concepts vector comparison. • Cosine metric.
Implementation(1) • Using Wikipedia Wikipedia snapshot as of March 26, 2006. parsing the Wikipedia XML dump 2.9 GB of text in 1,187,839 articles removing small and overly specific concepts 241,393 articles removing stop words and rare words 389,202 distinct terms
Implementation(2) • Using Open Directory Project (ODP, http://www.dmoz.org) ODP snapshot as of April 2004 pruning non-English material 436MB. 400k concepts and 2.8M URLs crawling all of its URLs 70 GB of additional textual data removing stop words and rare words 20,700,000 distinct terms
Evaluation • The “gold standard” -- Human judgements
Human evaluating word relatedness • WordSimilarity-353 collection2 containing 353 word pairs. • Hire people (13-16 for each word pair). • Average to a single relatedness score for each pair.
Human evaluating doc similarity • 50 documents from the ABC’s news mail service. • Paired docs in all possible ways (how many pairs?) • Hire people (8-12 for each doc pair). • Average to a single relatedness score for each pair. • 1225= 50*49/2
Conclusion • Explicit Semantic Analysis is a novel approach to computing semantic relatedness of natural language texts with the aid of large scale knowledge repositories (Wikipedia and the ODP). • Results are good!
Q&A Thank you !
Agenda or Summary Layout 10:00am 11:00am 1:00pm A second line of text could go here Discussion Item One – A Placeholder for text Add a second line of text here Discussion Item Two – A Placeholder for text Add a second line of text here Discussion Item Four – A Placeholder for text Add a second line of text here Discussion Item Five – A Placeholder for text Add a second line of text here Discussion Item Three – A Placeholder for text Add a second line of text here 2:00pm 12:00pm
Main Content Page Layout Add a subtitle here • This text is a placeholder. • Here is the second level. • You may change this text • Here is the third level • Formatting is controlled by the slide masterand the layout pages. • There is a third level • And even a fourth level An accent, click to edit the text inside. An accent, click to edit the text inside.
A callout, this can be edited or deleted Comparison Page Layout A second line of text could go here Comparison of Item One Comparison of Item Two • This is a place holder for item one. Item one can be text, a picture, graph, table, etc. • Here is level two • Here is level three • Level 4 • Level 4, you may add more text or delete this text. • This is a place holder for item one. Item one can be text, a picture, graph, table, etc. • Here is level two • Here is level three
A One Column Page Layout A Second line of text can go here. • A content placeholder. Use for text, graphics, tables and graphs. You can change this text or delete it. • Here is a placeholder for more text. You may delete this text • Here is a placeholder for more text. You may delete this text
Two Picture Page Layout A second line of text here • A placeholder for text for the first picture • More information can be added here by changing this text. • A placeholder for the second picture • More information can be added here by changing this text.
Three Picture Page Layout A second line of text may go here. • A description of the first picture. You may change this text. • A description of the first picture. You may change this text. • A description of the second picture. You may change this text. • A description of the second picture. You may change this text. • A description of the third picture. You may change this text. • Images from PresenterMedia.com
Table Page Layout A second line of text can go here. Here is the description of the table. You may change or delete this text as you wish. This chart is compatible with PowerPoint 97 to 2007. Here is a placeholder for more text and description of the chart. Changing this text will not interfere with the formatting of this template.
Line Graph Page Layout PowerPoint 97 through 2007 Compatible Star Burst! Here is the description of the chart. You may change or delete this text as you wish. This chart is compatible with PowerPoint 97 to 2007. Here is a placeholder for more text and description of the chart. Changing this text will not interfere with the formatting of this template.
Bar Graph Page Layout PowerPoint 2007 Enhanced Version A callout, this can be edited or deleted Here is the description of the chart. You may change or delete this text as you wish. This chart utilizes features only available with 2007. Here is a placeholder for more text and description of the chart. Changing this text will not interfere with the formatting of this template.
Pie Graph Page Layout PowerPoint 2007 Enhanced Version Here is the description of the chart. You may change or delete this text as you wish. This chart utilizes features only available with 2007. Here is a placeholder for more text and description of the chart. Changing this text will not interfere with the formatting of this template.
Smart Art Page Layout PowerPoint 2007 Enhanced Version This chart utilizes Smart Art which is feature in PowerPoint 2007. If you wish to make charts like this and don’t have PPT 2007, we have provided the graphical elements to help you build this yourself. Here is the description of the chart. You may change or delete this text as you wish. Here is a placeholder for more text and description of the chart. Changing this text will not interfere with the formatting of this template.
Smart Art Page Layout PowerPoint 2007 Enhanced Version This chart utilizes Smart Art which is feature in PowerPoint 2007. If you wish to make charts like this and don’t have PPT 2007, we have provided the graphical elements to help you build this yourself. Here is the description of the chart. You may change or delete this text as you wish. Here is a placeholder for more text and description of the chart. Changing this text will not interfere with the formatting of this template.
Smart Art Page Layout PowerPoint 2007 Enhanced Version This chart utilizes Smart Art which is feature in PowerPoint 2007. If you wish to make charts like this and don’t have PPT 2007, we have provided the graphical elements to help you build this yourself. Here is the description of the chart. You may change or delete this text as you wish.
Picture Page Layout You Picture caption here. Image from PresenterMedia.com
Animation Page Make an Impact in your presentations by adding some themed PowerPoint animations.