170 likes | 302 Views
Approximate subgraph matching for detection of topic variations. Mitja Trampu š Dunja Mladenić AI Lab, Jožef Stefan Institute , Slovenia. Mining Diversity. Web content varies in many aspects , e.g. Topical Social (author, target audience, people written about)
E N D
Approximate subgraph matching for detection of topic variations MitjaTrampuš Dunja Mladenić AI Lab, Jožef Stefan Institute, Slovenia
Mining Diversity • Web content varies in many aspects, e.g. • Topical • Social (author, target audience, people written about) • Geographical (publisher, places written about) • Sentiment (positive/negative) • Writing style (structure, vocabulary) • Coverage bias • This work: (micro-)topical diversity • Macroscopic = largely solved • Microscopic = challenge
Task: Given a collection of texts on a topic, • identify a common template • align texts to the template
Template representation • Syntactic • info1: X people were killed / killed X people / resulted in X casualties • info2: blew up Y / destroyed Y / attacked in a Y • Semantic • kill(bomber, people); count(people, X) • destroy(bomber, Y) X count Y bomber people destroy kill
car drive 12 count church suicidebomber believers destroy kill exit vest 2 run wear count policestation bomber policeofficer blow up slaughter Prerequisite: Semantic Graph treatment attack execute 100 receive withstand count hospital terrorist patients demolish kill
Mining Templates • Template := subgraph with frequent specializations • Specializations implied by background taxonomy (WordNet) • Threshold frequency manually defined 2 12 teardown count count policestation church suicidebomber bomber believers policeofficer blow up slaughter kill X count building bomber people destroy kill
Semantic Graph Construction • Data: Google News crawl • HTML cleanup • Named entity tagging • Pronoun resolution (he/she/him/her) • Named entity consolidation (Barack Obama vs President Obama) • Parsing, triple/fact/assertion extraction(for now: subj-verb-obj only) • Ontology/taxonomy alignment • Merging triples into a graph
Approximate subgraph matching number number 2 12 number count count teardown count count destroy slaughter blow up count location location church policestation location suicidebomber person bomber person person policeofficer believers people people people destroy destroy kill kill kill kill FREQUENT SUBTREE MINING GENERALIZE
Approximate subgraph matching 12 2 number number teardown count count count count policestation building church location suicidebomber bomber person bomber policeofficer people people believers destroy blow up destroy slaughter kill kill kill SPECIALIZE
Preliminary results • 5 test domains; for each: • ~10 graphs, ~10000 nodes • 10-60 seconds • At min. support 30% • 20 maximal patterns, 9 manually judged as interesting
Conclusion • Future work: • Mapping text -> semantics • Other ontologies? • Interestingness measure for assertions and patterns • Evaluation (precision, recall; multiple domains) • Alternative approaches to generalizing subgraphs • Template extraction is achievable, but not easy • Human filtering of results hard to avoid • Current approach reasonably fast
Q? Thank you.
Can we extract all relations?Kind of … Thousands of small quakes resumed 18 months ago and continue to rattle Mammoth Lakes, June Lake and other Mono County resort towns. The temblors, most measuring 1 to 3 on the Richter scale, started beneath Mammoth Mountain. Subject – Verb – Object Triplets Semantic Graph
Templates - why • Interpret content • news archives: structure/annotate old texts, enable semantic search • wikipedia: suggestions for infobox entries • Generate content • wikipedia: a starting point for new articles / a checklist of information to be included • No normative definition of “good template”
Evaluation • Qualitative • Usage-specific • Not useful for tuning algorithms • Quantitative • Precision • Recall