190 likes | 327 Views
Entities, Topics and Events in Community Memories. Elena Demidova , Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard , Nikos Papailiou , Vassilis Plachouras , Wim Peters, Thomas Risse, Yannis Stavrakas , and Nina Tahmasebi.
E N D
Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou, Vassilis Plachouras, Wim Peters, Thomas Risse, YannisStavrakas, and Nina Tahmasebi 1st International Workshop on Archiving Community Memories 6 September 2013, Lisbon, Portugal
Architecture Overview • Offline processing • ETOEs extraction • Semantic enrichment &consolidation • Cross-crawl analysis • Dynamics detection
Entity & Event Extraction from Text • Development of applications that • identify document sections by language • automatically select appropriate resources to process multilingual text (within as well as across documents), • handle different domains within single pipelines appropriately • GATE applications are wrapped in the off-line module • Entity types: Person, Location, Organisation, … • Cross-document co-reference within GATE • Improved linguistic pre-processing for degraded text in tweets (joint development with TrendMiner project) • Improvements to event recognition, including use of low-scoring terms as event indicators • Adaptation to German
Entity Enrichment and Correlation [SDA 12] • EnrichmentandcorrelationusingDBpedia & Freebase • DBpedia Spotlight: keyword search using entity labels with conf. 0.6. • Freebase: structured queries using ARCOMEM entity types • FC data: 5,800 enrichedentities (Dbpedia: 492; Freebase: 5,309) • Avg. precision 0.89 ([1- 0.8] dependent on theentity type andsource) • RAR data: 19,429 enriched entities(Dbpedia: 6,021; Freebase: 13,408)
Freebase Dataset • Data: 22 millions entities, 350 millions facts • Schema: 7,500 entity types in about 100 domains • (June 2011) • Wikipedia, MusicBrainz, …
ARCOMEM Entities and Enrichments - Graph • Nodes: entities/events (blue), enrichmentsDBpedia (green), Freebase (orange) • 1013 clustersofcorrelatedentities/events in FC =>clusterexpansion usingrelatedenrichments
Enrichment and Correlation: Clustering [WOLE 12] • Directcorrelations (entitiessharingthesameenrichments): • E.g. {Mexico, Mexiko, MEXIKO}, {Greece, Griechenland} • #Clusters withat least 2 correlatedentities: FC : 1,013 RAR : 1,381 • Exploitgraphanalysismethodstodetectclosenessoftheenrichments • Linking: e.g. relatedeventswithorganisationsandpersons • Enrichment&Clusteringcomponenthasbeenintegrated in the offline processingandreleased. • SARA integration: Enrichments: direct links to LOD entities; Clusters: findingsimilar(orrelated) entities • Outlook: integrationofindirectrelationships, studyingdataqualityaspects in LOD
Topic Modeling on Rock am Ring • Probabilistic topic models provide a suite of techniques to uncover the hidden semantic theme of a large collection of data • Documents may exhibit multiple topics • Each topic is described by a distribution of probability over the dictionary • Associate each topic with a list of representative documents and write them into the ARCOMEM KB Album 0.021 Metal 0.015 Songs 0.014 Band 0.013 Dj0.007 Lyrics 0.004 Rock 0.055 Am 0.050 Ring 0.042 Festival 0.009 Tickets 0.003 • Rock Am Ring Data: • 32,864 documents • Multilingual (English, German, etc.) The Topic Detection module is based on the Mahout Collapsed VariationalBayeswhich scales on very large dataset Fashion0.003 Collection 0.003 Food0.003 Style 0.003 Color 0.002 Page 0.007 Site 0.005 Web 0.005 Click 0.004 Link 0.004 Task 1: Topic Detection Task 2: Assign Documents to Topics
Temporal Evolution in Topic Modeling [Mantrach 13] • Several Challenges: • Tracking the evolution of topics • Early detection of emerging topics • Prediction of trendy topics Topics may evolve and emerge over time
Trendy Topic Detection • Understanding what was the trend at a specific time in the past • Detect events/entities/words that are popular in a time frame POS HBase Tokens Trendy Tf-Idf Ranked List Named Entity Rec. • Compute Trendiness: • The term frequency in a period is penalized with the average term frequency over other time periods • Tokens that are popular in all time periods are down-weighted
Twitter Dynamics [WOSS 12] • Motivation – being able to pose questions like: • “What are the hashtags associated with #obama at time t?” • “Find tweets that mention #cnn during the periods that #obama is associated with #romney” • “How the hashtags associated with #obamawins have evolved over time?” • “Find tweets that mention #romney during the peak periods of #obama” • Designed a model that takes the temporal aspect for associatinghashtags in tweets into account (e.g. based on co-occurrence) • Implemented query operators for retrieving the tweets that satisfy complex conditions: filter, fold, jump, merge, join • Implemented a prototype system • Experiments with 25,000 tweets about the US elections
NamedEntity Evolution [TPDL 12] Joseph Ratzinger Pope Benedict Named Entities (NE): people, places, companies... Characteristics of Named Entity Evolution (NEE) • Same thing but different terms over time • Change occurs over short periods of time • Small or no concept shift • Announced to the public repeatedly Goal: Find method for named entity evolution recognition independent from external knowledge sources Pope Benedict XVI Benedict XVI Joseph Aloisius Ratzinger Cardinal Ratzinger Cardinal Joseph Ratzinger Change Period
NamedEntity Evolution Recognizer (NEER) [NEER Coling 12] Processing Chain Barack Obama Senator State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama Illinois Democrat Vladimir Putin President-elect Vladimir V Putin Minister Vladimir Putin ActingPresident Vladimir V Putin President Vladimir V Putin Evaluation Results • Burst detection found total 73% of all change periods • High recall for unsupervised method • Machine learning boosts precision • Data set:http://www.l3s.de/neer-dataset/
FOKAS – Formerly Known As Search Engine [FOKAS Coling12] http://www.l3s.de/fokas/
References [SDA 12] Dietze, S., Maynard, D., Demidova, E., Risse, T., Peters, W., Doka, K., Stavrakas, Y., Entity Extraction and Consolidation for Social Web Content Preservation, 2nd SDA Workshop, Pafos, 2012. [WOLE 12] Nunes, B. P., Kawase, R., Dietze, S., Taibi, D., Casanova, M.A., Nejdl, W., Can entities be friends?, Proc. of WOLE2012 Workshop at the ISWC2012, Boston, US (2012). [KECSM 12] Maynard, D., Dietze, S., Hare, J., Peters, W., (Eds.), Proc. of the 1st KECSM Workshop at the ISWC2012, CEUR Workshop Proceedings Vol. 895, 2012. [TPDL 12] Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P., Exploiting the Social and Semantic Web for guided Web Archiving, TPDL2012, Pafos, Cyprus, September 2012. [ICDM 12] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco .Topic-aware Social Influence Propagation Models. Proc. of the ICDM 2012, Brussels, Belgium, December 2012 [WSDM 13] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco. Cascade-Based Community Detection. Proc. of the WSDM 2013, Rome, Italy, February 2013 [NEER Coling 12] Nina Tahmasebi , Gerhard Gossen , NattiyaKanhabua , HelgeHolzmann , Thomas Risse, NEER: An Unsupervised Method for Named Entity Evolution Recognition. Coling2012, Mumbai [FOKAS Coling 12] HelgeHolzmann , Gerhard Gossen , Nina Tahmasebi, fokas: Formerly Known As -- A Search Engine Incorporating Named Entity Evolution, Proc. of the Coling 2012, Mumbai, India [WOSS 12] VassilisPlachouras, and YannisStavrakas. Querying Term Associations and their Temporal Evolution in Social Data. Int. VLDB Workshop on Online Social Systems (WOSS 2012). [ICMR 12] Hare, Jonathon, Samangooei, Sina, Dupplaw, David and Lewis, Paul H. ImageTerrier: an extensible platform for scalable high-performance image retrieval. ACM ICMR'12, Hong Kong, HK. [MTA12] Hare, Jonathon S., Samangooei, Sina and Lewis, Paul H. (2012) PracticalscalableimageanalysisandindexingusingHadoop. Multimedia Tools andApplications, 1-34. [Mantrach 13] AminMantrach. A Joint Past and Present NMF for Topic Detection and Transitions in Social Media; Subm. 13
Thank You! Dr. Elena Demidova demidova@L3S.de L3S Research Center Appelstrasse 9a 30167 Hannover