220 likes | 307 Views
Where does this new information belong? From developing mining algorithms to supporting knowledge discovery. Bettina Berendt – thanks for joint work with and support from Ilija Subasi ć Mathias Verbeke Siegfried Nijssen Luc De Raedt K.U. Leuven. Yes we can!. The problem.
E N D
Where does this new information belong?From developing mining algorithms to supporting knowledge discovery Bettina Berendt – thanks for joint work with and support from Ilija Subasić Mathias Verbeke Siegfried Nijssen Luc De Raedt K.U. Leuven
Yes we can! The problem
The solution? Automatic topic dectection Health 0.017 Care 0.015 Insurance 0.013 American 0.013 Uninsured 0.009 Families 0.008 Working 0.005
Visionary president Damp-rag president Rhetorics Party-politics (right and left) Obama‘s overall agenda Same event/document; different interpretations & categorisations
Text mining Stream mining Media studies ! Conference programme Similar problems in science and learning Topic detection in time-indexed corpora of news texts
Similar problems in other areas Music collections, multimedia collections: see Andreas Nürnberger‘s talk at SML 2010
Political activist Female Has problems with anger management The solution?Context-aware systems / personalisation You probably do / should think about it this way: ...
What users want ... to structure the world how they see it interactivity left right ... to re-use their categories (that they worked so hard to find) semantics ... to acknowledge that others see the world differently squares / circles • Social similarity / diversity green / not green ... to be able to see through their eyes is (nearly) green • perspective- taking ... to provide data mining methods to do all that!
Research agenda The problem interactivity automatic topic dectection semantics support sense-making = provide methods / tools for Knowledge Disovery (in the full sense) • Social similarity / diversity • perspective- taking ... to provide data mining methods to do all that!
Research agenda Our solution approach The problem interactivity automatic topic dectection semantics support sense-making = provide methods / tools for Knowledge Disovery (in the full sense) • Social similarity / diversity • perspective- taking ... to provide data mining methods to do all that!
Burstiness measure • time relevance, • a “temporal co-occurrence lift” Selection approach for concepts • concepts = words or named entities • salient concept = high TF & involved in a salient relation, time-indexed STORIES: mining basics (1)Graphical summarisation of multiple text documents Document / text pre-processing Similarity measure to determine salient relations • Template recognition • Multi-document named entities • Stopword removal, lemmatization • “fact (assertion) recognition” • bursty co-occurrence Document summarization strategy • no topics, but salient concepts & relations • time window; word-span window
STORIES: mining basics (2)Graph analysis for query recommendation Aim: highlight subgraphs that represent an event Topological properties Change: Subgraph new in this period
STORIES: evaluation • Information retrieval quality • Edges – events: up to 80% recall, ca. 30% precision • Search quality • Subgraphs index coherent document clusters • Learning effectiveness • Document search with story graphs leads to averages of • 67-75% accuracy on judgments of story fact truth • on average, 1.3-4.7 queries with 3.4-5.2 nodes/words per query • Comparison with other temporal text mining methods • New (and only) framework for cross-method comparison • Recall-&precision-style metrics different method rankings
Apply my grouping rfid (Security/privacy, Group 2, ...) to the following new search result: * Show users and how similarly they group * Apply U4‘s grouping to my new search result: Damilicious: functionality basics
Damilicious: mining basics (1)Methods and process • Query • Automatic clustering • Manual regrouping • Re-use • Learn classifier & present way(s) of grouping • Transfer the constructed concepts Features/methods for the conceptual/predictive clustering: • Lingo phrases, Lingo clustering, Ripper • co-citation, bibliometric coupling, word or LSA similarity, combinations; k-means, hierarchical
Damilicious: mining basics (2)Measures of grouping and user diversity Diversity = 1 – similarity = 1 - Normalized mutual information (entropy-based measure) • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For several queries: aggregate • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For several queries: aggregate NMI = 0
Damilicious: evaluation • Clustering: Does it generate meaningful document groups? • yes (tradition in bibliometrics) – but: data? • Small expert evaluation of CiteseerCluster • Choosing the clustering and classification methods for conceptual clustering • Experiments: different features, clustering methods, classification methods quality of reconstruction and extension-over-time (NMI) • Technology acceptance • End-user experiment (clustering & regrouping) • 5-personformative user study (transfer of own results)
Conclusions and (some) questions • Sense-making involves • Extracting information from texts • Extracting structural information between entities • Creating, using and modifying categories • Interacting with external representations • Acknowledging diversity and perspective-taking • ... • Appropriate mining methods, measures, ...? • More/better evaluation methods and frameworks? • Use cases? KD approach Text mining Graph mining Semantics Interactivity Usage mining and “model-processing“ (conceptual / predictive clustering) • Sense-making involves • Extracting information from texts • Extracting structural information between entities • Creating, using and modifying categories • Interacting with external representations • Acknowledging diversity and perspective-taking • ...
Questions ? you ! Thank
To Read • Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems. DOI - 10.1007/s10115-009-0227-x (PDF) • Berendt, B. & Subašić, I. (2009). STORIES in time: a graph-based interface for news tracking and discovery. n N. Cristianini & M. Turchi (Eds.), Proceedings of Intelligent Analysis and Processing of Web News Content (IAPWNC) at The 2009 IEEE /WIC / ACM International Conferences Web Intelligence (WI'09) / Intelligent Agent Technology (IAT'09). 15 September 2009, Milan, Italy. (Proceedings of WI-IAT.2009, DOI 10.1109/WI-IAT.2009.342, pp. 531-534) (PDF) • Verbeke, M., Berendt, B., & Nijssen, S. (2009). Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. In G. Boato & C. Niederee (Eds.), Proceedings of First International Workshop on Living Web, collocated with the 8th International Semantic Web Conference (ISWC-2009), Washington D.C., USA, October 26, 2009. CEUR Workshop Proceedings Vol-515. (PDF) • Berendt, B. (2010). Diversity in search: what, how, and what for? Talk at Barcelona Media / Yahoo! Research and UPF, 4 March 2010. (PPT) • Berendt, B., Krause, B., & Kolbe-Nusser, S. (2010). Intelligent scientific authoring tools: Interactive data mining for constructive uses of citation networks. networks. Information Processing & Management, 46(1), 1-10. (PDF)