Text Mining Workshop of ACM SIGKDD

Text MiningWorkshop of ACM SIGKDD Speaker: Ping-Tsun Chang

Outline • SIGKDD: Text Mining Workshop: • Session: Mining Time-Tagged Text • Mining of Concurrent Text and Time Series • TimeMines: Constructing Timelines with Statistical Models of Word Usage • Session: Text Mining Applications: • Mining E-mail Authorship

Mining of Concurrent Text and Time SeriesÆnalyst • Predicting trends in stock prices based on the content of news stories that precede the trends • Two types of data • Financial time series • Time-stamped news stories • How to connect? • Learn a language model for every trend type

Mining of Concurrent Text and Time Series System Design Time-Series Data (Stock Price) Trends Language Model For Trend-Type Likelihood That the Document Is from Each Model Texual Data (News Articles) Relevent Documents Align Trends With Documents New Document

Mining of Concurrent Text and TimeRedescribe Time Series • Identifying Trends • Discretizing Trends • This step in a subjective one in which we assign labels to segments based on their characteristics • Length • Slope • Intercept • r2

Mining of Concurrent Text and Time Clustering • Agglomerative clustering

Mining of Concurrent Text and Time Language Models (I) • A Language Model represents a discrete distribution over the words in the vecabulary

Mining of Concurrent Text and Time Language Models (II) • Language Model can separate stories that are followed by a surge that from stories that are not

Mining of Concurrent Text and Time Current Alignment • A document would be associated with more than one trend • It is possible for d2 to influence both trends t1 and t2.

TimeMines: Constructing Timelines with Statistical Models of Word Usage • Automatically generates timelines from data-tagged free text corpora • Construct overviews of text corpora suitable for browsing using timelines • Identify time-dependent features that identify important topics in text documents

TimeMinesSystems Overview • Process steps to discover features in text

TimeMinesThe Model for Extracting Features • Stationary random model • The occurrence of a feature depends only on its base rate, and dose NOT vary with time. • The arrival of features is a random process with an Unknown binomial distribution • Extracting Features • Noun phrases and name entities • Label as noun phrases any grouops of words of length less than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN

TimeMinesFinding Significant Features • Many statistics can be used to characterize a 2x2 Contigency Table • EMIM: Expected Mutual Information Measure • KL: Kullback-Leibler divergence • x2: Chi-Square

TimeMinesGrouping Significant Features • The assumption that two features fj and fk have independent distributions implies that P( fk ) = P( fk | fj )

TimeMinesSystems Image • The pop-up window shows significant named entities of Oklahoma, FBI, Justice Department, etc.

Mining E-mail Authorship • Authorship identification or categorisation by E-mail documents • E-mail document features • Structural characteristics • Linguistic evidenece • Support Vector Machine

Mining E-mail AuthorshipE-mail document body attributes • Structural features • pattern of vocabulary usage • Stylistic • Sub-stylistic features

Mining E-mail AuthorshipExperienmantal Results • SVMlight • F-measure with β=1.0

Text Mining Workshop of ACM SIGKDD

Text Mining Workshop of ACM SIGKDD

Presentation Transcript

Text-Mining: analysis of text data

Text Mining

Demonstration of Text Mining

ACM Wi-Fi Workshop

Text mining- text analytics- data mining

Text Mining

Applications of Text Mining

SIGKDD Program Review sigkdd kdd

2007 ACM SIGKDD Data Mining Practice Prize Winners

Text Mining

Text Mining

Text Mining

Text Mining

Text-Mining: analysis of text data

Text Mining