180 likes | 418 Views
Text Mining Workshop of ACM SIGKDD. Speaker: Ping-Tsun Chang. Outline. SIGKDD: Text Mining Workshop: Session: Mining Time-Tagged Text Mining of Concurrent Text and Time Series TimeMines: Constructing Timelines with Statistical Models of Word Usage Session: Text Mining Applications:
E N D
Text MiningWorkshop of ACM SIGKDD Speaker: Ping-Tsun Chang
Outline • SIGKDD: Text Mining Workshop: • Session: Mining Time-Tagged Text • Mining of Concurrent Text and Time Series • TimeMines: Constructing Timelines with Statistical Models of Word Usage • Session: Text Mining Applications: • Mining E-mail Authorship
Mining of Concurrent Text and Time SeriesÆnalyst • Predicting trends in stock prices based on the content of news stories that precede the trends • Two types of data • Financial time series • Time-stamped news stories • How to connect? • Learn a language model for every trend type
Mining of Concurrent Text and Time Series System Design Time-Series Data (Stock Price) Trends Language Model For Trend-Type Likelihood That the Document Is from Each Model Texual Data (News Articles) Relevent Documents Align Trends With Documents New Document
Mining of Concurrent Text and TimeRedescribe Time Series • Identifying Trends • Discretizing Trends • This step in a subjective one in which we assign labels to segments based on their characteristics • Length • Slope • Intercept • r2
Mining of Concurrent Text and Time Clustering • Agglomerative clustering
Mining of Concurrent Text and Time Language Models (I) • A Language Model represents a discrete distribution over the words in the vecabulary
Mining of Concurrent Text and Time Language Models (II) • Language Model can separate stories that are followed by a surge that from stories that are not
Mining of Concurrent Text and Time Current Alignment • A document would be associated with more than one trend • It is possible for d2 to influence both trends t1 and t2.
TimeMines: Constructing Timelines with Statistical Models of Word Usage • Automatically generates timelines from data-tagged free text corpora • Construct overviews of text corpora suitable for browsing using timelines • Identify time-dependent features that identify important topics in text documents
TimeMinesSystems Overview • Process steps to discover features in text
TimeMinesThe Model for Extracting Features • Stationary random model • The occurrence of a feature depends only on its base rate, and dose NOT vary with time. • The arrival of features is a random process with an Unknown binomial distribution • Extracting Features • Noun phrases and name entities • Label as noun phrases any grouops of words of length less than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN
TimeMinesFinding Significant Features • Many statistics can be used to characterize a 2x2 Contigency Table • EMIM: Expected Mutual Information Measure • KL: Kullback-Leibler divergence • x2: Chi-Square
TimeMinesGrouping Significant Features • The assumption that two features fj and fk have independent distributions implies that P( fk ) = P( fk | fj )
TimeMinesSystems Image • The pop-up window shows significant named entities of Oklahoma, FBI, Justice Department, etc.
Mining E-mail Authorship • Authorship identification or categorisation by E-mail documents • E-mail document features • Structural characteristics • Linguistic evidenece • Support Vector Machine
Mining E-mail AuthorshipE-mail document body attributes • Structural features • pattern of vocabulary usage • Stylistic • Sub-stylistic features
Mining E-mail AuthorshipExperienmantal Results • SVMlight • F-measure with β=1.0