1 / 18

Text Mining Workshop of ACM SIGKDD

Text Mining Workshop of ACM SIGKDD. Speaker: Ping-Tsun Chang. Outline. SIGKDD: Text Mining Workshop: Session: Mining Time-Tagged Text Mining of Concurrent Text and Time Series TimeMines: Constructing Timelines with Statistical Models of Word Usage Session: Text Mining Applications:

marina
Download Presentation

Text Mining Workshop of ACM SIGKDD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text MiningWorkshop of ACM SIGKDD Speaker: Ping-Tsun Chang

  2. Outline • SIGKDD: Text Mining Workshop: • Session: Mining Time-Tagged Text • Mining of Concurrent Text and Time Series • TimeMines: Constructing Timelines with Statistical Models of Word Usage • Session: Text Mining Applications: • Mining E-mail Authorship

  3. Mining of Concurrent Text and Time SeriesÆnalyst • Predicting trends in stock prices based on the content of news stories that precede the trends • Two types of data • Financial time series • Time-stamped news stories • How to connect? • Learn a language model for every trend type

  4. Mining of Concurrent Text and Time Series System Design Time-Series Data (Stock Price) Trends Language Model For Trend-Type Likelihood That the Document Is from Each Model Texual Data (News Articles) Relevent Documents Align Trends With Documents New Document

  5. Mining of Concurrent Text and TimeRedescribe Time Series • Identifying Trends • Discretizing Trends • This step in a subjective one in which we assign labels to segments based on their characteristics • Length • Slope • Intercept • r2

  6. Mining of Concurrent Text and Time Clustering • Agglomerative clustering

  7. Mining of Concurrent Text and Time Language Models (I) • A Language Model represents a discrete distribution over the words in the vecabulary

  8. Mining of Concurrent Text and Time Language Models (II) • Language Model can separate stories that are followed by a surge that from stories that are not

  9. Mining of Concurrent Text and Time Current Alignment • A document would be associated with more than one trend • It is possible for d2 to influence both trends t1 and t2.

  10. TimeMines: Constructing Timelines with Statistical Models of Word Usage • Automatically generates timelines from data-tagged free text corpora • Construct overviews of text corpora suitable for browsing using timelines • Identify time-dependent features that identify important topics in text documents

  11. TimeMinesSystems Overview • Process steps to discover features in text

  12. TimeMinesThe Model for Extracting Features • Stationary random model • The occurrence of a feature depends only on its base rate, and dose NOT vary with time. • The arrival of features is a random process with an Unknown binomial distribution • Extracting Features • Noun phrases and name entities • Label as noun phrases any grouops of words of length less than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN

  13. TimeMinesFinding Significant Features • Many statistics can be used to characterize a 2x2 Contigency Table • EMIM: Expected Mutual Information Measure • KL: Kullback-Leibler divergence • x2: Chi-Square

  14. TimeMinesGrouping Significant Features • The assumption that two features fj and fk have independent distributions implies that P( fk ) = P( fk | fj )

  15. TimeMinesSystems Image • The pop-up window shows significant named entities of Oklahoma, FBI, Justice Department, etc.

  16. Mining E-mail Authorship • Authorship identification or categorisation by E-mail documents • E-mail document features • Structural characteristics • Linguistic evidenece • Support Vector Machine

  17. Mining E-mail AuthorshipE-mail document body attributes • Structural features • pattern of vocabulary usage • Stylistic • Sub-stylistic features

  18. Mining E-mail AuthorshipExperienmantal Results • SVMlight • F-measure with β=1.0

More Related