1 / 8

Finding Text Trends

Finding Text Trends. Word usage tracks interest changes Segment documents by time period Phrase frequency = number of documents Phrase must have support Trend is sequence of frequencies. Approach. Reuse current data mining tools Two major components phrase identification

joanne
Download Presentation

Finding Text Trends

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Text Trends • Word usage tracks interest changes • Segment documents by time period • Phrase frequency = number of documents • Phrase must have support • Trend is sequence of frequencies

  2. Approach • Reuse current data mining tools • Two major components • phrase identification • trend identification via shape queries • Each word treated as a “transaction” • “timestamp” becomes a variable • Match history to query • visual query language • Visualize results

  3. Identifying Phrases • Phrases defined recursively • <<(IBM)><(data)(mining)>> • (IBM) = a word • <(data)(mining)> = a 1-phrase • <<x><y>> = a 2-phrase

  4. Kludge timestamps • Trying to use existing tools • Queries may specify document sections • same sentence • same paragraph • same section • Word “timestamps” fudged • sentence + 1,000 • paragraph + 100,000 • section + 10,000,000

  5. Query tricks • Minimum gap = 1000 • same but sequential sentences • Maximum gap = 999 • same sentence • Maximum gap = 99,999 • same paragraph

  6. Shape Definition Language • For describing trends in word frequency • rising • falling • spike • Has graphical front-end • Can be “blurry” • shape significant • interval details neglected

  7. Test Application • U.S. Patent dB • dB searched by unknowledgeable user • Identified rising trends for several phrases • Transition from specific query to mining not described

  8. Problems • Tended to identify too many phrases • Worked on pruning of phrases • non-maximal subset near maximal phrase • syntactic sub-phrases

More Related