80 likes | 169 Views
Finding Text Trends. Word usage tracks interest changes Segment documents by time period Phrase frequency = number of documents Phrase must have support Trend is sequence of frequencies. Approach. Reuse current data mining tools Two major components phrase identification
E N D
Finding Text Trends • Word usage tracks interest changes • Segment documents by time period • Phrase frequency = number of documents • Phrase must have support • Trend is sequence of frequencies
Approach • Reuse current data mining tools • Two major components • phrase identification • trend identification via shape queries • Each word treated as a “transaction” • “timestamp” becomes a variable • Match history to query • visual query language • Visualize results
Identifying Phrases • Phrases defined recursively • <<(IBM)><(data)(mining)>> • (IBM) = a word • <(data)(mining)> = a 1-phrase • <<x><y>> = a 2-phrase
Kludge timestamps • Trying to use existing tools • Queries may specify document sections • same sentence • same paragraph • same section • Word “timestamps” fudged • sentence + 1,000 • paragraph + 100,000 • section + 10,000,000
Query tricks • Minimum gap = 1000 • same but sequential sentences • Maximum gap = 999 • same sentence • Maximum gap = 99,999 • same paragraph
Shape Definition Language • For describing trends in word frequency • rising • falling • spike • Has graphical front-end • Can be “blurry” • shape significant • interval details neglected
Test Application • U.S. Patent dB • dB searched by unknowledgeable user • Identified rising trends for several phrases • Transition from specific query to mining not described
Problems • Tended to identify too many phrases • Worked on pruning of phrases • non-maximal subset near maximal phrase • syntactic sub-phrases