Finding Text Trends

Finding Text Trends • Word usage tracks interest changes • Segment documents by time period • Phrase frequency = number of documents • Phrase must have support • Trend is sequence of frequencies

Approach • Reuse current data mining tools • Two major components • phrase identification • trend identification via shape queries • Each word treated as a “transaction” • “timestamp” becomes a variable • Match history to query • visual query language • Visualize results

Identifying Phrases • Phrases defined recursively • <<(IBM)><(data)(mining)>> • (IBM) = a word • <(data)(mining)> = a 1-phrase • <<x><y>> = a 2-phrase

Kludge timestamps • Trying to use existing tools • Queries may specify document sections • same sentence • same paragraph • same section • Word “timestamps” fudged • sentence + 1,000 • paragraph + 100,000 • section + 10,000,000

Query tricks • Minimum gap = 1000 • same but sequential sentences • Maximum gap = 999 • same sentence • Maximum gap = 99,999 • same paragraph

Shape Definition Language • For describing trends in word frequency • rising • falling • spike • Has graphical front-end • Can be “blurry” • shape significant • interval details neglected

Test Application • U.S. Patent dB • dB searched by unknowledgeable user • Identified rising trends for several phrases • Transition from specific query to mining not described

Problems • Tended to identify too many phrases • Worked on pruning of phrases • non-maximal subset near maximal phrase • syntactic sub-phrases

Finding Text Trends