210 likes | 407 Views
World class IT in a world-wide market. Text Mining Highlights . Marten Trautwein Syllogic Research & Development. RoadMap. TextHub A parallel information retrieval tool Text Mine A document clustering extension Emile Grammar induction & clustering. What is TextHub?.
E N D
Text Mining Highlights Marten Trautwein Syllogic Research & Development
RoadMap • TextHub • A parallel information retrieval tool • Text Mine • A document clustering extension • Emile • Grammar induction & clustering
What is TextHub? • Intelligent Parallel Information Retrieval Tool • Intuitive Web based graphical user interface • Compression Decompression • Indexing Retrieval • Document clustering & categorization
The star topology • Master receives requests • Master delegates tasks • Slave performs tasks • Master collects results • Master returns answer
Use of parallelism • Documents outnumber processors • Divide and conquer • Distribute documents • Communication overhead minimum • Linear speed-up (1GB per hour)
Functionality details • Compression / Decompression • Canonical Huffman encoding • Indexing • Inverted file index with canonical terms • Retrieval • Boolean (AND, OR, MINUS) • Search modifiers (stemming, case folding, stop list, synonyms, semantic network) • Proximity (AT, FAR, NEAR) • Relevance ranking • Score documents
Relevance ranking • Rate relevance of document • Score based on number of occurrences • Score compensated for large documents • TextHub marks where document is relevant
Text Mine - Document clustering • Improve relevance feed-back • Clustering of related documents • Categorization of documents • Minimum spanning tree algorithm
V F T D U S A E B C Using minimum spanning tree • Combine different measures • Ordinary query retrieves relevant nodes • Nodes serve as entry-points • No global minimum spanning tree ?
Emile • In coorparation with University of Amsterdam • Engine enabling • Grammar induction • Knowledge base construction • Compound term separation • Language independent
Fragment of Phaistos disk 1 41 40 7. 2 12 4 40 33. 2 12 6 18 *. 2 12 13 1. 2 12 13 1 18. 2 12 27 14 32 18 27. 2 12 27 35 37 21. 2 12 31 26. 2 12 32 23 38. 2 12 41 19 35. 2 27 25 10 23 18. … 16 14 18. 16 23 18 43. Fragment of grammar [0] --> [3] . [3] --> [16] [47] [14] --> 15 [40] [14] --> 2 12 [16] --> 2 [57] 25 10 23 [16] --> [14] 13 1 [16] --> 16 14 [40] --> 7 [40] --> 29 [47] --> 18 [47] --> 24 40 [57] --> 27 [57] --> 29 Grammar induction
Dictionary Type [35] K033 k033 K105 k33 Dictionary Type [87] Vrachtgeb vrachtgeb Vrachtgebouw Vracht Dictionary Type [89] CGOADTP6 Printqueue Dictionary Type [114] is Userid Password Dictionary Type [138] status Error Dictionary Type [196] scarlos vrachtbrieven Dictionary Type [215] G239 g239 Dictionary Type [237] enorm ontzettend super Dictionary Type [290] pingen benaderen Knowledge base construction
[16] --> School of Medicine , University of Washington , Seattle 98195 , USA [16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan [16] --> Heinrich-Heine-University , Dusseldorf , Germany [16] --> School of Medicine , Chiba University [5] --> Department of Urology , [16] [94] --> Chinese [94] --> Japanese [94] --> Polish [101] --> 32 : Cancer Res 1996 Oct [101] --> 35 : Genomics 1996 Aug [101] --> 44 : Cancer Res 1995 Dec [101] --> 50 : Cancer Res 1995 Feb [101] --> 54 : Eur J Biochem 1994 Sep [101] --> 58 : Cancer Res 1994 Mar [105] --> identified in 13 cases ( 72 [105] --> detected in 9 of 87 informative cases ( 10 [105] --> observed in 5 ( 55 [11] --> LOH was [105] % Emile outcome