260 likes | 277 Views
Explore the evolution of text mining tools like Emile Marten and IM4Text, including their features, functionalities, and applications in a global market. Learn about term clustering, summarization, and more in this comprehensive guide.
E N D
Practical results with Emile Marten Trautwein Syllogic B.V.
Road map • Introduction myself • Context: • Text mining tools • Results with Emile
Introduction myself • Computer Science at UvA (1986 - 1991) • Theoretical computer science • Complexity of Categorial Unification Grammar • Dr Janssen • PhD Computer Science at Uva (1991 - 1995) • Theoretical computer science • Complexity of Unification Grammars • Dr v. Emde Boas, Dr Janssen, Dr Torenvliet • Syllogic B.V. (1995 - ...) • Research and development • Text mining
Context • Term clustering • TextAnalyst - Microsystems Co. Ltd. • Intelligent miner for text - IBM
TextAnalyst • Microsystems Co. Ltd. • Megaputer Intelligence Inc (distributor) • Version 2.0 • www.megaputer.com
TextAnalyst - Features • Functionality includes • Hierarchical / Structured topics • Knowledge base formation • Semantic search • Abstracting • Languages • English • Russian
Intelligent miner for text • IBM Corp. • Version 2.3 • December 1998 • www-4.ibm.com/software/data/iminer/fortext/
IM4Text - Features • Functionality includes • Clustering • Categorization • Search • Summarization • WebCrawler • Languages • English
IM4Text- Clustering 0 III IX, X VII XI I II IV V VI VIII XII
Verity Knowledge Organizer Autonomy Knowledge Server GrapeVine TextWise's DR-LINK, CHESS and CINDOR Data Junction's Cambio DataSet Synthema, Italy (IBM Technology Watch) Semio Corp's SemioMap Cartia's ThemeScape Canis' cMap Inxight's LinguistX and VizControls Muscat's Empower Other tools
Emile • Syllogic / University of Amsterdam • Version 3.1
Emile - Features • Functionality includes • Grammar induction • Knowledge base construction • Compound term separation • Languages • Any
Fragment of Phaistos disk 1 41 40 7. 2 12 4 40 33. 2 12 6 18 *. 2 12 13 1. 2 12 13 1 18. 2 12 27 14 32 18 27. 2 12 27 35 37 21. 2 12 31 26. 2 12 32 23 38. 2 12 41 19 35. 2 27 25 10 23 18. … 16 14 18. 16 23 18 43. Fragment of grammar [0] --> [3] . [3] --> [16] [47] [14] --> 15 [40] [14] --> 2 12 [16] --> 2 [57] 25 10 23 [16] --> [14] 13 1 [16] --> 16 14 [40] --> 7 [40] --> 29 [47] --> 18 [47] --> 24 40 [57] --> 27 [57] --> 29 Emile - Grammar induction
Emile - Incomplete data set Ik kan geen mail lezen met MS-Mail Ik kan geen mail schrijven met MS-Mail Ik kan geen mail openen met MS-Mail Ik kan geen mail verzenden met MS-Mail Ik kan geen mail lezen met MS-Outlook Ik kan geen mail schrijven met MS-Outlook Ik kan geen mail openen met MS-Outlook Ik kan geen mail verzenden met MS-Outlook Ik kan geen mail lezen met Mail Ik kan geen mail schrijven met Mail Ik kan geen mail openen met Mail Ik kan geen mail verzenden met Mail Ik kan geen mail lezen met Outlook Ik kan geen mail schrijven met Outlook Ik kan geen mail openen met Outlook Ik kan geen mail verzenden met Outlook
Default on 12 context support: 30%expression support: 30%total support: 50% Default on 8 context support: 40%expression support: 40%total support: 60% context support: 50%expression support: 50%total support: 70% Generate data set Generate complete language Generate data set Generate 15 out of 16 sentences Generate complete language Emile - Variable settings
[0] --> [2] [18] [0] --> [31] [29] [0] --> [42] [15] [2] --> Ik kan geen mail [12] met [12] --> openen [12] --> verzenden [15] --> met [41] [15] --> met [18] [18] --> MS-Mail [18] --> MS-Outlook [27] --> verzenden [27] --> lezen [29] --> met [30] [30] --> MS-Outlook [30] --> Mail [31] --> Ik kan geen mail [27] [31] --> Ik kan [45] [39] --> lezen [39] --> schrijven [41] --> Mail [41] --> Outlook [42] --> Ik kan [45] [45] --> geen mail [39] [45] --> geen mail [12] Emile - Induced grammar
Dictionary Type [35] K033 k033 K105 k33 Dictionary Type [87] Vrachtgeb vrachtgeb Vrachtgebouw Vracht Dictionary Type [89] CGOADTP6 Printqueue Dictionary Type [114] is Userid Password Dictionary Type [138] status Error Dictionary Type [196] scarlos vrachtbrieven Dictionary Type [215] G239 g239 Dictionary Type [237] enorm ontzettend super Dictionary Type [290] pingen benaderen Emile - Knowledge base
[16] --> School of Medicine , University of Washington , Seattle 98195 , USA [16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan [16] --> Heinrich-Heine-University , Dusseldorf , Germany [16] --> School of Medicine , Chiba University [5] --> Department of Urology , [16] [94] --> Chinese [94] --> Japanese [94] --> Polish [101] --> 32 : Cancer Res 1996 Oct [101] --> 35 : Genomics 1996 Aug [101] --> 44 : Cancer Res 1995 Dec [101] --> 50 : Cancer Res 1995 Feb [101] --> 54 : Eur J Biochem 1994 Sep [101] --> 58 : Cancer Res 1994 Mar [105] --> identified in 13 cases ( 72 [105] --> detected in 9 of 87 informative cases ( 10 [105] --> observed in 5 ( 55 [11] --> LOH was [105] % Emile - Knowledge base
Merits Emile • Language independent • Clustering within sentences • Incremental learning • No training phase • Raw text input • Access to source code