350 likes | 362 Views
Explore language barriers in digital libraries, propose solutions through text mining tech. Increase access to diverse content through multilingual search, translation, and summarization systems. Address information overload with efficient language understanding systems.
E N D
Research Problems in Digital Libraries:Data Mining and Text Mining Jaime Carbonell and Raj Reddy Carnegie Mellon University April 21, 2006 Talk presented at CS50 symposium at CMU
Digital Libraries and Universal Access to Information • Create a Universal Digital Library containing all the books ever published • Unfortunately many of the books are in English • Not readable by over 80% of the population
Information Overload • If we read a book every day • we can only read, at most, 40,000 books in a life time • Having millions of books online and accessible creates an information overload • “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon • Multilingual search technology can help to reduce the overload • permits users to search very large data bases quickly and reliably • independent of language and location
Understanding Language • Books in non-native languages remain incomprehensible to most people • Translation and Summarization essential for world wide use • Current translation systems are not yet perfect • Significant improvements in language understanding systems in the past few decades • Systems based on statistical and linguistic techniques have shown significant performance improvements • improve performance using machine learning • Digitization projects will act as test bed • for validating Language Understanding Systems Research • e.g. The Million Book Digital Library Project
The Million Book Digital Library • Collaborative venture among many countries including USA, China and India • So far 400,000 books have been scanned in China and 200,000 in India • Content is made freely available around the globe • Those wishing to see the Video in the next slide should download from http://www.rr.cs.cmu.edu/MSRI.zip
Million Book Project: Status • 21 Centers in India • 17 centers in China • 1 Center in Egypt • Planned : Australia and Europe • About 600,000 books scanned • About 120,000+ accessible on the web from India • http://dli.iiit.ac.in/ • Uses 8TB of storage • 10 TB server at CMU Library planned for July 2005 • 1,000,000 books by the end of 2007 • Capacity to scan a million pages a day expected to be operational by the end of 2006
Million Book Project: Research Challenges • Providing Access to Billions everyday • Distributed Cached Servers in every country and region • Self-Healing Data Bases • Easy to use interfaces for Billions • Text Mining Challenges • Multilingual Information Retrieval • Summarization • Text Categorization • Named-Entity identification • Novelty Detection • Translation
Information Bill of Rights • Get theright information • To theright people • At theright time • On theright medium • In theright language • With theright level of detail
Relevant Text Mining Technologies IR (search engines) Classification, routing Anticipatory analysis Info extraction, speech Machine translation Summarization • “…right information” • “…right people” • “…right time” • “…right medium” • “…right language” • “…right level of detail”
… The Right Information:Next Generation Search Engines • Search Criteria Beyond Query-Relevance • Google:Popularity(link density, click freq, …) • Vivisimo: Panoramic view (clustering + labeling) • Information novelty(content differential, recency) • Trustworthiness of source • Appropriateness to user (difficulty level, …) • Hidden web: 10X visible web (Federated search) • “Find What I Mean” Principle • Search on semantically related terms • Induce user profile from past history, etc. • Disambiguate terms (e.g. “Jordan”)
Clustering (Vivisimo-style) Search vs Standard IR documents query IR Cluster summaries
MMR Ranking vs Standard IR documents query MMR IR λcontrols spiral curl
… In The Right Level of DetailSynthetic Document = Summary++ • Extractive combo (tracking, MMR, …) • Centrality of info • KIT model relevant • Novelty (vs last time) • Entities, relations, dates, … + raw text • Later: contradiction & attitude detection • Combine: CMU, IBM (NE + rel extraction), UMD (user model, summ), Stanford (contradiction detection) Entities ……… Relations ……. Audio transcripts Textual summary Texts (Eng, Arabic, Chinese …) Analyst zoom-in Novel Attitude mixed Sources
… In the Right Language (MT) Interlingua Semantic Analysis Sentence Planning Transfer Rules Syntactic Parsing Text Generation Source (Arabic) Target (English) Direct: EBMT, SMT
EBMT example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.
1986 1991 1993 1996 2000 Interlingua Spoken Language Multi Engine Example Based Statistical Low Resource Automatic MT Evaluation Portable Letras Avenue MEMT METEOR Diplomat Tongues GEBMT KANT MT Lab KBMT-89 JANUS C-STAR I Pangloss RADD - MT/TIDES GALE Enthusiast TransTac C-STAR II ThaiLator Nespole Lingwear Semantic Annotation Speechalator Q & A Extraction CALL
“Language of Life”: vocabulary chemical groups, properties of AA
Evolutionary Methods for Discovering Sequence Structure Mapping Distribution of amino acids A Multiple Sequence Alignment Human Monkey Mouse Rat Cow Dog Fly Worm Yeast Conserved Properties across Rhodopsin
Results: -Helical Rung Prediction • 1DBG: correctly identify 10 out of 11 rungs
Concluding Observations… and Exaggerations • Everything can be reduced to Information • Information is the key everything • All “natural” information has an underlying language (genomics, linguistics, …) • Information is all levels of graunularity • Subatomic DNA/proteins society … • Information + language + computation = lifetime employment