1 / 21

Data Integration and Information Retrieval: Moving from the Ivory Tower into the Corporate Office

Data Integration and Information Retrieval: Moving from the Ivory Tower into the Corporate Office. Tae W. Ryu Department of Computer Science California State University, Fullerton. Summary of Today’s Talk. Past and current research activities Data integration and information retrieval

augustus
Download Presentation

Data Integration and Information Retrieval: Moving from the Ivory Tower into the Corporate Office

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Integration and Information Retrieval: Moving from the Ivory Tower into the Corporate Office Tae W. Ryu Department of Computer Science California State University, Fullerton

  2. Summary of Today’s Talk • Past and current research activities • Data integration and information retrieval • Commercial application to Real Estate business by Mr. Shin • Questions & answers

  3. A Bioinformatics Project at CSUF • A bioinformatics research group (BIG) involving several faculty members and students from Computer Science, Biology, Biochemistry, Mathematics at CSUF and Pomona College in Claremont started in 2001 • Bioinformatics is the study ofthe biological systems using computers.

  4. DNA the molecule of life

  5. DNA(Deoxyribonucleic Acid) • DNA is double-stranded • Base pairs (A-T, G-C) are complementary, known as Watson-Crick bps • A double-stranded DNA sequence can be represented by strings of letters (1D) in either direction • 5' ... TACTGAA ... 3' • 3' ... ATGACTT ... 5' • Length of DNA in bps (e.g. 100kbp)

  6. Genes and Genetic Code • What are genes? • Specific sequence of nucleotides (A,T,G,C) along a chromosome carrying information for constructing a protein • Who defined the concept of a gene? • Mendel – 1860’s (DNA was elucidated 75 years later) • What is the genetic code? • 3 base pairs in a gene = a codon (representing one amino acid) • Genome is a complete set of chromosomes.

  7. Gene Gene Gene Non-coding Regions in DNA Intergenic regions (non-coding) and Introns • Over 90% of the Human genome is non-coding sequences (intergenic region or junk DNA). • The role of this region is yet unknown but is speculated to be very important.

  8. Our Project Goal • Understand the importance and roles of those non-coding regions (intergenic regions) in DNA • Build a high-quality of integrated data source for the non-coding sequences (intergenic regions) in Eukaryotic genomes • Seek pilot projects for bioinformatics research and education at CSUF

  9. Bioinformatics andIntegrated Biological Data • Major task for bioinformatists is to make sense out the biological data • Typical tasks • Modeling, sequence to structure or functional class, structure to function or mechanism • How? • Biology-oriented approach: • Experiment and DNA manipulation in a Wet-Lab • Computer-oriented approach: • Data mining, pattern recognition and discovery, prediction model, simulation, etc. • Success in most bioinformatics research requires • An integrated view of all the relevant data • High quality of genomic sequence data and other relevant data • The results of analyses, such as, patterns produced by other research • User friendly and powerful information retrieval tool • Data analysis and interpretation • Data analysis by data mining and statistical approaches • Interpretation by biologists (with strong domain knowledge)

  10. Obstacles to Data Integration • Data spread over multiple, heterogeneous data sources • Databases (MySql, Oracle, SQL Server, etc.) • Semi-structured sequence files (text or XML) • HTML format in Web sites • Output of analytic programs (BLAST, PFAM, etc.) • Not all sources represent biology optimally • Semantics of sources can differ widely • Genbank is sequence-centric, not gene-centric • SwissProt is sequence-centric, not domain-centric • Use different terms and definitions • Biological ontologies are being built now • Lack of standards in data representation • XML is emerging as a standard for data transfer

  11. More Obstacles • Poor data quality (errors) and incomplete data • due to errors in Labs • due to the large amount of data that is computer-generated using heuristic algorithms • Data in the original data sources is changing • This is a really challenging problem that requires in-depth knowledge of both Computer Science and Molecular Biology • Several approaches are possible (cross-validation, re-experiment) but still limited

  12. Possible Approaches • Database approach (conventional) • Relational or object-oriented database • Data warehouse (or Data mart) • Data warehouse maintains an integrated high-quality, current (or historical), and consistent data. • Data mart is a small scale of data warehouse • Often important prerequisite for sophisticated data mining • Ideal approach (a future system) • A comprehensive information management system with all the above components plus powerful search engine and intelligent information retrieval based on text mining

  13. Virtual Intergenic Data Warehouse Transformed data set1 Transformed data set2 Data mining … Transformed data setn User interface … View Cube Multi-dimensional views Cube Statistical and Data mining tools Intergenic Data Warehouse Data extraction, cleansing, and reconcile process Metadata Building data warehouse Mediator Mediator Mediator Mediator Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper GenBank Swiss-Prot PROSITE EPD TRANSFAC Others … …

  14. Current Progress • Intergenic Database (IGDB version 1.1) • Integrated from genbank for Caenorhabditiselegans (nematode) and Saccharomyces cerevisiae (baker’s yeast), and Arabidopsis thaliana (mouse-ear cress) genomes • Mouse, Mosquito, Human are under way • Pattern Summary System (PATSS) • Summarize the sequence patterns generated by BLAST • Pattern visualization with alignment tools • Distributed BLAST using Web service and clustered computers • Ontology-based data integration • Intelligent wrapper and mediator • Structure description language for data extraction • Powerful information retrieval system based on customized search engine with the support of text mining • Web crawlers and customized search engine, document indexing • Text mining, natural language processing

  15. Caching DNS Async UDP DNS prefetch client Text Repository and index Text indexing and other analyses DNS resolver Client (UDP) DNS cache Hyperlink Extractor and Normalizer Wait for DNS Wait until http socket available Http Send and receive Page fetching context/thread isPageKnown? Crawl Metadata Persistent global work pool of Urls URL Approval guard Load monitor and work-thread manager isUrlVisited? Search Engine: How Does It Work? Relative links, links embedded in scripts, images Per-server queues Handles spider traps, robots.txt Fresh work

  16. Search Engine for Web Data Integration and Retrieval (d,t) Fresh batch of documents Batch sort May preserve this sorted sequence • t: token id • d: document id • s: a bit to specify if the document has been deleted or inserted (t,d) (t,d,s) Merge-purge Build compact index (may hold partly in RAM) Query logs Batch sort Main index Fast indexing (may not be compact) Query processor Text mining User Stop-press index (d,t,s) New or deleted documents

  17. What is Text Mining? • Text mining is the process of extracting interesting/useful patterns from text documents (1997 by data mining group). • Text is the most natural form of storing and exchanging information • Very high commercial potential • Study indicates that 80% of company’s information was contained in text documents such as emails, memos, reports, etc. • Applications • Customer profile analysis • mining incoming emails for customer’s complaint and feedback • Information dissemination • organizing and summarizing trade news and reports for personalized information service • Security • email or message scan, spam blocker • Patent analysis • analyzing patent databases for major technology players and trends • Extracting specific information from the Web (Web mining) • More powerful and intelligent search engine

  18. Text Mining Framework Text documents Document retrieval Information extraction Information mining Interpretation • Information extraction: machine readable dictionaries and lexical knowledge bases are essential. • Fact extraction: • pattern matching, lexical analysis, syntactic and semantic structure • Fact integration and knowledge representation • Information mining: mostly based on data mining and machine learning techniques • Episodes and episode rules • Conceptual clustering and concept hierarchies • Text categorization • clustering, classification (machine learning approach) • Text summarization • Visualization • Natural language processing (very computationally expensive) • Commercial products (mostly for categorization, summarization, visualization) • iMiner (IBM), TextWise (Syracuse), cMap (Canis), etc.

  19. Search Engine Ontologies Future Information Management System Browsers Customized windows Text mining Data mining Indexing Databases Or Data warehouse Web documents Text documents World Wide Web (Internet)

  20. Techniques Used for Real Estate Business by Mr. Shin • Data integration from multiple data sources • Database integration • Information extraction from Web using Web crawler • Customized search engine with the support of text mining • User friendly information retrieval tool

  21. Thank You.

More Related