270 likes | 495 Views
Data and Text Mining for Computational Biology. Introduction. Course information. CS 6365 Data and Text Mining for Computational Biology Meets Tuesday and Thursday 7:00-8:15 pm at ECSS 2.412. Instructor. Vasileios Hatzivassiloglou Associate Professor, Computer Science
E N D
Data and Text Mining for Computational Biology Introduction
Course information • CS 6365 • Data and Text Mining for Computational Biology • Meets Tuesday and Thursday 7:00-8:15 pm at ECSS 2.412
Instructor • Vasileios Hatzivassiloglou • Associate Professor, Computer Science • Founding Professor, Bioengineering • Research focus: Discover knowledge from massive amounts of raw data • data not the same as information • information overload
Research Interests • Text analysis, machine learning, intelligent information retrieval, summarization, question answering, bioinformatics, medical informatics
Contact information • Office hours: Tuesday and Thursday 6:00-7:00pm and by appointment • Office location: ECSS 3.406 • vh@hlt.utdallas.edu • (972) 883-4342 • Teaching Assistant: TBA
Course goals • Introduce the field of bioinformatics • Discuss primary techniques used for data mining • Introduce text mining and additional issues it brings to data mining methods • Use examples from computational biology
Intended audience • For both computer scientists and biologists • Not an easy task to balance the two • Focus on data and text mining algorithms and applications • Coverage of machine learning background • No extensive algorithmic analysis / computational complexity • Medium level of programming
Prerequisites • Officially CS 6325 – Introduction to Bioinformatics • Waived for this offering of the course • You should know • Basic data structures (multidimensional arrays, hash tables, binary trees) • One high-level programming language and be able to adapt to a new one as needed • Be able to install and use external software packages
You need not know • Molecular biology • Machine learning • Data mining (in general) • Text analysis / natural language processing • Information retrieval • Artificial intelligence
Course level • Introductory graduate course (MS or first-year PhD) • Maturity in programming and data structures as of a Computer Science senior • Ability (and interest in) accessing the primary literature in a guided fashion
Course structure • 6 lectures on biological background and bioinformatics in general • 6 lectures on data similarity • 8 lectures on data mining methods • 3 lectures on text mining and knowledge mining methods • student presentations of research papers (3-4 sessions)
Expected work load • Two homework sets given in mid-to-late September and mid-to-late October • Two weeks to turn in each homework set • Mid-term exam in early October • Each student selects two or three research papers to review in late October • Student presentations of research papers in the last week of November / first week of December • Final exam
Course project • In lieu of the research papers and presentation, students may elect to work on a project in teams of two or three • Project is chosen by the students with the advice and consent of the instructor • Project investigation/implementation should be approximately 1.5-2 times the work required for a regular homework
Programming • Each student selects their own programming language (must be available at UTD and accessible to TA) • Examples: C, C++, Java, Perl, Python • Can also use a package/programming environment specifically tailored to bioinformatics
One likely package • R (http://www.r-project.org/) • R is the free alternative to S-Plus developed at AT&T research • S-Plus is the extensible, programmable alternative to statistical packages like SAS and SPSS • If you know C, you will be right at home with R
Another likely package • BioPerl (http://bio.perl.org/) • A collection of library modules in Perl written by and for bioinformaticians • Perl supports high-level operations such as hashes as a basic data structure, string matching, and regular expressions • Perl is really bad at OOP and efficiency • Easy to learn
Grading • Class participation: 20% • Homework assignments: 30% (total) • Midterm: 10% • Research paper presentation or project: 20% • Final exam: 20%
Textbooks • No good integrated textbook on data mining from a computational biology perspective • We will use a text book covering bioinformatics algorithms and another text book on data mining in general, and additional chapters from other books and research articles • Copies of chapters / research articles will be provided
Recommended textbook #1 • “An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)”, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004. • ISBN 0262101068 • 448 pages • Available on Amazon.com for $41, Barnes and Noble for $60
Recommended textbook #2 • “Data Mining : Concepts and Techniques” by Jiawei Han and Micheline Kamber, Elsevier, second edition, 2006. • ISBN 1558609016 • 800 pages • Available on Amazon.com for $52, Barnes and Noble for $65
Supplementary textbooks • “Bioinformatics: The Machine Learning Approach” by Pierre Baldi and Soren Brunak, 2nd edition, 2001. • “Data mining : multimedia, soft computing, and bioinformatics” by Sushmita Mitra and Tinku Acharya, 2003. • Both of the above are available as full-text eBooks via http://library.utdallas.edu.
Background reading • Biology: “Molecular Biology of the Cell” by Bruce Alberts et al., 4th edition, 2002. • Machine learning: “Machine Learning” by Tom Mitchell, 1997.
Background reading (II) • Statistics: “The elements of statistical learning: data mining, inference, and prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2001. • Data structures and algorithms: “Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2nd edition, 2001.
So what is it all about? • Three parts: • Bioinformatics / computational biology • Data mining • Text mining
Bioinformatics • A fast developing discipline • We will discuss • basic concepts of molecular biology • databases of biological data • structure and function of DNA, RNA, proteins • sequence searching (BLAST) • sequence similarity and comparison • protein structure (2D and 3D) • protein motifs and patterns • microarrays • phylogenetics
Data mining • Given a large amount of data of known types, extract useful information • We will discuss • data cleanup and outliers • model construction • data and dimensionality reduction • classification • prediction / probability estimation • clustering • measuring performance
Text mining • Not only we have a large amount of raw data, but we don’t know what each item means • We will discuss: • tokenization and basics of text processing • recognition of terms and entities • classification • dictionary creation • relationship learning and extraction • document level clustering and information retrieval