320 likes | 473 Views
EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Administrative. Register for 3 hours of credit. Me. Luke Huan, assistant prof. in Electrical Engineering & Computer Science Homepage: http://people.eecs.ku.edu/~jhuan/ Office: 2304 Eaton Hall
E N D
EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006
Administrative • Register for 3 hours of credit
Me • Luke Huan, assistant prof. in Electrical Engineering & Computer Science • Homepage: http://people.eecs.ku.edu/~jhuan/ • Office: 2304 Eaton Hall • Email: jhuan@eecs.ku.edu • Office hour: • 10:00 – 11:00am Monday and Wednesday
My Lecture Style • I may tend to talk fast, especially when excited • Class materials are highly interdisciplinary • Use your questions to slow me down • Ask for clarification, repetition of a strange phrase, jargons • “If in doubt, speak it out”
You • Introduction: • Who you are • What department you are in • Why you are taking the course
Outline for Today • What is mining biological data? • What is this course about? • Course home page • Course references • Paper presentation • Final project • Grading • Forward class reviewing
What is Mining Biological Data • Goal: understanding the structure of biological data • Patterns • Descriptive models • Predictive models • Challenges: • What is the nature of the data? • What are the computational tasks? • How to break a task into a group of computational components? • How to evaluate the computational results? • Applications • Experimental design and hypothesis generation • Synthesis novel proteins • Drug design • …
What is this Course About? • Learning… • Problems in mining biological data • Available techniques, their pros and cons • How to combine techniques together • Enough perception to avoid pitfalls • Practicing… • To present recent papers on a selected topic • To work on a project that may involve • A domain expert, • A driving biological problem, and • The development of new data mining techniques
Class Information • Class Homepage: http://people.eecs.ku.edu/~jhuan/fall06.html • Meeting time: 9:00 – 9:45 Monday, Wednesday, Friday • Meeting place: Eaton Hall 2001 • Prerequisite: none
Textbook & References • Textbook: none • References • Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 2001. (ISBN:1-55860-489-8) • The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5) • Bioinformatics: Genes, Proteins, and Computers, edited by Christine Orengo, David Jones, Janet Thornton, Bios Scientific Publishers, 2003. (ISBN: 1-85996-0545)
Paper Presentation • One per student • Research paper(s) • List of recommendations will be posted at the class webpage a week from now • Your own pick (upon approval) • Three parts • Review the goal of the paper(s) • Discuss the research challenges • Present the techniques and comment on their pros and cons • Questions and comments from audience • Extra credit for active participants of class discussions • Order of presentation: first come first pick • Please send in your choice of paper by September 1st.
Final Project • Project (due Nov. 27th) • One project • I will post some suggestions at class website. • I am soliciting projects from researchers on campus • You are welcome to propose your own • Discuss with me before you start • Checkpoints • Proposal: title and goal (due Sep. 8th) • Background and related work (due Sep. 29th) • Outline of approach (due Oct. 20th) • Implementation & Evaluation (due Nov. 10th) • Class demo (due Nov. 27th)
Grading • Grading scheme • No homework • No exam
Forward Class Reviewing • This is for overview, not content • Don’t worry if you do not understand some of the words, that’s why you want to take this class. • Gives an idea of what is coming • Order of presentation might be shuffled to accommodate everyone’s schedule • Topics may be adjusted with progresses of the class
Week 1: Pattern Mining • Frequent patterns: finding regularities in data • Frequent patterns (set of items) are one that occur frequently in a data set • Can we automatically profile customers? • What products are often purchased together? Customer Shopping basket One hypothesis: {a, c} {m}
Week 2: Advanced Pattern Mining • Reducing number of patterns • Maximal patterns and closed patterns • Constraint-based mining • Patterns with concept hierarchy • Patterns in quantitative data • Correlation vs. association
Week 3: Mining Microarray Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 9, 3273-3297.
p5 p2 y c b y p1 x a y y d b p4 p3 G1 y y b c b q1 s1 s4 y b P3 b P2 y b c y y b y s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b P6 P5 s3 q3 P4 G3 G2 Week 4: Patterns in Sequences, Trees, and Graphs = 2/3 b y f=2/3 f=2/3 f = 3/3 a y b P1
Lys Lys Gly Gly Leu Val Ala His Cartoon Space filling Oxygen Nitrogen Carbon Sulfur Ribbon Surface Week 5: Pattern Discovery in Biomolecules • Protein • A sequence from 20 amino acids • Adopts a stable 3D structure that can be measured experimentally
Outliers Cluster 1 Cluster 2 Week 6: Descriptive Models • Group objects into clusters • Ones in the same cluster are similar • Ones in different clusters are dissimilar • Unsupervised learning: no predefined classes
Week 8: Mining Microarray (II) • Apply subspace clustering to microarray analysis • Find groups of genes that are co-regulated • May integrate data from protein sequences and functional description of genes • Applying subgraph mining to microarray analysis
Week 9: Predictive Models • Two-class version: • Using “training data” from Class +1 and Class -1 • Develop a “rule” for assigning new data to a Class Slides from J.S. Marron in Statistics at UNC
Week 10: Classification Algorithms and Applications • Decision tree • Fishers linear discrimination method • Kernel methods
Week 11: Text Mining, Gene Ontology, Data Management • Ontology seeks to describe or posit the basic categories and relationships of being or existence to define entities and types of entities within its framework. Ontology can be said to study conceptions of reality (Wikipedia). • GO is a database of terms for genes • Terms are connected as a directed acyclic graph • Levels represent specifity of the terms (not normalized) • GO contains three different sub-ontologies: • Molecular function • Biological process • Cellular component
Part of the biological system in a cell at the molecular level A proteome is the set of all proteins in an organism Week 12: Systems Biology & Proteomics Source: http://www.ircs.upenn.edu/modeling2001/,
Protein-protein interaction in yeast 35,000 Growth of Known Structures in Protein Data Bank (PDB) # of structures Year Gary D. Bader & Christopher W.V. Hogue, Nature Biotechnology 20, 991 - 997 (2002) Week 13: Analyzing Biological Networks • Biological networks pose serious challenges and opportunities for the data mining research in computer science • Large volume of data • Heterogeneous data types
Week 14: bio-Data Integration • Data are collected from many different sources • Each piece of data describes part of a complicated (and not directly observable) biological process • Combine data together to achieve better understanding and better prediction
Week 15, 16: Project Presentation • Check what you have learned from the class • Celebrate the hard work!
Further References • Data mining • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. • Journal: Data Mining and Knowledge Discovery, IEEE-TKDD • Bioinformatics • Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc. • Journals: Bioinformatics, J. of Computational Biology, etc.
Further References • AI & Machine Learning • Conferences: Machine learning (ICML), AAAI, IJCAI, etc. • Journals: Machine Learning, Artificial Intelligence, etc. • Statistics • Conferences: Joint Stat. Meeting, etc. • Journals: Annals of statistics, etc. • Database systems • Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, • Journals: ACM-TODS, IEEE-TKDE etc. • Visualization • Conference proceedings: IEEE Visualization, ACM-SIGGraph, etc. • Journals: IEEE Trans. visualization and computer graphics, etc.