110 likes | 247 Views
Development of a Chicken Unigene Database. Project No. 9. Ruoming Jin. Lilian Lacoste. Jianshan Tang . Department of CIS University of Delaware. Animal Science Dept. University of Delaware. DBI - French National School of Aeronautics and Space.
E N D
Development of a Chicken Unigene Database Project No. 9 Ruoming Jin Lilian Lacoste Jianshan Tang Department of CIS University of Delaware Animal Science Dept. University of Delaware DBI - French National School of Aeronautics and Space Mentors: Dr. Wellington Martins - Dr. Joan Burnside
Results Phrap Clustering Result: Phrap 17,090 ESTs 9,205 cluster • 2815 contigs • 6390 singlets
BLAST output1 Filtering Parsing Comparing BLAST output2 Second clustering method : using BLAST output Contig 1 Contig 2 Similarity function Similarity matrix
What's "gbc"? • Graph Based Clustering • Clustering, a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters. • Graph, the relation of the data could be expressed as graph • If there is a relation of two nodes, one edge connects them • Working in bioinformatics • Protein sequence clustering • EST clustering • A lot of other applications! • Objective of "gbc" • Support different input format • Efficiently support very large sparse graph clustering • Flexible to use by user
How to use "gbc" • Output • Cluster number, and all the nodes belongs to the cluster • Clique clustering • a clique is a completely connected subgraph • each maximal clique in the graph becomes a cluster • clusters many overlap • generally produces small but very tight clusters • Single-link clustering • A maximal connected subgraph becomes a cluster • produces larger but weaker clusters
A little about Implementation Works • Two clustering algorithm • Single-link • Clique • Graph Classes • Efficiently support dense/sparse graph • Provide the same interface without modifying clustering code
Analysis program Analysis tools Results output Process log output Clustering algorithm Comparison algorithm Number of contigs Run analysis Reset BLAST output New contig set Reset semantics Change matrix threshold
Analysis tools : contig information Display the BLAST output : - sequences references - sequences annotations - percentage of matching basepairs Display the list of contigs sorted according to their best matching percentage in the BLAST output
Analysis tool : EST selector Display : - frequency vs length (in ESTs) of contigs - list of ESTs in a contig Allows to select the best representative EST according to length and tissue type
First results On a set of 400 contigs representing 1000 ESTs Contig number :133 Contig size :740 Best matching fraction :0.9413109756097561 gb|AF178529.1|AF178529 Gallus gallus Rad54b (RAD54B) mRNA, compl... 1235 0.0 gb|BC001965.1|BC001965 Homo sapiens, RAD54, S. cerevisiae, homol... 184 5e-44 ref|XM_005161.3| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 184 5e-44 gb|AF112481.1|AF112481 Homo sapiens RAD54B protein (RAD54B) mRNA... 184 5e-44 ref|NM_012415.1| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 184 5e-44 emb|AL133578.1|HSM801429 Homo sapiens mRNA; cDNA DKFZp434J1672 (... 184 5e-44 dbj|AP003534.1|AP003534 Homo sapiens genomic DNA, chromosome 8q2... 76 3e-11 gb|AC084633.1|CBRG45G04 Caenorhabditis briggsae cosmid G45G04, c... 44 0.11 dbj|AB018110.1|AB018110 Arabidopsis thaliana genomic DNA, chromo... 44 0.11 Contig number :79 Contig size :743 Best matching fraction :0.43587786259541983 gb|AF178529.1|AF178529 Gallus gallus Rad54b (RAD54B) mRNA, compl... 571 e-160 gb|BC001965.1|BC001965 Homo sapiens, RAD54, S. cerevisiae, homol... 143 2e-31 ref|XM_005161.3| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 143 2e-31 gb|AF112481.1|AF112481 Homo sapiens RAD54B protein (RAD54B) mRNA... 143 2e-31 ref|NM_012415.1| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 143 2e-31 emb|AL133578.1|HSM801429 Homo sapiens mRNA; cDNA DKFZp434J1672 (... 143 2e-31 dbj|AP003534.1|AP003534 Homo sapiens genomic DNA, chromosome 8q2... 76 3e-11 gb|AC009623.6|AC009623 Homo sapiens chromosome 8, clone RP11-219... 40 1.7
References • Gene Index analysis of the human genome estimates approximately 120,000 genes. Liang-Feng; Holt-Ingeborg, Pertea-Geo, Karamycheva-Svetlana, Salzberg-Steven-L, Quackenbush-John Nature-Genetics. June, 2000; 25 (2): 239-240. • The TIGR Gene Indices: Reconstruction and representation of expressed gene sequences Quackenbush-John, Liang-Feng, Holt-Ingeborg, Pertea-Geo, Upton-JonathanNucleic-Acids-ResearchJan. 1, 2000; 28 (1): 141-145 • IMAGEne I: Clustering and ranking of I.M.A.G.E. cDNA clones corresponding to known genes. Cariaso-M, Folta-P , Wagner-M, Kuczmarski-T, Lennon-G Bioinformatics-Oxford. Dec., 1999; 15 (12): 965-973. • R. Larson, M. Hearst : Content analysis - Lecture from University of California , Berkeley School of information management and systems 1998. http://www.sims.berkeley.edu/courses/is202/f98/Lecture16/sld001.htmGib • T. Ono, H. Hishigaki, A. Tanigami, T. Takagi - Automated extraction of information on protein-protein interaction from biological literature. Bioinformatics vol 17 no 2 - Oxford University Press 2001. • I. Iliopoulos, A.J. Enright, C.A. Ouzounis - TEXTQUEST: document clustering of medline abstracts for concept discovery in molecular biology. EMBL Cmabridge Outstation, Cambridge CB10 ISD, UK.