350 likes | 378 Views
The Bioinformatics Lab at the University of Toledo offers education in bioinformatics, conducts genomic research, and provides career opportunities in the field. Learn about our courses, projects, and student achievements.
E N D
Alexei Fedorov, Ph.D. Associate ProfessorHead of Bioinformatics LabDepartment of Medicine Vice Director Program in Bioinformatics and Genomics/Proteomics Tel: (419)‑383‑5270Email: alexei.fedorov@utoledo.edu http://bpg.utoledo.edu/~afedorov/lab/
Bioinformatics Lab in 2013-2014 PhD students Shuhao QiuMasters studentsAhmed Al-Khudair Current grants NSF Career Development 2007-2012 “Investigation of intron cellular roles”
MAJOR GOAL: Bioinformatics Investigation of the Human Genome
Education in Bioinformatics(TWO TYPES OF STUDENTS) • Computer/math background gain experience in Biology (Sam, Andy) • Biological background gain experience in programming (Dave, Maryam) • Example of computational projects: Binary-absrtacted Markov models and their application to sequence classification http://etd.ohiolink.edu/view.cgi?acc_num=mco1271271172 http://bpg.utoledo.edu/~sshepard/defense/ video
Genomic MRIhttp://bpg.utoledo.edu/gmri/ http://www.jove.com/Details.php?ID=2663
Job perspectives (example: Ashwin Prakash) PhD – November 2011, HSC UT PhD research fellow -- from January 2011 Johns Hopkins School of Medicine Declined offers: • Cold Spring Harbor Laboratory • Baylor College of Medicine
The PI’s students received the following awards: • Jason Bechtel, Outstanding MSBS student in 2008 at HSC UT. • Theodor Rais, Second/Third Poster award by Ohio Bioinformatics Consortium, 2009. • Samuel Shepard, Outstanding PhD student in 2010 at HSC UT. • Lorraine Walters, Undergraduate Research Recognition Award, UT May 2012. • Arnab Saha-Mandal, 1) Outstanding MSBS student in 2013 at HSC UT; and 2) Canadian Institute of Health Research fellowship support ($20,000). • Jasmine Serpen, 1) Ohio Governor's Thomas Edison Award for Excellence in Biotechnology & Biomedical Technologies-1st place; and 2) OSERA Biomedical Research/Bioengineering Award-1st place (for high school students).
Program in Bioinformatics and Genomics/Proteomics (BPG) • http://hsc.utoledo.edu/depts/bioinfo/ • BPG offers a Certificate in association with the degrees of Doctor of Philosophy (Ph.D.) or Doctor of Medicine (M.D.). BPG also offers a Master of Science in Biomedical Sciences (MSBS).
Two courses in Spring semester: • Application of Bioinformatics, Proteomics, and Genomics (BIPG 640) or “Advanced Bioinformatics” (should be taken after “Fundamental Bioinformatics” of Dr. Trumbly) • Introduction to Bioinformatic Computation (BIPG 610) The main goal of this course is to provide basic programming skills to biological and medical students who may lack a background in computer sciences. Programming will be specifically taught using important biological examples, focusing in particular on the PERL language. No programming skills are required!
In the “Introduction to Bioinformatic Computation” course, rather than doing “cookbook” lab exercises, students participate in real-world, challenging problems whose resolution advances the field of genome biology. In addition to learning programming and other bioinformatic skills the students of this course acquire knowledge in how to present the final product of bioinformatic research and how to write a scientific paper on the subject. • In 2005 the class developed a program to identify novel genes for non-coding RNAs in humans and other mammals. This work resulted in publication of an article in Nucleic Acids Research1, coauthored by the group of students who were actively working on this project. • In 2006 course students created a novel public database (ASMD) and also a novel computational resource “Splicing Potential”. Ten students were co-authors in two manuscripts2,3. • In 2007 the class participated in the “Genomic MRI” project. Seven of these students are co-authors in BMC Genomics, 20084 • 2008 class continued “Genomic MRI” project. They performed whole genome comparisons for human, chimpanzee, and macaque and also analyzed distribution of 4 million SNPs inside and outside MRI regions. The results are in preparation for publication in Genome Research with 6 students among the authors.
Publications with IBC students 54. Prakash A., Shepard S., Mileyeva-Biebesheimer O., He J., Hart B., Chen M., Amarachiniha S., Bechtel J., Fedorov A. “Molecular forces shaping human genomic sequence at mid-range scales”, BMC Genomics 2009, 10:513. 53. Bechtel J.M., Wittenschlaeger T., Dwyer T., Song J., Arunachalam S., Ramakrishnan S.K., Shepard S., Fedorov A. Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structures. BMC Genomics 2008, 9:284. 52. Bechtel J. M., Rajesh P., Ilikchyan I., Deng Y., Mishra P.K., Wang G., Wu X., Afonin K., Grose W., Wang Y., Khuder S., and Fedorov A. Calculation of Splicing Potential from the Alternative Splicing Mutation Database Research Notes 2008, 1:4. 51. Bechtel J. M., Rajesh P., Ilikchyan I., Deng Y., Mishra P.K., Wang G., Wu X., Afonin K., Grose W., Wang Y., Khuder S., and Fedorov A. The Alternative Splicing Mutation Database: a hub for investigations of alternative splicing using mutational evidence. Research Notes 2008, 1:3. 44. Fedorov A, Stombaugh J., Harr M.W., Yu S., Nasalean L., Shepelev V. Computer identification of snoRNA genes using a Mammalian Orthologous Intron Database. Nucl. Acids Res. 2005. 33, 4578-4583.
COURSE: Bioinformatics of Biomarkers and Individualize Medicine, Spring 2012 • Course time line: 14 Weeks • No prerequisites, recommended: Introduction of bioinformatics and molecular biology • Reserve materials: None • Unit 1 Biomarker discovery and validation • Unit 2 Individualized Medicine
Investigation of the human genome BASE COUNT 846302 a 578512 c 575805 g 843114 t 1703 others ORIGIN 1 gaattcaaaa aagaaagaca atgacttgta gctgaagcta tgatcaggaa aagatggggt 61 ggacggcatt tgagaaaatc aggacagtgg tgtacttatc aaataagaag atctgggcag 121 aagattgttg aaaaagcaga cacagcactg agtagcagca tggagcagaa aagcataagg 181 aacaagtagt gcagtgtgcc tgaacatagg atgggaaatt aggaaagata aatggaggct 241 gactgtggga agccttacat tccaggctta gtggaataag taaatattta aatctcatga 301 gttcttttct ctctgctttc tatttttcac gacctgaact cacctcccag tgaggagatg 361 tttccaccta gcactaaaca gtaactagtt cagactatat atttaaaaaa aaaaaaaaaa 421 aaaaaaaaaa gcagaacagc tcagatcatc cagtgaagtg gtgctactat tatactatta 481 acggggagat gaaagccaga taagatggag aagtaggaaa tttacgaaac attttaaaag 541 aaaatttatt tattcatcaa tatttacata aatgtttatt aattctaagt actatagtag 601 gcacccattt attactttca aaaattgaca atatacaagt taataaaatc atattagttt 661 cctcttctaa taaaattatc tcactcaaat tcatataact aaaaatacat ttaataaatt 721 ttatttttaa aatataggcc acttctactc tattcatttt tgcacttaac attctcttgc 781 tttcaaaaat gtatgaaaaa tttcagttta gtccccacca aatctcaatt tagaccccgg 841 ataaagagta aataaattaa agagctgtca gaattaaaac actactacag gtctccttca 901 ctttatggca tagatgaagg caggaaatac tggctgaaaa ttttgtttat gtcaaagatt 961 ttgatgatta ccatcagaga tctgatatct cagggaagaa aagcctttca tataccactt 1021 aaaaaattct gccaggcgcg gtggctcacg cctgtaatcc cagcactttg ggaggctgag 1081 gtgggcagat cacctgaggt cagaagttcg agaccagcct gaccaacatg gagaaaccct 1141 gtctctacta aaaatacaaa atcagccggg cgtggtggcg catgcctgta atcccagcta 1201 cttgggaggc tgaggcagga gaatcacttg aacccaggag gcagaggttg cggtgagccg 1261 agatcacacc attgcactcc agcctgggca acaagggcga aactctgtct caaaaaaaaa 1321 aaaacttctg gggaaatggt ggcctgcctt gtaacatcta tgtgtcttag agggccatgg 1381 tatgacaccc ttgggcagtc atttatagag tccttccctg accagggaat catcctgcca
... after the first 50 pages .. 141601 cagcaccaaa tcctctcatt gcctttttaa aaaatgttgt ccaatttaac atcaagacac 141661 tgtccatgca atctgttgaa aaatctggct atttgcaaac aaagaaaaaa tgtatagcct 141721 cccacactat atatcaaaat aaacccaagt gtataaaaga gaaaatttta agtgaaacca 141781 aaacttgaaa atattgagat gaatattagt tagagctttg agtaggaaag gattttttga 141841 acagataacc aacagaggaa gtcagaaaac agtaatcatt tccttaatga aaatacaaaa 141901 cttaagtact tcaaaaaagt cattacaata cttaaaaacc ttacaacaat catgtggaaa 141961 gcatttatta caaataattc agaaaaagga tttatatccc taataactaa agaagtgagg 142021 aagaatgcta agatcacatt ttttaaaaag tagctaaagg ataatataaa tgactaacag 142081 acctgaggaa aaaagctaac ctcacaagta ttcaaccaaa taaaataacc tcgagatacc 142141 acttaaaaac ctatcgaaat aacgaagtgt ttggaaaatg acaagattca aaatctggta 142201 agagcagcat ttttccccat tgtggaggga gtgtgtaaat tggtgtggtc tttctgaaaa 142261 gcaattaggc aatcttgtat caaaaatctt caaagtgttc ttactctttg atgaagaatt 142321 ccacacgtgt gaatcctaaa acaattaaaa gtatgaacat atttttatgc acaaagatgt 142381 ttagccaaaa ggaaaacgac ctaaatgacg aatgatgtgc aactgcatgg ataaattgtt 142441 gtatatcaaa atgatgaaat attttgcagc tttgaaaagg taattttgaa aaaactttaa 142501 agacctcaaa aatgcccaaa atatattaat tgaaaaggat acaaaacttt attatttcac 142561 tacgtaatga aacagaatac agttgatcct tgaacaacgc tggtttgaac tgcactcgtc 142621 cacttacatt cagatttttt tctttttgct tttttttttt gagacgaagt ctcactctgt 142681 cacccaggct ggagggcagt ggcaccattc tggctcacta caacctgcgt ataccaggtt 142741 caagcaattc tcctgcctca gcctcccaag tagctggaat tacaggcgcc tgtcaccacg 142801 tccagctaat ttttgtattt ttagtagaga cggagtttca ccatgttggc caggctggtc 142861 tcgaactcct ggcctcaagt aatccacctg cctcagcctc ccaaagtgct gggattacag 142921 gcatcagccg ggtgcggtgg cttatgcctg caatcccatc ctggctaaca cggtgaaacc 142981 ctgtctctac taaaatacaa aaaattagct gagtgtggtg gcacatgcct atagttccag 143041 ctacttggga ggctgaggga tgagaattgc ttgaacctgg gaggcagagg ttgcagtgag 143101 ccgagatcac accactgtac tccagcctgg gcaacagagc aagactccat ctcaaaaaaa 143161 aaaaaaaaaa aaaaaagaaa aagaaaaaga aaaaggtatg ttatgaatgc agaaagtata 143221 tgttgatgct agtctattgt gtaatttacc accataaaat atacacaggt ctattataga 143281 agttaaaatg tatcaaaatg tatacacaaa cacttagaga tagtacatgg tatcattccc 143341 agttgagaaa aatgtaagca aacatgaaga tgcagtatta aatcataact gtataaaatt
... after next 200 pages 683041 ggaggtgggg agcgcctctg cccagccgcc ccatctggga ggtggggagc gcctctgtcc 683101 agccaccaac ccatctggga agtgaggagc gcctctgcct ggccaccccg tctgggaagt 683161 gaggagcacc tctgccgggc tgccccgtct gggaagtgtt cccaacagct ctgaagagac 683221 agcgaccatc gagaatgggc catgatgacg atggtggttt tgtcgaaaag aaaaggggga 683281 aatgtgggga aaagaaagag agatcagatt gttactgtgt ctgtgtagaa agaagtagac 683341 ataggagact ccattttgtt ctgtactaag aaaaattctt ctgccttggg atgctgttaa 683401 tctataacct tacccccaaa cccctgctct ctgaaacatg tgctgtgtca actcagggtt 683461 aaatggatta agggcgatgc aagatgtgct ttgttaaaca gatgcttgaa gacagaaaaa 683521 aaaaaagaaa gagaaaaaaa aaatcattga aggattattt atgccctatg gcatcccttt 683581 ctccaacact tgtcacctaa tgaccaggga tcaataccca caaatacagt aagacctatt 683641 tttaaaggtt ttcagcttaa ctgttttgtc tcttaataaa tttttatata ggaaaaaaaa 683701 aagaatgttg aatattggcc cccactctct tctggcttgt agagtttctg cagagagatc 683761 cactgttagt ctgatggctt ccctttgtgg gtaacccagt ctttctttct gcccttaaca 683821 ttttttcctt catttcaacc atggtgaatc tgacaattat gtgtcttggt gttgctcttc 683881 tcaaggagta tctttgtggt gttctctgta tttcctgaat ttgaatattg gcctgtgtgg 683941 ataggttggg gaagttctcc tggataatat cctgaagagt gttttccaac ttggttccat 684001 tctcccagtc actttcaggt acaccaatca aatgtaggtt tggtcttttc acatagtccc 684061 atatttcttg gaggctttgc tcattccttt tcattctttt ttctctaatc ttgtcttcaa 684121 gctttatttc attaagttag tttatatttg actgtgcttt atacttgaca aagcactttc 684181 acatttcttg tcttttttgg gcctgataat tactctgcaa gttaaaaagg aaaaactcca 684241 agtaccatta cgctccgtga ggacagggac tattttgttc attgttgcaa cctaagcact 684301 taatatgttg cctggtccag agtagatact catatataaa tacttgctga ataaagggat 684361 gaatgggtgg gtggttagat gaatggaatt tgccttaatt ttcaagatgg attcaatttc 684421 caattccact tactggtgag aagccttgtc taagtcttta aaccttactt tcctcatcta 684481 taaaacagtg acaatgatat tgtttctgct accacaatgg aaaaaaggac agaattactt 684541 agtgtcatag tgatcaggaa taaagccagg gcttgaagca tctcctgatt cctagggcat 684601 tgtttgtccc aatgtatatg gcagagggag aaagaaaacc gttgagtctt aatctgtcag 684661 gcactatttt atgaacttta aaatcctcat agcagggcca ggtgcagtgg ctcacacctg 684721 taatcccagc actttgggag gccaaggcag gcagatcact tgaggtcagg accagcctgt 684781 ccaacgtggt gaaaccacat ctctactaaa aatacaaaaa ttagccaggc gtggtggtgc 684841 atgcctataa tcccagctac ttgggaggct gaggcaggag aaatgcttga acctgggagg 684901 cagaggttgt ggtgagctga gattgtgcca ctgtactcca gcctgggcaa cagaacaaga
Human chromosome 1 4,814,628 lines = =100,000 pages = 100 books (1000 pages each)
The 1000 Genome ProjectA guide to your ancestry The pattern of the human genetic variations believed to be a key to reveal much about the human population history and diversity. The 1000 Genome project has sequences 1092 genome from different populations and by identifying the sequence that correspond to LWK, GBR, JPT and FIN, we are aiming to learn more about the population genetic patterns and to get a picture of the genetic diversity existed within the mentioned populations. The 1000 genome project effort to catalogue the human genetic variation is utilized in this project to calculate and compare these genetic differences between 14 populations. I am presenting the results that our bioinformatics lab’s team obtained so far and working on having it put in a paper. Using Perl programming to compute the differences between each two individual’s genomes from the 1000 Genome project for the 14 populations • ASW HapMap African ancestry individuals from SW US • CEU CEPH individuals • CHB (CHB) Han Chinese in Beijing • CHS (CHB) Han Chinese South • CLM Colombian in Medellin, Colombia • FIN HapMap Finnish individuals from Finland • GBR British individuals from England and Scotland (GBR) • IBS Iberian populations in Spain • JPT JPT Japanese individuals • LWK (LWK) Luhya individuals • MXL HapMap Mexican individuals from LA California • PUR Puerto Rican in Puerto Rico • TSI Toscan individuals • YRI (YRI) Yoruba individuals
Figure 2: The Graph below showing the 14 populations consisting 4 distinct origins and lets call them 4 ancestries. 1_African , 2_Hybrid , 3_European, 4Asian. 4 3 1 2
Figure 3: The three populations that have African origin, they total differences distributed close to each other. The LWK population(Luhya individuals ) showd some individual who had almost half (2.7 million – 4.8 million) the number of differences, almost all of these have been declared as siblings and relatives. Some of them are not declared to be relatives by the 100 Genome project so our results suggest that they might be some undeclared relatives in the 100 genome project.
We further examined some populations for any declared relationships between any of these individuals; the relatives showed that they have the minimum difference in their genetic variation. For example, In the LWK population as showing in the table below, the relatives fall at the top of the list when we sorted the total differences from lowest to highest. The green highlighted cells showing that these individuals are related to each other as been declared by the 1000 genome appendix, The ones that are not highlighted we suggest that they are somehow relatives but they haven’t been declared by the 1000 genome project.
Figure 4:CLM, PUR and MXL populations, they show a very wide distribution ranged from 3.1-4.86. what our results indicate that these population have wide range of mixed blood. The PUR population have a second peak showing on the right side (range between 4.74-4.9 million), we expect that these individuals having different blood. More investigation on these people being conducted to know where do they have blood from.
Figure 5:Populations from FIN, GBR, TSI, CEU and IBS. All these population fall under European origin. The IBS population show as a really low curve because only 13 person have been sequenced from this population.
Figure 6:The population from Asian origin showed how they are close in their blood by having really close shape of distribution that ranged between 3.4 million- 3.69 million.
We are more investigating the highest differences pairs (the highest differences between pairs of individuals) that we suggest that they possibly have a different origin. We investigated the highest 40 pairs in some population and we found that some individuals showed high difference with other individual and that were significantly repeated. Example in the figure below
The list below is the CLM individuals that showed the highest genetic differences with each other and when we looked at them individually we noticed that some of them have been repeated significantly more than others as it shows in the right side list of repeats. We see that HG01551 and HG01342 has been repeated as highest difference for 20 times while others were repeated 2and 3 times. So we decided to investigate the possibility of these individuals having other origin. • HG01551 4479513 HG01136 • HG01365 4480834 HG01342 • HG01342 4481529 HG01250 • HG01551 4481637 HG01250 • HG01551 4483529 HG01375 • HG01551 4485279 HG01125 • HG01488 4487693 HG01342 • HG01366 4488647 HG01342 • HG01551 4490996 HG01259 • HG01342 4493212 HG01271 • HG01342 4493218 HG01277 • HG01377 4494064 HG01342 • HG01462 4494414 HG01390 • HG01551 4496682 HG01365 • HG01461 4497146 HG01342 • HG01342 4498051 HG01125 • HG01551 4499694 HG01148 • HG01551 4499713 HG01345 • HG01375 4500523 HG01342 • HG01551 4501432 HG01134 • HG01551 4503181 HG01495 • HG01389 4506393 HG01342 • HG01342 4508562 HG01148 • HG01551 4510222 HG01377 • HG01342 4514486 HG01134 • HG01551 4519187 HG01389 • HG01342 4520380 HG01124 • HG01440 4527415 HG01342 • HG01342 4533004 HG01275 • HG01342 4535490 HG01272 • HG01551 4537772 HG01272 • HG01551 4541901 HG01488 • HG01551 4542804 HG01461 • HG01551 4558088 HG01462 • HG01551 4561600 HG01275 • HG01390 4562418 HG01342 • HG01462 4564478 HG01342 • HG01551 4577349 HG01440 • HG01551 4608288 HG01390 • HG01551 4678948 HG01342
The idea was to take those repeated high difference individuals with 10 other controls from the same population that showed average number of genetic difference within the same population , we then randomly took individuals from other populations and calculated the genetic differences between our 10 control +2 high repeats and the 1 control from the other populations. The comparison below was between 10 controls from CLM plus the 2 high repeated high genetic difference (HG01551 and HG01342 ) , against one control individual from YRI population(Yoruba individuals ) “African Ancestry “. HG01551 and HG01342 had the lowest difference indicating that these two persons might be from African origin.
We more compared CLM controls with individual from African population(LWK) and another individual from Asian(CHS).The two control individuals showed lowest genetic difference against LWK control while showed highest difference when against CHS individual . This suggest that our two individuals from CLM population are originally belong to an African origin. CLM - LWK CLM - CHS
Conclusions • Total variants showed substantial geographic differentiation, • Total number of differences determines diverse populations that are more geographically and ancestrally remote. • populations are grouped by the predominant component of ancestry: Europe (CEU, TSI, GBR, FIN and IBS), Africa (YRI, LWK and ASW), East Asia (CHB, JPT and CHS) and the Americas (MXL, CLM and PUR). • Relatives within the same population have significantly less number of genotype variations “almost half the number” comparing to the non relatives. • The study of human genetic variation has evolutionary significance. It can help to understand ancient human population migrations as well as how different human groups are biologically related to one another.