460 likes | 708 Views
Bioinformatics Tools for Genotyping. Frances Tong Dr. Garry Larson, Ph.D City of Hope Department of Molecular Medicine Southern California Bioinformatics Institute Summer 2003 Funded by the National Science Foundation and the National Institutes of Health. Overview of Summer Program.
E N D
Bioinformatics Tools for Genotyping Frances Tong Dr. Garry Larson, Ph.D City of Hope Department of Molecular Medicine Southern California Bioinformatics Institute Summer 2003 Funded by the National Science Foundation and the National Institutes of Health
Overview of Summer Program • Learn ASP and VBScript • Learn the biology • Programming Project I : writing code for mining of online genetic data • Programming Project II : writing a program to graph linkage disequilibrium data
Intro to ASP & VBScript • ASP : Microsoft Active Server Pages * server generated web pages * similar to CGI but easier * works well with databases • VBScript : Microsoft Visual Basic Scripting * scripting language to enhance HTML web pages * default language of ASP
Hello World! • Sample ASP file (one line only!) <% response.write (“Hello, World!”) %>
Genetic Mapping of ASPs • ASPs : affected sibling pairs • Identification of genes associated with cancer in patients and siblings who both have cancer (breast, prostate, lung or colon) • Determine allele sharing statistics of susceptibility genes • Look at gene-gene interactions => Provide information on a person’s genetic risk of developing cancer
DNA Marker Genotyping • Genetic marker : polymorphic gene or section of DNA that has identifiable physical location on a chromosome used to trace inheritance • Ex. Microsatellite and SNP markers
Programming Project I:Tag Selection For Markers • Need unique way to identify markers (like social security numbers for people) • Chromosome locations are relative and change frequently (UCSC) • Use ASP to automate data mining to ease the generation of these unique 50 base-pair tags for each marker in database • Tags will be used to locate markers in genome
Submit accession number for microsatellite Submit accession number for snp Submit sequence surrounding simple repeat Marker Tag Selection
chromosome Sequence start position Sequence end position Link to UCSC browser Inputted sequence with repeats highlighted in blue Output
Send sequence to UCSC Choosing a 50bp tag Copy and paste here
UCSC Blat Results Blat is similar to BLAST : searches for alignment in genome
Convert to FASTA format FASTA format: >name sequence program converts marker tag file into fasta format automatically
Check tag selection Program sends fasta file to UCSC Blat
Linkage Disequilibrium A condition where two polymorphisms are found together on the same chromosome at a greater frequency than that predicted from the product of their individual frequencies.
5’ 5’ 5’ 5’ 5’ 3’ 3’ 3’ 3’ 3’ G/A G : 0.88 A : 0.12 T/C T : 0.75 C : 0.25 Two snps and their base frequencies (0.88)(0.75) = 0.66 G T (0.88)(0.25) = 0.22 G C (0.12)(0.75) = 0.09 A T (0.12)(0.25) = 0.03 C A Expected frequencies
IF observed frequencies of 2 variants together > expected frequencies => LINKAGE DISEQUILIBRIUM A and T together are in linkage disequilibrium
A Quantitative Measure of LD • One of the most common measures of linkage disequilibrium is • It is a squared correlation coefficient => the correlation of alleles at two sites. • Special case: (“perfect LD”) ~ Exactly two out of the four possible haplotypes are observed. ~ Markers NOT separated by recombination
Marker 1 Marker 2 0.7 1 Marker 1 0.7 0.2 Marker 2 Marker 3 Programming Project II • Program that helps visualize linkage disequilibrium by graphing scores such as • Each pair of markers has such a score => pairwise comparisons Marker 3 Symmetric! 1 0.2
Sample data for graphing Read data by row: Pairwise comparison of marker 1 and marker 7 results in two different kinds of measurements
GOLD – Graphical Overview of Linkage Disequilibrium • Existing program from the Univ. of Michigan to graph linkage disequilibrium http://www.sph.umich.edu/csg/abecasis/GOLD/ • Graphs based on a chromosomal position scale • Works very well for long range pattern analysis, but hard to distinguish each specific measurement.
Comparison of Program Output Same input file Output from GOLD Difficult to see individual points on graph Output from LD Color (my program) Easier to distinguish individual points
LD Color Program • Program written in ASP to graphically depict linkage disequilibrium in human genetic data • Color coded for specific numerical ranges of different measures of each pair-wise comparison of markers • Complete program: 4 files ; >1,000 lines of code
Program Features • Data input : file uploading or text pasting • Allows for variable file formats for input • User defined colors and ranges • Switch between different measures of LD • View actual data on graph or just the colors • Change size of graph • Option to select specific rows of data
Upload your file Paste data
Choose measure of linkage disequilibrium Specify which column the data is located
Same as before => used to specify data for other side of diagonal
Select only the markers you want graphed by choosing rows Default : all are graphed
Future Directions • LD Color • Mouseover tag to each cell on graph to show marker id (Javascript) • Ability to accept more kinds of file formats • Better form validation and error checking • More functionality and linking to outside sources
Acknowledgements • Dr. Garry Larson, Ph.D • Dave Ko City of Hope Senior Programmer Analyst • Louis Geller City of Hope Senior Research Associate • Dr. Ted Krontiris, M.D.,Ph.D Principal Investigator • The rest of the Krontiris Lab • Southern California Bioinformatics Institute: Dr. Jamil Momand, Dr. Nancy Warter-Perez, Dr. Sandra Sharp & Dr. Wendie Johnston, Jackie Leung & rest of SoCalBSI staff • Fellow interns • NSF & NIH