200 likes | 283 Views
Applying AI to Human Genome. Part 1 : Collecting data Prof. M. Embrechts Robert Bress Bram Heyns. Overview. Basics of DNA Collecting the data Collection : my application Perl Goal. Basics of DNA. DNA = polymer of 4 molecules : bases or nucleotides
E N D
Applying AI to Human Genome Part 1 : Collecting data Prof. M. Embrechts Robert Bress Bram Heyns
Overview • Basics of DNA • Collecting the data • Collection : my application • Perl • Goal
Basics of DNA • DNA = polymer of 4 molecules : bases or nucleotides • A = Adenine , C = Cytosine , G = Guanine , T = Thymine • Replication ( copying ) and translation ( reading ) => double helix : AT , GC ( copying ) • 3 letter combination = codon • RNA : U = Uracil in place of T => Transcribing • Protein = polymer composed of 20 amino acids ( reading ) => more complex structure than DNA
Intron – Exon - Splicejunction • exon 200 characters intron thousands • 30,000 genes identified out of possible 100,000 • Identification gene patent
Summary • Human : 23 chromosomes • Chromosomes thousands of genes • Gene info : exons , comments : introns • Exons and introns codons • Codon bases
Datacollection • Human Genome Project • NCBI website : http//www.ncbi.nlm.nih.gov • Entrez-Nucleotide.htm • NCBI Sequence Viewer.htm
Datacollection • Human Genome Project • NCBI website : http//www.ncbi.nlm.nih.gov • Entrez-Nucleotide.htm • NCBI Sequence Viewer.htm
Perl Practical Extraction and Report Language POD – files -> web Portability Free – CPAN modules String manipilation Extremely powerfull regex-engine Glue language designed for short and simple tasks, not equal to lack of power or “serious” features Tutorial : http://www.netcat.co.uk/rob/perl/win32perltut.html
Regular Expression – Pattern Matching • Practical Extraction and Report Language • Scan through data and extract useful information • m/PATTERN/ s/PATTERN/REPLACEMENT/ • 1 line Perl = 100 lines C or Java • Complex, but easy
Regex examples • /[KCZ]arl^sa/ • /<I>/(.*?)<\/I>/i • $1,$2,… • i , g , c , … • . , * , + , ? • /([0-9a-zA-Z])+/ or /([\w])+/ • s/us[^a-z]/them/g or s/us\W/them/g • /([acc|act][ttt|ttc|att])/ • TIMTOWTDT
Part 2 : Applying AI • Our choice : evolutionary computing • First part : identify exon part • Second part : identify splicejunctions • Third part : combine previous parts • Hope to reach +90% accuracy