280 likes | 483 Views
Parsing. Analyze text: split it into meaningful units, tokens Extract relevant information , disregard irrelevant information ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?. Blast. Program package for finding similarities between biological sequences
E N D
Parsing • Analyze text: split it into meaningful units, tokens • Extract relevant information, disregard irrelevant information • ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?
Blast • Program package for finding similarities between biological sequences • blastn compares DNA sequences with DNA sequences • Input: • Fasta file with query sequences • Formatted Fasta file with database sequences • Sensitivity parameter (and more) • Output: • Result of comparing each query to each database sequence
Example run Query file: arachis.fasta Database file: arabidopsis_nucleotides.fasta Format the database: formatdb –i arabidopsis.fasta –p F –o T Command: /users/chili/usr/blast-2.2.13/bin/blastall -p blastn-e 0.000000002-d arabidopsis.fasta-i arachis.fasta-o arachis_arab.bn
Example output – query with no match BLASTN 2.2.6 [Apr-09-2003] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query=CL5Contig1 (797 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done ***** No hits found ****** ..
Example output – query with matches Query=CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09 gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05 gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04 ..
Example output – match alignment >gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009 Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| |||||||||||| Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826 .. General form of output: Repetitions of (query, subject matches, alignments)
Extract information from blast output • Extract the best hit for each query sequence Query= CL69Contig1 (372 letters) .. Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 ..
Algorithm • Read blast output file line by line • Introduce two states: • Looking for next query • Looking for hit list • Return dictionary of query best hit
First state: Looking for next query Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 .. Look for a line starting with Query= (the = is important!)
Why we look for Query= and not just Query >gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009 Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| |||||||||||| Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826 ..
Second state: Looking for hit list Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 .. Case A: hits were found
Case B: no hits were found Query= CL5Contig1 (797 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done ***** No hits found ****** Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 ..
Second state: Looking for hit list Query= CL5Contig1 (797 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done ***** No hits found ****** Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 .. Look for a line starting with Searching Then read a few more lines to distinguish case A/B
blastparser.py (part 1) Find the query ID: Query= CL69Contig1
blastparser.py (part 2) Find the best match ID: Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 Find the best match ID: Searching..................................................done ***** No hits found ******
Test >>> from blastparser import parseBlastallOutput >>> d = parseBlastallOutput(“arachis_arab.bn”) >>> d[“gi|30419745”] ‘gb|BP625785.1|BP625785’ >>> d[“gi|30419753”] ‘none’
Two pass-parsing Evolutionary tree of life (animal kingdom) • Huge hierarchy of groups and subgroups • Each node in the tree has a name and a (possibly empty) list of descendant trees (sons) Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.
Abstract data structure to represent a general tree (not necessarily binary) tree.py
How can we write a tree to a sequential file? (File format should be readable by other systems, so we can’t use cPickle) • A tree is a labeled node containing a (possibly empty) list of other (sub)trees • Write tree node using start and end tags: <N=“Insects”> [sons] </N> • Formally (context-free grammar): T→ <N=“L”>S</N> S→λ| TS L→string label Insects Flies Beetles B C A E D
Recursive method: string representation of tree Insects Flies Beetles B C A E D tree.py First obtain string representation of sons (empty string if no sons) by calling function recursively.. .. then create string with start tag, label, sons’ representation, and end tag .. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..
Larger tree – How can we read a tree from a sequential file? <N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N> part_of_the_tree_of_life.txt We need a parser!
Two-pass parsing Complex parsing is often split in two passes: • Lexical analysis • Identify and assemble tokens: logical units of text • Structural analysis • Determine the structural hierarchy of the tokens In our case, the tokens are the two kinds of tag:
Lexical analysis Match either a start tag or an end tag Define a group containing the start tag’s label phylogenyparser.py (part 1) Search text from index pointer Create token of right type Move index pointer
Structural analysis Real root will be first son of this node 1 2 3 phylogenyparser.py (part 2) .. <N="Kangaroo"></N><N="Koala"></N>.. .. <N="Kangaroo"></N><N="Koala"></N>.. current_node current_node current_node new_node Kangaroo 1 2 3
Terrestrial vertebrates Turtle Testudines Reptilia Lizard Synapsida Lepidosauromorpha Diapsida Snake Therapsida Archosauromorpha Mammalia Owl Kangaroo Eutheria Penguin Marsupilia Tyrannosaurus Proboscidea Koala Primates Elephant Carnivora Human Wolf Gorilla Walrus Chimpanzee
Testprogram phylogenyparser.py
Navigating in the tree Name: Diapsida Father: Reptilia Siblings: Testudines Sons: Archosauromorpha Lepidosauromorpha (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? b Number of sibling (0-0)? 0 Name: Testudines Father: Reptilia Siblings: Diapsida Sons: Turtle (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p <N="Testudines"><N="Turtle"></N></N> Name: Testudines Father: Reptilia Siblings: Diapsida Sons: Turtle (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f Name: Reptilia Father: Terrestrial vertebrates Siblings: Synapsida Sons: Diapsida Testudines Turtle Testudines Reptilia Lepidosauromorpha Diapsida Archosauromorpha