120 likes | 235 Views
Speaker:. (Peter)Xiaoyong Wu Bioinformatics 4/28/03. Topic. Including Biological Literature Improves Homology Search Jeffrey T. Chang, Soumya Rachaudhuri, and Russ B. Altman (Paper Source: http://www.jeffchang.com/). Problem. Target of bio-sequence study:
E N D
Speaker: • (Peter)Xiaoyong Wu • Bioinformatics • 4/28/03
Topic Including Biological Literature Improves Homology Search Jeffrey T. Chang, Soumya Rachaudhuri, and Russ B. Altman (Paper Source: http://www.jeffchang.com/)
Problem • Target of bio-sequence study: Annotate the giant sequence information based on accurate homology recognition (ex. Disclose the possible function, relationship of sequences for medical research) • Current approach: Sequence similarity, such as PSI-BLAST • Problem: seq. similarity <> seq. homology
Idea of this paper • How expert in biology solve this problem? • Supplementing sequence similarity with biomedical literature information • Modify PSI-BLAST in each iteration using literature similarity to bound the search of sequences in a sensible scope
Methodology • Collect sequence information and literature into a concatenation and remove the so called “stop words” • Calculate document similarity(Wilbur and Yang) A and B are word vectors of two documents. cos(A. B) == 1, similar documents, cos(A, B) == 0, different documents.
Methodology • Construct the word vectors A and B of two documents. A = (a1, a2, a3, …am) B = (b1, b2, b3, …bm) am and bm represent the same attribute(word) total attributes are the union of words of A and B documents
Methodology-validation & test • Superfamily of proteins Over 1000 protein superfamilies, in SCOP(http://scop.berkeley.edu/), proteins in one superfamilies are of same function. Butone protein may cover more than 2 superfamilies. • Gold Standard All proteins just cover one superfamily. All proteins with multiple functions are removed.
Results • Recall: the number of homologous sequences > a fixed e-value cutoff(seq. in Gold Standard retrieved by modified PSI-BLAST)/total number of homologous sequence(Gold standard) • Precision: number of homologous sequences detected/total number of seq. detected(PSI-BLAST reported)
Questions? Thanks!