Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO)

Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Outline • Problem / Goals • Related Work • Information Retrieval • Vector Space Model • Latent Semantic Indexing (LSI) • Biological Databases • SGO Use & Results

Problem • Biological tools are creating vast amounts of data. • Current techniques are time-consuming and expensive. • Want to know phenotype (function) from genotype (structure/sequence).

Goals • Develop a tool to aid researchers in finding and understanding functional gene relationships. • Use information that covers whole genome, e.g. literature.

Related Work • Jenssen et al. (2001) developed PubGene. • Literature network • Assigns functional association if there is a co-occurrence of gene symbols • Wilkinson and Huberman (2004) expanded this idea to find communities of related genes. • Yandell and Majoros (2002) use natural language processing techniques to identify nature of relationships.

Related Work • Most all literature-based techniques rely on term co-occurrence. • What about gene aliases? • Solution: Apply a more robust technique.

Information RetrievalVector Space Model • Documents are parsed into tokens. • Tokens are assigned a weight of, wij, of ith token in jth document. • An m x n term-by-document matrix, A, is created where • Documents are m-dimensional vectors. • Tokens are n-dimensional vectors.

Information RetrievalTerm Weights • Term weights are the product of a local and global component • tf • idf • idf2

Information RetrievalTerm Weights (cont’d) • log-entropy • Goal is to give distinguishing terms more weight.

Information RetrievalQuery & Similarity • Queries are represented by a pseudo-document vector • Similarity is the cosine of the angle between document vectors.

Information RetrievalLatent Semantic Indexing (LSI) LSI performs a truncated SVD on A = UΣVT • U is the m x n matrix of eigenvectors of AAT • VT is the r x n matrix of eigenvectors of ATA • Σ is the r x r diagonal matrix containing the r nonnegative singular values of A • r is the rank of A A rank-k approximation is given by Ak = UkΣkVkT

Information RetrievalLSI (cont’d) • Document-to-document similarity is • Queries are projected into low-rank approximation space

Information RetrievalLSI (cont’d) • Scaled document vectors can be computed once and stored for quick retrieval. • The lower-dimensional space forces queries and documents to be compared in a more conceptual manner and saves storage. • Choice of number of factors is an open question. • End Effect: LSI can find similarities between documents that have no term co-occurrence.

Information RetrievalEvaluation Measures • Precision – ratio of relevant returned documents to the total number of returned documents. • Recall – ratio of relevant returned documents to the total number of relevant documents. • Goal is to have high precision at all levels of recall. • Systems are often evaluated by average precision (AP), which is the average of 11 interpolated precision values at the decile ranges.

Biological DatabasesMEDLINE • MEDLINE (NLM) • Contains 14+ million references to journal articles with a concentration in medicine • Span over 4,600 journals worldwide • 1966 to present • ~500,000 citations added annually • Each citation is manually indexed with MeSH terms.

Biological DatabasesPubMed • PubMed • Retrieves articles from MEDLINE and other journals. • Can be queried via any combination of attributes.

Biological DatabasesLocusLink • NCBI human-curated database • Single query interface to a comprehensive directory for genes and gene reference sequences for key genomes. • Provides links to related records in PubMed and other citations when applicable. • Provides RefSeq Summary of gene function and links to key MEDLINE citations relevant to each gene.

Biological DatabasesOverview • MEDLINE has lots information • Not all articles relate to genes • Gene terminology problem • LocusLink does not cover all relevant citations, but a representative few.

Biological DatabasesGene Document Construction • Concatenate titles and abstracts of MEDLINE citations cross-referenced in Human, Rat, and Mouse LocusLink entries. • Sequencing abstracts included – noise • LocusLink references are not comprehensive, so recall of all relevant abstracts is not guaranteed.

SGO • Primarily uses LSI to rank genes. • Enables user to specify query method • Gene query • Keyword query • Number of factors • Show latent matches • Saves previous query sessions.

SGOInterface

SGOInterface (cont’d)

SGOTrees • Unfortunately, ranked lists mean little to biologists. • Pairwise distances can be formed into a matrix where is the similarity between documents i and j

SGOTrees (cont’d) • Fitch-Margoliash (1967) method in PHYLIP is applied to D to generate hierarchical trees. • Thresholds can be applied to self-similarity matrix to produce graphs.

SGOHierarchical Tree

SGOGraph or Nodal Tree

SGOCoding Issues • Web interface – must be interactive • Queries are processed on click • Document collections are parsed offline • Trees are constructed offline • Storage will eventually become an issue.

ResultsTest Data Set • 50 gene test data set was constructed. • Alzheimer’s Disease • Cancer • Development • Reelin signaling pathway used as basis for evaluation • 5 primary genes (directly associated) • 7 secondary genes (indirectly associated)

ResultsPrimary AP • AP for 5 primary genes • 61% for 5 factors • 84% for 25 factors • 84% for 50 factors

ResultsSecondary AP • AP for 12 secondary genes • 53% for 5 factors • 59% for 25 factors • 61% for 50 factors

ResultsComparison • LSI comparable to tf-idf for 5 primary genes • Far superior to tf-idf for 12 second genes • PubMed co-citation identifies 2 of the 7 indirectly related genes • Abstract overlap of LocusLink citations fails to identify any indirectly related genes • tf-idf fails on many keyword queries • Tested on Gene Ontology classifications (not shown) • Similar tendencies are observed

ResultsAbstract Representation • To simulate scaling up, decrease representation of reelin-related genes • AP of 47% on 20,856 Human LocusLink abstracts

ResultsHierarchical Tree

Conclusions • SGO allows genes to be compared to each other and to keyword (function). • SGO identifies latent relationships with promising accuracy. • SGO is not meant to replace existing technologies, but to assist researchers • Verify current results • Direct future exploration

Future Work • Scale up to entire genome • Document construction • Incorporate structural or other information for multi-modal similarity • Test other models e.g. NMF, QR, etc. • Interactive tree building • Keep collections current

Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO)