430 likes | 522 Views
What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management. Patrick Glenisson. Bio-informatics group Dept Electrical Engineering K.U.Leuven, Belgium. Steunpunt O&O Statistieken Faculty of Economy K.U.Leuven, Belgium. ntroduction. I. Text mining.
E N D
What's in a word ?Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of EconomyK.U.Leuven, Belgium
Text mining Gibbs sampling Graphical models Classification & clustering Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Research on algorithms and software development for: clinical bioinformatics gene regulation bioinformatics
Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Text mining research Combine statistical approaches with domain-specific requirements Knowledge discoverythrough literature analysis in various domains: Bio-informatics Sciento- & Technometrics Knowledge management
Overview • Bio-informatics: • gene profiling • multi-view learning • Scientific trend mapping • clustering and bibliometric indicators • Innovation & Spillovers • Tracing of person in science & technology spaces 25’ 5-10’
Overview • Text mining goals InformationRetrieval Document analysis &Extraction of tokens InformationExtraction • Text mining methodology Shallow Statistics Shallow Parsing Full NLP parsing • Overall approach Domain-specific Problemspecific Generic
ase 1: C Literature & biological data
Sample annotations C1 .. C2 C3 Gene annotations G1 G2 G3 .. Gene expressionmeasurement ‘Post-genome’ biology • focus shift : • from single gene to gene groups • complex interactions within cellular environment • microarrays measure the simultaneous activity:
conditions Expression data gene Clustering Interpretation
conditions Expression data gene annotations and relationsencoded as free text gene expression Databases Integrated analysis PRIORINFORMATION
Hence, 2 views: • Text analysis for interpretation (supportive role) • Text analytics for ‘inference’ (active role)
GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity A ‘historical’ quote: `Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database’ (M. Gerstein, 2001)
Increased awareness • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • Structured vocabularies are on the rise • GO • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO for gene names • GOA • …
gene T 3 T 2 T 1 vocabulary (GOF) Vector space model • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • index
Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Validity of gene index Genes that are functionally related should be close in text space:
Validity of gene index Genes that are functionally relatedshould be close in text space:
Validity of gene index Genes that are functionally relatedshould be close in text space:
Optimal number of clusters ? Define `optimal’ ? Text-based scoring • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 C1
Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring
TXTGate • a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. • incorporates term-based indices .. • .. and use them as a starting point • to explore the text through the eyes of different domain vocabularies • to link out to other resources by query building, or • to sub-cluster genes based on text.
Term-centric Gene-centric Domain vocabularies as ‘views’
Features of the approach • Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s • … that allow some level of interoperability with external annotation databases • Sub-clustering gene groups useful to detect • biological sub-patterns • Reasonably robust to corrupted groups • Gene index normalizes for unbalanced references
Text analysis for interpretation (supportive role) • Text analytics for ‘inference’ (active role)
Meta-clustering text & data • As multiple information sources are available when analyzing gene expression data, we pose the question:“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ” ..
Integration of text & data • In each information space • Appropriate preprocessing • Choice of distance measures
Combine data: • confidence attributed to either of the two data types • in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.
However, distribution of distances invoke a bias Scaling problem • Therefore, use technique from statistical meta-analysis(so-called omnibus procedure) Expression Distancehistogram Text Distancehistogram
Optimal k ? Various cutoffs k of the cluster tree M-scoreintegrated clustering M-score expression data only
A peek inside Text Profile Expression Profile Strongre-enforcement
ase 2: C Sciento- & technometrics
Mapping of Science • Journal ‘Scientometrics’ • Full-text articles • Document cluster analysis • Co-word mapping • Temporal dimension:clusters over time
Mapping of Science • Coupling with bibliometric indicators; • Based on reference (hyperlink) information • Mean reference Age • Nr Serials
Domain studies in Patent space Similarities ‘Seed’ patent 30 technology classes
User profiling & Author-Inventor linkage • Name resolution • Same persons (variants, mistakes) • Different persons (similar initials, or even full name) Van Veldhoven Veldhoven, Van Van Veldhoven Vanveldhoven Wim Van Veldhoven Walter Van Veldhoven Wim Van Veldhoven Wim Van Veldhoven
Content-based name matching • Detect spillovers and entrepreneurial activities at (e.g.) university-level • Matching of ‘inventors’ & ‘authors’ time-consuming semi-automated approach: Relevance ranking Patent DB Publication DB
Acknowledgements Steunpunt O&O Statistieken Debackere K Glänzel W ESAT / BioI / Text Mining: Coessens B Van Vooren S Janssens F Van Dromme D ESAT / BioI: Moreau Y De Moor B
Thanks! ? ? CONTACT INFO: Patrick.glenisson@econ.kuleuven.be