Patrick Glenisson

What's in a word ?Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of EconomyK.U.Leuven, Belgium

ntroduction I

Text mining Gibbs sampling Graphical models Classification & clustering Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Research on algorithms and software development for: clinical bioinformatics gene regulation bioinformatics

Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Text mining research Combine statistical approaches with domain-specific requirements Knowledge discoverythrough literature analysis in various domains: Bio-informatics Sciento- & Technometrics Knowledge management

Overview • Bio-informatics: • gene profiling • multi-view learning • Scientific trend mapping • clustering and bibliometric indicators • Innovation & Spillovers • Tracing of person in science & technology spaces 25’ 5-10’

Overview • Text mining goals InformationRetrieval Document analysis &Extraction of tokens InformationExtraction • Text mining methodology Shallow Statistics Shallow Parsing Full NLP parsing • Overall approach Domain-specific Problemspecific Generic

ase 1: C Literature & biological data

protein

Sample annotations C1 .. C2 C3 Gene annotations G1 G2 G3 .. Gene expressionmeasurement ‘Post-genome’ biology • focus shift : • from single gene to gene groups • complex interactions within cellular environment • microarrays measure the simultaneous activity:

conditions Expression data gene Clustering Interpretation

conditions Expression data gene annotations and relationsencoded as free text gene expression Databases Integrated analysis PRIORINFORMATION

Hence, 2 views: • Text analysis for interpretation (supportive role) • Text analytics for ‘inference’ (active role)

GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity A ‘historical’ quote: `Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database’ (M. Gerstein, 2001)

Increased awareness • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • Structured vocabularies are on the rise • GO • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO for gene names • GOA • …

gene T 3 T 2 T 1 vocabulary (GOF) Vector space model • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • index

Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Validity of gene index Genes that are functionally related should be close in text space:

Validity of gene index Genes that are functionally relatedshould be close in text space:

Optimal number of clusters ? Define `optimal’ ? Text-based scoring • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 C1

Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring

Collaborative gene filtering

TXTGate • a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. • incorporates term-based indices .. • .. and use them as a starting point • to explore the text through the eyes of different domain vocabularies • to link out to other resources by query building, or • to sub-cluster genes based on text.

Term-centric Gene-centric Domain vocabularies as ‘views’

Query building to external DB

Features of the approach • Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s • … that allow some level of interoperability with external annotation databases • Sub-clustering gene groups useful to detect • biological sub-patterns • Reasonably robust to corrupted groups • Gene index normalizes for unbalanced references

Text analysis for interpretation (supportive role) • Text analytics for ‘inference’ (active role)

Meta-clustering text & data • As multiple information sources are available when analyzing gene expression data, we pose the question:“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ” ..

Mathematical integration

Integration of text & data • In each information space • Appropriate preprocessing • Choice of distance measures

Combine data: • confidence attributed to either of the two data types • in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

However, distribution of distances invoke a bias  Scaling problem • Therefore, use technique from statistical meta-analysis(so-called omnibus procedure) Expression Distancehistogram Text Distancehistogram

Optimal k ? Various cutoffs k of the cluster tree M-scoreintegrated clustering M-score expression data only

A peek inside

A peek inside Text Profile Expression Profile Strongre-enforcement

ase 2: C Sciento- & technometrics

Mapping of Science • Journal ‘Scientometrics’ • Full-text articles • Document cluster analysis • Co-word mapping • Temporal dimension:clusters over time

Mapping of Science • Coupling with bibliometric indicators; • Based on reference (hyperlink) information • Mean reference Age • Nr Serials

Domain studies in Patent space Similarities ‘Seed’ patent 30 technology classes

User profiling & Author-Inventor linkage • Name resolution • Same persons (variants, mistakes) • Different persons (similar initials, or even full name) Van Veldhoven Veldhoven, Van Van Veldhoven Vanveldhoven Wim Van Veldhoven Walter Van Veldhoven Wim Van Veldhoven Wim Van Veldhoven

Content-based name matching • Detect spillovers and entrepreneurial activities at (e.g.) university-level • Matching of ‘inventors’ & ‘authors’ time-consuming  semi-automated approach: Relevance ranking Patent DB Publication DB

Acknowledgements Steunpunt O&O Statistieken Debackere K Glänzel W ESAT / BioI / Text Mining: Coessens B Van Vooren S Janssens F Van Dromme D ESAT / BioI: Moreau Y De Moor B

Thanks! ? ? CONTACT INFO: Patrick.glenisson@econ.kuleuven.be

Patrick Glenisson