150 likes | 238 Views
Concept Clustering, Summarization and Annotation. Qiaozhu Mei. Outline. Theme extraction Theme summarization Concept clustering Entity Annotation. Theme extraction. Motivation Extract subtopics/themes from a collection Input A collection of documents, with index of terms/phrases Output
E N D
Concept Clustering, Summarization and Annotation Qiaozhu Mei
Outline • Theme extraction • Theme summarization • Concept clustering • Entity Annotation
Theme extraction • Motivation • Extract subtopics/themes from a collection • Input • A collection of documents, with index of terms/phrases • Output • A set of word distributions, each represented with top probability words • Future Direction • Take all kinds of priors: in between “know nothing” and “know a lot”, usually know something with different types of information
Theme summarization • Motivation • The output of themes are not well interpretable. Use phrases to represent a theme (use k phrase to summarize a theme). • Input • A text collection and a set of themes • Output • A ranked list of phrases for each theme • Future Direction • Automatically generated phrases v.s. Parser. (Evaluation)
Concept clustering • Motivation • Group semantically replaceable/similar terms into tight semantic clusters (tight concepts). E.g. synonyms • Input • A collection of documents and a list of terms • Output • A set of tight clusters • Future Direction • Apply heuristics to speed up without degrade the performance (Evaluation)
Concept clustering: results • GNAME#glutathione GNAME#COII GNAME#PEC • ((GNAME#glutathione) ((GNAME#COII) (GNAME#PEC))) • GNAME#GST GNAME#IgG4#ap GNAME#COX#2 • (((GNAME#GST) (GNAME#IgG4#ap)) (GNAME#COX#2)) • GNAME#alpha#bungarotoxin GNAME#D2 • ((GNAME#alpha#bungarotoxin) (GNAME#D2)) • GNAME#mrjp1 GNAME#apamin GNAME#E1A • (((GNAME#mrjp1) (GNAME#apamin)) (GNAME#E1A)) • GNAME#Apis GNAME#ribosomal • ((GNAME#Apis) (GNAME#ribosomal)) • GNAME#alpha#glucosidases GNAME#somatostatin GNAME#G • (((GNAME#alpha#glucosidases) (GNAME#somatostatin)) (GNAME#G)) • GNAME#alpha#glucosidase GNAME##alpha##glucosidase • ((GNAME#alpha#glucosidase) (GNAME##alpha##glucosidase)) • GNAME#16S GNAME#mammalian • ((GNAME#16S) (GNAME#mammalian)) • GNAME#signal GNAME#mocambique • ((GNAME#signal) (GNAME#mocambique)) • GNAME#sequence GNAME#sequences • ((GNAME#sequence) (GNAME#sequences)) • GNAME#D GNAME#E • ((GNAME#D) (GNAME#E)) • GNAME#of GNAME#from • ((GNAME#of) (GNAME#from)) • GNAME#mellifera GNAME#specific GNAME#venom#specific • (((GNAME#mellifera) (GNAME#specific)) (GNAME#venom#specific))
Concept clustering: results (II) • GNAME#nicotinic GNAME#ER • ((GNAME#nicotinic) (GNAME#ER)) • GNAME#acetylcholine GNAME#green • ((GNAME#acetylcholine) (GNAME#green)) • GNAME#EFB GNAME#gp120 GNAME#Penncap#M • ((GNAME#EFB) ((GNAME#gp120) (GNAME#Penncap#M))) • GNAME#F#actin GNAME#mtDNA • ((GNAME#F#actin) (GNAME#mtDNA)) • GNAME#tubulin GNAME#mAb • ((GNAME#tubulin) (GNAME#mAb)) • GNAME#hemolymph GNAME#precursor • ((GNAME#hemolymph) (GNAME#precursor)) • GNAME#domain GNAME#element • ((GNAME#domain) (GNAME#element)) • GNAME#Melittin GNAME#mugml#1 GNAME#venom • (((GNAME#Melittin) (GNAME#mugml#1)) (GNAME#venom)) • GNAME#diastase GNAME#invertase GNAME#CAT • ((GNAME#diastase) ((GNAME#invertase) (GNAME#CAT))) • GNAME#peroxidase GNAME#catalase • ((GNAME#peroxidase) (GNAME#catalase)) • GNAME#Vg GNAME#PKG • ((GNAME#Vg) (GNAME#PKG)) • GNAME#GABA GNAME#dopamine • ((GNAME#GABA) (GNAME#dopamine)) • GNAME#TPN GNAME#AMCI#1 GNAME#RJ GNAME#SRs • (((GNAME#TPN) (GNAME#AMCI#1)) ((GNAME#RJ) (GNAME#SRs))) • GNAME#nuclear GNAME#CA • ((GNAME#nuclear) (GNAME#CA)) • GNAME#synthase GNAME#neuron • ((GNAME#synthase) (GNAME#neuron))
Concept clustering: results (III) • GNAME#immunoglobulin GNAME#DraI GNAME#IgM GNAME#AluI • (((GNAME#immunoglobulin) (GNAME#DraI)) ((GNAME#IgM) (GNAME#AluI))) • GNAME#Ig GNAME#TNF#beta • ((GNAME#Ig) (GNAME#TNF#beta)) • GNAME#neurons GNAME#OBPs • ((GNAME#neurons) (GNAME#OBPs)) • GNAME#Mdh#1 GNAME#Mdh GNAME#NF#kappaB • (((GNAME#Mdh#1) (GNAME#Mdh)) (GNAME#NF#kappaB)) • GNAME#MRJP1 GNAME#HGL • ((GNAME#MRJP1) (GNAME#HGL)) • GNAME#promoter GNAME#enzyme • ((GNAME#promoter) (GNAME#enzyme)) • GNAME#mitochondrial GNAME#homeobox • ((GNAME#mitochondrial) (GNAME#homeobox)) • GNAME#AncR#1 GNAME#Nasonov GNAME#Sax1 • (((GNAME#AncR#1) (GNAME#Nasonov)) (GNAME#Sax1)) • GNAME#transcripts GNAME#isozymes • ((GNAME#transcripts) (GNAME#isozymes)) • GNAME#glutamate GNAME#malate • ((GNAME#glutamate) (GNAME#malate)) • GNAME#collagen GNAME#IL#1beta GNAME#IL#4 • ((GNAME#collagen) ((GNAME#IL#1beta) (GNAME#IL#4))) • GNAME#binding GNAME#histone • ((GNAME#binding) (GNAME#histone)) • GNAME#system GNAME#gC GNAME#OBP • (((GNAME#system) (GNAME#gC)) (GNAME#OBP)) • GNAME#calmodulin GNAME#PhTX GNAME#deltamethrin • (((GNAME#calmodulin) (GNAME#PhTX)) (GNAME#deltamethrin)) • GNAME#amylase GNAME#sucrase • ((GNAME#amylase) (GNAME#sucrase)) • GNAME#TNF#alpha GNAME#IgG#ap GNAME#D1 • (((GNAME#TNF#alpha) (GNAME#IgG#ap)) (GNAME#D1)) • GNAME#A2 GNAME#A#2 • ((GNAME#A2) (GNAME#A#2)) • GNAME#IFN#gamma GNAME#DTX • ((GNAME#IFN#gamma) (GNAME#DTX)) • GNAME#MRJP3 GNAME#Mblk#1 • ((GNAME#MRJP3) (GNAME#Mblk#1)) • GNAME#antigen GNAME#alleles • ((GNAME#antigen) (GNAME#alleles))
Concept clustering: results (IV) • GNAME#bovine GNAME#aflatoxin • ((GNAME#bovine) (GNAME#aflatoxin)) • GNAME#albumin GNAME#tryptase • ((GNAME#albumin) (GNAME#tryptase)) • GNAME#4 GNAME#2 • ((GNAME#4) (GNAME#2)) • GNAME#region GNAME#site • ((GNAME#region) (GNAME#site)) • GNAME#AHB GNAME#hexokinase GNAME#rhodopsin • (((GNAME#AHB) (GNAME#hexokinase)) (GNAME#rhodopsin)) • GNAME#PI GNAME#P1 • ((GNAME#PI) (GNAME#P1)) • GNAME#pollen GNAME#plants • ((GNAME#pollen) (GNAME#plants)) • GNAME#lipase GNAME#LDH • ((GNAME#lipase) (GNAME#LDH)) • GNAME#AL GNAME#SCT GNAME#COI#COII • ((GNAME#AL) ((GNAME#SCT) (GNAME#COI#COII))) • GNAME#chymotrypsin GNAME#CAP GNAME#NGF • (((GNAME#chymotrypsin) (GNAME#CAP)) (GNAME#NGF)) • GNAME#PLA GNAME#trehalase • ((GNAME#PLA) (GNAME#trehalase)) • GNAME#IgG1 GNAME#IgG4 • ((GNAME#IgG1) (GNAME#IgG4)) • GNAME#inhibitor GNAME#Phospholipase • ((GNAME#inhibitor) (GNAME#Phospholipase)) • GNAME##s GNAME#P • ((GNAME##s) (GNAME#P))
Concept clustering: results (V) • GNAME#restriction GNAME#Z • ((GNAME#restriction) (GNAME#Z)) • GNAME#PER GNAME#RAST • ((GNAME#PER) (GNAME#RAST)) • GNAME#PLA2s GNAME#EC • ((GNAME#PLA2s) (GNAME#EC)) • GNAME#beta#glucosidase GNAME#GIF • ((GNAME#beta#glucosidase) (GNAME#GIF)) • GNAME#ASP1 GNAME#ASP2 • ((GNAME#ASP1) (GNAME#ASP2)) • GNAME#PKC GNAME#elastase GNAME#Permethrin • ((GNAME#PKC) ((GNAME#elastase) (GNAME#Permethrin))) • GNAME#MLT GNAME#JH#III • ((GNAME#MLT) (GNAME#JH#III)) • GNAME#RyR GNAME#MHC • ((GNAME#RyR) (GNAME#MHC)) • GNAME#filaments GNAME#filament • ((GNAME#filaments) (GNAME#filament)) • GNAME#F1 GNAME#F#1 • ((GNAME#F1) (GNAME#F#1)) • GNAME#TPNQ GNAME#EEP GNAME#MDH#1 • ((GNAME#TPNQ) ((GNAME#EEP) (GNAME#MDH#1))) • GNAME#c GNAME#b5 • ((GNAME#c) (GNAME#b5)) • GNAME#scFv GNAME#Dfd • ((GNAME#scFv) (GNAME#Dfd)) • GNAME#h2 GNAME#HMAP GNAME#ACh • (((GNAME#h2) (GNAME#HMAP)) (GNAME#ACh))
Entity Annotation • Motivation • Annotate an entity (term, biological entity, concept, etc) with different types of structured information • Generate a dictionary-like entry for each entity • Input • A text collection, an index of sentences • Output • A dictionary-like annotation entry for each entity • Future Direction • Tune each component of the annotator
Entity Annotation: results • GNAME#Mdh#1 11 • Related terms: • GNAME#Hk#1 0.000612038 • GNAME#locus 0.000449124 • GNAME#Est#6 0.000424242 • GNAME#Pgm#1 0.000291602 • GNAME#Est#1a 0.000291602 • ligustica 0.000288879 • GNAME#Est#5 0.000265296 • linkage 0.000191466 • spinula 0.00017993 • characterize 0.000174218 • GNAME#dehydrogenase 0.000172905 • Segregational 0.000160911 • Aegean 0.000160911 • GNAME#Adh#1 0.000160911 • Marginal 0.000160911 • Liguria 0.000160911 • GNAME#Mdh#1A 0.000160911
Example Sentences: • 12182 0.207504 : Segregational analyses demonstrated the absence of close linkage between Lap-D and GNAME#Est#1a , GNAME#Est#2 , GNAME#Est#5 , GNAME#Est#6 , GNAME#Mdh#1 , GN\ • AME#Hk#1 and GNAME#Pgm#1 GNAME#loci GNAME#of GNAME#Apis GNAME#mellifera . • 19949 0.203663 : Genetic linkage studies showed no close linkage between the GNAME#Est#1a GNAME#locus and the genetic markers GNAME#Est#6 , GNAME#Mdh#1 and GNAME#Hk#1 . • 30357 0.176708 : The tests were conducted primarily with biochemical markers ( GNAME#Adh#1 , GNAME#Est#1 , GNAME#Est#3 , GNAME#Est#5 , GNAME#Est#6 , GNAME#Hk#1 , GNAME#Mdh#1\ • , and GNAME#Pgm#1 ) ; the morphological mutation cordovan ( cd ) is also included . • 48736 0.16039 : Marginal populations of A. m. ligustica differ from the central populations of this subspecies in allele frequencies at the GNAME#Mdh#1 GNAME#locus . • 45078 0.152925 : Electrophoretic analysis of the GNAME#MDH GNAME#[ GNAME#malate GNAME#dehydrogenase GNAME#] GNAME#enzyme GNAME#system demonstrated that honeybee populations \ • of eastern Liguria belong to A. m. ligustica spinula , while , in the Western populations , the frequency of the GNAME#Mdh#1 GNAME#M GNAME#allele , which is characteristic of Fr\ • ench A. m. mellifera L. , linearly increases toward the French boundary .
Entity Annotation: results (II) • Semantically Similar entities:: • GNAME#Mdh#1 11 1 • GNAME#Hk#1 5 0.94811 • GNAME#Est#6 6 0.932399 • GNAME#Pgm#1 4 0.922537 • GNAME#Est#1a 4 0.922091 • GNAME#Adh#1 2 0.913576 • GNAME#Mdh#1A 2 0.906424 • GNAME#Mdh#1B 2 0.906424 • GNAME#M 3 0.899708 • GNAME#Lap 1 0.898897 • GNAME#Est#1 5 0.898051 • GNAME#PGM2 2 0.897837 • GNAME#aldehyde 1 0.897415 • GNAME#Cypermethrin 1 0.897026 • GNAME#ACP1 1 0.896832 • GNAME#EstIV 1 0.896831 • GNAME#MdhIII 1 0.896831 • GNAME#Est#2s 1 0.896803 • GNAME#aminopeptidases 1 0.896719 • GNAME#Mdh#1C 1 0.896674
Future Plan • Summar: • With Microsoft Research. Will help Xu to integrate the synonym extraction into gene summarization. • After Summar: • Work on the future directions listed for each module. • Two general functionalities: • Theme extraction, summarization and theme pattern analysis • Synonym extraction