1 / 15

Concept Clustering, Summarization and Annotation

Concept Clustering, Summarization and Annotation. Qiaozhu Mei. Outline. Theme extraction Theme summarization Concept clustering Entity Annotation. Theme extraction. Motivation Extract subtopics/themes from a collection Input A collection of documents, with index of terms/phrases Output

kaycee
Download Presentation

Concept Clustering, Summarization and Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concept Clustering, Summarization and Annotation Qiaozhu Mei

  2. Outline • Theme extraction • Theme summarization • Concept clustering • Entity Annotation

  3. Theme extraction • Motivation • Extract subtopics/themes from a collection • Input • A collection of documents, with index of terms/phrases • Output • A set of word distributions, each represented with top probability words • Future Direction • Take all kinds of priors: in between “know nothing” and “know a lot”, usually know something with different types of information

  4. Theme summarization • Motivation • The output of themes are not well interpretable. Use phrases to represent a theme (use k phrase to summarize a theme). • Input • A text collection and a set of themes • Output • A ranked list of phrases for each theme • Future Direction • Automatically generated phrases v.s. Parser. (Evaluation)

  5. Concept clustering • Motivation • Group semantically replaceable/similar terms into tight semantic clusters (tight concepts). E.g. synonyms • Input • A collection of documents and a list of terms • Output • A set of tight clusters • Future Direction • Apply heuristics to speed up without degrade the performance (Evaluation)

  6. Concept clustering: results • GNAME#glutathione GNAME#COII GNAME#PEC • ((GNAME#glutathione) ((GNAME#COII) (GNAME#PEC))) • GNAME#GST GNAME#IgG4#ap GNAME#COX#2 • (((GNAME#GST) (GNAME#IgG4#ap)) (GNAME#COX#2)) • GNAME#alpha#bungarotoxin GNAME#D2 • ((GNAME#alpha#bungarotoxin) (GNAME#D2)) • GNAME#mrjp1 GNAME#apamin GNAME#E1A • (((GNAME#mrjp1) (GNAME#apamin)) (GNAME#E1A)) • GNAME#Apis GNAME#ribosomal • ((GNAME#Apis) (GNAME#ribosomal)) • GNAME#alpha#glucosidases GNAME#somatostatin GNAME#G • (((GNAME#alpha#glucosidases) (GNAME#somatostatin)) (GNAME#G)) • GNAME#alpha#glucosidase GNAME##alpha##glucosidase • ((GNAME#alpha#glucosidase) (GNAME##alpha##glucosidase)) • GNAME#16S GNAME#mammalian • ((GNAME#16S) (GNAME#mammalian)) • GNAME#signal GNAME#mocambique • ((GNAME#signal) (GNAME#mocambique)) • GNAME#sequence GNAME#sequences • ((GNAME#sequence) (GNAME#sequences)) • GNAME#D GNAME#E • ((GNAME#D) (GNAME#E)) • GNAME#of GNAME#from • ((GNAME#of) (GNAME#from)) • GNAME#mellifera GNAME#specific GNAME#venom#specific • (((GNAME#mellifera) (GNAME#specific)) (GNAME#venom#specific))

  7. Concept clustering: results (II) • GNAME#nicotinic GNAME#ER • ((GNAME#nicotinic) (GNAME#ER)) • GNAME#acetylcholine GNAME#green • ((GNAME#acetylcholine) (GNAME#green)) • GNAME#EFB GNAME#gp120 GNAME#Penncap#M • ((GNAME#EFB) ((GNAME#gp120) (GNAME#Penncap#M))) • GNAME#F#actin GNAME#mtDNA • ((GNAME#F#actin) (GNAME#mtDNA)) • GNAME#tubulin GNAME#mAb • ((GNAME#tubulin) (GNAME#mAb)) • GNAME#hemolymph GNAME#precursor • ((GNAME#hemolymph) (GNAME#precursor)) • GNAME#domain GNAME#element • ((GNAME#domain) (GNAME#element)) • GNAME#Melittin GNAME#mugml#1 GNAME#venom • (((GNAME#Melittin) (GNAME#mugml#1)) (GNAME#venom)) • GNAME#diastase GNAME#invertase GNAME#CAT • ((GNAME#diastase) ((GNAME#invertase) (GNAME#CAT))) • GNAME#peroxidase GNAME#catalase • ((GNAME#peroxidase) (GNAME#catalase)) • GNAME#Vg GNAME#PKG • ((GNAME#Vg) (GNAME#PKG)) • GNAME#GABA GNAME#dopamine • ((GNAME#GABA) (GNAME#dopamine)) • GNAME#TPN GNAME#AMCI#1 GNAME#RJ GNAME#SRs • (((GNAME#TPN) (GNAME#AMCI#1)) ((GNAME#RJ) (GNAME#SRs))) • GNAME#nuclear GNAME#CA • ((GNAME#nuclear) (GNAME#CA)) • GNAME#synthase GNAME#neuron • ((GNAME#synthase) (GNAME#neuron))

  8. Concept clustering: results (III) • GNAME#immunoglobulin GNAME#DraI GNAME#IgM GNAME#AluI • (((GNAME#immunoglobulin) (GNAME#DraI)) ((GNAME#IgM) (GNAME#AluI))) • GNAME#Ig GNAME#TNF#beta • ((GNAME#Ig) (GNAME#TNF#beta)) • GNAME#neurons GNAME#OBPs • ((GNAME#neurons) (GNAME#OBPs)) • GNAME#Mdh#1 GNAME#Mdh GNAME#NF#kappaB • (((GNAME#Mdh#1) (GNAME#Mdh)) (GNAME#NF#kappaB)) • GNAME#MRJP1 GNAME#HGL • ((GNAME#MRJP1) (GNAME#HGL)) • GNAME#promoter GNAME#enzyme • ((GNAME#promoter) (GNAME#enzyme)) • GNAME#mitochondrial GNAME#homeobox • ((GNAME#mitochondrial) (GNAME#homeobox)) • GNAME#AncR#1 GNAME#Nasonov GNAME#Sax1 • (((GNAME#AncR#1) (GNAME#Nasonov)) (GNAME#Sax1)) • GNAME#transcripts GNAME#isozymes • ((GNAME#transcripts) (GNAME#isozymes)) • GNAME#glutamate GNAME#malate • ((GNAME#glutamate) (GNAME#malate)) • GNAME#collagen GNAME#IL#1beta GNAME#IL#4 • ((GNAME#collagen) ((GNAME#IL#1beta) (GNAME#IL#4))) • GNAME#binding GNAME#histone • ((GNAME#binding) (GNAME#histone)) • GNAME#system GNAME#gC GNAME#OBP • (((GNAME#system) (GNAME#gC)) (GNAME#OBP)) • GNAME#calmodulin GNAME#PhTX GNAME#deltamethrin • (((GNAME#calmodulin) (GNAME#PhTX)) (GNAME#deltamethrin)) • GNAME#amylase GNAME#sucrase • ((GNAME#amylase) (GNAME#sucrase)) • GNAME#TNF#alpha GNAME#IgG#ap GNAME#D1 • (((GNAME#TNF#alpha) (GNAME#IgG#ap)) (GNAME#D1)) • GNAME#A2 GNAME#A#2 • ((GNAME#A2) (GNAME#A#2)) • GNAME#IFN#gamma GNAME#DTX • ((GNAME#IFN#gamma) (GNAME#DTX)) • GNAME#MRJP3 GNAME#Mblk#1 • ((GNAME#MRJP3) (GNAME#Mblk#1)) • GNAME#antigen GNAME#alleles • ((GNAME#antigen) (GNAME#alleles))

  9. Concept clustering: results (IV) • GNAME#bovine GNAME#aflatoxin • ((GNAME#bovine) (GNAME#aflatoxin)) • GNAME#albumin GNAME#tryptase • ((GNAME#albumin) (GNAME#tryptase)) • GNAME#4 GNAME#2 • ((GNAME#4) (GNAME#2)) • GNAME#region GNAME#site • ((GNAME#region) (GNAME#site)) • GNAME#AHB GNAME#hexokinase GNAME#rhodopsin • (((GNAME#AHB) (GNAME#hexokinase)) (GNAME#rhodopsin)) • GNAME#PI GNAME#P1 • ((GNAME#PI) (GNAME#P1)) • GNAME#pollen GNAME#plants • ((GNAME#pollen) (GNAME#plants)) • GNAME#lipase GNAME#LDH • ((GNAME#lipase) (GNAME#LDH)) • GNAME#AL GNAME#SCT GNAME#COI#COII • ((GNAME#AL) ((GNAME#SCT) (GNAME#COI#COII))) • GNAME#chymotrypsin GNAME#CAP GNAME#NGF • (((GNAME#chymotrypsin) (GNAME#CAP)) (GNAME#NGF)) • GNAME#PLA GNAME#trehalase • ((GNAME#PLA) (GNAME#trehalase)) • GNAME#IgG1 GNAME#IgG4 • ((GNAME#IgG1) (GNAME#IgG4)) • GNAME#inhibitor GNAME#Phospholipase • ((GNAME#inhibitor) (GNAME#Phospholipase)) • GNAME##s GNAME#P • ((GNAME##s) (GNAME#P))

  10. Concept clustering: results (V) • GNAME#restriction GNAME#Z • ((GNAME#restriction) (GNAME#Z)) • GNAME#PER GNAME#RAST • ((GNAME#PER) (GNAME#RAST)) • GNAME#PLA2s GNAME#EC • ((GNAME#PLA2s) (GNAME#EC)) • GNAME#beta#glucosidase GNAME#GIF • ((GNAME#beta#glucosidase) (GNAME#GIF)) • GNAME#ASP1 GNAME#ASP2 • ((GNAME#ASP1) (GNAME#ASP2)) • GNAME#PKC GNAME#elastase GNAME#Permethrin • ((GNAME#PKC) ((GNAME#elastase) (GNAME#Permethrin))) • GNAME#MLT GNAME#JH#III • ((GNAME#MLT) (GNAME#JH#III)) • GNAME#RyR GNAME#MHC • ((GNAME#RyR) (GNAME#MHC)) • GNAME#filaments GNAME#filament • ((GNAME#filaments) (GNAME#filament)) • GNAME#F1 GNAME#F#1 • ((GNAME#F1) (GNAME#F#1)) • GNAME#TPNQ GNAME#EEP GNAME#MDH#1 • ((GNAME#TPNQ) ((GNAME#EEP) (GNAME#MDH#1))) • GNAME#c GNAME#b5 • ((GNAME#c) (GNAME#b5)) • GNAME#scFv GNAME#Dfd • ((GNAME#scFv) (GNAME#Dfd)) • GNAME#h2 GNAME#HMAP GNAME#ACh • (((GNAME#h2) (GNAME#HMAP)) (GNAME#ACh))

  11. Entity Annotation • Motivation • Annotate an entity (term, biological entity, concept, etc) with different types of structured information • Generate a dictionary-like entry for each entity • Input • A text collection, an index of sentences • Output • A dictionary-like annotation entry for each entity • Future Direction • Tune each component of the annotator

  12. Entity Annotation: results • GNAME#Mdh#1 11 • Related terms: • GNAME#Hk#1 0.000612038 • GNAME#locus 0.000449124 • GNAME#Est#6 0.000424242 • GNAME#Pgm#1 0.000291602 • GNAME#Est#1a 0.000291602 • ligustica 0.000288879 • GNAME#Est#5 0.000265296 • linkage 0.000191466 • spinula 0.00017993 • characterize 0.000174218 • GNAME#dehydrogenase 0.000172905 • Segregational 0.000160911 • Aegean 0.000160911 • GNAME#Adh#1 0.000160911 • Marginal 0.000160911 • Liguria 0.000160911 • GNAME#Mdh#1A 0.000160911

  13. Example Sentences: • 12182 0.207504 : Segregational analyses demonstrated the absence of close linkage between Lap-D and GNAME#Est#1a , GNAME#Est#2 , GNAME#Est#5 , GNAME#Est#6 , GNAME#Mdh#1 , GN\ • AME#Hk#1 and GNAME#Pgm#1 GNAME#loci GNAME#of GNAME#Apis GNAME#mellifera . • 19949 0.203663 : Genetic linkage studies showed no close linkage between the GNAME#Est#1a GNAME#locus and the genetic markers GNAME#Est#6 , GNAME#Mdh#1 and GNAME#Hk#1 . • 30357 0.176708 : The tests were conducted primarily with biochemical markers ( GNAME#Adh#1 , GNAME#Est#1 , GNAME#Est#3 , GNAME#Est#5 , GNAME#Est#6 , GNAME#Hk#1 , GNAME#Mdh#1\ • , and GNAME#Pgm#1 ) ; the morphological mutation cordovan ( cd ) is also included . • 48736 0.16039 : Marginal populations of A. m. ligustica differ from the central populations of this subspecies in allele frequencies at the GNAME#Mdh#1 GNAME#locus . • 45078 0.152925 : Electrophoretic analysis of the GNAME#MDH GNAME#[ GNAME#malate GNAME#dehydrogenase GNAME#] GNAME#enzyme GNAME#system demonstrated that honeybee populations \ • of eastern Liguria belong to A. m. ligustica spinula , while , in the Western populations , the frequency of the GNAME#Mdh#1 GNAME#M GNAME#allele , which is characteristic of Fr\ • ench A. m. mellifera L. , linearly increases toward the French boundary .

  14. Entity Annotation: results (II) • Semantically Similar entities:: • GNAME#Mdh#1 11 1 • GNAME#Hk#1 5 0.94811 • GNAME#Est#6 6 0.932399 • GNAME#Pgm#1 4 0.922537 • GNAME#Est#1a 4 0.922091 • GNAME#Adh#1 2 0.913576 • GNAME#Mdh#1A 2 0.906424 • GNAME#Mdh#1B 2 0.906424 • GNAME#M 3 0.899708 • GNAME#Lap 1 0.898897 • GNAME#Est#1 5 0.898051 • GNAME#PGM2 2 0.897837 • GNAME#aldehyde 1 0.897415 • GNAME#Cypermethrin 1 0.897026 • GNAME#ACP1 1 0.896832 • GNAME#EstIV 1 0.896831 • GNAME#MdhIII 1 0.896831 • GNAME#Est#2s 1 0.896803 • GNAME#aminopeptidases 1 0.896719 • GNAME#Mdh#1C 1 0.896674

  15. Future Plan • Summar: • With Microsoft Research. Will help Xu to integrate the synonym extraction into gene summarization. • After Summar: • Work on the future directions listed for each module. • Two general functionalities: • Theme extraction, summarization and theme pattern analysis • Synonym extraction

More Related