290 likes | 394 Views
MEAD 3.09 A platform for multidocument multilingual text summarization.
E N D
MEAD 3.09 A platform for multidocument multilingual text summarization University of Michigan, Smith College, Columbia UniversityUniversity of Pennsylvania, Johns Hopkins UniversityChinese University of Hong Kong, University of AlabamaUniversity of Sheffield, University of CambridgeJHU Summer School 2004 - Baltimore
Text summarization • Identifying the “most important” information from a document or set of documents. • Extractive/abstractive • Single-document/multi-document • Informative/Indicative MEAD - JHU 2004 2
MEAD • Multi-document, multilingual, extractive summarization platform • Open-source (Perl & Java), well documented API and utilities • v. 1.0-2.0 (Michigan 2000), v. 3.0 (JHU 2001) • Latest release is v. 3.09 (Michigan 2001-2004) MEAD - JHU 2004 3
Four stages • Preprocessing and clustering • CIDR, XML representation • Feature extraction • Default + custom • Score extraction • Feature combination • Sentence reranking • Cross-sentence relationships: repetitions, chronology, source preferences MEAD - JHU 2004 4
Sample .config file <MEAD-CONFIG TARGET='GA3' LANG='ENG‘ CLUSTER-PATH='/clair4/mead/data/GA3' DATA-DIRECTORY='/clair4/mead/data/GA3/docsent'> <FEATURE-SET BASE-DIRECTORY='/clair4/mead/data/GA3/feature/'> <FEATURE NAME='Centroid‘ SCRIPT='/clair4/mead/bin/feature-scripts/Centroid.pl HK-WORD-enidf ENG'/> <FEATURE NAME='Position‘ SCRIPT='/clair4/mead/bin/feature-scripts/Position.pl'/> <FEATURE NAME='Length‘ SCRIPT='/clair4/mead/bin/feature-scripts/Length.pl'/> </FEATURE-SET> <CLASSIFIER COMMAND-LINE='/clair4/mead/bin/default-classifier.pl \ Centroid 1 Position 1 Length 9' SYSTEM='MEADORIG' RUN='10/09'/> <RERANKER COMMAND-LINE='/clair4/mead/bin/default-reranker.pl MEAD-cosine 0.7'/> <COMPRESSION BASIS='sentences' PERCENT='20'/> </MEAD-CONFIG> MEAD - JHU 2004 5
Sample .sentfeature file <SENT-FEATURE> <S DID="87" SNO="1" > <FEATURE N="Centroid" V="0.2749" /> </S> <S DID="87" SNO="2" > <FEATURE N="Centroid" V="0.8288" /> </S> <S DID="81" SNO="1" > <FEATURE N="Centroid" V="0.1538" /> </S> <S DID="81" SNO="2" > <FEATURE N="Centroid" V="1.0000" /> </S> <S DID="41" SNO="1" > <FEATURE N="Centroid" V="0.1539" /> </S> <S DID="41" SNO="2" > <FEATURE N="Centroid" V="0.9820" /> </S> </SENT-FEATURE> MEAD - JHU 2004 6
Sample .extract file <!DOCTYPE EXTRACT SYSTEM '/clair/tools/mead/dtd/extract.dtd'> <EXTRACT QID='GA3' LANG='ENG' COMPRESSION='7' SYSTEM='MEADORIG' RUN='Sun Oct 13 11:01:19 2002'> <S ORDER='1' DID='41' SNO='2' /> <S ORDER='2' DID='41' SNO='3' /> <S ORDER='3' DID='41' SNO='11' /> <S ORDER='4' DID='81' SNO='3' /> <S ORDER='5' DID='81' SNO='7' /> <S ORDER='6' DID='87' SNO='2' /> <S ORDER='7' DID='87' SNO='3' /> </EXTRACT> MEAD - JHU 2004 7
Sample .query <!DOCTYPE QUERY SYSTEM "/clair4/mead/dtd/query.dtd" > <QUERY QID="Q-551-E" QNO="551" TRANSLATED="NO"> <TITLE> Natural disaster victims aided </TITLE> <DESCRIPTION> The description is usually a few sentences describing the cluster. </DESCRIPTION> <NARRATIVE> The narrative often describes exactly what the user is looking for in the summary. </NARRATIVE> </QUERY> MEAD - JHU 2004 9
Features • Centroid: cosine overlap with the centroid vector of the cluster • SimWithFirst: cosine overlap with the first sentence in the document (or with the title, if it exists) • Length: 1 if the length of the sentence is above a given threshold and 0 otherwise • RealLength: the length of the sentence in words • Position: the position of the sentence in the document • QueryOverlap: cosine overlap with a query sentence or phrase • KeywordMatch: full match from a list of keywords • CosineCentrality: eigenvector centrality of the sentence on the lexical connectivity matrix with a defined threshold MEAD - JHU 2004 11
Centrality in summarization • Motivation: capture the most central words in a document or cluster • Centroid score [Radev & al. 2000, 2004a] • Alternative methods for computing centrality? MEAD - JHU 2004 12
Social networks • Induced by a relation r • Prestige (centrality) in social networks: • Degree centrality: number of friends • Geodesic centrality: bridge quality • Eigenvector centrality: who your friends are MEAD - JHU 2004 13
Eigenvectors of stochastic graphs • Square connectivity matrix • Directed vs. undirected • An eigenvalue for a square matrix A is a scalar such that there exists a vector x0 such that Ax = x • The normalized eigenvector associated with the largest is called the principal eigenvector of A • A matrix is called a stochastic matrix when the sum of entries in each row sum to 1 and none is negative. All stochastic matrices have a principal eigenvector • The connectivity matrix used in PageRank [Page & al. 1998] is irreducible [Langville & Meyer 2003] • An iterative method (power method) can be used to compute the principal eigenvector • That eigenvector corresponds to the stationary value of the Markov stochastic process described by the connectivity matrix • This is also equivalent to performing a random walk on the matrix MEAD - JHU 2004 14
Eigenvectors of stochastic graphs • The stationary value of the Markov stochastic matrix can be computed using an iterative power method: • PageRank adds an extra twist to deal with dead-end pages. With a probability 1-, a random starting point is chosen. This has a natural interpretation in the case of Web page ranking su = successor nodes pr = predecessor nodes MEAD - JHU 2004 15
LexPageRank (Cosine centrality) Example (cluster d1003t) 1 (d1s1) Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met. 2 (d2s1) Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990. 3 (d2s2) Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it. 4 (d2s3) Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation. 5 (d3s1) The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area. 6 (d3s2) Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.'' 7 (d3s3) Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM). 8 (d4s1) The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors. 9 (d5s1) British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.'' 10 (d5s2) In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations. 11 (d5s3) A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq. MEAD - JHU 2004 16
Cosine centrality MEAD - JHU 2004 17
Cosine centrality (t=0.3) d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 MEAD - JHU 2004 18
Cosine centrality (t=0.2) d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 MEAD - JHU 2004 19
d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 Cosine centrality (t=0.1) d4s1 Sentences vote for the most central sentence! MEAD - JHU 2004 20
ID LPR (0.1) LPR (0.2) LPR (0.3) Centroid d1s1 0.6007 0.6944 0.0909 0.7209 d2s1 0.8466 0.7317 0.0909 0.7249 d2s2 0.3491 0.6773 0.0909 0.1356 d2s3 0.7520 0.6550 0.0909 0.5694 d3s1 0.5907 0.4344 0.0909 0.6331 d3s2 0.7993 0.8718 0.0909 0.7972 d3s3 0.3548 0.4993 0.0909 0.3328 d4s1 1.0000 1.0000 0.0909 0.9414 d5s1 0.5921 0.7399 0.0909 0.9580 d5s2 0.6910 0.6967 0.0909 1.0000 d5s3 0.5921 0.4501 0.0909 0.7902 Cosine centrality vs. centroid centrality MEAD - JHU 2004 21
Classifiers • Default: linear combination (possibly using thresholds) • Lead-based: positional and chronological • Random • Decision-tree: trainable MEAD - JHU 2004 22
Rerankers • Identity: trivial • Default: remove sentences that are too similar • Time-based: use chronology • Source-based: source preference • Novelty: • CST-based: cross-document structure theory [Radev 2000, Zhang&al. 2002, Zhang&Radev 2004] • MMR: maximal marginal relevance [Carbonell & Goldstein 1998] MEAD - JHU 2004 23
Evaluation methods • Precision/recall/f-measure: baseline • Kappa: interjudge agreement and difficulty • Relative utility: non-binary judgements [Radev 2000] • Relevance correlation: IR-based • Cosine: default or TF*IDF • Longest-common subsequence [Saggion&al. 2002] • Word overlap • BLEU: n-gram precision [Papineni&al. 2002] • ROUGE: n-gram recall and lcs [Lin 2004] MEAD - JHU 2004 24
Recent applications • NewsInEssence (www.newsinessence.com) • DUC 2001-2004 • WapMEAD • Java-MEAD interface • Chronological fact extraction • Novelty detection • Protein interaction extraction MEAD - JHU 2004 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 MEAD - JHU 2004 28
More recent additions • MEAD “addons” – conversion from plain text, HTML, PDF, etc. to MEAD XML • Client + server • Summary to sentjudge conversion • Trainable version of MEAD using decision trees, maxent, and SVM MEAD - JHU 2004 30
Successes • Large-scale effort (more than 20 people have participated in it) • Open architecture • Downloaded more than 1,000 times in the last 2 years • Used in teaching • Novel models of centrality: centroid, degree, cosine centrality • Currently in five languages: English, Chinese, Korean, Spanish, Japanese • DUC (including several first-place rankings in 2003, 2004) MEAD - JHU 2004 31
Sample .meadrc file compression_basis sentences compression_absolute 1 classifier \ /clair4/projects/mead307/source/mead/bin/default-classifier.pl \ Centroid 3.0 Position 1.0 Length 15 SimWithFirst 2.0 reranker \ /clair4/projects/mead307/source/mead/bin/default-reranker.pl \ MEAD-cosine 0.9 enidf MEAD - JHU 2004 33