• 770 likes • 1.09k Views
Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Presentation at UC Berkeley SIMS, November 10, 2004. Recent advances in multi-document summarization. Large : 10 10 pages, 200 TB [Lyman&Varian 03] cf. brain (10 11 neurons)
E N D
Dragomir RadevUniversity of Michigan, Ann Arborradev@umich.edu Presentation at UC Berkeley SIMS, November 10, 2004 Recent advances in multi-document summarization
Large: 1010 pages, 200 TB [Lyman&Varian 03] cf. brain (1011 neurons) Multilingual: English 56.4% of sites, German 7.7%, French 5.6%, Japanese 4.9%, Chinese 2.4% Evolving: 22% of sites change every day, another 31% change every month [Cho&Garcia-Molina 00] Uneven importance: at different levels Adequate representations are needed for user-friendly access WWW as a textual database
Outline • Introduction • Random walks and social networks • LexRank • Projects in language modeling and machine learning
Outline • Introduction • Random walks and social networks • LexRank • Projects in language modeling and machine learning
Typical NLP problems Entity extraction Relation extraction Text classification Summarization Information retrieval Machine translation Question answering Text understanding Parsing Word sense disambiguation Lexical acquisition Paraphrasing NLP is very hard! The pen is in the box. Every American has a mother. Boston called. I saw Zoe. The poor girl looked tired. Mary and Sue bought each other a book. The spirit is willing but the flesh is weak. Children make delicious snacks. Army head seeks arms. Czech President and playwright Havel to receive honors Natural Language Processing (NLP)
Multidisciplinary Statistical Well founded Scaleable Linguistics E-commerce Bioinformatics Lin. Algebra Info. Retrieval Graph theory Bioinformatics Intelligence Stat. Mechanics User interfaces Sociology Translation Recent trends in NLP NLP
Language doesn’t have a regular structure (like a database) Sentences are very unlike each other Linguistic analysis: parse trees Hard to generalize Finding structure Across sentences Across sites/sources/documents Over time Representations Graphs everywhere! Finding structure
MEAD: salience-based extractive summarization Centroid-based summarization (single and multi document) Vector space model Additional features: position, length, lexrank NewsInEssence • (1000+ downloads) • Cross-document structure theory (CST) • NIE: first robust news summarization system (2001)
Outline • Introduction • Random walks and social networks • LexRank • Projects in language modeling and machine learning
Social networks • Induced by a relation • Symmetric or not • Examples: • Friendship networks • Board membership • Citations • Power grid of the US • WWW
1 6 8 2 7 5 3 4 Graph-based representations Square connectivity(incidence) matrix P Graph G (V,E)
Markov chains • A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel P. • Path = sequence (x0, x1, …, xn). • The probability of a path can be computed as a product of probabilities for each step i.
Random walks • Access time Hij = expected number of steps to go from i to j. • Example [Lovász 1993]. What is Hij on a path with nodes 0, 1, n-1? H(k-1,k) = 2k-1 H(i,k) = H(i,k-1) + 2k-1 H(i,k) = (2i+1) + (2i+3) + … + (2k-1) = k2 – i2 H(0,k) = k2 (Brownian motion: travel distance sqrt(t) in time t) • Electrical networks • Rstis the resistance between two nodes s and t. The round-trip travel time between s and t is exactly 2mRst, where m is the number of edges.
Stationary solutions • The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: • E is stochastic • E is irreducible • E is aperiodic • To make these conditions true: • All rows of E add up to 1 (and no value is negative) • Make sure that E is strongly connected • Make sure that E is not bipartite • Example: PageRank [Brin and Page 1998]: use “teleportation”
1 6 8 2 7 5 3 4 t=10 Example This graph E has a second graph E’superimposed on it:E’ is the uniform transition graph.
Eigenvectors • An eigenvector is an implicit “direction” for a matrix. Ev = λv, where v is non-zero, though λ can be any complex number in principle. • The largest eigenvalue of a stochastic matrix E is λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, ETp = p.
Prestige and centrality • Degree centrality: how many neighbors each node has. • Closeness centrality: how close an actor is to all of the other nodes • Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes • Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. • Prestige = same as centrality but for directed graphs.
Computing the stationary distribution Solution for thestationary distribution functionPowerStatDist (E): begin p(0) = u; i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; untilL < end
t=0 1 6 8 2 7 t=1 5 3 4 t=10 Example
Outline • Introduction • Random walks and social networks • LexRank
Centrality in summarization • Motivation: capture the most central words in a document or cluster • Centroid score [Radev & al. 2000, 2004a] • Alternative methods for computing centrality?
Sample multidocument cluster (DUC cluster d1003t) 1 (d1s1) Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met. 2 (d2s1) Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990. 3 (d2s2) Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it. 4 (d2s3) Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation. 5 (d3s1) The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area. 6 (d3s2) Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.'' 7 (d3s3) Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM). 8 (d4s1) The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors. 9 (d5s1) British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.'' 10 (d5s2) In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations. 11 (d5s3) A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq.
Cosine between sentences • Let s1 and s2 be two sentences. • Let x and y be their representations in an n-dimensional vector space • The cosine between is then computed based on the inner product of the two. • The cosine ranges from 0 to 1.
d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 Cosine centrality (t=0.3)
d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 Cosine centrality (t=0.2)
d3s3 d2s3 d3s2 d3s1 d1s1 d4s1 d5s1 d2s1 d5s2 d5s3 d2s2 Cosine centrality (t=0.1) Sentences vote for the most central sentence!
LexRank • T1…Tn are pages that link to A, c(Ti) is the outdegree of pageTi, and N is the total number of pages. • d is the “damping factor”, or the probability that we “jump” to a far-away node during the random walk. It accounts for disconnected components or periodic graphs. • When d = 0, we have a strict uniform distribution.When d = 1, the method is not guaranteed to converge to a unique solution. • Typical value for d is between [0.1,0.2] (Brin and Page, 1998).
ID LPR (0.1) LPR (0.2) LPR (0.3) Centroid d1s1 0.6007 0.6944 1.0000 0.7209 d2s1 0.8466 0.7317 1.0000 0.7249 d2s2 0.3491 0.6773 1.0000 0.1356 d2s3 0.7520 0.6550 1.0000 0.5694 d3s1 0.5907 0.4344 1.0000 0.6331 d3s2 0.7993 0.8718 1.0000 0.7972 d3s3 0.3548 0.4993 1.0000 0.3328 d4s1 1.0000 1.0000 1.0000 0.9414 d5s1 0.5921 0.7399 1.0000 0.9580 d5s2 0.6910 0.6967 1.0000 1.0000 d5s3 0.5921 0.4501 1.0000 0.7902 Cosine centrality vs. centroid centrality
Evaluation metrics • Difficult to evaluate summaries • Intrinsic vs. extrinsic evaluations • Extractive vs. non-extractive evaluations • Manual vs. automatic evaluations • ROUGE = mixture of n-gram recall for different values of n. • Example: • Reference = “The cat in the hat” • System = “The cat wears a top hat” • 1-gram recall = 3/5; 2-gram recall = 1/4;3,4-gram recall = 0 • ROUGE-W = longest common subsequence • Example above: 3/5
Evaluation results Centroid: C0.5, C10, C1.5, C1, C2.5, C2 Degree: D0.5T0.1, D0.5T0.2, D0.5T0.3, D1.5T0.1, D1.5T0.2, D1.5T0.3, D1T0.1, D1T0.2, D1T0.3 LexRank: Lr0.5T0.1, Lr0.5T0.2, Lr0.5t0.3, Lr1.5t0.1, Lr1.5t0.2, Lr1.5t0.3, Lr1T0.1, Lr1T0.2, Lr1T0.3 Rouge-1 Lr1.5t0.1 0.400 Lr1.5t0.2 0.400 Lr1T0.2 0.396 … C1 0.382 Rouge-4 Lr1.5t0.1 0.124 Lr1.5t0.2 0.124 Lr1T0.2 0.124 … C2 0.118 Rouge-2 Lr1.5t0.2 0.115 D1.5T0.2 0.114 D1T0.2 0.113 … C1.5 0.099
LCS Recall DUC results
DUC results (MU recall, ROUGE): 1st place 2003 (duc.nist.gov) 1-2 place 2004 applications: Web page summarization (WIE) Topical crawling Answer focused wireless access Cross-lingual IR-based evaluation Knowledge based Beyond summarization: Classification WSD Spam recognition Results and applications
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Outline • Introduction • Random walks and social networks • LexRank • Projects in language modeling and machine learning
Noisy channel model: assume that a source sentence has to be translated into a target language sentence Goal: find Obvious problems can be fixed with syntax (?) JHU 02 and 03 projects (Franz Och, Jan Hajic, Dan Gildea + others) Syntax in Statistical Machine Translation • Solution using log-linear combination of features
Given: a Chinese sentence+ The top 1000 candidate translations in English Parse all of these Compute features: monolingual, bilingual, syntax-free, and syntactic Evaluation using BLEU (BiLingual Evaluation Understudy) Example: Is the number of constituents across languages the same? Is the english tree grammatical? Are the two sentences of comparable length? Feature combination Use a greedy maxbleu algorithm Setup
Chinese parse tree IP NP QP NP NP VP NP CLP NR CD M NN NN NN NN NN NN VV 中国 十四 个 边境 开放 城市 经济 建设 成就 显著 China 14 border open cities economic achievements marked
Multiple references 1. fourteen chinese open border cities make significant achievements in economic construction 2. xinhua news agency report of february 12 from beijing - the fourteen chinese border cities that have been opened to foreigners achieved satisfactory results in their economic construction in 1995 . 3. according to statistics , the cities achieved a combined gross domestic product of rmb 19 billion last year , an increase of more than 90 % over 1991 before their opening . 4. the state council successively approved the opening of fourteen border cities to foreigners in 1992 , including heihe , pingxiang , hunchun , yining and ruili , and permitted them to set up 14 border economic cooperation zones . 1. significant accomplishment achieved in the economic construction of the fourteen open border cities in china 2. xinhua news agency , beijing , feb. 12 - exciting accomplishment has been achieved in 1995 in the economic construction of china 's fourteen border cities open to foreigners . 3. statistics have indicated that these cities produced a combined gdp of over 19 billion yuan last year , an increase of more than 90 % , compared with that in 1991 before the cities were open to foreigners . 4. in 1992 , the state council successively opened fourteen border cities to foreigners . these included heihe , pingxiang , huichun , yining , and ruili . meanwhile , the state council also gave its approval to these cities to establish fourteen border zones for economic cooperation . 1. in china , fourteen cities along the border opened to foreigners achieved remarkable economic development 2. xinhua news agency , beijing , february 12 - the economic development in china 's fourteen cities along the border opened to foreigners achieved gratifying results in 1995 . 3. according to statistics , these cities completed a gross domestic product in excess of rmb 19 billion in last year , an increase of more than 90 % over 1991 ( the year before they were opened ) . 4. in 1992 , the state council successively approved fourteen cities along the border to be opened to foreigners , which included hei he , pingxiang , hunchun , yining and ruili etc. at the same time , these cities were also given approvals to set up fourteen border @-@ economic @-@ cooperation zones . 1. economic construction achievement is prominent in china 's fourteen border opening up cities . 2. xinhua news agency , beijing , february 12 - delightful economic construction result was achieved in china 's fourteen border opening up cities in 1995 . 3. according to statistics , gdp registered over 19 billion yuan last year in those cities , over 90 % higher than those of year 1991 before opening up . 4. fourteen border cities like heihe , pingxiang , huichun , yinin , and ruili etc were approved successively by the state council in 1992 as the cities opening to the outside world , setting up of fourteen border economic cooperation zones in these cities were also approved simultaneously . 1. china 's 14 open border cities marked economic achievements 2. xinhua news agency , beijing , february 12 chinese 14 border an open city 1995 economic development to achieve good results 3. according to statistics , the city last year 's gross domestic product ( gdp ) over 19 billion yuan , and opening up of more than 90 % growth in 1991 . 4. the state council in 1992 has approved the heihe , pingxiang , huichun , yining and ruili , 14 border cities as an open city , and the city also approved a total of 14 border economic cooperation .
Syntactic features (S1 (S (PP (IN in) (NP (NNP china))) (, ,) (NP (NP (CD fourteen) (NNS cities)) (PP (IN along) (NP (DT the) (NN border)))) (VP (VBN opened) (PP (TO to) (NP (NP (NNS foreigners)) (VP (VBN achieved) (NP (JJ remarkable) (JJ economic) (NN development)))))))) (S1 (NP (NP (JJ significant) (NN accomplishment)) (VP (VBN achieved) (PP (IN in) (NP (NP (DT the) (JJ economic) (NN construction)) (PP (IN of) (NP (NP (DT the) (CD fourteen) (JJ open) (NN border) (NNS cities)) (PP (IN in) (NP (NNP china)))))))))) (S1 (S (NP (CD fourteen) (ADJP (JJ chinese) (JJ open)) (NN border) (NNS cities)) (VP (VBP make) (NP (JJ significant) (NNS achievements)) (PP (IN in) (NP (JJ economic) (NN construction)))))) (S1 (S (NP (JJ economic) (NN construction) (NN achievement)) (VP (AUX is) (ADJP (JJ prominent) (PP (IN in) (S (NP (NP (NNP china) (POS 's)) (NP (CD fourteen) (NN border))) (VP (VBG opening) (PRT (RP up)) (NP (NNS cities))))))))) (S1 (S (NP (NP (NNP china) (POS 's)) (CD 14) (ADJP (JJ open)) (NN border) (NNS cities)) (VP (VBD marked) (NP (JJ economic) (NNS achievements)))))
TR PRED say APPS , PAT increase ACT Spoon ACT name ACT rate EXT pct TWHEN January RSTR Alan TWHEN recently PAT president RSTR ad APP Newsweek RSTR 5 ACT &Gen; RSTR Newsweek FUF PARTIC CIRCUM AFFECTED AGENT PARTIC CAT clause PROCESS PROCESS CREATED AGENT LEX say TENSE past OBJECT-CLAUSE that CAT HEAD CLASSIFIER POSSESSOR NP CAT pp PREP CAT LEX in LEX January DETERMINER none LEX Newsweek
BLEU baseline: 31.6% Most features: 30.0%-31.8% Flipdeps: 31.8% Best single feature: 32.5% Best combination 32.9% (statistically significant improvement) Results in [Och&al.04] Results
Phylogenetic Text Modeling Machine translation identification 其他党政及司法部门也必须从明年年初开始采取类似行动。 1. Other Party, governmental and law enforcement authorities must take similar actions beginning from the start of next year. 2. Other Party and government agencies and judicial departments must also take similar actions early next year. 3. All other Party, Government and Judicial Departments must start similar actions at the beginning of next year. 4. Other Party, government, and judicatory departments must take similar action at the beginning of next year. 5. Other party and government departments as well as judicial departments must take similar action from the beginning of next year. 6. All other party government and judicial departments must also take similar measures from the beginning of next year. 7. Other party and judicial authorities should take similar actions from the beginning of next year. 8. Other departments of the Party, the government and the judicial departments must also take similar actions early next year. 9. Other Party and Government departments as well as judicial departments must also take similar measures from the beginning of next year. 10. The other law enforcement agencies and departments will also take part in similar proceedings from the beginning of next year. 11. Other party, governmental and judicial departments will have to take similar action from the beginning of next year. 12. Other party politics and judicial department also will have to start from next year beginning of the year to adopt similar motion. 13. Other party and judicial section must start from the beginning of year of next year taking similar action also 14. The beginning of a year for and res judiciaria as welling must from next year of other party commences assuming is similar toing the proceeding. 15. At the beginning of next year politics and judicial department other parties must also start to pick to take similar action. 16. Other party politics and the judicial department also will have to start from at the beginning of next year to take the similar action. 17. Other party policies and judicial department must also begin from early next year to take similar action.
t-test: p<0.05Chinese: Levenshtein 50/50, BLEU 50/50 Arabic: Levenshtein 50/50, BLEU 48/50