470 likes | 601 Views
Information Access I Multilingual Text Summarization. GSLT, Göteborg, October 2003. Barbara Gawronska, Högskolan i Skövde. Types of summaries (Spärck Jones 1999, Hovy & Lin 1999). With respect to content: Indicative : provide an idea what the text is about, but do not render the content
E N D
Information Access I Multilingual Text Summarization GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde
Types of summaries(Spärck Jones 1999, Hovy & Lin 1999) • With respect to content: • Indicative: provide an idea what the text is about, but do not render the content • Informative: shortened versions of the text • With respect to the way of creating: • Extracts: reused portions of the text • Abstracts: re-generated text reflecting the important content • Compressed texts: (Knight & Marcu 2000): compressing syntactic parse trees in order to get a shorter text
Text compression (Knight & Marcu 2000, Lin 2003) • ”Given the original sentence t, find the best short sentence s generated from t, i.e. maximize P(s|t). Original sentence (Lin 2003): In Louisiana, the hurricane landed with wind speeds of about 120 miles per hour and caused severe damage in small coastal centres such as Morgan City, Franklin and New Iberia
Different genres and tasks require different summaries (informative summaries not so good for detective stories ) and Different texts require different summarization techniques
A special case: dialogue summarization: selecting successful ’dialog transactions’ – the game theoretical approach (Verbmobil: Wahlster, Alexandersson)
Multilingual summarization: • Extracting/compressing + MT or • Abstracting + multilingual generation
A possible combination system including multilingual summarization of news reports
The main objectives of the Newspeak project: • · Evaluation of different methods of semantic classification in the lexicon • · Development of a summarization module that would be well-suited for the news domain • · A comparison between the ‘traditional’ machine translation (MT) on the one side, and information extraction (IE) combined with reading comprehension (RC) and multilingual text generation (MTG) on the other side • · Exploration of the interplay between textual structure, syntax, and prosodic markers.
One of the main problems with media texts: no possibility of stating what is a true fact (hence, some criticism could be raised against TREC factoid questions...) GUERILLA FIGHTS IN LEBANON Israeli warplanes and artillery attacked suspected guerrilla hideouts Friday following a series of clashes in south Lebanon. Four guerrillas were reportedly killed. Guerrillas of the Syrian-backed Amal group attacked Israeli and allied militia positions in the Israeli-occupied zone at daybreak, Lebanese security officials said. Three guerrillas were killed in the assaults, said an Israeli army spokesman in Jerusalem. Amal saidnone of its fighters was killed.
The Theory of Mental Spaces (Fauconnier1985, Fauconnier and Sweetser 1996)
The notion of ’mental spaces’ (Fauconnier 1985, Sweetser & Fauconnier 1996, Sanders & Redeker 1996)
One of the main problems with media texts: no possibility of stating what is a true fact (hence, some criticism could be raised against TREC factoid questions...) GUERILLA FIGHTS IN LEBANON Israeli warplanes and artillery attacked suspected guerrilla hideouts Friday following a series of clashes in south Lebanon. Four guerrillas were reportedly killed. Guerrillas of the Syrian-backed Amal group attacked Israeli and allied militia positions in the Israeli-occupied zone at daybreak, Lebanese security officials said. Three guerrillas were killed in the assaults, said an Israeli army spokesman in Jerusalem. Amal saidnone of its fighters was killed.
Sample text 2 BEIT JALA, West Bank Israeli troops pulled out of Beit Jala before dawn on Thursday, leaving the Palestinian town quiet amid reports of fresh violence in other West Bank towns. The Palestinians said the Israel Defence Forces had staged incursions into Hebron, killing one and injuring 16 others, and Tulkarem, killing one and injuring 10. The Israel Defence Forces (IDF) had no immediate commenton the accusation that troops had entered Tulkarem, and strongly denied there was an incursion at Hebron.
Exploding objects WordNet Classification: missile bomb
Sample text 3 Iraqi President Saddam Hussein is striking a defiant tone a day after U.S. President George Bush's State of the Union address, saying his nation is ready to "destroy and defeat" any American attack. In a televised meeting with his military commanders on Wednesday, Saddam said the U.S. had no right to attack his country, and every American soldier is coming "as an aggressor." "If they have illusions, by God, America will be harmed," the Iraqi leader said. "[It is] not in the American people's interest that such harm come to it, its reputation and economy." In a powerful address Tuesday evening, Bush braced Americans and the rest of the world for a possible war with Iraq, warning that America was determined in its resolve to see Saddam disarmed.
Sample output from SemCat + speaker and speech act identification [source(semcat(Iraqi President Saddam Hussein,[propername,human([]),human([high_status])])),semcat(tone,[[],speech_act(manner)]),circ([semcat(is,[[],cop([])]),semcat(striking,[[],[]]),semcat(a,[[],det([])]),semcat(defiant,[[],[]])]),said([semcat(a,[[],det([])]),semcat(day,[[],time_period([])]),semcat(after,[[],prep([])]),semcat(U.S. President George Bush_s State,[propername,place([country]),group_of_people([]),human([high_status]),human([]),place([d23,convent_borders])]),semcat(of,[[],prep([])]),semcat(the,[[],det([])]),semcat(Union,[propername,explosion([]),group_of_people([]),place([country])]),semcat(address,[[],speech_act([neutral]),place([d2])]),semcat(saying,[[],say_verb([neutral])]),semcat(his,[[],poss([])]),semcat(nation,[[],place([country]),group_of_people([])]),semcat(is,[[],cop([])]),semcat(ready,[[],[]]),semcat(to,[[],prep([])]),semcat(",[[],[]]),semcat(destroy,[[],[]]),semcat(and,[[],konj([])]),semcat(defeat,[[],[]]),semcat(",[[],[]]),semcat(any,[[],det([])]),semcat(American,[propername,human([])]),semcat(attack,[[],military_operation([])]),semcat(.,[[],[]])]),[]] [source(semcat(Saddam,[propername,[]])),semcat(said,[[],say_verb([neutral])])…
Sample output from SemCat + speaker and speech act identification (2) coreference checked [source(semcat(Iraqi President Saddam Hussein,[propername,human([]),human([high_status])])),semcat(tone,[[],speech_act(manner)]),circ([semcat(is,[[],cop([])]),semcat(striking,[[],[]]),semcat(a,[[],det([])]),semcat(defiant,[[],[]])]),said([semcat(a,[[],det([])]),semcat(day,[[],time_period([])]),semcat(after,[[],prep([])]),semcat(U.S. President George Bush_s State,[propername,place([country]),group_of_people([]),human([high_status]),human([]),place([d23,convent_borders])]),semcat(of,[[],prep([])]),semcat(the,[[],det([])]),semcat(Union,[propername,explosion([]),group_of_people([]),place([country])]),semcat(address,[[],speech_act([neutral]),place([d2])]),semcat(saying,[[],say_verb([neutral])]),semcat(his,[[],poss([])]),semcat(nation,[[],place([country]),group_of_people([])]),semcat(is,[[],cop([])]),semcat(ready,[[],[]]),semcat(to,[[],prep([])]),semcat(",[[],[]]),semcat(destroy,[[],[]]),semcat(and,[[],konj([])]),semcat(defeat,[[],[]]),semcat(",[[],[]]),semcat(any,[[],det([])]),semcat(American,[propername,human([])]),semcat(attack,[[],military_operation([])]),semcat(.,[[],[]])]),[]] [source(semcat(Iraqi President Saddam Hussein,[propername,human([]),human([high_status])])),semcat(said,[[],say_verb([])]),…
The classification of speech act phrases in the Newspeak lexicon (1)
The classification of speech act phrases in the system lexicon (2)
Some principles for selection of claims to be rendered: • 1) Informatives: • Neutral, the sender is not marked for high status: officials said, the news agency reported, reportedly…A claim p introduced by a neutral informative is rendered in the summary; the source is omitted if there are no denials or confirmations of p in the text and if the source is not marked for high status, like ‘President’ • Neutral, the sender marked for high status, and ‘declarations’: the President said…the government condemned…The source is rendered if it is marked for high status • Affirmative; confirmations of explicit claims: Israeli sources confirmed that…Confirmations of previous explicit claims are omitted in the summary • Affirmative; confirmations of claims that are not explicitly mentioned:Both the information source and the claim, including the type of the speech act phrase, are rendered in the summary, if the speech act is a confirmation of a claim not present in the news report
Some principles for selection of claims to be rendered: • 1) Informatives: • Negative, or neutral followed by denied claims:The president denied, The Israeli source said that it is not true…Both the initial claim and its denial are rendered in the summary together with the information about the senders
2) Utterance refusal, negated speech act phrases, hypotheses, commissives, interpretations: The Israeli sources neither denied or confirmed, the minister did not say, if…, the defense secretary declined to say…, the government had no immediate comments… Utterance refusals or negated speech act phrases related to an explicit claim are omitted If a source refuses to confirm/deny a claim that has not been explicitly mentioned in the previous part of the text, the whole speech act is rendered, inclusive the type of the speech act Hypotheses and commissives are rendered together with their sources and marked for unsure epistemic status
Some principles for selection of claims to be rendered: 3) Epistemic spaces: e. g. no one knows if the device was planted deliberately or if it was leftover from New Year’s Eve If two claims would exclude each other in the same mental space, and if no source in the text takes responsibility for any of these claims, both claims are to be rendered as hypotheses
Sample input text RAMALLAH, West Bank -- Palestinian leader Yasser Arafat said Thursday that elections as part of a reform of the Palestinian Authority will be held this winter, whether or not Israeli forces withdraw from the Palestinian territories. That represented a change of course from Arafat, who said last week that no elections would be held until the Israelis pulled back. Shortly after Arafat's announcement, a committee he had appointed to set up elections resigned, according to Israel Radio, because Arafat would not agree to a specific date for the elections. Other Palestinian leaders said the resignations were a procedural matter. Arafat also condemned Wednesday's suicide bombing in the Israeli town of Rishon Letzion . Two Israelis were killed and at least 37 others wounded when the bomber detonated explosives in the center of a crowded pedestrian district. The terror attack marked the second time in two weeks a suicide bombing directed at civilians has rocked Rishon Letzion, a town about 15 miles southeast of Tel Aviv. On May 8, a suicide attack at a pool hall killed 15 people and wounded dozens of others. "Suddenly there was an explosion," 16-year-old Shmuel Voller told The Associated Press on Wednesday. The bombing occurred on Rothschild Street in the heart of the town around 9:15 p.m. (2:15 p.m. ET).
Generation: sample summary RAMALLAH, West Bank -- Palestinian leader Yasser Arafat said Thursday that elections as part of a reform of the Palestinian Authority will take place this winter, whether or not Israeli forces withdraw from the Palestinian territories. On Wednesday, a suicide bombing took place in the Israeli town of Rishon Letzion, on Rothschild Street in the center of a crowded pedestrian district, around 9:15 p.m. (2:15 p.m. ET). Two Israelis were killed and at least 37 others wounded. Arafat condemned the attack.
Generation TL vocabulary more restricted than SL vocabularyTL pattern fit textual/semantic relations Swedish: Israeliska trupper tågade ut ur Beit Jala Israeli+pl troops marched out of/left Beit Jala (tågade ut ur instead of *drog ut av) Polish: Wojska izraelskie wycofały się z Beit Jala Troops Israeli backed out from Beit Jala (wycofały się instead of *wyciągnęły or *wyciągały).
Generation E: A bomb exploded in Bilbao, Spain, early Friday morning. S: En bomb exploderade i den spanska staden Bilbao tidigt på fredagsmorgonen a bomb explode-past in def Spanish city Bilbao early on Friday-morning-def E: There were no injuries. S: Inga personskador rapporterades no person-injuries report-past-passive E: ETA is suspected for being responsible for the attack. S: Förmodligen ligger ETA bakom bombdådet. Presumably lay-pres ETA behind bomb-outrage-def
Animacy degree Gramma-tical gender Semantic features Accusative form Adjective ending in plural Verb ending in plural, past tense inanimate +ma/+fe -alive acc=nom -e -ły +ne +/- alive semianimate +ma - alive, + mobile or + spherical sg: acc=gen or acc=nom, pl: acc=nom -e -ły animate +ma/+fe + alive sg: acc=gen, pl:acc=nom -e -ły superanimate +ma + human acc=gen -i/-y -li The grammatical and semantic characteristics of Polish nouns
Duchy sta-ły Ghost+PL stand+PAST+PL ’The ghosts were standing (there)’ Psy sta-ły Dog+PL stand+PAST+PL ’The dogs were standing (there)’ Dziewczynki sta-ły Girl+PL stand+PAST+PL ’The girls were standing (there)’ Krzesła sta-ły Chair+PL stand+PAST+PL ’The chairs were standing (there)’
Krzesła sta-ły Chair+PL stand+PAST+PL ’The chairs were standing (there)’ Chłopcy sta-li Boy+PL stand+PAST+PL+MALE+HUMAN ’The boys were standing (there)’
Extracting ’superanimate’ nouns (1) Pojawili się więc Algierczycy, Jemeńczycy, obywatele Bangladeszu, Uzbecy, Kirgizi i Tadżycy. ’There arrived Algerians, Yemenis, citizens of Bangladesh, Uzbeks, Kirgizis, and Tadjiks’ Stop-list and a suffix list with declension numbers Algierczycy 35 n hum ma pl nom Jemeńczycy n hum ma pl nom 35 obywatele n hum ma pl nom 14 36 Uzbecy n hum ma pl nom Kirgizi n hum ma pl nom 38 35 Tadżycy n hum ma pl nom Pojawili v hum ma pl
Extracting ’superanimate’ nouns (2) Postverbal subjects: We wtorek w stolicy Kataru zebrali się na nieformalnej konferencji ministrowie22 państw Ligi Arabskiej. ‘22 ministers of the Arab countries gathered together at an informal conference in the capital of Qatar on Tuesday.‘ Preverbal subjects: W przyjętej w Dausze wspólnej deklaracji Arabowie zdecydowanie potępiliterroryzm we wszelkich formach. ‘In the joint declaration the Arab leaders have strongly condemned all forms of terrorism.’ Antecedents of the relative pronoun ’którzy’: Komórka składała się z wielu dziesiątek osób, w tym dwóch pilotów, którzykształcilisię w tych samych szkołach amerykańskich, co Mohammed Atta. ‘The cell consisted of dozens of people, including two pilots, who had completed their education at the same American schools that Mohammed Atta attended.’
The decrease of unknown superanimate noun forms during the training phase (training on 4 files, ca 11 000 words each) – normalized data
World news Sport Science Business Nouns (types) found in the database 166 47 64 48 Nouns (types) added to the database 24 121 98 78 Total 190 168 162 126 The results of post-editing after the training phase The lexical coverage of different text domains
The general procedure for extracting and classifying different word classes in Polish
Further development · Domain extension · Further work on the target lexicon · Feedback from the generation module into the source lexicon · Continued study of relations between textual structure and prosody