580 likes | 597 Views
Learn about the importance of information extraction, tools and applications for extracting structured information from unstructured and semi-structured documents, and the future of this field. Examples provided.
E N D
Extragerea informațieiИзвлечение информации Information Extraction If information is power and riches, then it is not the amount that gives the value, but access at the right time and in the most suitable form.
Plan • 1. Pentru ce? Necesitatea și Importața • 2. Ce anume? Descrierea, detalierea • 3. Cum? Instrumente • 4. Aplicații. Soft (free downloadable or API) • 5. Viitorul
Exemplu • Plecata din Halifax (estul Canadei), pe 1 iulie, masina, acoperita cu celule solare si semanind la forma cu o farfurie zburatoare, a sosit la sfirsitul saptaminii trecute la Vancouver, in vest, depasind vechiul record mondial, de 4.000 de kilometri, transmite AFP. • Nicolae Badea, presedintele clubului "Dinamo" Bucuresti, tilharit in noaptea de 16 iulie, a fost audiat, ieri, timp de peste cinci ore, de catre procurorii Parchetului Curtii de Apel Bucuresti.
Exemplu "Ciinii rosii" i-au promis antrenorului Cornel Dinu un frumos cadou de ziua sa de nastere, calificarea in turul urmator Dinamo vrea un miracol Rezultatul inregistrat in partida tur cu Polonia Varsovia, 3-4, a scazut simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Definiția • Extragerea informației (Information Extraction IE) este sarcina de a extrage în mod automat informația structurată din cea nestructurată și / sau documente semi-structurate în format electronic (machine-readable). • În cele mai multe cazuri, această activitate se referă la analiza textelor utilizînd metodele procesării limbajului natural (NLP). • Recent, extragerea informației a inclus și procesarea documentelor multimedia, cum ar fi adnotare automată și extragerea conținutului de imagini / audio / video.
NER - Named Entities Recognition The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
NER - Named Entities Recognition The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Găsirea:
NER - Named Entities Recognition The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Clasificarea: Person Location Date Organization etc.
Exemplu • Plecata din Halifax (estul Canadei), pe 1 iulie, masina, acoperita cu celule solare si semanind la forma cu o farfurie zburatoare, a sosit la sfirsitul saptaminii trecute la Vancouver, in vest, depasind vechiul record mondial, de 4.000 de kilometri, transmite AFP. Nicolae Badea, presedintele clubului "Dinamo" Bucuresti, tilharit in noaptea de 16 iulie, a fost audiat, ieri, timp de peste cinci ore, de catre procurorii Parchetului Curtii de Apel Bucuresti. Clasificarea: Person Location Date Organization etc.
Exemplu "Ciinii rosii" i-au promis antrenorului Cornel Dinu un frumos cadou de ziua sa de nastere, calificarea in turul urmator Dinamo vrea un miracol Rezultatul inregistrat in partida tur cu Polonia Varsovia, 3-4, a scazut simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA. Clasificarea: Person Location Date Organization etc.
Clase de entități Contractul pentru proiectarea, execuția și finalizarea celor 3,3 kilometri "restanți" ai secțiunii București - Ploiești a Autostrăzii București - Brașov a fost semnat, valoarea acestuia ridicându-se la 129,181 milioane de lei, fără TVA, potrivit unui anunț al Companiei Naționale de Autostrăzi și Drumuri Naționale din România (CNADNR), postat vineri seară pe pagina de Facebook a Ministerului Transporturilor (MT). sâmbătă, 19 Dec 2015, 02:21 • Person • Organization • Location • Date • Time • Percentage • Monetary amount
Clase de entități Trimisul special al AGERPRES, Florentina Peia, transmite: Românul rănit în atacul de la Tel Aviv a fost externat și s-a întors vineri în țară, președintele Klaus Iohannis urmând să stabilească împreună cu acesta ziua în care se vor întâlni, au precizat reprezentanți ai Administrației Prezidențiale. vineri, 11 Mar 2016, 11:55 • Person • Organization • Location • Date • Time • Percentage • Monetary amount
Clase de entități Vânzările pe plan global ale brandului Volkswagen au scăzut în februarie cu 4,7%, la 394.000 vehicule, pe fondul declinului cererii în SUA, America de Sud și regiunea Asia Pacific, transmit EFE și Reuters. vineri, 11 Mar 2016, 11:48 • Person • Organization • Location • Date • Time • Percentage • Monetary amount
MUCMessage Understanding Competition Message Understanding Conference
Message Understanding Competition Message Understanding Conference MUC - 7 • Exemple • <ENAMEX TYPE="ORGANIZATION">Taga Co.</ENAMEX> • <ENAMEX TYPE="LOCATION">North and South America</ENAMEX> • <TIMEX TYPE="DATE" ALT="1987">all of 1987</TIMEX> • <TIMEX TYPE="DATE">from 1990 through 1992</TIMEX> • <NUMEX TYPE="MONEY">10- and 20-dollar</NUMEX> bills • <NUMEX TYPE="MONEY">175 to 180 million Canadian dollars</NUMEX> • The Named Entity task • Three subtasks: • Entity names • Organizations • Persons • Locations • Temporal expressions • Dates • Times • Number expressions • Monetary values • Percentages
Cazuri complexe "Mips Vice President John Hime" <ENAMEX TYPE="ORGANIZATION">Mips</ENAMEX> Vice President <ENAMEX TYPE="PERSON">John Hime</ENAMEX> "Treasury Secretary“ <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX> Secretary "the U.S. Vice President“ the <ENAMEX TYPE="LOCATION">U.S.</ENAMEX> Vice President http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe "Arthur Anderson Consulting" <ENAMEX TYPE="ORGANIZATION">Arthur Anderson Consulting</ENAMEX> "Boston Chicken Corp." <ENAMEX TYPE="ORGANIZATION">Boston Chicken Corp.</ENAMEX> "U.S. Fish and Wildlife Service" <ENAMEX TYPE="ORGANIZATION">U.S. Fish and Wildlife Service</ENAMEX> "Northern California" <ENAMEX TYPE="LOCATION">Northern California</ENAMEX> "West Texas" <ENAMEX TYPE="LOCATION">West Texas</ENAMEX> http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe "California's Silicon Valley“ <ENAMEX TYPE="LOCATION">California</ENAMEX>'s <ENAMEX TYPE="LOCATION">Silicon Valley</ENAMEX> "Canada's Parliament“ <ENAMEX TYPE="LOCATION">Canada</ENAMEX>'s <ENAMEX TYPE="ORGANIZATION">Parliament</ENAMEX> http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe "Big Blue" [alias for International Business Machines Corp.] <ENAMEX TYPE="ORGANIZATION">Big Blue</ENAMEX> "Big Board" [alias for New York Stock Exchange] <ENAMEX TYPE="ORGANIZATION">Big Board</ENAMEX> "Mr. Fix-It" [nickname for candidate for head of the CIA] Mr. <ENAMEX TYPE="PERSON">Fix-It</ENAMEX> http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Message Understanding Competition Message Understanding Conference MUC - 7 Information Extraction task 1. Template Element Exemple Fred Kilian,the squadron commander of the F-14 pilot in the Nashville crash that killed five people last week. ENTITY ENT_NAME: "Fred Kilian" ENT_TYPE: PERSON ENT_DESCRIPTOR: "NAVY SQUADRON COMMANDER“ LOCATION LOCALE: "Berry Field" LOCALE_TYPE: AIRPORT COUNTRY: "United States"
Message Understanding Competition Message Understanding Conference MUC - 7 Information Extraction task 2. Template Relation Exemple EMPLOYEE_OF= PERSON:= ENT_NAME: "JOHN O'NEIL" ENT_TYPE: PERSON ENT_CATEGORY: PER_CIV ORGANIZATION: = ENT_NAME: "N.Y. Times News Service" ENT_TYPE: ORGANIZATION ENT_CATEGORY: ORG_CO
Relation Extraction (RE) – EmployeeOf(Steve Jobs,Apple) a relation between a person and anorganisation, extracted from ‘Steve Jobs works for Apple’ – LocatedIn(Smith,New York) a relation between a person and location, extracted from ‘Mr. Smith gave a talk at the conference in New York’, – SubsidiaryOf(TVN,ITI Holding) a relation between two companies, extracted from ‘Listed broadcaster TVN said its parent company, ITIHoldings, is considering various options for the potential sale. Jakub Piskorski and Roman Yangarber. Information Extraction: Past, Present and Future. Chapter 2.
Message Understanding Competition Message Understanding Conference MUC - 7 Information Extraction task 3. Scenario Template • Aceasta sarcină a fost definită în dependență de domeniul textelor. În cazul MUC-7 au fost analizate rapoartele de lansare a vehiculelor aeriene cu informația respectivă: • sarcina utilă a acestui vehicul • data și locul de lansare • tipul de misiune • funcția
Message Understanding Competition Message Understanding Conference MUC - 7 Information Extraction task 3. Scenario Template JOINT-VENTURE-1 • Relationship: TIE-UP • Entities: “Bridgestone Sport Co.” , “a local concern”, “a Japanese trading house” • Joint Ent: “Bridgestone Sports Taiwan Co.” • Activity: ACTIVITY-1 • Amount: NT$20 000 000 ACTIVITY-1 • Activity: PRODUCTION • Company: “Bridgestone Sports Taiwan Co.” • Product: “iron and ‘metal wood’ clubs” • Start date: DURING: January 1990 Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading houseto produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.
Message Understanding Competition Message Understanding Conference MUC - 7 Coreference Task Detectarea co-referințelor Exemple The sleepy boys and girls enjoy their breakfast. <COREF ID="1" MIN="boys and girls">The sleepy boys and girls</COREF> enjoy <COREF ID="2" REF="1" TYPE="IDENT">their</COREF> breakfast. Edna Fribble and Sam Morton addressed the meeting yesterday. Ms. Fribble discussed coreference, and Mr. Morton discussed unnamed entities. <COREF ID="1">Edna Fribble</COREF> and <COREF ID="2">Sam Morton</COREF> addressed the meeting yesterday. <COREF ID="3" REF="1" TYPE="IDENT" MIN="Fribble">Ms. Fribble</COREF> discussed coreference, and <COREF ID="4" REF="2" TYPE="IDENT" MIN="Morton">Mr. Morton</COREF> discussed unnamed entities.
Message Understanding Competition Message Understanding Conference MUC - 7 Coreference Task Detectarea co-referințelor Exemple "Ciinii rosii" i-au promis antrenorului Cornel Dinu un frumos cadou de ziua sa de nastere, calificarea in turul urmator Dinamo vrea un miracol Rezultatul inregistrat in partida tur cu Polonia Varsovia, 3-4, a scazut simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Event Extraction (EE) Un email cu anunțul oficial despre un meeting: COMPUTER SYSTEMS LABORATORY COLLOQUIUM 4:15PM, Wednesday, June 02, 2004 NEC Auditorium, Gates Computer Science Building B03 http://ee380.stanford.edu[1] Topic: Automating CapCom…. Un email cu invitația neoficială: “hey i also wanted to invite you to the final software faire for cs194. It's where we will be showing off our senior project. It's next wed from like 12 -4 i think or something.” Julie A. Black, Nisheeth Ranjan. Automated Event Extraction from Email
Rezultate MUC-7 Co-reference Named Entites Scenario Templates Template Elements Elaine Marsh, Dennis Perzanowski. MUC-7 EVALUATION OF IE TECHNOLOGY: Overview of Results, 1998
Multilingual IE (NER) • The highly multilingual media analysis application Europe Media Monitor (EMM) makes extensive use of name dictionaries, including not only large lists of person, organisation and location names, but also many spelling variants for the same named entity, both within the same language and across languages and scripts. As EMM could not operate without these non-traditional dictionaries, we wish to make a strong case in their favour. In this chapter, we will explain how such vocabulary lists are used within EMM and how they were produced automatically by analysing over 100,000 news articles per day in over twenty languages. A large part of EMM’s vocabulary lists is made publicly available for download as part of JRC-Names. STEINBERGER Ralf; JACQUET GUILLAUME; DELLA ROCCA Leonida. Creation and use of multilingual named entity variant dictionaries
Multilingual IE (NER) • As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English version of a multi-parallel corpus and by projecting the annotations into all the other language versions. For the translation of English entities, we use a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database. For the projection, we incrementally apply different methods: perfect string matching, perfect consonant signature matching and edit distance similarity. The resulting annotated parallel corpus will be made available for reuse. STEINBERGER Ralf; JACQUET GUILLAUME; DELLA ROCCA Leonida. Creation and use of multilingual named entity variant dictionaries
We are looking to fill a traineeship position in the field of ‘Multilingual Text Analysis’.If you are interested, please follow the instructions provided at the URL listed below. URL of call: http://recruitment.jrc.ec.europa.eu/ Call number: 2016-IPR-G-000-6713 Title of call: Multilingual Text Analysis Deadline: 21 March 2016 (Brussels time) Starting date: 1 June 2016 Duration: 5 months Eligibility requirements: https://ec.europa.eu/jrc/en/working-with-us/jobs/temporary-positions/jrc-trainees
The European Commission’s Joint Research Centre (JRC) in Ispra, Italy, is looking for a trainee to support the JRC’s Europe Media Monitor (EMM) team with a variety of Language Technology-related tasks. EMM gathers and analyses reports from traditional and social media in dozens of languages by clustering related news items; categorising them; extracting information such as: entities (persons, organisations, locations), events (who did what to whom, where and when), quotations by and about people; identifying sentiment; linking related news clusters over time and across languages. Methods used are mostly hybrid: machine learning tools are used to gather evidence, learn vocabulary and rules, but the results are usually controlled and optimised through human intervention. The public EMM applications can be accessed at the URLshttp://emm.newsbrief.eu/overview.html and http://emm.newsexplorer.eu.
The successful trainee will carry out any of the following tasks: a) use third-party software to carry out a terminology use study, which includes comparing occurrences of terms and their variants in English, French and Spanish; b) gather existing definitions of important terms from the internet; c) improve the JRC’s existing entity-oriented sentiment analysis tools, then analyse large quantities of sentiment data and its change over time with the purpose of identifying opinion change patterns and trends; d) contribute to the semi-automatic classification of entities in JRC’s multilingual entity database; e) contribute to improving the recognition of multilingual organisation names; f) annotation of linguistic data and/or evaluation of automatic text analysis results; g) contribute to writing a scientific publication.
Qualifications: Essential: · A degree (or an almost completed degree) in computational linguistics, computer science or related areas; · Programming skills; · Good command of oral and written English (level B2). Advantage: · Knowledge of further foreign languages; · Proven advanced programming skills, especially in Java; · Good knowledge of Language Technology-related tools and methods; · The proven ability to work independently and as part of a team.
Multilingual IE (NER) • Multi-word entities, such as organisation names, are frequently written in many different ways. We have previously automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics (Ehrmann et al., 2013). In this paper, we address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables and we show that this method produces reasonable results.
Metodele utilizate pentru extragerea informației • Deterministe • Expresii regulate • Automate finite • Utilizarea cunoștințelor suplimentare • Statistice • Metodele de instruire a calculatorului • Modele statistice în baza secvențelor • Lanț Markov • Conditional Random Fields
Exemple:crearea semiautomată a expresiilor regulate Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan. Regular Expression Learning for Information Extraction
Utilizarea contextului "Ciinii rosii" i-au promis antrenorului Cornel Dinu un frumos cadou de ziua sa de nastere, calificarea in turul urmator Dinamo vrea un miracol Rezultatul inregistrat in partida tur cu Polonia Varsovia, 3-4, a scazut simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Utilizarea contextului "Ciinii rosii" i-au promis antrenorului Cornel Dinu un frumos cadou de ziua sa de nastere, calificarea in turul urmator Dinamo vrea un miracol Rezultatul inregistrat in partida tur cu Polonia Varsovia, 3-4, a scazut simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA. Domeniul SPORT Persoane:jucător antrenor referee Organizații: echipa organizator Evenimente: meci campionat ?
Utilizarea contextului • This hiking route from Ceahlau Massif is located on the north side of the mountain. The starting point is in Durau Resort (800 m), continues to Fantanele Chalet (1220 m), Panaghia Stone, Toaca Peak and Dochia Chalet (1750 m). The marking of the route is a red stripe. • The necessary time to make this route is between 3h and 3h and 30 min and has a length of 7,5 km. The level difference is about 950 m. From Fantanele Chalet you can also go to Duruitoarea Waterfall by following a yellow triangle marked trail.
Utilizarea contextului the girl cried. "MR. Morgan, it's the best come out on"? "Mr. Flannagan is away Her voice dropped. "MR. Flannagan has been away He shook his head. "MR. Manuel did that in the war. A Bay State supporter said, "MR. Hearst's fight construction of Chain Bridges. "MR. Palmer will attend to any and said loudly, "MR. Chairman"! Cady Partlow's head in rolled-up shirt sleeves. "MR. Ferrell , chairman of the board "That's right". "MR. Wycoff's car is waiting Lucius Beebe's book, "MR. Pullman's Elegant Palace Car", an affectionate smile. "MR. Skyros too smart laid it in the basket. "MR. Jack sets store by that". I said, "MR. McKenzie, it is as authentic who had ordered it? "MR. Gross, concerning the formation Garth hesitated. "MR. Hohlbein and I have noticed the much quoted remark: "MR. Green indeed writes of 1162 Sixth Avenue. "MR. Miller was in the shop", June 21, announced. "MR. Wolfe had been in a politician. "MR. Jones, you may recall
Utilizarea contextului the girl cried. "MR. Morgan, it's the best come out on"? "Mr. Flannagan is away Her voice dropped. "MR. Flannagan has been away He shook his head. "MR. Manuel did that in the war. A Bay State supporter said, "MR. Hearst's fight construction of Chain Bridges. "MR. Palmer will attend to any and said loudly, "MR. Chairman"! Cady Partlow's head in rolled-up shirt sleeves. "MR. Ferrell , chairman of the board "That's right". "MR. Wycoff's car is waiting Lucius Beebe's book, "MR. Pullman's Elegant Palace Car", an affectionate smile. "MR. Skyros too smart laid it in the basket. "MR. Jack sets store by that". I said, "MR. McKenzie, it is as authentic who had ordered it? "MR. Gross, concerning the formation Garth hesitated. "MR. Hohlbein and I have noticed the much quoted remark: "MR. Green indeed writes of 1162 Sixth Avenue. "MR. Miller was in the shop", June 21, announced. "MR. Wolfe had been in a politician. "MR. Jones, you may recall Putem utiliza expresii regulate Mr. [A-Z][a-z]+
Gazeteers • A gazetteer is a geographicaldictionaryordirectoryused in conjunction with a map or atlas.
The 2 Minute Guide to GATE • Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth) -- call this your corpus. • Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology. • Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.). • Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard. • Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded). • Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For techies: this sits in the backroom as a RESTful web service.) • Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity graphing, time series graphing, annotation structure search and (last but not least) boolean full text search. (More techy stuff: mash up these types of search with your existing UIs.)