Daniel Gayo Avello (University of Oviedo)

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process Daniel Gayo Avello (University of Oviedo)

What’s the problem?  • Document reading is a time consuming task… • Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords… • But, they are “electronic” so we can work on them in some way…   8%

What’s the problem? (cont.) • Many techniques to perform several Natural Language Processing (NLP) useful tasks: • Language identification. • Document categorization and clustering. • Keyword extraction. • Text summarization. • Quite different: • With/Without human supervision. • With/Without training. • With/Without complex linguistic data. • With/Without document corpora.   17%

Any suggestion? • It would be great to use only one technique to carry out several of those tasks. • Desirable goals: • Simple (only free text, not linguistic data) • Fully automatic (neither supervision nor ad hoc heuristics) • Scalable (from one web page to several web sites) • Could it be a bio-inspired solution? 25%

Our (bio-inspired) hypothesis • Living beings are defined by their genome. • Document from a corpus ≈ Individual from a population • So…? • Let’s imagine a “document genome”… • Similar documents (similar language/topic)  Similar genomes. • More interesting, translation from “document genome” to “significance proteins” (i.e., keyphrases and summaries). 33%

aminoacids DNA UAC AUGCCGGGUUACUAA mRNA copied into a single-stranded mRNA molecule Folding process Protein folded into a 3D structure Our biological inspiration • The protein biosynthesis process… Termination Elongation Could we mimic this to distill from a single documentkeyphrases and summaries!? Initiation Polypeptide chain Transcription 42%

The “ingredients”… 50%

A “DNA” for Natural Language? • n-grams (slices of adjoining n characters) • Frequency not the most relevant weight for each n-gram. • There exist different measures to show relation between both elements in a bigram: • Mutual information. • Dice coefficient. • Loglike. • … • Cannot be applied straightforward to n-grams…  • …But, they can be generalized (Ferreira and Pereira, 1999)  58%

Original document The rain in Spain stays mainly in the plain. < in > < mai> < pla> < rai> < Spa> < sta> < the> <ain > <ainl> <ays > <e pl> <e ra>… Relative frequency Fair Specific Mutual Information n-grams <Spai> 0.025 2.013 Assigning weights to n-grams <inly> 0.025 1.975 A “DNA” for Natural Language? (cont.) 67%

The- he-r e-ra 20 29 24 Document genome translation • So… • “Document genome” spliced into “pseudo-tRNA”. • Document used as “pseudo-mRNA”. • We “attach” to the document pseudo-tRNA “molecules” (with max. weight) while average significance per character continues growing. • Result: Document spliced into “chunks” with maximum average significance. The rain in Spain stays mainly in the plain 20 The 49 The r 73 The ra pseudo-mRNA The rain in Spain stays mainly in the plain. etc. 75%

Work on Early Stage Folding the “protein” / summarization • To obtain keyphrases the “protein” (text chunks) must be folded… • At this moment we are studying different alternatives: • Mutual reinforcement? • Chunks ≈ Documents  Apply classical IR techniques? • Others? • Automatic text summarization • Simple but useful approach. • Use the shortest paragraphs with the most significant keyphrases. 83%

To test feasibility of these ideas a prototype was developed. • blindLight – http://www.purl.org/NET/blindLight • It receives a user-provided URL and produces: • A “blindlighted” version of the original URL. • A list of keyphrases. • An automatic summary. 92%

Conclusions • Proof-of-concept tests have been performed • Details in the paper… • Results can be improved. • Thorough study and analysis is needed. • Really promising! • Summary of the proposal • Free text from just one document. • Language independent (currently only western languages). • Bio-inspired. • Extremely simple to implement. 100%

Thank you! Merci beaucoup! ¡Muchas gracias!

Daniel Gayo Avello (University of Oviedo)

Daniel Gayo Avello (University of Oviedo)

Presentation Transcript

Empirical Models to deal with Unobserved Heterogeneity

SYMPOSIUM. Novel aspects of renal bone disease

Securing applications. Performance of Rotor CAS protection and capabilities

STAMPPP Science and the Treatment of Autism: A Multimedia Pack for Parents and Professionals

Daniel H. Janzen

WEB SIMULATOR OF ADMISSION MARKS TO ENTER THE UNIVERSITY OF OVIEDO: THE IDEA AND THE EXPERIENCE

DANIEL

José Luis Acuña- University of Oviedo, Spain

University of Oviedo,Oviedo Spain

Dubravka šimonović Ana borovečki

THE BUILDING OF THE UNIVERSITY OF OVIEDO

Daniel Coastland University

Fernando Alonso

UNIVERSITY INSTITUTIONAL EVALUATION

UNIVERSITY INSTITUTIONAL EVALUATION

A BEST PRACTICE OF TEACHING AMONGST UNIVERSITY LECTURES

F. Javier Belzunce University of Oviedo

Danny Wildemeersch, Alexis Oviedo, Tineke Rayen, Tine Bonnarens, Miao Zhao

Learn More About Oviedo Real Estate,Oviedo Realtor and more