Extracting Why Text Segment from Web Based on Grammar-gram

Extracting Why Text Segment from Web Based on Grammar-gram Iulia Nagy, Master student, 2010-02-27

Summary • Introduction • Related work • Rule Based Methods • Machine Learning Approach • “Bag of Function Words” method • Method outline • Adaptation of “Bag of Function Words” to English • Experiments and Evaluation • Conclusion and Remarks

Problem tremendous growth of the Internetinformation hard to find

Solution • Create QA system system capable to give an exact answer to an exact question detect answer from arbitrary corpora • Purpose obtain viable information rapidly

Purpose of our research Create a why-QA system with automatically-built classifier • Classifier • Use a model presented in Japanese Literature created using Machine learning based on Bag of Grammar approach Purpose of this paper

Related word Two main trends • Rule Based methods • Machine Learning methods

Rule based in why-QA Suzan Vererne’s Approach • Improve performance by re-ranking Method : • weight the score assigned to a QA-pair by QAP with a number of syntactic features.

Machine Learning method Higashinaka and Isozaki’s Approach • Acquire causal expression from Japanese EDR dictionary Method : • train a ranker based on clause structures extracted from EDR

Machine Learning method Tanaka’s Approach • Build why-classifier with function words as features Method : • Bag of function words

Bag of function words Function words Machine learning approach to automatically build domain independent why-classifier based of function words Conditions to obtain domain-independence Class fulfilling conditions

Bag of function words Ts 1 Create feature space Create feature vectors Extract function words Ts 2 … Ts n Mapping using “tf-idf” on function words Fv1 Vectors' format: Classification scheme Fv 2 Trainer for because at after in under which that why to therefore є … Fv n Loogit Boost weak learners Method – same baseline for Japanese and English

Adaptation to English • Differences • Adjustments • Identify eligible function words in English

Experiment • Data • Processing • Label all words with POS and extract function words • Calculatetf-idffor each function word • Map features from feature set into feature vectors

Experiment • Classifier • Used Loogit Boost (Weka) with Decision stump • Created 5 classifiers (50, 100, 150, 200, 250 iterations) • Evaluation • 10-fold cross validation • Models trained on 9 folds and tested on 1 • Measured precision, recall and F-measure

Results – why text segments No of iterations

Results – non why text segments (NWTS) No of iterations

Conclusion Method effective on English Type of TS • Results • 321 instances out of 432 correctly classified • 76.1% precision and 70.6% recall on WTS • 72.6% precision and 77.9% recall on NWTS

Future works • Experiment with a increased dataset (> 5000) • Use Yahoo!Answers database to extract dataset • Interest • Include causative construction in the analysis

Questions and remarks Thank you for your attention !

Extracting Why Text Segment from Web Based on Grammar-gram

Extracting Why Text Segment from Web Based on Grammar-gram

Presentation Transcript

A Text-based Grammar for Expository Reading and Writing

Text grammar

Snowball : Extracting Relations from Large Plain-Text Collections

Text grammar

Web-Based Tool and Why

Extracting Semantic Networks From Text Via Relational Clustering

ELIJAH: Extracting Genealogy from the Web

Feature-based Grammar

On Compression-Based Text Classification

Text Mining -- Extraction Web-Based Information Architectures

A semantic approach for extracting domain taxonomies from text

N-Gram-based Dynamic Web Page Defacement Validation

Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery

Self-supervised Probabilistic Methods for Extracting Facts from Text

Extracting Structured Data from Web Page

Extracting Parallel Texts from Massive Web Documents

Extracting Structured Data from Web Pages

Extracting Semistructured Information from the Web

Extracting knowledge from the World Wide Web

Extracting RDF Data from Unstructured Sources Based on an RDF Target Schema

voice segment 9 [OMIT “W” FROM ORIGINAL TEXT ]

Extracting the User's Interests from Web Log Data using A Time Based Algorithm