Genre as Noise - Noise in Genre

Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Motivation • For search applications we often would like to narrow down the result set to a certain class of documents • For corpus construction an exclusion of certain document classes could be helpful • Documents with a high rate of errors could harm in applications like for example computer aided language learning (CALL) or lexicon construction. Documents of certain classes could be more erroneous like others. It makes sense to investigate the implications of document genre in the area of noise reduction

Definition of Genre • Partition of documents into distinct classes of text with similar function and form • Independent dimension ideally orthogonal to topic • Examples for document genres: blogs, guestbooks, science reports • Mixed documents are possible = documents where parts belong to different genres

Two different views on Genre

Two different views on Genre A document with the wrong genre will often be noise

Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise

Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise:

Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise: Micro-Noise

Outline • Introduction of a new genre hierarchy • Macro-Noise detection • Feature Space • Classifiers • Experiments and applications • Micro-Noise detection • Error dictionaries • Experiments on correlation of genre and noise • Experiments on classification by noise

A hierarchy of Genres Demands for a genre classification schema: • Task oriented granularity • Hierarchical • Logically consistent • Complete

A hierarchy of Genres 8 container classes with 32 leaf genres

Corpus Containter Classes • Allow to compare to other classification schemas • Allow to evaluate the seriousness of classification errors Training and Evaluation Corpus • For each of the 32 genres 20 English HTML web documents for training and 20 documents for testing were collected leading to a corpus with 1,280 files.

Detection of Macro-Noise Macro-Noise detection is a classification problem • Candidate Features • Feature selection mechanism • Build Classifiers • Combine Classifiers for Classification

Feature Space Examples for Features • Form: line length, number of sentences • Vocabulary: specialized word lists, dictionaries, multi lexemic epr. • Structure: POS • Complex patterns: style All together we got over 200 features for the 32 genres

Feature Space Kernel question: Selection of features • Global feature sets for the standard machine learning algorithms • Specialized feature sets for our specialized classifiers Small set of significant and natural features for each genre Avoiding accidental similarities between documents

Feature Space Feature Selection for specialized genre classifiers do select candidate feature add feature if performance of classification improves ordering by classification strength prune features that have become obsolete until Recall > 90/75% && Precision > 90/75% Rules: Constructed as inequations with discriminative ranges Classifiers: Conjunction of single rules

Classifiers Example: Classifier for reportage as a conjunction of single rules

Classifiers Classifier Combination • Filtering: Class as a disqualification criterion for another class in the case of multiple classification • Ordering by F1 value: Classifiers that lead more probably to a correct classification are applied first • Ordering by dependencies and recall: A graph with edges that represent the number of wrong classifications of one class as another controls the sequence of classifier application. First, edges with smaller values are traversed leading to fewer wrong classifications

Experiments on Macro-Noise Detection of Genre: • On the test corpus we get a precision of72.2% and an overall recall of54,00% with the specialized classifiers • Superior to machine learning methods with SVM as the best method leading to 51.9%precision and to 47.8% recall • The superiority can be stated only for the small training corpora • Work for incremental classifier improvement and the behavior on bigger training sets is forthcoming

Experiments on Macro-Noise Application 1: Retrieving Scientific Articles on fish • Queries like (cod Λ habitat) are sent to a search engine to retrieve scientific documents • Evaluation over the 30 top-ranked documents of a query • Precision and the Recall at cut-points 5,10,15,20 documents could be significantly improved by genre recognition, leaving room for further improvement

Experiments on Macro-Noise Application 2: Language models for speech recognition • Language models of speech corpora are notoriously sparse • Standard solution augmentation by text documents should be improved choosing genres similar to spoken text as: forum, interview, blog • The noise in a crawled corpus of ~30,000 documents could be reduced to a residue of 2.5%

Detection of Micro-Noise Examples for Micro-Noise: Typing errors, cognitive errors • Method: Detection of errors with specialized Error dictionaries

Error Dictionaries Construction principle: Micro-Noise occurs from elucidable channel characteristics. These characteristics can be discovered in an analytical way or by observations in a training corpus. • Transition rules: Ri := lαr ► lβr with l,α, β ,r as character sequences • These rules are applied to a vocabulary base that should represent the documents to be processed. Productivity depends on context l,r. • We get a raw error dictionary D_err-raw with entries [error token | original token | character transition(s)]

Error Dictionaries Filtering Step: • The raw error dictionary D_err_raw is filtered against a collection of relevant positive dictionaries leading to two error dictionaries: • D_err: non word errors • D_err-ff: word errors, false friends

Error Dictionaries Usage of error dictionaries: • With a base of 100,000 English words we got a filtered error dictionary for typing errors with 9,427,051 entries • For cognitive errors we got a lexicon with 1,202,997 entries • Recall 60 %, Precision 85% on a reference corpus • Error detection: scan the text with the error dictionary and compute the mean error rate per 1,000 tokens

Experiments on Micro-Noise Correlation of error rate and genre: • For each genre in the genre corpus we computed the errors per 1,000 tokens with the help of the two error dictionaries • We got a strong correlation between genre and mean error rate • Extreme values are legal texts with 0.23 errors per 1,000 tokens and guestbooks with 6.23 errors per 1,000 tokens

Experiments on Micro-Noise Stability of the values for Training and Test corpora: similar plot

Experiments on Micro-Noise Preliminary experiments on using Micro-Noise for classification: • Extension of specialized genre classifiers by a filter based on the mean error rate: Improvement of precision for 5 genres but also 1 classifier that lost performance, recall for 3 genres was lower • SVM classifier with new feature mean error rate: also equivocal results with improvements for some of the genres • Problem: high variance of the error rate, with error free documents also for genres with a high mean error rate

Conclusion

Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents

Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction

Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition even with small training corpora

Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition • Error dictionaries can be used to estimate the mean error rates of documents

Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition • Error dictionaries can be used to estimate the mean error rates of documents • We found a strong correlation between genre and the error rate

Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition • Error dictionaries can be used to estimate the mean error rates of documents • We found a strong correlation between genre and the error rate • Classification by noise leads to equivocal results

Future Work • We will try to convince other researchers to build up a corpus with • at least 1,000 documents per genre • We work on an incremental learning algorithm for the improvement of • our classifiers by user click behavior • The correlation of genre and error rates will be further investigated on the • a bigger genre corpus with an exhaustive statistical analysis • Regarding the effects of errors on IR applications the repair potential • of error dictionaries will be investigated

Thank you for your attention!

Genre as Noise - Noise in Genre

Genre as Noise - Noise in Genre

Presentation Transcript

GENRE

Testing as a Genre

GENRE

Genre

Genre

Genre

GENRE

Genre

Genre

Genre

Genre

genre

Genre

Genre

Genre

Genre

Genre

GENRE

Genre

Genre

Genre-based decomposition of email class noise