1 / 37

Genre as Noise - Noise in Genre

Genre as Noise - Noise in Genre. Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta. Motivation. For search applications we often would like to narrow down the result set to a certain class of documents

Download Presentation

Genre as Noise - Noise in Genre

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

  2. Motivation • For search applications we often would like to narrow down the result set to a certain class of documents • For corpus construction an exclusion of certain document classes could be helpful • Documents with a high rate of errors could harm in applications like for example computer aided language learning (CALL) or lexicon construction. Documents of certain classes could be more erroneous like others. It makes sense to investigate the implications of document genre in the area of noise reduction

  3. Definition of Genre • Partition of documents into distinct classes of text with similar function and form • Independent dimension ideally orthogonal to topic • Examples for document genres: blogs, guestbooks, science reports • Mixed documents are possible = documents where parts belong to different genres

  4. Two different views on Genre

  5. Two different views on Genre A document with the wrong genre will often be noise

  6. Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise

  7. Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise:

  8. Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise: Micro-Noise

  9. Outline • Introduction of a new genre hierarchy • Macro-Noise detection • Feature Space • Classifiers • Experiments and applications • Micro-Noise detection • Error dictionaries • Experiments on correlation of genre and noise • Experiments on classification by noise

  10. A hierarchy of Genres Demands for a genre classification schema: • Task oriented granularity • Hierarchical • Logically consistent • Complete

  11. A hierarchy of Genres 8 container classes with 32 leaf genres

  12. Corpus Containter Classes • Allow to compare to other classification schemas • Allow to evaluate the seriousness of classification errors Training and Evaluation Corpus • For each of the 32 genres 20 English HTML web documents for training and 20 documents for testing were collected leading to a corpus with 1,280 files.

  13. Detection of Macro-Noise Macro-Noise detection is a classification problem • Candidate Features • Feature selection mechanism • Build Classifiers • Combine Classifiers for Classification

  14. Feature Space Examples for Features • Form: line length, number of sentences • Vocabulary: specialized word lists, dictionaries, multi lexemic epr. • Structure: POS • Complex patterns: style All together we got over 200 features for the 32 genres

  15. Feature Space Kernel question: Selection of features • Global feature sets for the standard machine learning algorithms • Specialized feature sets for our specialized classifiers Small set of significant and natural features for each genre Avoiding accidental similarities between documents

  16. Feature Space Feature Selection for specialized genre classifiers do select candidate feature add feature if performance of classification improves ordering by classification strength prune features that have become obsolete until Recall > 90/75% && Precision > 90/75% Rules: Constructed as inequations with discriminative ranges Classifiers: Conjunction of single rules

  17. Classifiers Example: Classifier for reportage as a conjunction of single rules

  18. Classifiers Classifier Combination • Filtering: Class as a disqualification criterion for another class in the case of multiple classification • Ordering by F1 value: Classifiers that lead more probably to a correct classification are applied first • Ordering by dependencies and recall: A graph with edges that represent the number of wrong classifications of one class as another controls the sequence of classifier application. First, edges with smaller values are traversed leading to fewer wrong classifications

  19. Experiments on Macro-Noise Detection of Genre: • On the test corpus we get a precision of72.2% and an overall recall of54,00% with the specialized classifiers • Superior to machine learning methods with SVM as the best method leading to 51.9%precision and to 47.8% recall • The superiority can be stated only for the small training corpora • Work for incremental classifier improvement and the behavior on bigger training sets is forthcoming

  20. Experiments on Macro-Noise Application 1: Retrieving Scientific Articles on fish • Queries like (cod Λ habitat) are sent to a search engine to retrieve scientific documents • Evaluation over the 30 top-ranked documents of a query • Precision and the Recall at cut-points 5,10,15,20 documents could be significantly improved by genre recognition, leaving room for further improvement

  21. Experiments on Macro-Noise Application 2: Language models for speech recognition • Language models of speech corpora are notoriously sparse • Standard solution augmentation by text documents should be improved choosing genres similar to spoken text as: forum, interview, blog • The noise in a crawled corpus of ~30,000 documents could be reduced to a residue of 2.5%

  22. Detection of Micro-Noise Examples for Micro-Noise: Typing errors, cognitive errors • Method: Detection of errors with specialized Error dictionaries

  23. Error Dictionaries Construction principle: Micro-Noise occurs from elucidable channel characteristics. These characteristics can be discovered in an analytical way or by observations in a training corpus. • Transition rules: Ri := lαr ► lβr with l,α, β ,r as character sequences • These rules are applied to a vocabulary base that should represent the documents to be processed. Productivity depends on context l,r. • We get a raw error dictionary D_err-raw with entries [error token | original token | character transition(s)]

  24. Error Dictionaries Filtering Step: • The raw error dictionary D_err_raw is filtered against a collection of relevant positive dictionaries leading to two error dictionaries: • D_err: non word errors • D_err-ff: word errors, false friends

  25. Error Dictionaries Usage of error dictionaries: • With a base of 100,000 English words we got a filtered error dictionary for typing errors with 9,427,051 entries • For cognitive errors we got a lexicon with 1,202,997 entries • Recall 60 %, Precision 85% on a reference corpus • Error detection: scan the text with the error dictionary and compute the mean error rate per 1,000 tokens

  26. Experiments on Micro-Noise Correlation of error rate and genre: • For each genre in the genre corpus we computed the errors per 1,000 tokens with the help of the two error dictionaries • We got a strong correlation between genre and mean error rate • Extreme values are legal texts with 0.23 errors per 1,000 tokens and guestbooks with 6.23 errors per 1,000 tokens

  27. Experiments on Micro-Noise Stability of the values for Training and Test corpora: similar plot

  28. Experiments on Micro-Noise Preliminary experiments on using Micro-Noise for classification: • Extension of specialized genre classifiers by a filter based on the mean error rate: Improvement of precision for 5 genres but also 1 classifier that lost performance, recall for 3 genres was lower • SVM classifier with new feature mean error rate: also equivocal results with improvements for some of the genres • Problem: high variance of the error rate, with error free documents also for genres with a high mean error rate

  29. Conclusion

  30. Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents

  31. Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction

  32. Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition even with small training corpora

  33. Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition • Error dictionaries can be used to estimate the mean error rates of documents

  34. Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition • Error dictionaries can be used to estimate the mean error rates of documents • We found a strong correlation between genre and the error rate

  35. Conclusion • For certain applications the dimension genre partitions document repositories • into noise and wanted documents • We introduced a new genre hierarchy that allows informed corpus construction • Our easy to implement specialized classifiers are able to reach competitive • results for genre recognition • Error dictionaries can be used to estimate the mean error rates of documents • We found a strong correlation between genre and the error rate • Classification by noise leads to equivocal results

  36. Future Work • We will try to convince other researchers to build up a corpus with • at least 1,000 documents per genre • We work on an incremental learning algorithm for the improvement of • our classifiers by user click behavior • The correlation of genre and error rates will be further investigated on the • a bigger genre corpus with an exhaustive statistical analysis • Regarding the effects of errors on IR applications the repair potential • of error dictionaries will be investigated

  37. Thank you for your attention!

More Related