1 / 25

STYLOMETRY I N I R SYSTEMS

Explore the theory of stylistics and its application in information retrieval (IR) systems. Learn about the history of stylometry, stylistic features, recent studies, and our approach. Discover the diverse applications of stylometry, including authorship attribution, forensic author identification, and genre-based information retrieval.

rcaroll
Download Presentation

STYLOMETRY I N I R SYSTEMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN

  2. Outline • Stylistics and Stylometry • Applications of stylometry • History of stylometric researches • Stylistic features • Recent Studies • Our approach • Conclusion Stylometry in IR Systems

  3. STYLISTICS • The theoritical framework for stylistic combines; • Halliday’s Language Theory • Sander’s Theories of Stylistic • Halliday says: “A text is what is meant, selected from the total set of opinions that constitute what can be meant” • Sander says: “Style is the result of choices made by an author from a range of possibilities offered by the language system” Stylometry in IR Systems

  4. STYLISTICS • Stylistic variation depends on • Author preferences and competence • Familiarity • Genre • Communicative context • Expected characteristics of the intended audience • Modeling, representing and utilizing this variation is the business of stylistic analysis. Stylometry in IR Systems

  5. stylometry • The application of the study of linguistic style • Style refers to the linguistic choices of authors that persist over their works, independently of content • Aim is to describe a text from a rather formal perspective like; • Number of words • Number of repetitions • Sentence length Stylometry in IR Systems

  6. APPLICATIONS OF STYLOMETRY • Authorship attribution • Forensic author identification • To find the author of an anonymous text • Observation of the “characteristics” of a particular author • Organization and retrieval of documents based on their writing style • Systems for genre-based information retrieval Stylometry in IR Systems

  7. HISTORY OF STYLOMETRY • Stylometry grew out of analyzing text for evidence of authenticity, authorial identity • According to modern practice of discipline, there are distinctive patterns of a language to identify authors • After development of computers and their capacities • Large data sets can be analyzed • New methods can be generated and easily applied Stylometry in IR Systems

  8. HISTORY OF STYLOMETRY, CONT’D • Current researches uses techniques based on term frequency counts • Frequency data are collected for common terms • These data are then analyzed using a range of fairly standard statistical techniques • However, they cannot guarantee quality ouput yet, i.e. Ulysses Stylometry in IR Systems

  9. Methodology • Use a subset of structural and stylometric features on a set of authors without consideration of author characteristics • Currently, authorship attribution studies are dominated by the use of lexical measures • Generally used statistics: • Word length • Syllables per word • Sentence-length • Sentence count • Text length in words • Use of punctuation marks

  10. Stylistic Features • Lexically-Based Methods • Vocabulary richness of the author • Frequencies of occurrence of individual words • Vocabulary diversity: • Type-token ratio V/N • V: size of vocabulary of sample text • N: number of tokens • Hapaxlegomena • How many words occur once • Frequencies of occurrence: • Function words

  11. Stylistic Features • Problems: • Text length dependent • Unstable for short texts • Function word set requires manual effort • Specific to the group of authors considered • Solution: • Use set of most frequent words • Both content-words and function words

  12. Related Studies • Analysis of the text by a natural language processing tool: • Use existing NLP tool • Sentence and Chunk Boundaries Detector (SCBD) • Use sub-word units like character N-grams instead of word frequencies: • Character sequences of length n • Most frequent n-grams provide information about author’s stylistic choices on lexical, syntactical and structural level

  13. Word based features • Bag-of-words • Apply stemming and stopword list • Function words • Content-free • POS Annotation • Feature Selection • Semantic Disambiguation

  14. Linguistic constituents • Structure of natural language sentences show word occurrences follow a specific order • Words are grouped into syntactic units called “constituents” • Use word relationships by extracting constituents for feature construction • Subdivide document into sentences • Construct a syntax tree for each sentence

  15. Syntax tree • Use a syntax tree representation of different authors sentences as features

  16. Our Aprroach • Use Stylometry to analyze the following • Texts translated by the same translator but written by different authors • Texts translated by different translators but written by the same authors Stylometry in IR Systems

  17. Proposed Steps • Feature Extraction • Determine which features represent the style best • Training • Training the classifier with a training set • Many methods present, (SVM, bayesian…) • Recognition and Classification of texts • Analyzing the results of classification Stylometry in IR Systems

  18. 1. Feature Extraction • The stylometric features of a text can be: • Word length • Sentence length • Paragraph length • Character n-grans • Function words • Feature choices affect classification results seriously. • Then obtain a feature vector with n-dimensions • V = {v1,v2,v3 … vn} Stylometry in IR Systems

  19. 2. Training • Choose training data for every class • May be randomly selected texts • May be manually picked • Determine the corresponding parameters to each class Stylometry in IR Systems

  20. 3. Recognition and Classification • Use the parameters we obtained from training data • Compute the distance • Label the data • Classify the data Stylometry in IR Systems

  21. Results of the Classification • We will have two set of results • The original texts classified by author • The translated texts classified by no prior class information • These results will give us a clue about the two issues we stated at the beginning • Example: “The Picture of Dorian Gray” is translated into Turkish by many translators • Look if these are clustered in one class or separate classes Stylometry in IR Systems

  22. Our Aim • With the right classification we will be able to identify • If sytlometric analysis works in finding an author in two different languages • If translations carry more of their translators’ style or if they still have their authors’ style • “…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.” Stylometry in IR Systems

  23. Conclusion • Today there are many useful applications of stylometry. • Authorship attribution, plagiarism detection, genre-based information retrieval • What features are valuable for analysis is still an important question. • We aim to find the stylistic connection between a text and its translation. Stylometry in IR Systems

  24. References • Computational Stylistics in Forensic Author Identifiction, Carole E. Charsi • Style vs. Expression in Literary Narratives, Özlem Uzuner, Boris Katz • Computer-Based Authorship Attribution Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis • Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos • Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum Stylometry in IR Systems

  25. Stylometry in IR Systems

More Related