250 likes | 266 Views
Explore the theory of stylistics and its application in information retrieval (IR) systems. Learn about the history of stylometry, stylistic features, recent studies, and our approach. Discover the diverse applications of stylometry, including authorship attribution, forensic author identification, and genre-based information retrieval.
E N D
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN
Outline • Stylistics and Stylometry • Applications of stylometry • History of stylometric researches • Stylistic features • Recent Studies • Our approach • Conclusion Stylometry in IR Systems
STYLISTICS • The theoritical framework for stylistic combines; • Halliday’s Language Theory • Sander’s Theories of Stylistic • Halliday says: “A text is what is meant, selected from the total set of opinions that constitute what can be meant” • Sander says: “Style is the result of choices made by an author from a range of possibilities offered by the language system” Stylometry in IR Systems
STYLISTICS • Stylistic variation depends on • Author preferences and competence • Familiarity • Genre • Communicative context • Expected characteristics of the intended audience • Modeling, representing and utilizing this variation is the business of stylistic analysis. Stylometry in IR Systems
stylometry • The application of the study of linguistic style • Style refers to the linguistic choices of authors that persist over their works, independently of content • Aim is to describe a text from a rather formal perspective like; • Number of words • Number of repetitions • Sentence length Stylometry in IR Systems
APPLICATIONS OF STYLOMETRY • Authorship attribution • Forensic author identification • To find the author of an anonymous text • Observation of the “characteristics” of a particular author • Organization and retrieval of documents based on their writing style • Systems for genre-based information retrieval Stylometry in IR Systems
HISTORY OF STYLOMETRY • Stylometry grew out of analyzing text for evidence of authenticity, authorial identity • According to modern practice of discipline, there are distinctive patterns of a language to identify authors • After development of computers and their capacities • Large data sets can be analyzed • New methods can be generated and easily applied Stylometry in IR Systems
HISTORY OF STYLOMETRY, CONT’D • Current researches uses techniques based on term frequency counts • Frequency data are collected for common terms • These data are then analyzed using a range of fairly standard statistical techniques • However, they cannot guarantee quality ouput yet, i.e. Ulysses Stylometry in IR Systems
Methodology • Use a subset of structural and stylometric features on a set of authors without consideration of author characteristics • Currently, authorship attribution studies are dominated by the use of lexical measures • Generally used statistics: • Word length • Syllables per word • Sentence-length • Sentence count • Text length in words • Use of punctuation marks
Stylistic Features • Lexically-Based Methods • Vocabulary richness of the author • Frequencies of occurrence of individual words • Vocabulary diversity: • Type-token ratio V/N • V: size of vocabulary of sample text • N: number of tokens • Hapaxlegomena • How many words occur once • Frequencies of occurrence: • Function words
Stylistic Features • Problems: • Text length dependent • Unstable for short texts • Function word set requires manual effort • Specific to the group of authors considered • Solution: • Use set of most frequent words • Both content-words and function words
Related Studies • Analysis of the text by a natural language processing tool: • Use existing NLP tool • Sentence and Chunk Boundaries Detector (SCBD) • Use sub-word units like character N-grams instead of word frequencies: • Character sequences of length n • Most frequent n-grams provide information about author’s stylistic choices on lexical, syntactical and structural level
Word based features • Bag-of-words • Apply stemming and stopword list • Function words • Content-free • POS Annotation • Feature Selection • Semantic Disambiguation
Linguistic constituents • Structure of natural language sentences show word occurrences follow a specific order • Words are grouped into syntactic units called “constituents” • Use word relationships by extracting constituents for feature construction • Subdivide document into sentences • Construct a syntax tree for each sentence
Syntax tree • Use a syntax tree representation of different authors sentences as features
Our Aprroach • Use Stylometry to analyze the following • Texts translated by the same translator but written by different authors • Texts translated by different translators but written by the same authors Stylometry in IR Systems
Proposed Steps • Feature Extraction • Determine which features represent the style best • Training • Training the classifier with a training set • Many methods present, (SVM, bayesian…) • Recognition and Classification of texts • Analyzing the results of classification Stylometry in IR Systems
1. Feature Extraction • The stylometric features of a text can be: • Word length • Sentence length • Paragraph length • Character n-grans • Function words • Feature choices affect classification results seriously. • Then obtain a feature vector with n-dimensions • V = {v1,v2,v3 … vn} Stylometry in IR Systems
2. Training • Choose training data for every class • May be randomly selected texts • May be manually picked • Determine the corresponding parameters to each class Stylometry in IR Systems
3. Recognition and Classification • Use the parameters we obtained from training data • Compute the distance • Label the data • Classify the data Stylometry in IR Systems
Results of the Classification • We will have two set of results • The original texts classified by author • The translated texts classified by no prior class information • These results will give us a clue about the two issues we stated at the beginning • Example: “The Picture of Dorian Gray” is translated into Turkish by many translators • Look if these are clustered in one class or separate classes Stylometry in IR Systems
Our Aim • With the right classification we will be able to identify • If sytlometric analysis works in finding an author in two different languages • If translations carry more of their translators’ style or if they still have their authors’ style • “…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.” Stylometry in IR Systems
Conclusion • Today there are many useful applications of stylometry. • Authorship attribution, plagiarism detection, genre-based information retrieval • What features are valuable for analysis is still an important question. • We aim to find the stylistic connection between a text and its translation. Stylometry in IR Systems
References • Computational Stylistics in Forensic Author Identifiction, Carole E. Charsi • Style vs. Expression in Literary Narratives, Özlem Uzuner, Boris Katz • Computer-Based Authorship Attribution Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis • Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos • Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum Stylometry in IR Systems