250 likes | 398 Views
STYLOMETRY I N I R SYSTEMS. Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN. Outline. Stylistics and Stylometry Applications of stylometry History of stylometric researches Stylistic features Recent Studies Our approach Conclusion. STYL I ST I CS.
E N D
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN
Outline • Stylistics and Stylometry • Applications of stylometry • History of stylometric researches • Stylistic features • Recent Studies • Our approach • Conclusion Stylometry in IR Systems
STYLISTICS • The theoritical framework for stylistic combines; • Halliday’s Language Theory • Sander’s Theories of Stylistic • Halliday says: “A text is what is meant, selected from the total set of opinions that constitute what can be meant” • Sander says: “Style is the result of choices made by an author from a range of possibilities offered by the language system” Stylometry in IR Systems
STYLISTICS • Stylistic variation depends on • Author preferences and competence • Familiarity • Genre • Communicative context • Expected characteristics of the intended audience • Modeling, representing and utilizing this variation is the business of stylistic analysis. Stylometry in IR Systems
stylometry • The application of the study of linguistic style • Style refers to the linguistic choices of authors that persist over their works, independently of content • Aim is to describe a text from a rather formal perspective like; • Number of words • Number of repetitions • Sentence length Stylometry in IR Systems
APPLICATIONS OF STYLOMETRY • Authorship attribution • Forensic author identification • To find the author of an anonymous text • Observation of the “characteristics” of a particular author • Organization and retrieval of documents based on their writing style • Systems for genre-based information retrieval Stylometry in IR Systems
HISTORY OF STYLOMETRY • Stylometry grew out of analyzing text for evidence of authenticity, authorial identity • According to modern practice of discipline, there are distinctive patterns of a language to identify authors • After development of computers and their capacities • Large data sets can be analyzed • New methods can be generated and easily applied Stylometry in IR Systems
HISTORY OF STYLOMETRY, CONT’D • Current researches uses techniques based on term frequency counts • Frequency data are collected for common terms • These data are then analyzed using a range of fairly standard statistical techniques • However, they cannot guarantee quality ouput yet, i.e. Ulysses Stylometry in IR Systems
Methodology • Use a subset of structural and stylometric features on a set of authors without consideration of author characteristics • Currently, authorship attribution studies are dominated by the use of lexical measures • Generally used statistics: • Word length • Syllables per word • Sentence-length • Sentence count • Text length in words • Use of punctuation marks
Stylistic Features • Lexically-Based Methods • Vocabulary richness of the author • Frequencies of occurrence of individual words • Vocabulary diversity: • Type-token ratio V/N • V: size of vocabulary of sample text • N: number of tokens • Hapaxlegomena • How many words occur once • Frequencies of occurrence: • Function words
Stylistic Features • Problems: • Text length dependent • Unstable for short texts • Function word set requires manual effort • Specific to the group of authors considered • Solution: • Use set of most frequent words • Both content-words and function words
Related Studies • Analysis of the text by a natural language processing tool: • Use existing NLP tool • Sentence and Chunk Boundaries Detector (SCBD) • Use sub-word units like character N-grams instead of word frequencies: • Character sequences of length n • Most frequent n-grams provide information about author’s stylistic choices on lexical, syntactical and structural level
Word based features • Bag-of-words • Apply stemming and stopword list • Function words • Content-free • POS Annotation • Feature Selection • Semantic Disambiguation
Linguistic constituents • Structure of natural language sentences show word occurrences follow a specific order • Words are grouped into syntactic units called “constituents” • Use word relationships by extracting constituents for feature construction • Subdivide document into sentences • Construct a syntax tree for each sentence
Syntax tree • Use a syntax tree representation of different authors sentences as features
Our Aprroach • Use Stylometry to analyze the following • Texts translated by the same translator but written by different authors • Texts translated by different translators but written by the same authors Stylometry in IR Systems
Proposed Steps • Feature Extraction • Determine which features represent the style best • Training • Training the classifier with a training set • Many methods present, (SVM, bayesian…) • Recognition and Classification of texts • Analyzing the results of classification Stylometry in IR Systems
1. Feature Extraction • The stylometric features of a text can be: • Word length • Sentence length • Paragraph length • Character n-grans • Function words • Feature choices affect classification results seriously. • Then obtain a feature vector with n-dimensions • V = {v1,v2,v3 … vn} Stylometry in IR Systems
2. Training • Choose training data for every class • May be randomly selected texts • May be manually picked • Determine the corresponding parameters to each class Stylometry in IR Systems
3. Recognition and Classification • Use the parameters we obtained from training data • Compute the distance • Label the data • Classify the data Stylometry in IR Systems
Results of the Classification • We will have two set of results • The original texts classified by author • The translated texts classified by no prior class information • These results will give us a clue about the two issues we stated at the beginning • Example: “The Picture of Dorian Gray” is translated into Turkish by many translators • Look if these are clustered in one class or separate classes Stylometry in IR Systems
Our Aim • With the right classification we will be able to identify • If sytlometric analysis works in finding an author in two different languages • If translations carry more of their translators’ style or if they still have their authors’ style • “…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.” Stylometry in IR Systems
Conclusion • Today there are many useful applications of stylometry. • Authorship attribution, plagiarism detection, genre-based information retrieval • What features are valuable for analysis is still an important question. • We aim to find the stylistic connection between a text and its translation. Stylometry in IR Systems
References • Computational Stylistics in Forensic Author Identifiction, Carole E. Charsi • Style vs. Expression in Literary Narratives, Özlem Uzuner, Boris Katz • Computer-Based Authorship Attribution Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis • Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos • Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum Stylometry in IR Systems