More on authorship

More on authorship Feedback on corpora and features

Review • In authorship classification, we want to determine who the author is given an unknown text sample and samples from potential authors • Authorship can be used for… • identifying the author of an anonymous work • settling disputed authorship • determining whether an article was plagiarized • identifying the writer of a threating or harassing note • verifying the authenticity of notes (e.g. suicide notes)

Features • Features used should be purely stylistic, not dependent on the topic the author is writing about • What types of features are more content independent? • How should we normalize features?

Individualized Feedback • Corpus collection • New features

Things to fix: • # of authors (need at least 3) • # of files per author (at least 50) • file naming convention (e.g. authorname_filenumber) • empty documents • length of documents (at least 20 sentences) • sentence boundary detection • tokenization • boilerplate • genre: everyone choose a different genre from newspaper articles 

Dennis

Gus

Mohsen

QiangGao

Becky

Sam

Victor

Weifeng

Xiao Liu

Yifeng

Yuan-Lu

Yuan-Lu’s New Features New features: • productivity of the author (baseline accuracy) • synonymous phrases and words (using counts not ratios gives 22.2% on LA Times data) • will v. would • can v. could • may v. might • I v. we Feedback: • Use of modals may signal politeness or formalness of the text: • Can I borrow your pen? • May I borrow your pen? • Additional ways to model synonyms?

Modeling Synonymy • synonymy = higher lexical cohesion • Use WordNet? • # of synonyms per document/sentence?/??? • Lemmatization? • change/changing/changed/changes • ratio of words to lemmas? • binning: how many words have {1, 2, 3, 4, … x} morphologically related words elsewhere in the text?

Dennis New Features • frequency of emoticons • frequency of hashtags • Use of subordinate clauses: NP -> NP (SBAR -> S) Feedback: • Hashtag frequency may be content dependent

Using hashtags as features • Are some hashtags more content independent? • e.g. #moscow v. #yolo, #tbt • Other hashtag related features? • # per sentence • average length of hashtag • other?

Gus New Features • length-based: average word/sentence/morpheme length • formatting: uppercase/lowercase • mentions: activities/named entities/foods (content dependent?) • pragmatics • politeness level: estimate by count of polite words and phrases (examples?) • person descriptions and epithets (e.g. “He’s a smart fellow”) • morphemes (frequency? length?, # per word?)

Suggestions… • Named Entity Tags • Frequency of tags (e.g. PERSON, LOCATION) • Politeness words • thanks, thank you, please…. • modals? • swear words?

QiangGao Features • Crowdfunding blogs: same topic but authors have different strategies and knowledge/skills • frequency of content words • HTML tags: type of tags and frequencies of each type • Structure Features: font size, color, embedded image and links to other website

Feedback • Some features may be too content dependent • Type of tags v. Frequencies of each type? • Embedded image and links (frequency of specific links? or # of links per document?) • # of different fonts used, ratio of bold to non-bold; italic to non-italic, etc

Sam New Features • Frequency of Proper Nouns per sentence/document • Dialogue based features: • “different authors use a higher or lower dialogue-to-narrative ratio” • “different authors use longer or shorter sequences of dialogue” • Suggestions: • direct v. indirect quotes • split quotations: “Yes,” said X, “I’m going to…” • length of quotes

Mohsen New Features: • Length of sentences of paragraphs • Repetition of adjectives in the same paragraph • Usage of different pronouns (my, I, your, you, his, he, her, she, it, its, we, our, they, their) Feedback: • Additional repetition? Repetition of lemmas? (e.g. skills, skilled, skillful) • Extend set of pronouns? • Gender content dependent? • Feature selection: only considered even numbered lines

Pronouns

Priya New features • # of exclamatory marks used • # of commas used • # of hyphens used Feedback • Similar to punctuation frequency • Other features for poetry: • average length of syllables • rhyme scheme, • use of alliteration (e.g. # of words in the sentence that start with the same letter) • frequency of individual letters • Additional poetic devices: http://leavingcertenglish.net/2011/04/poetic-techniques-terminology/

Becky • Passive voice • Length/proportion of asides/appositives/sub-clauses? • syntactic richness (syntactic rule types/syntactic rule tokens) Feedback: • condensing POS tags? • syntactic richness of individual types of phrases (e.g. syntactic richness of different NPS) • noisy data from incorrect parses? (maybe limit rewrite rules to those which are shorter than XP  X X X X)?

Victor Features: • Average message length • Lexical diversity • HTML Feedback: • Avg. message length may be useful for specific genres, especially forums, twitter, or other short texts, may not be useful across genres • Lexical diversity already used, what about lemma diversity?

Weifeng New Features: • Average # of tokens per sentence • Declarative sentence ratio: ratio between statements, versus commands, questions, and exclamations • Font (bold, italic, underlined) usage Feedback: Implementation of declarative sentence ratio?

Sentence type: • Questions  ? • Declaratives  . • Commands  ! • Exclamation marks can also be used for non-commands • Commands don’t always have an exclamation mark • Meet me in the alley with the money at midnight. Come alone. • Another way to model commands could be to look at verbs starting a sentence, or look for verbs without a subject.

Xiao Liu Features: • greetings, signatures, quotes, links, contact information • emoticons :( v. ): • extract domain specific terms using MetaMap Feedback: • domain specific terms work well if unknown text is also in the domain • ratio of generic to brand name drug? • replace with tags? (e.g. SYMPTOM, MEDICATION)

Yi Feng New Features: • date/time from articles Feedback: • Find a way to distinguish between years and model #s • Add additional temporal expressions (e.g. previously) including multi-word expressions (e.g. the next day, one year after, etc.)

Zahra New Features: • Topic related words (e.g. {election, government}, {God, bible, worship}, {download, xbox, play, fan}) • Frequency of symbols (specifically parenthesis and curly brackets) • How the author ends the article (e.g. link/references/farewell) Feedback • Topic related words are too content dependent • use of symbols is more content independent • implementation of author’s ending?

More on authorship