350 likes | 487 Views
More on authorship. Feedback on corpora and features. Review. In authorship classification, we want to determine who the author is given an unknown text sample and samples from potential authors Authorship can be used for… identifying the author of an anonymous work
E N D
More on authorship Feedback on corpora and features
Review • In authorship classification, we want to determine who the author is given an unknown text sample and samples from potential authors • Authorship can be used for… • identifying the author of an anonymous work • settling disputed authorship • determining whether an article was plagiarized • identifying the writer of a threating or harassing note • verifying the authenticity of notes (e.g. suicide notes)
Features • Features used should be purely stylistic, not dependent on the topic the author is writing about • What types of features are more content independent? • How should we normalize features?
Individualized Feedback • Corpus collection • New features
Things to fix: • # of authors (need at least 3) • # of files per author (at least 50) • file naming convention (e.g. authorname_filenumber) • empty documents • length of documents (at least 20 sentences) • sentence boundary detection • tokenization • boilerplate • genre: everyone choose a different genre from newspaper articles
Yuan-Lu’s New Features New features: • productivity of the author (baseline accuracy) • synonymous phrases and words (using counts not ratios gives 22.2% on LA Times data) • will v. would • can v. could • may v. might • I v. we Feedback: • Use of modals may signal politeness or formalness of the text: • Can I borrow your pen? • May I borrow your pen? • Additional ways to model synonyms?
Modeling Synonymy • synonymy = higher lexical cohesion • Use WordNet? • # of synonyms per document/sentence?/??? • Lemmatization? • change/changing/changed/changes • ratio of words to lemmas? • binning: how many words have {1, 2, 3, 4, … x} morphologically related words elsewhere in the text?
Dennis New Features • frequency of emoticons • frequency of hashtags • Use of subordinate clauses: NP -> NP (SBAR -> S) Feedback: • Hashtag frequency may be content dependent
Using hashtags as features • Are some hashtags more content independent? • e.g. #moscow v. #yolo, #tbt • Other hashtag related features? • # per sentence • average length of hashtag • other?
Gus New Features • length-based: average word/sentence/morpheme length • formatting: uppercase/lowercase • mentions: activities/named entities/foods (content dependent?) • pragmatics • politeness level: estimate by count of polite words and phrases (examples?) • person descriptions and epithets (e.g. “He’s a smart fellow”) • morphemes (frequency? length?, # per word?)
Suggestions… • Named Entity Tags • Frequency of tags (e.g. PERSON, LOCATION) • Politeness words • thanks, thank you, please…. • modals? • swear words?
QiangGao Features • Crowdfunding blogs: same topic but authors have different strategies and knowledge/skills • frequency of content words • HTML tags: type of tags and frequencies of each type • Structure Features: font size, color, embedded image and links to other website
Feedback • Some features may be too content dependent • Type of tags v. Frequencies of each type? • Embedded image and links (frequency of specific links? or # of links per document?) • # of different fonts used, ratio of bold to non-bold; italic to non-italic, etc
Sam New Features • Frequency of Proper Nouns per sentence/document • Dialogue based features: • “different authors use a higher or lower dialogue-to-narrative ratio” • “different authors use longer or shorter sequences of dialogue” • Suggestions: • direct v. indirect quotes • split quotations: “Yes,” said X, “I’m going to…” • length of quotes
Mohsen New Features: • Length of sentences of paragraphs • Repetition of adjectives in the same paragraph • Usage of different pronouns (my, I, your, you, his, he, her, she, it, its, we, our, they, their) Feedback: • Additional repetition? Repetition of lemmas? (e.g. skills, skilled, skillful) • Extend set of pronouns? • Gender content dependent? • Feature selection: only considered even numbered lines
Priya New features • # of exclamatory marks used • # of commas used • # of hyphens used Feedback • Similar to punctuation frequency • Other features for poetry: • average length of syllables • rhyme scheme, • use of alliteration (e.g. # of words in the sentence that start with the same letter) • frequency of individual letters • Additional poetic devices: http://leavingcertenglish.net/2011/04/poetic-techniques-terminology/
Becky • Passive voice • Length/proportion of asides/appositives/sub-clauses? • syntactic richness (syntactic rule types/syntactic rule tokens) Feedback: • condensing POS tags? • syntactic richness of individual types of phrases (e.g. syntactic richness of different NPS) • noisy data from incorrect parses? (maybe limit rewrite rules to those which are shorter than XP X X X X)?
Victor Features: • Average message length • Lexical diversity • HTML Feedback: • Avg. message length may be useful for specific genres, especially forums, twitter, or other short texts, may not be useful across genres • Lexical diversity already used, what about lemma diversity?
Weifeng New Features: • Average # of tokens per sentence • Declarative sentence ratio: ratio between statements, versus commands, questions, and exclamations • Font (bold, italic, underlined) usage Feedback: Implementation of declarative sentence ratio?
Sentence type: • Questions ? • Declaratives . • Commands ! • Exclamation marks can also be used for non-commands • Commands don’t always have an exclamation mark • Meet me in the alley with the money at midnight. Come alone. • Another way to model commands could be to look at verbs starting a sentence, or look for verbs without a subject.
Xiao Liu Features: • greetings, signatures, quotes, links, contact information • emoticons :( v. ): • extract domain specific terms using MetaMap Feedback: • domain specific terms work well if unknown text is also in the domain • ratio of generic to brand name drug? • replace with tags? (e.g. SYMPTOM, MEDICATION)
Yi Feng New Features: • date/time from articles Feedback: • Find a way to distinguish between years and model #s • Add additional temporal expressions (e.g. previously) including multi-word expressions (e.g. the next day, one year after, etc.)
Zahra New Features: • Topic related words (e.g. {election, government}, {God, bible, worship}, {download, xbox, play, fan}) • Frequency of symbols (specifically parenthesis and curly brackets) • How the author ends the article (e.g. link/references/farewell) Feedback • Topic related words are too content dependent • use of symbols is more content independent • implementation of author’s ending?