1 / 35

More on authorship

More on authorship. Feedback on corpora and features. Review. In authorship classification, we want to determine who the author is given an unknown text sample and samples from potential authors Authorship can be used for… identifying the author of an anonymous work

neith
Download Presentation

More on authorship

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More on authorship Feedback on corpora and features

  2. Review • In authorship classification, we want to determine who the author is given an unknown text sample and samples from potential authors • Authorship can be used for… • identifying the author of an anonymous work • settling disputed authorship • determining whether an article was plagiarized • identifying the writer of a threating or harassing note • verifying the authenticity of notes (e.g. suicide notes)

  3. Features • Features used should be purely stylistic, not dependent on the topic the author is writing about • What types of features are more content independent? • How should we normalize features?

  4. Individualized Feedback • Corpus collection • New features

  5. Things to fix: • # of authors (need at least 3) • # of files per author (at least 50) • file naming convention (e.g. authorname_filenumber) • empty documents • length of documents (at least 20 sentences) • sentence boundary detection • tokenization • boilerplate • genre: everyone choose a different genre from newspaper articles 

  6. Dennis

  7. Gus

  8. Mohsen

  9. QiangGao

  10. Becky

  11. Sam

  12. Victor

  13. Weifeng

  14. Xiao Liu

  15. Yifeng

  16. Yuan-Lu

  17. Yuan-Lu’s New Features New features: • productivity of the author (baseline accuracy) • synonymous phrases and words (using counts not ratios gives 22.2% on LA Times data) • will v. would • can v. could • may v. might • I v. we Feedback: • Use of modals may signal politeness or formalness of the text: • Can I borrow your pen? • May I borrow your pen? • Additional ways to model synonyms?

  18. Modeling Synonymy • synonymy = higher lexical cohesion • Use WordNet? • # of synonyms per document/sentence?/??? • Lemmatization? • change/changing/changed/changes • ratio of words to lemmas? • binning: how many words have {1, 2, 3, 4, … x} morphologically related words elsewhere in the text?

  19. Dennis New Features • frequency of emoticons • frequency of hashtags • Use of subordinate clauses: NP -> NP (SBAR -> S) Feedback: • Hashtag frequency may be content dependent

  20. Using hashtags as features • Are some hashtags more content independent? • e.g. #moscow v. #yolo, #tbt • Other hashtag related features? • # per sentence • average length of hashtag • other?

  21. Gus New Features • length-based: average word/sentence/morpheme length • formatting: uppercase/lowercase • mentions: activities/named entities/foods (content dependent?) • pragmatics • politeness level: estimate by count of polite words and phrases (examples?) • person descriptions and epithets (e.g. “He’s a smart fellow”) • morphemes (frequency? length?, # per word?)

  22. Suggestions… • Named Entity Tags • Frequency of tags (e.g. PERSON, LOCATION) • Politeness words • thanks, thank you, please…. • modals? • swear words?

  23. QiangGao Features • Crowdfunding blogs: same topic but authors have different strategies and knowledge/skills • frequency of content words • HTML tags: type of tags and frequencies of each type • Structure Features: font size, color, embedded image and links to other website

  24. Feedback • Some features may be too content dependent • Type of tags v. Frequencies of each type? • Embedded image and links (frequency of specific links? or # of links per document?) • # of different fonts used, ratio of bold to non-bold; italic to non-italic, etc

  25. Sam New Features • Frequency of Proper Nouns per sentence/document • Dialogue based features: • “different authors use a higher or lower dialogue-to-narrative ratio” • “different authors use longer or shorter sequences of dialogue” • Suggestions: • direct v. indirect quotes • split quotations: “Yes,” said X, “I’m going to…” • length of quotes

  26. Mohsen New Features: • Length of sentences of paragraphs • Repetition of adjectives in the same paragraph • Usage of different pronouns (my, I, your, you, his, he, her, she, it, its, we, our, they, their) Feedback: • Additional repetition? Repetition of lemmas? (e.g. skills, skilled, skillful) • Extend set of pronouns? • Gender content dependent? • Feature selection: only considered even numbered lines

  27. Pronouns

  28. Priya New features • # of exclamatory marks used • # of commas used • # of hyphens used Feedback • Similar to punctuation frequency • Other features for poetry: • average length of syllables • rhyme scheme, • use of alliteration (e.g. # of words in the sentence that start with the same letter) • frequency of individual letters • Additional poetic devices: http://leavingcertenglish.net/2011/04/poetic-techniques-terminology/

  29. Becky • Passive voice • Length/proportion of asides/appositives/sub-clauses? • syntactic richness (syntactic rule types/syntactic rule tokens) Feedback: • condensing POS tags? • syntactic richness of individual types of phrases (e.g. syntactic richness of different NPS) • noisy data from incorrect parses? (maybe limit rewrite rules to those which are shorter than XP  X X X X)?

  30. Victor Features: • Average message length • Lexical diversity • HTML Feedback: • Avg. message length may be useful for specific genres, especially forums, twitter, or other short texts, may not be useful across genres • Lexical diversity already used, what about lemma diversity?

  31. Weifeng New Features: • Average # of tokens per sentence • Declarative sentence ratio: ratio between statements, versus commands, questions, and exclamations • Font (bold, italic, underlined) usage Feedback: Implementation of declarative sentence ratio?

  32. Sentence type: • Questions  ? • Declaratives  . • Commands  ! • Exclamation marks can also be used for non-commands • Commands don’t always have an exclamation mark • Meet me in the alley with the money at midnight. Come alone. • Another way to model commands could be to look at verbs starting a sentence, or look for verbs without a subject.

  33. Xiao Liu Features: • greetings, signatures, quotes, links, contact information • emoticons :( v. ): • extract domain specific terms using MetaMap Feedback: • domain specific terms work well if unknown text is also in the domain • ratio of generic to brand name drug? • replace with tags? (e.g. SYMPTOM, MEDICATION)

  34. Yi Feng New Features: • date/time from articles Feedback: • Find a way to distinguish between years and model #s • Add additional temporal expressions (e.g. previously) including multi-word expressions (e.g. the next day, one year after, etc.)

  35. Zahra New Features: • Topic related words (e.g. {election, government}, {God, bible, worship}, {download, xbox, play, fan}) • Frequency of symbols (specifically parenthesis and curly brackets) • How the author ends the article (e.g. link/references/farewell) Feedback • Topic related words are too content dependent • use of symbols is more content independent • implementation of author’s ending?

More Related