1 / 30

Digital Text and Data Processing

Digital Text and Data Processing. Week 2. Text Mining Research. This class: focus is mostly on computational analysis of literary texts Different names: ‘Text analysis’ Digital Literary Studies Literary informatics (Martin Mueller) Algorithmic Criticism (Stephen Ramsay)

denton
Download Presentation

Digital Text and Data Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Text and Data Processing Week 2

  2. Text Mining Research • This class: focus is mostly on computational analysis of literary texts • Different names: • ‘Text analysis’ • Digital Literary Studies • Literary informatics (Martin Mueller) • Algorithmic Criticism (Stephen Ramsay) • Two approaches: research based on vocabulary and research based on data about the words

  3. Studies based on vocabulary • Segmentation or tokenisation • Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation

  4. Frequency lists • Tokens and types • Frequency lists • ‘Bag of words’ model: original word order is ignored the 2782 and 1646 to 1604 of 1293 a 1152 was 950 I 902 that 799 she 776 in 733 her 698 you 652 he 628 had 606 it 518 not 510 is 489

  5. Stylometrics • Study of style on the basis of quantitative aspects • Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors David Hoover, Textual Analysis

  6. Hugh Craig, Stylistic Analysis and Authorship Studies

  7. Vocabulary Diversity • Type/token ratio • Normalisation for the number of words in a text Peter Garrard, Textual Pathology

  8. Zipf’s law • A small numer of words have a high frequency, a large number of ‘hapax legomena’ (words that appear only once) • Function words and lexical words

  9. Authorship attribution • Suggesting an author for texts whose authorship is disputed • One possible method: Delta (developed by John Burrows) John Burrows, Never Say Always Again: Reflections on the Numbers Game

  10. Applications • Authorship attribution • Formal similarities and differences between genres, literary periods, authors • ‘Thematic summaries’ by creating lists of significant function words (e.g. inverse document frequency) • Allusions; intertextual references • Investigation of the structure of book, cf. Tanya Clement’s study of Gertude Stein’s The Making of America

  11. Challenges • Case-insensitivity, e.g. ‘his’ or ‘His’ • Compound words and phrasal verbs, e.g. ‘carry out’, ‘look after’, ‘swimming pool’, ‘bus stop’ • Different spellings (diachronic and synchronic) • Polysemous words • ‘reductionst’ approach

  12. Regular expressions • Text patterns • Simplest regular expression: Simple sequence of charactersExample: /sun/Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match:[…] the gate of the eastern sun, […] gloom beneath the noonday sun.

  13. \b can be used in regular expressions to represent word boundaries • If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. /\bsun\b/i[…] Points to the unrisen sun! […][…] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]

  14. Character classes . Any character, except the newline \w Any alphanumerical character: alphabetical characters, numbers and underscore \d Any digit \s White space: space, tab, newline [..] Any of the characters supplied within square brackets

  15. Quantifiers {n,m} Pattern must occur a least n times, at most m times {n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,}

  16. Examples /\d{4}/ Matches: 1234, 2013, 1066 /[a-zA-Z]+/ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t /b[aeiou]{1,2}t\w*/ Matches: bit, but, beat, boathouseNot: beauty, blister, boat-house

  17. Anchors Do not match characters, but locations within strings. \b Word boundaries ^ Start of a line $ End of a line

  18. Match variables • Parentheses create substrings within a regular expression • In perl, this substring is stored as variable $1 • Example: $keyword = “computer-aided” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “computer” }

  19. Regular expressions can be combined with vertical bar (‘|’) /\bsun\b|\bstar\b|\bmoon\b/ • ‘special characters’ need to be escaped with the backslash (‘\’) /\?/ /\[/

  20. Exercise Download “concordance.pl” and experiment with regular expressions

  21. Recapitulation W1 • Variables begin with a dollar sign. Two types: strings and numbers • Statements end in a semi-colon • “Use strict” has the effect that all variables need to be declared on first use with the “my keyword” • “Use warnings” means that programmers will be warned when there errors, even when there are “non-fatal”

  22. Operators • Concatenation of strings with the dot $string1 = “Hello” ; $string2 = “World” ; $string3 = $string . “ “ . $string2 ; • Mathematical operators: $sum = 5 + 1 ; $sum = 5++ ; $number = 2 ; $number += 3 ;

  23. Three types of variables • Scalars: a single value; start with $ • Arrays: multiple values; start with @ • Hashes: Multple values which can be referenced with ‘keys’; start with %

  24. $line = “If music be the food of love, play on” ; @array = split( “ “ , $line ) ; # $array[0] contains “If” # $array[4] contains “food”

  25. my $freqList ; $freqList{“if”}++ ; $freqList{“music”}++ ; print $freqList{“if”} ;

  26. Looping through an array Looping through an array foreach my $w ( @words ) { print $w ; } foreach my $w ( @words ) { print $w ; } Looping through a hash foreach my $w ( keys %freq ) { print $w . “\t” . $freq{$w} ; }

More Related