300 likes | 393 Views
Globalisation & Computer systems. Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions. Text processes. Character encoding design:
E N D
Globalisation & Computer systems Week 7 • Text processes and globalisation part 1: • Sorting strings: collation • Searching strings and regular expressions
Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” • Text processes operate over text elements
Text processes Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects
Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)
Sorting Language specific • sort order • phonetically based sort • graphically based sort • sort element
Sorting Levels of comparison • Level 1 (primary difference) • Levels 2 and 3 (similar) • Level 4 (exact match)
Sorting Levels of comparison • Level 4: exact match • match in code value • character equivalence • resumes : resumes
Sorting Levels of comparison • Level 1 (primary difference: alphabetic)
Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes
Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes • Level 2 (similar: no accent < accent) • resume < résumé • resumes < résumés • Level 3 (similar: lower case < upper case) • résumé < Résumé
Sorting Forward and backward sequence sort • Forward sequence • Start comparison from beginning of string • Backward sequence • Start comparison from end of string
Sorting Implementation • Sort keys • assign set of weights to each character in the string • compare substrings according to weighting • switch weightings on / off
Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects
Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk
Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt
Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”
Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”
Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”
Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt
Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt
Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”
Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/