250 likes | 378 Views
Text Operations. Liqi Gao. Utilities. C/C++ Library Perl (Active Perl) Regular Expression Edit Plus / Ultra Edit Excel. C/C++ Language. Standard library: Read a line Remove a CR or LF Split a line. C++ Boost Library Case Conversion Trimming Replace Algorithm Finding Algorithm
E N D
Text Operations Liqi Gao
Utilities • C/C++ Library • Perl (Active Perl) • Regular Expression • Edit Plus / Ultra Edit • Excel
C/C++ Language • Standard library: • Read a line • Remove a CR or LF • Split a line • C++ Boost Library • Case Conversion • Trimming • Replace Algorithm • Finding Algorithm • Split
C/C++: Read a Line • Though it’s simple, it’s useful! • Three methods:
C/C++: Remove CR/LF • Get a line under Windows and Linux platform
C/C++: Remove CR/LF (cont.) • The noising CR • Carriage Return
C/C++: Split a Line • Split a line by a specific character
C/C++: Split a Line (cont.) • Split a line
C++ Boost: Case Conversion • to_upper: Convert a string to upper case • to_lower: Convert a string to lower case
C++ Boost: Split • split(): splits the input into parts
Regular Expression • Regular expression is a powerful tool for string operations.
An Example • *\([0-9/ ]+\) *[0-9\.\?]+% empty • ^( *)([0-9]+)( *) \2\t
An Introduction to Perl Excels at pattern search and text manipulation (Practical Extraction and Reporting Language) Open source / free software • Cheap! Free and available for all systems • can use and install without restriction • open source promotes portability • vastly expandable through freely available modules (add-on libraries at CPAN repository) • fewer restrictions/lower cost for commercial use • can buy fancy development tools if desired • centralized source, linear development path avoids vendor vicissitudes and incompatibilities!
#!/usr/bin/perl $x = 6e9; print “Hello world!\n”; printf “All %d of you!\n”, $x; Perl Interpreter #include <stdio.h> int main() { float x; x = 6e9; printf(“Hello world!\n”); printf(“All %d of you!\n”, x); } 10001110110011000111011100001110111000110111011000111000110111010100110111001011001101101101010101000111001110001101010101101010101001011101011101100011111000 ... C Compiler Perl is not compiled Hello world! All 6000000000 of you! C (compiled) Perl is not compiled C Compiler • Source Code • Plain text (ASCII) • Human readable • Human editable • Platform Independent • Binary Executable • NOT human readable • NOT human editable • NOT platform independent!
A Taste of Perl: print a message perltaste.pl: Greet the entire world. #!/usr/bin/perl -w - command interpretation header $x = 6e9; - variable assignment statement print “Hello world!\n”; printf “All %d of you!\n”, $x; } - function calls (output statements)
Scalar Values Numerical Values • integer: 5, “3”, 0, -307 • floating point: 6.2e9, -4022.33 • hexadecimal/octal: 0x0d4f, 0477 • NOTE:all numerical values stored as floating-point numbers (usu. “double” precision)
String Values • Double-quoted: interpolates (replaces variable name/control character with it’s value) • Single-quoted: no interpolation done (as-is) • Quoting operators: qq//, qw//, etc. $day = “Monday”; “Happy Monday!\n” Happy Monday!<NL> “Happy $date!\n” Happy Monday!<NL> ‘Happy Monday!\n’ Happy Monday!<NL> ‘Happy $date!\n’ Happy $date!\n
String Manipulation Concatenation $dna1 = “ACTGCGTAGC”; $dna2 = “CTTGCTAT”; • juxtapose in a string assignment or print statement $new_dna = “$dna1$dna2”; • Use the concatenation operator ‘.’ $new_dna = $dna1 . $dna2; • Add segments serially using incremental concatenation: $new_dna = $dna1; $new_dna .= $dna2; (shorthand for:$new_dna = $new_dna . $dna2;)
Substitution DNA transcription: T U Substitution operator s//: $dna = “GATTACATACACTGTTCA”; $rna = $dna; $rna =~ s/T/U/; # “GAUUACAUACACUGUUCA” Exercise: Start with $dna =“gattACataCACTgttca”; and do the same as above. Print out $rna to the screen.
transcribe.pl: $dna =“gattACataCACTgttca”; $rna = $dna; $rna =~ s/T/U/g; print "DNA: $dna\n"; print "RNA: $rna\n"; Does it do what you expect? If not, why not? • Patterns in substitution are case-sensitive! What can we do? • Convert all letters to upper (or lower) case (preferred when possible) • If we want to retain mixed case, use transliteration operatortr// • $rna =~ tr/tT/uU/;
Case conversion $string = “acCGtGcaTGc”; Upper case: $dna = uc($string); # “ACCGTGCATGC” or$dna = uc $string; or$dna = “\U$string”; Lower case: $dna = lc($string); # “accgtgcatgc” or$dna = “\L$string”; Sentence case: $dna = ucfirst($string) # “Accgtgcatgc” or$dna = “\u\L$string”;
Perl in NLP • Look up in Dictionary • Word Frequency • Chinese Word Segmentation • POS • …… • Whatever you could need