Text Operations

Text Operations Liqi Gao

Utilities • C/C++ Library • Perl (Active Perl) • Regular Expression • Edit Plus / Ultra Edit • Excel

C/C++ Language • Standard library: • Read a line • Remove a CR or LF • Split a line • C++ Boost Library • Case Conversion • Trimming • Replace Algorithm • Finding Algorithm • Split

C/C++: Read a Line • Though it’s simple, it’s useful! • Three methods:

C/C++: Remove CR/LF • Get a line under Windows and Linux platform

C/C++: Remove CR/LF (cont.) • The noising CR • Carriage Return

C/C++: Split a Line • Split a line by a specific character

C/C++: Split a Line (cont.) • Split a line

C++ Boost: Case Conversion • to_upper: Convert a string to upper case • to_lower: Convert a string to lower case

C++ Boost: Trimming & Replace

C++ Boost: Split • split(): splits the input into parts

Regular Expression • Regular expression is a powerful tool for string operations.

An Example • *$[0-9/ ]+$ *[0-9\.\?]+% empty • ^( *)([0-9]+)( *) \2\t

An Introduction to Perl Excels at pattern search and text manipulation (Practical Extraction and Reporting Language) Open source / free software • Cheap! Free and available for all systems • can use and install without restriction • open source promotes portability • vastly expandable through freely available modules (add-on libraries at CPAN repository) • fewer restrictions/lower cost for commercial use • can buy fancy development tools if desired • centralized source, linear development path avoids vendor vicissitudes and incompatibilities!

#!/usr/bin/perl $x = 6e9; print “Hello world!\n”; printf “All %d of you!\n”, $x; Perl Interpreter #include <stdio.h> int main() { float x; x = 6e9; printf(“Hello world!\n”); printf(“All %d of you!\n”, x); } 10001110110011000111011100001110111000110111011000111000110111010100110111001011001101101101010101000111001110001101010101101010101001011101011101100011111000 ... C Compiler Perl is not compiled Hello world! All 6000000000 of you! C (compiled) Perl is not compiled C Compiler • Source Code • Plain text (ASCII) • Human readable • Human editable • Platform Independent • Binary Executable • NOT human readable • NOT human editable • NOT platform independent!

A Taste of Perl: print a message perltaste.pl: Greet the entire world. #!/usr/bin/perl -w - command interpretation header $x = 6e9; - variable assignment statement print “Hello world!\n”; printf “All %d of you!\n”, $x; } - function calls (output statements)

Scalar Values Numerical Values • integer: 5, “3”, 0, -307 • floating point: 6.2e9, -4022.33 • hexadecimal/octal: 0x0d4f, 0477 • NOTE:all numerical values stored as floating-point numbers (usu. “double” precision)

String Values • Double-quoted: interpolates (replaces variable name/control character with it’s value) • Single-quoted: no interpolation done (as-is) • Quoting operators: qq//, qw//, etc. $day = “Monday”; “Happy Monday!\n” Happy Monday!<NL> “Happy $date!\n” Happy Monday!<NL> ‘Happy Monday!\n’ Happy Monday!<NL> ‘Happy $date!\n’ Happy $date!\n

String Manipulation Concatenation $dna1 = “ACTGCGTAGC”; $dna2 = “CTTGCTAT”; • juxtapose in a string assignment or print statement $new_dna = “$dna1$dna2”; • Use the concatenation operator ‘.’ $new_dna = $dna1 . $dna2; • Add segments serially using incremental concatenation: $new_dna = $dna1; $new_dna .= $dna2; (shorthand for:$new_dna = $new_dna . $dna2;)

Substitution DNA transcription: T  U Substitution operator s//: $dna = “GATTACATACACTGTTCA”; $rna = $dna; $rna =~ s/T/U/; # “GAUUACAUACACUGUUCA” Exercise: Start with $dna =“gattACataCACTgttca”; and do the same as above. Print out $rna to the screen.

transcribe.pl: $dna =“gattACataCACTgttca”; $rna = $dna; $rna =~ s/T/U/g; print "DNA: $dna\n"; print "RNA: $rna\n"; Does it do what you expect? If not, why not? • Patterns in substitution are case-sensitive! What can we do? • Convert all letters to upper (or lower) case (preferred when possible) • If we want to retain mixed case, use transliteration operatortr// • $rna =~ tr/tT/uU/;

Case conversion $string = “acCGtGcaTGc”; Upper case: $dna = uc($string); # “ACCGTGCATGC” or$dna = uc $string; or$dna = “\U$string”; Lower case: $dna = lc($string); # “accgtgcatgc” or$dna = “\L$string”; Sentence case: $dna = ucfirst($string) # “Accgtgcatgc” or$dna = “\u\L$string”;

Perl in NLP • Look up in Dictionary • Word Frequency • Chinese Word Segmentation • POS • …… • Whatever you could need

Case study

Thanks for your attention

Text Operations

Text Operations

Presentation Transcript

Chapter 7 Text Operations

INFO 624 -- Week 5 Text Properties and Operations

Text Operations

Text text text.

Text Text Text Character limit = 994 characters

Text Text Text Text Text Text Text Text Text Text Text Text Text Text

Your text Your text Your text Your text

Oracle Text Operations

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations

Title Text Title Text Title Text Title Text Title Text Title Text Title Text Title Text

Title Text Title Text Title Text Title Text Title Text Title

What he/she currently does: text Text Text Text

Text Operations: Preprocessing

Modern Information Retrieval Chapter 7: Text Operations

Text here Text here Text here Text here Text here Text here Text here Text here Text here

Core Text Mining Operations

Text Text Text

Modern Information Retrieval Chapter 7: Text Operations