390 likes | 604 Views
Programming Interest Group http://www.comp.hkbu.edu.hk/~chxw/pig/index.htm. Tutorial Three Strings & Sorting. Character Codes. Character codes are mappings between numbers and the symbols which make up a particular alphabet. ASCII: American Standard Code for Information Interchange
E N D
Programming Interest Grouphttp://www.comp.hkbu.edu.hk/~chxw/pig/index.htm Tutorial Three Strings & Sorting
Character Codes • Character codes are mappings between numbers and the symbols which make up a particular alphabet. • ASCII: American Standard Code for Information Interchange • A single byte character code • 128 characters are specified • The highest-order bit is left as zero • Unicode • Two bytes per character • Natively supported by Java
Some Properties about ASCII • Uppercase letters, lowercase letters, and numerical digits appear sequentially. • To iterate through all the lowercase letters • for(ch = ‘a’; ch <= ‘z’; ch++) • is character ch uppercase? • (ch >= ‘A’) && (ch <= ‘Z’) • Convert uppercase character ch to lowercase • ch – (‘A’ – ‘a’)
Strings • Strings are sequences of characters. • Different programming languages may have different representations! • C/C++: • Null-terminated array: the string ends with null character ‘\0’. • Enough array must be allocated to hold the largest possible string (plus the null). • Java: • Array plus length
Manipulating Strings • The length of the string • Copy a string • Reverse a string • Concatenate two strings • Search a character in a string • Search a string in a string • String matching problem (or string searching problem)
String Matching Problem • String matching algorithms are also used to search for particular patterns in DNA sequences. • E.g., find the location of pattern P = abaa in the text T = abcabaabcabac
String Matching Algorithms • http://en.wikipedia.org/wiki/String_searching_algorithm • Given: text T with length n, pattern P with length m • Naïve algorithm • Matching time: O((n-m+1)m) • Rabin-Karp algorithm • Preprocessing time: (m) • Matching time: O((n-m+1)m) • Knuth-Morris-Pratt (or KMP) algorithm • Preprocessing time: (m) • Matching time: (n) • Boyer-Moore (or BM) algorithm • Preprocessing time: (m) • Matching time: worst (n), average (n/m)
C String Library Functions • ctype.h and string.h #include <ctype.h> int isalpha(int c); int isupper(int c); int islower(int c); int isdigit(int c); int ispunct(int c); int isxdigit(int c); int isprint(int c); int toupper(int c); int tolower(int c); #include <string.h> char *strcat(char *dst, const char *src); char *strncat(char *dst, const char *src, size_t n); int strcmp(const char *s1, const char *s2); int strncmp(const char *s1, const char *s2, size_t n); char *strcpy(char *dst, const char *src); char *strncpy(char *dst, const char *src, size_t n); size_t strlen(const char *s); char *strstr(const char *s1, const char *s2); char *strtok(char *s1, const char *s2);
C++ String Library functions • C++ supports the c-style strings • C++ also has a string class string::size() string::empty() string::append(s) string::erase(n, m) string::insert(size_type n, const string&s) string::find(s) string::rfind(s)
Java String Objects • String class: java.lang.String • http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html • In Java, strings are constant; their values cannot be changed after they are created. • StringBuffer class: java.lang.StringBuffer • http://java.sun.com/j2se/1.4.2/docs/api/java/lang/StringBuffer.html • A string buffer is like a String, but can be modified. At any point in time it contains some particular sequence of characters, but the length and content of the sequence can be changed through certain method calls.
Example: Corporate Renaming • Corporate name changes are occurring with ever greater frequency, as companies merge, buy each other out, try to hide from bad publicity…. These changes make it difficult to figure out the current name of a company when reading old documents. • Your company, Digiscam, has put you to work on a program which maintains a database of corporate names changes and does the appropriate substitutions to bring old documents up to date. • Your program should take as input a file with a given number of corporate name changes, followed by a given number of lines of text for you to correct. Only exact matches of the string should be replaced. • There will be at most 100 corporate changes, and each line of text is at most 1,000 characters long.
Sample Input and Output 4 “Anderson Consulting” to “Accenture” “Enron” to “Dynegy” “DEC” to “Compaq” “TWA” to “American” 5 Anderson Accounting begat Anderson Consulting, which offered advice to Enron before it DECLARED bankruptcy, which made Anderson Consulting quite happy it changed its name in the first place! Output: Anderson Accounting begat Accenture, which offered advice to Dynegy before it CompaqLARED bankruptcy, which made Anderson Consulting quite happy it changed its name in the first place!
Required String Operations • Read strings • Store strings • Search strings for patterns • Modify strings • Print strings
Read and Store #include <string.h> #define MAXLEN 1001 /* longest possible string */ #define MAXCHANGES 101 /* maximum number of name changes */ typedef char string[MAXLEN]; string mergers[MAXCHANGES][2]; /* store before/after corporate names */ int nmergers; /* number of different name changes */ read_changes( ) { int i; scanf(“%d\n”, &nmergers); for (i = 0; i < nmergers; i++ ) { read_quoted_string( &(mergers[i][0]) ); read_quoted_string( &(mergers[i][1]) ); } } read_quoted_string(char *s) { int i = 0; char c; while ( (c=getchar()) != ‘\”’); while ( (c=getchar()) != ‘\”’) { s[i] = c; i++; } s[i] = ‘\0’; }
Searching for PatternsReturn the position of the first occurrence of the pattern p in the text t, and -1 if it does not occur. int findmatch( char *p, char *t) { int i, j; int plen, tlen; plen = strlen(p); tlen = strlen(t); for ( i = 0; i <= (tlen – plen); i++) { j = 0; while ( (j < plen) && (t[i+j] == p[j]) ) j++; if (j == plen) return (i); } return (-1); }
Manipulating StringsReplace the substring of length xlen starting at position pos in string s with the contents of string y. replace_x_with_y(char *s, int pos, int xlen, char *y) { int i; int slen, ylen; slen = strlen(s); ylen = strlen(y); if (xlen >= ylen) for( i = (pos+xlen); i <= slen; i++) s[i+(ylen-xlen)] = s[i]; else for( i = slen; i >= (pos+xlen); i-- ) s[i+(ylen-xlen)] = s[i]; for (i = 0; i < ylen; i++) s[pos+i] = y[i]; }
Completing the Merger main () { string s; char c; int nlines; int i, j; int pos; read_changes(); scanf(“%d\n”, &nlines); for ( i = 1; I <= nlines; i++ ) { j = 0; while ( (c=getchar()) != ‘\n’) { s[j] = c; j++; } s[j] = ‘\0’; for( j = 0; j < nmergers; j++ ) while ( (pos = findmatch(mergers[j][0], s) ) != -1 ) { replace_x_with_y (s, pos, strlen(mergers[j][0], mergers[j][1]); } printf(“%s\n”, s); } }
Practice • http://acm.uva.es/p/v8/848.html • http://acm.uva.es/p/v8/850.html • http://acm.uva.es/p/v100/10010.html • http://acm.uva.es/p/v100/10082.html • http://acm.uva.es/p/v101/10132.html • http://acm.uva.es/p/v101/10150.html (*) • http://acm.uva.es/p/v101/10188.html • http://acm.uva.es/p/v102/10252.html
Sorting • Sorting is the most fundamental algorithmic problem in computer science. • Internal sorting: the entire sort can be done in main memory (the input fit into main memory) • External sorting: cannot be performed in main memory and must be done on disk or tape (the input is much too large to fit into memory) • To see a list of sorting algorithms: • http://en.wikipedia.org/wiki/Sorting_algorithm
Properties of sorting algorithms • Computational complexity of element comparisons in terms of the size of the list • Worst case, best case, average case • Sort algorithms which only use an abstract key comparison operation always need at least Ω(n log n) comparisons on average. • Memory usage • some sorting algorithms are "in place", such that only O(1) or O(log n) memory is needed beyond the items being sorted, while others need to create auxiliary locations for data to be temporarily stored. • Stability • stable sorting algorithms maintain the relative order of records with equal keys (i.e. values). That is, a sorting algorithm is stable if whenever there are two records R and S with the same key and with R appearing before S in the original list, R will appear before S in the sorted list. • When equal elements are indistinguishable, such as with integers, stability is not an issue. • Unstable sorting algorithms may change the relative order of records with equal keys. • Unstable sorting algorithms can be specially implemented to be stable.
Some simple algorithms • Bubble sort (or sinking sort) • Selection sort • Insertion sort • Best case: O(N) • Worst case: O(N2) • Average case: (N2) • All are O(N2)
Shellsort • Proposed by Donald Shell in 1959. • Increment sequences h1, h2, h3, ..., ht, used in reverse order; h1=1. • After a phase, with an increment hk, A [i] <= A [i + hk]. All elements spaced hk apart are sorted. • The action of an hk-sort is to perform an insertion sort on hk independent subarrays. • The running time of shell sort depends on the choice of increment sequence. • The average-case running time of shellsort, using Hibbard’s increments, is thought to be O(N5/4) [worst case: (N3/2)] • The average-case running time of shellsort, using Sedgewick’s increments, is conjectured to be O(N7/6) [worst case: (N4/3)] • Shellsort is simple, and the performance is acceptable even for N in the tens of thousands.
More complicated algorithms • Mergesort • A good example of divide and conquer • Stable • Heapsort • Make use of data structure heap • Unstable • Running time: O(NlogN) • Remark: • Merge sort is the cornerstone of most external sorting algorithm
Quicksort • Quicksort is the fastest known sorting algorithm in practice. • A divide-and-conquer recursive algorithm • Average running time is O(NlogN). • Worst running time is O(N2). • Quicksort an array S: • If the number of elements in S is 0 or 1, return; • Pick any element v in S. This is called the pivot. • Partition S-{v} into two disjoint groups:S1 = {x S-{v}|x v}, and S2 = {x S-{v}|x v}. • Return {quicksort (S1) followed by v followed by quicksort (S2)}. • Efficient implementations of Quicksort are typically unstable. • Details can be found at any data structure & algorithm textbook, or goto http://en.wikipedia.org/wiki/Quicksort
Non-comparison sorts • Not limited by the O(nlog n) lower bound • Bucket sort • http://en.wikipedia.org/wiki/Bucket_sort • Radix sort • http://en.wikipedia.org/wiki/Radix_sort • Counting sort • http://en.wikipedia.org/wiki/Counting_sort
Sorting Library Functions • In C: Sort an array: #include <stdlib.h> void qsort(void *base, size_t nmemb, size_t size, int (*compar) (const void *, const void *)); This function sorts an array with nmemb elements pointed by base, where each element is size-bytes long. Binary search: #include <stdlib.h> void *bsearch(const void *key, const void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *));
qsort( ) example int main(void) { char line[1024]; char *line_array[1024]; int i = 0; int j = 0; while((fgets(line, 1024, stdin)) != NULL) if(i < 1024) line_array[i++] = strdup(line); else break; sortstrarr(line_array, i); while(j < i) printf("%s", line_array[j++]); return 0; } #include <stdio.h> #include <string.h> #include <stdlib.h> void sortstrarr(void *array, unsigned n); static int cmpr(const void *a, const void *b); static int cmpr(const void *a, const void *b) { return strcmp(*(char **)a, *(char **)b); } void sortstrarr(void *array, unsigned n) { qsort(array, n, sizeof(char *), cmpr); }
Sorting and Searching in C++ • The C++ STL includes methods for sorting, searching, and more. void sort(RandomAccessIterator bg, RandomAccessIterator end); void sort(RandomAccessIterator bg, RandomAccessIterator end, BinaryPredicate op); void stable_sort(RandomAccessIterator bg, RandomAccessIterator end); void stable_sort(RandomAccessIterator bg, RandomAccessIterator end, BinaryPredicate op);
Sorting and Searching in Java • java.util.Arrays static void sort(Object [] a) static void sort(Object [] a, Comparator c) static int binarysearch(Object [] a, Object key) static int binarysearch(Object [] a, Object key, Comparator c) sort() methods in jave.util.Arrays are all stable.
Example 1 • http://acm.uva.es/p/v100/10041.html Background The world-known gangster Vito Deadstone is moving to New York. He has a very big family there, all of them living in Lamafia Avenue. Since he will visit all his relatives very often, he is trying to find a house close to them. Problem Vito wants to minimize the total distance to all of them and has blackmailed you to write a program that solves his problem. Input The input consists of several test cases. The first line contains the number of test cases. For each test case you will be given the integer number of relatives r ( 0 < r < 500) and the street numbers (also integers) s1, s2, …, sr where they live ( 0 < si < 30000 ). Note that several relatives could live in the same street number. Output For each test case your program must write the minimal sum of distances from the optimal Vito's house to each one of his relatives. The distance between two street numbers si and sj is dij= |si-sj|.
Example 1 • If there is 0 or 1 relative, just return 0; • If there are 2 relatives: • If there are 3 relatives: • If there are 4 relatives: • Can you see the solution now?
Example 2 The following is a list of some sorting algorithms. Bubble sort, heap sort, insertion sort, merge sort, quick sort, selection sort, shell sort… My business here is to give you some numbers, and to sort them is your business. Attention, I want the smallest number at the top of the sorted list. Input: The input file consist of a series of data sets. Each data set has two parts: the first part contains two non-negative integers, n (1 ≤ n ≤ 100,000) representing the total of numbers you will get, and m (1 ≤ m ≤ n) representing the interval of the output sorted list. The second part contains n positive integers which will be less than 2,000,000,000. The input is terminated by a line with two zeros. Output: For each data set, you should output several numbers in ONE line. After you get the sorted list, you should output the first number of each m numbers, and you should print exact ONE space between two adjacent numbers. And please make sure that there should NOT be any blank line between outputs of two adjacent data sets.
Example 2 Sample Input: 8 2 3 5 7 1 8 6 4 2 0 0 Output for the Sample Input: 1 3 5 7
Example 3 Dr. Lee cuts a string S into N pieces, s[0], s[1], …, s[N-1]. Now, Dr. Lee gives you these N sub-strings. There might be several possibilities that the string S could be. For example, if Dr. Lee gives you three sub-strings {“a”, “ab”, “ac”}, the string S could be “aabac”, “aacab”, “abaac”, …. Your task is to output the lexicographically smallest S. Input: The first line of the input is a positive integer T. T is the number of the test cases. The first line of each test case is a positive integer N (1 ≤ N ≤ 8) which represents the number of sub-strings. After that, N lines followed. The i-th line is the i-th sub-string. Assume that the length of each sub-string is positive and less than 100. Output: The output of each test is the lexicographically smallest S. No redundant spaces are needed.
Example 3 Sample Input: 1 3 a ab ac Output for the Sample Input: aabac
Example 3 • Analysis: • Solution One: brute-force (N is small) • 8! = 40320 • A better solution: • Define a new relation between two strings X and Y • If XY < YX, then X << Y. E.g. X = “b”, Y = “ba”. We have • X < Y. But Y << X because bab < bba • Try to prove that if X << Y and Y << Z, then X << Z • Then we can sort the N strings based on “<<“ operator • Combine the sorted string
Example 3 #include <iostream> #include <string> #include <algorithm> Using namespace std; int T, n; string s[10]; bool cmp(string x, string y) { return x + y < y + x; } int main( ) { int i; cint >> T; while (T--) { cin >> n; for(i = 0; i < n; i++) cin >> s[i]; sort(s, s+n, cmp); for (i = 0; i < n; i++) cout << s[i]; cout << endl; } return 0; }
Practice • http://acm.uva.es/p/v1/120.html • http://acm.uva.es/p/v100/10026.html • http://acm.uva.es/p/v100/10037.html (*) • http://acm.uva.es/p/v101/10138.html • http://acm.uva.es/p/v101/10152.html • http://acm.uva.es/p/v101/10191.html • http://acm.uva.es/p/v101/10194.html