230 likes | 383 Views
Design and Implementation*. Objective: To design and implement a program for a relatively small yet reasonably complicated problem. To introduce and review a variety of implementation languages and to have students review the pros and cons of different implementation choices and languages.
E N D
Design and Implementation* • Objective: To design and implement a program for a relatively small yet reasonably complicated problem. To introduce and review a variety of implementation languages and to have students review the pros and cons of different implementation choices and languages. • “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious” – Frederick P. Brooks, The Mythical Man Month *The examples in these slides come from Brian W. Kernighan and Rob Pike, The Practice of Programming, Addison-Wesley, 1999.
Themes • “Once the data structures are laid out, the algorithms tend to fall into place, and the coding is comparatively easy” • The choice of programming language is relatively unimportant to the overall design. • Comparing implementations demonstrates how languages can help or hinder, and ways in which they are unimportant.
Topics • Problem: Generate random English text that reads well. • Program: some data comes in, some data goes out, and the processing depends on a little ingenuity. • Implementations: C, C++, Java, Perl
Failed Attempts • Generate random letters (even with proper frequency). • Generate random words chosen from a dictionary • Need a statistical model with more structure (frequency of phrases)
The Markov Algorithm - Learn • "Learn" from input: • Look at alln-word phrases (prefixes); • Keep a list of words that follow (one-word suffices) for each prefix • Store the prefix/suffix-list in a dictionary • The key (entry) is the prefix • The satellite data, associated w/each prefix, is the list of suffices • Note, we're "faking" a multi-map. Each prefix has, potentially, several possible suffices.
Sample • The following example uses a subset of Dr. Brooks' quote • Uses a prefix length of 2 words • We won't bother w/stripping out punctuation • We won't worry about capitalisation (so, "We" and "we" are different strings) • We will use a special (Null, Null) prefix to indicate the start.
Sample (notes) • We don't filter out duplicates • E.g., "and" appears twice for the prefix ("your", "tables") • This is fine. "and" is, from the input, a nice word to follow. It should be a more probable choice • We also use the suffix (null) to say "Here's a decent place to end our story."
Markov Algorithm – generate text • set w1 and w2 to the first two words in the text • print w1 and w2 • loop: • randomly choose w3, one of the successors of w1 and w2 (in sample text) • print w3 • replace w1 and w2 by w2 and w3
Sample • Say the current prefix is (me, your) • Go to our dictionary, find the entry • Choose a random word from the suffix list (say, "tables") • Write that word out • Make a new prefix: ("your", "tables") • Repeat, until we choose the suffix (null), or decide we've output enough words
Implementation • See lecture outline for links to 4 different implementations • C (see Makefile) • C++ • Java • Perl • Python • What are the pros and cons of the different implementations?
The Data Structures • Python and Perl have everything we need built in • Java and C++ provide appropriate containers in their standard libraries • In C we'll need to roll these things ourselves
The Data Structures - Python • The dictionary (dict) is given to us • The prefixes, the keys in the dictionary, will be 2-element tuples (immutable) • The satellite data will be stored in a list (an array) • If a prefix doesn't already exist in the table, we insert it, along w/an empty list, [] • We just append the new suffix onto the end of this list
Hash Tables • We need a dictionary to map a prefix to its list of suffices (given a prefix, what are the possible suffices?) • In C we'll roll our own hash table • The prefix (key) is hashed, returning a value on [0, N-1], where N is the table size • Each element in the table (array) is a bucket, a linked-list of prefixes • Each prefix is associated w/its own list of suffices
The prefix in C • Prefix – stored in an array of strings (ptrs to char): char* pref[ NPREF ]; • NPREF – the # of words in a prefix, a global constant • Note, each string exists once in memory (after call to strdup); various objects just store pointers to these strings.
pref[0] pref[1] suf next Hash table entries in C - State • Use chains (linked list of entries) to handle collisions • Each entry (or State) is a node in a linked-list. It contains: • The key (the prefix) • The satellite data (pointer to a list of suffix) • A pointer to the next State
word word next next Satellite Data – C • More than one suffix may be associated w/a given entry (State) • Stored in a linked list • List made of structs, of type Suffix: • char* word, the suffix (word) itself • pointer to another Suffix
The hash table in C • statetab – the table itself • Use chains (linked list of States) to handle collisions • statetab is an array of pointers to States (to the beginning of a linked list)
a state: pref[0] pref[1] suf next word next pref[0] pref[1] word suf next next The hash table – C “me” “your” “flowcharts” a suffix: another state (same hash key): another suffix: “tables”
C - eprintf • These are simply some user-defined functions (mostly wrappers) to help w/the coding: void eprintf( char *, ... ); • prints an error msg to stderr, then exits void weprintf( char *, ... ); • prints a warning to stderr; doesn't exit char* estrdup( char *s ); • calls strdup( s ) to dynamically allocate memory and copy s; exits if malloc fails
C – eprintf (cont.) void* emalloc(size_t n); Calls malloc(n). Exits upon failure void* erealloc(void *p, size_t n); Calls realloc( p, n ). Exits upon failure void setprogname(char *); If set, uses program name in messages char* progname(void);
memmove • Moves (low-level) a block of memory memmove( d, s, n ) • Moves n-bytes, starting at s (source) to t (target) • Given prefix (w1, …, wn-1), with suffix wn: memove(prefix, prefix+1, (NPREF-1)*sizeof(prefix[0]) ); prefix[NPREF-1] = suffix; prefix is now ( w2, …, wn ) • Just slides everybody down 1, to make/get next prefix
Choosing a suffix • Consider this line, in generate(): if ( rand() % ++nmatch == 0 ) • We walk the list of suffices • For the first one, we mod by 1 (100% chance of choosing this one) • At the next one, we mod by 2 (50% chance of choosing this one) • At the 3rd, 1 in 3 • At the 4th, 1 in 4 • … • Convince yourself that each suffix has equal probability of being chosen.