Arrays and Strings

Arrays and Strings CSCI 2720 University of Georgia Spring 2007

The Array ADT • Stores a sequence of consecutively numbered objects • Each object can be accessed (selected) using its index

More formally …. • Given integers l and u • with u >= l-1, • the interval l ..u is defined to be the set of integers i such that l <=i<=u • An array is a function • from any interval (the index set of the array) • to a set of objects or elements • the value set of the array

Formally, continued … • If X is an array and i is a member of its index set, • We write X[i] to denote the value of X at i • The members of the range of X are known as the elements of X

The Array ADT • Access(X,i) • Length(X) • Assign(X,i,v) • Initialize(X,v) • Iterate(X,F)

Access(X,i) • Return X[i]

Length(X) • Return u – l + 1, the number of elements in I (the interval on X)

Assign(X,i,v) • Replace array X with a function whose value on i is v (and whose value on all other arguments is unchanged). • We also write this as: • X[i] <- v

Initialize(X,v) • Assign v to every element of array X

Iterate(X,F) • Apply F to each element of array X in order, from smallest index to largest index. • F is an action on a single array element. • for i = l to u do F(X[i])

String • A special type of array • If  is any finite set, then a string over  is • an array whose value set is  and whose index set is 0..n-1 for some non-negative n • The set  is called an alphabet • Each element of  is called a character •  often consists of the Roman alphabet, plus digits, the space, and common punctuation marks

Strings • If w is a string, then • Length(w) = n • Also written |w| • If w = TREE, then • w is a string of length 4 • w[0] = T, w[1] = R • The null string is the string whose domain is the empty interval • Has no elements • Written 

String-specific operations • Substring(w,i,m) • Concat(w1,w2)

Substring(w,i,m) • w is a string; i,m integers • Returns the string of length m containing the portion of w that starts at i • Formally: • returns a string w’ with indices 0 .. m-1 such that w’[k] = w[i+k] for each k satisfying 0 <=k <=m • only applies if • 0 <= i <= |w| and • 0 <= m <= (|w| -1) • otherwise, returns 

Substring … • Example: w = SNICKERING • Substring(w,2,3) returns ICK • Substring(w,3,0) returns  • Substring(w,10,3) returns  • Prefix • each substring(w,0,j) for 0<= j <= |w| is a prefix of w • Suffix • each substring(w,j, |w| - j) for 0<= j <= |w| is a suffix of w

Concat(w1,w2) • returns a string • of length |w1| + |w2| • whose characters are the characters of w1 followed by those of w2 • Concat(w,) = Concat(,w) = w • Example: • w1 = BIRD, w2 = DOG, • Concat(w1,w2) = BIRDDOG • Concat(w2,w1) = DOGBIRD

Tables vs. Arrays • Table = physical organization of memory into sequential cells • Array = an abstract data type, with specific operations • Arrays frequently implemented using tables, but may be implemented in other ways

Multi-dimensional arrays • a function whose range is any set V and whose domain is the Cartesian product of any number of intervals • the Cartesian product of intervals I1, I2, …Id, written as I1 x I2 x … Id, is the set of all d-tuples <i1, i2, … id> such that ik Ik for each k.

Multi-D arrays • if C is a multidimensional array and if i =<i1, i2, … id> then C[i1, i2, … id] is the value of C at i • The dimension of a multi-D array is the number of intervals whose Cartesian product makes up the index set • The size of the kthdimension of such an array is the number of elements in Ik

Contiguous Representation of Arrays: Why Computer Scientists start counting at 0 • Store elements in a table: x x+4 x+8 x+12 x+16 x+20 x[0] x[1] x[2] x[3] x[4] x[5] • Each element begins at x + 4(i-1) • x = starting address of the array • 4 = sizeof(element) • i = index of element of interest 17 43 87 94 101 143

More generally • if X is the address of the first cell in memory of an array with indices l..u, and if each element has size L, then • the ith element is stored at address X + L * (i-1) • the element can be retrieved in constant time

When iterating through the array • can save a few operations by doing “pointer arithmetic” • just add L to current address to get next element • don’t have to subtract, multiply, add • still linear in number of elements, but faster linear

Where’s the needed info stored? • Could store L, l, and u at the starting address of X .. but would need to adjust the formula to calculate the location of individual cells. • If language is strongly typed, some or all of L, l, and u may be part of the definition of X and stored elsewhere • C/C++ -- L part of typing info, l assumed to be 0, u not stored (programmer needs to keep track)

Where’s the needed info stored? • Can use a sentinel value after the last element of the array • C/C++ -- we do this with strings. Store a ‘\0’ at the end • means that you need to iterate through to find Length, no longer O(1)

What if the elements have different lengths? • allot Max to all elements • wasted space • can still access in O(1) time • store pointers to elements • pointers require memory • need 2 accesses (calculate location of pointer, then follow it), but still O(1) • pointer to element is at X + P * (i-1) • easy to swap even large or complex elements … just swap their pointers

2D arrays • can also represent in contiguous memory … but do we keep rows together or do we keep columns together?? • Example: array with logical ordering A B C D E F G H I J K L

A B C D E F G H I J K L A E I B F J C G K D H L Row major v. column-major

Where are 2D elements stored? • Row-major: R[i,j] stored at: • R + L * (NPR(i-1) + (j-1)), where • R is starting address of the array • L is the size of each element • NPR is the number of elements per row • i is the row number • j is the column number

Where are 2D elements stored? • Column-major: C[i,j] stored at: • C + L * (NPC(j-1) + (i-1)), where • C is starting address of the array • L is the size of each element • NPC is the number of elements per column • i is the row number • j is the column number

Multi-dimensional arrays

Constant-time initialization procedure Initialize(ptr M, value v) //Initialize each element of M to v Count(M) <- 0 Default(m) <- v function Valid(int I, ptr M): boolean //return true if M[i] has been modified //since last Initialize return (0 <= When(M)[i] < Count(M)) and (Which(m)[When(M)[i]] == i)

Constant time initialization function Access(int i, ptr M):value // return M[i] if Valid(I,M) then return Data(M)[i] else return Default(M) procedure Assign(ptr M, int I, value v) // Set M[i] <- v if not Valid(i, M) then When(M)[i] <- Count(M) Which(M)[Count(M)] <- i Count(M) <- Count(M) + 1 Data(M)[i] <- v

But requires 3x memory … Which(M) When(M) Data(M)

Sparse Arrays • Definitions • List Representations • Hierarchical Tables • Arrays with Special Shapes

Sparse Arrays • some arrays contain only a few elements … wouldn’t it be more efficient to store only the non-null values? same idea when only a few values differ from the majority • some arrays have a special shape … upper diagonal matrix, symmetric matrix • sparse array : an array in which only a small fraction of the elements are significant in some way • null element: doesn’t need to be stored; is either actually null, or well-known, or easily calculated

List representations

Hierarchical tables

Upper-triangular matrix

Representation of Strings • Background • Huffman Encoding • Lempel-Ziv Encoding

Representing Strings • How much space do we need? • Assume we represent every character. • How many bits to represent each character? • Depends on ||

Bits to encode a character • Two character alphabet{A,B} • one bit per character: • 0 = A, 1 = B • Four character alphabet{A,B,C,D} • two bits per character: • 00 = A, 01 = B, 10 = C, 11 = D • Six character alphabet {A,B,C,D,E, F} • three bits per character: • 000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101 =F, 110 =unused, 111=unused

More generally • The bit sequence representing a character is called the encoding of the character. • There are 2n different bit sequences of length n, • ceil(lg||) bits required to represent each character in  • if we use the same number of bits for each character then length of encoding of a word is |w| * ceil(lg||)

Can we do better?? • If  is very small, might use run-length encoding

What if … • the string we encode doesn’t use all the letters in the alphabet? • log2(ceil(|set_of_characters_used|) • But then also need to store / transmit the mapping from encodings to characters • … and is typically close to size of alphabet

Huffman Encoding: • Still assumes encoding on a per-character basis • Observation: assigning shorter codes to frequently used characters can result in overall shorter encodings of strings • requires assigning longer codes to rarely used characters • Problem: • when decoding, need to know how many bits to read off for each character. • Solution: • Choose an encoding that ensures that no character encoding is the prefix of any other character encoding. An encoding tree has this property.

A Huffman Encoding Tree 21 0 1 9 12 E 0 1 5 7 0 1 0 1 3 2 3 4 A T R N

21 0 1 9 12 E 0 1 5 7 0 1 0 1 3 2 3 4 A T R N

Weighted path length Weighted path = Len(code(A)) * f(A) + Len(code(T)) * f(T) + Len(code(R) ) * f(R) + Len(code(N)) * f(N) + Len(code(E)) * f(E) = (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1) = 9 + 6 + 9 + 12 + 9 = 45 Claim (proof in text) : no other encoding can result in a shorter weighted path length

Building the Huffman Tree A 3 T 4 R 4 E 5

Building the Huffman Tree 7 R 4 E 5 A 3 T 4

Arrays and Strings