E N D
String Processing CHP # 3
Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is in the filed of word processing. Such processing involve pattern matching , we discuss pattern matching in details, two different algorithms of pattern matching and its complexity . Basic terminology Each programming language contain a character set that is used to communicate with the computer from one language to another language. Following are characters. Alphabet a,b,c,d--------------------------------x,y,z. Digits 0,1,2,3-----------------------------9 Special character +,-,/ ,() , $ , =
String is finite sequence S of zero or more character. The number of character in string is called its length. The string with zero character is called empty string or null. Specific strings will be denoted by enclosing single quotation mark. e.g ‘ The End’ , ‘ To be or not to be’ , ‘ ‘ are strings of length 7, 18 and zero. Concatenation let S1, S2 be string. The string consisting of the characters of S1 followed by Characters of string S2 is called the concatenation of S1 and S2. it will be denoted S1//S2. e.g ‘THE’ // ‘ END’ = ‘THEEND’ it is noted that length of S1//S2 is equal to sum of the length of S1 and S2. Substring a string Y is called a substring of string S if there exist strings X and z such that S = X//Y//Z If X is empty string, then Y is called initial substring of S, if z is an empty string then Y is called a terminal substring of S. If y is substring of S then length of S does not exceed X . Storing String Strings are stored in three types of structure. Fix length structure Variable length structure Linked structure
Fixed length storages In this storage each line of print is viewed as record, where all record have same length i.e where each record accommodates the same number of character. Advantage is ease of accessing data from any given record. The updating data in a given record . Disadvantage. Time is wasted reading an entire record if most of storage consist of inessential blank space. Certain records may require more space than available. When correction consist of more or fewer characters than the original text, changing a misspelled word requires the entire record be changed.
bat cat sat vat NULL Variable length storage The storage of variable length strings in memory cells with fixed length can be done in two general ways. • One can use a marker, such as two dollar signs ($$), to signal the end of the string. • One can list the length of the string as an additional item in the pointer array. Linked storage Linked storage is used for most extensive word processing applications, strings are stored by means of linked lists. We discuss word processing operation in details in next chapter. Here we discuss the way strings appear in these data structure. By a (one way) linked list, we mean a linearly ordered sequence of memory cells called nodes, where each node contains an item called link, which points to the next node in list(which contain the address of next node. example discuss on board
Character data type Here we discuss how various programing languages handle character data type. Constant many languages denotes string constant by placing the string in either single or double quotation mark. Example on board Variables each programming language has its own rule for forming character variables. These variables categorized into three types. Static character variable is that whose length is defined before the program is executed and cannot change throughout the program. Semistatic variable is that in which length may vary during the execution of the program as long as the length does not exceed a maximum value determined by the program before the program is executed. Dynamic character variable we mean a variable whose length can change during the execution of program.
String operations Although string may be consider as sequence or linear array of character, groups of consecutive elements in a string(such as word, phrase) called substring. Further more The basic units of access in a string are usually these substrings, not individual characters. Substring Accessing a substring from a given string requires three pieces of information, the name of string, the position of the first character of the substring in the given string and the length of the substring or the position of the last character of the substring. We call this operation SUBSTRING. e.g SUBSTRING(String , Initial , length) Indexing It also called pattern matching, refers to finding the position where a string pattern P first appears in a given string text T. we call it INDEX and write INDEX(text , pattern) If pattern P does not appear in the text T, then INDEX assign value 0. indexing example is on board
Concatenation Let S1, and S2 be string then concatenation of S1 and S2 is denoted by S1 // S2 is the string consisting of the character of S1 followed by the character S2. e.g S1 ‘MARK’ S2 ‘TWIN’ S1//S2 = ‘MARKTWIN” Length The number of character in string is called its length, we will write Length(string) e.g LENGTH(‘COMPUTER ’) =9
Word Processing In earlier times computer can process data only character type now a days computer process printed text letter articles etc. the operation usually associated with word processing are the following • Insertion it mean inserting a string in the middle of the text. • Deletion it mean removing a string from the text. • Replacing it mean replacing one string in the text y another Insertion Suppose in a given text T we wants to insert a string S so that S begins in position K. we denote this operation by INSERT ( text, position, string)e.g INSERT(‘ABCDEF’, 3 , ‘XYZ’) = ‘ABXYXCDEF’ This insertion function can also be implemented by using string operation INSERT(T, K, S) = SUBSTRING (T, 1, K-1) //S// SUBSTRING (T, K, LENGTH(T)-K+1) That is, the initial substring of T before position K, which has length K-!, is connected
Continue with String S, and the result is concatenated with remaining part of T, has length LENGTH(T)-(K-1) = LENGTH(T) –K+1 Deletion Suppose in a given text T we wants to remove the substring which begins in position K and length L. we denote this operation by DELET ( text, position, length) e.g DELET(‘ PRESTON’ , 2 , 2) = ‘PSTON’ DELET(‘ ABCDEFG’ , 2 , 4) = ‘AFG’ Algo discuss on board.
Replacement Suppose in a given text T we want to replace the first occurrence of a pattern P1 by a pattern P2. we will denote this operation by REPLACE(text, pattern1, Pattern2) e.g REPLACE(‘ABXYEFGH’, ‘XY’ , ‘CD’) = ‘ABCDDEFGH’ We note that replace function can be expressed as deletion function followed by insertion function. The REPLACE function can be executed by using the following three steps K:= INDEX(T,P1) T:= DELETE(T, K, Length(P1)) Insert (T, K, P2) The first two steps delete P1 from T, and third step insert P2 in the position K from which P1 was deleted. Algo discuss on board.
Pattern Matching Algorithm Pattern matching is the problem of deciding whether or not given string pattern P appears In a string text T. we assume that the length P does not exceed the length of T. here we discusses two pattern matching algorithm, with this we also discuss complexity of algorithm to measure efficiency. Pattern matching algorithm In this algorithm we compare a given pattern P with each of the substring of T, moving from left to right until we get a match. Structure is Wk = SUBSTRING(T, K, Length(P)) This statement shows that wk denotes the substring of T having same length as P and beginning with the kth character of T. first we compare P character by character, with first substring, W1. if all the character are the same then p=W1 and so P appears in T and index(T, P)= 1. suppose some character of P is not match of W1 then P# W1. and we move to next substring W2.
Continue The process stops (a) when we find a match of P with some substring wk. and so P appear in T and index(T, P)= k or (b) when we exhaust all the Wk.'s with no match and hence p does not appear in T. the maximum value MAX of the substring K is equal to LENGTH(T) – LENGTH (P) + 1 (example and algo is discuss on board)