380 likes | 482 Views
CSC 213 – Large Scale Programming. Lecture 21: Indexed Files. Today’s Goals. Look at how Dictionary s used in real world Where this would occur & why they are used there In real world setting, what problems can/do occur Indexed file usage presented and shown
E N D
CSC 213 – Large Scale Programming Lecture 21:Indexed Files
Today’s Goals • Look at how Dictionarys used in real world • Where this would occur & why they are used there • In real world setting, what problems can/do occur • Indexed file usage presented and shown • How & why we split index & data files • Formatting of each file and how they get used • Describe what problems solved using indexed files • Java coding techniques that simplify using these files • Idea needed when using multiple indexes shown
Dictionaries in Real World • Often need large database on many machines • Split search terms across machines • Updating & searching work split between machines • Database way too large for any single machine • If you think about it, this is incredibly common • Where?
Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways
Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways
Index & Data Files • Split information into two (or more) files • Data file uses fixed-size records to store data • Index files contain search terms & data locations • Fixed-size records usually used in data file • Each record will use exactly that much space • Extra space wasted if the value is smaller • But limits data size, cannot get more space • Makes it far easier to reuse space & rebuild index
Index File Format • No standard format – depends on type of data • Often variable sized, but this not specific requirement • Each entry in index file begins with exact search term • Followed by position containing matching data • As a result, often find indexes smushed together • Can read indexes at start of program execution • Reasonably assumes index file smaller than data file • Changes written immediately, however • When program starts, do NOT read data file
Indexed Files • Enables splitting search terms across computers • Alphabetical split searches faster on many servers U-X Y-Z A - C S-T D-E Q-R F-H I-P
Indexed Files • Enables splitting search terms across computers • Create indexes for different types of searching Song name Song Length
How Does This Work? • Using index files simplified using positions • Look in index structure to find position of data in file • With this position can then seek to specific record • Create instance & initialize by reading data from file
Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 2 F
Where Was "Searching" Used? • Indexed files used in Maps and Dictionarys • Read data into searchable object after opening file • For each record, Entryuses indexed data as its key • Single data file has multiple indexes to search it • Not a problem, each index has own Collection • Cannot have multiple instances for each data item • Cannot have single instance for each data item • Then how can we construct each Entry's value?
Proxy Pattern For The Win! • Create proxy instances to use as Entry's value • Proxy pretends has data by defining getters & setters • Data's position & file only fields these objects have • Whenever method called looks up & returns data • Other classes will think proxy has fields declared • Simplifies using class & ensures up-to-date data used • But little memory needed, since data resides on disk!
Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F
Coding public class Stock {private static final intNAME_OFF = 0;private static finalintNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ = 4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ = 6;private static final intSIZE = TICK_OFF + TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile = file;}
Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF= PRC_OFF +PRC_SZ;private static final intTICK_SZ= 6;private static finalintSIZE=TICK_OFF +TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Fixed max. sizeof each field Fixed size of a record in data file
Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ=6;private static final intSIZE=TICK_OFF+TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Offset in record to field start
Coding public class Stock { // Continues from last timepublic intgetStockPrice() {theFile.seek(position + PRC_OFF); return theFile.readInt();}public void setStockPrice(int price) {theFile.seek(position + PRC_OFF); theFile.writeInt(price);}public void setTickerSymbol(String sym) {theFile.seek(position + TICK_OFFSET);theFile.writeUTF(sym);}// More getters & setters from here…
Visualizing Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F
How Do We Add Data? • Adding new records takes only a few steps • Add space for record with setLength on data file • Update index structure(s) to include new record • Records in data file updated at each change
Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F 0 Ø
Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C
How Does This Work? • Removing records even easier • To prevent using record, remove items from indexes • Do NOT update index file(s) until program completes • Use impossible magic numbers for record in data file
Removing Data As We Go IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C
Removing Data As We Go IBM 106 IBM AT & T 23 T 0 Ø Citibank -2 C
Using Multiple Indexes • Multiple indexes for data file very often needed • Provides many ways of searching for important data • Since file read individually could also create problem • Multiple proxy instances for data could be created • Duplicates of instance are created for each index • Makes removing them all difficult, since not linked • Very easy to solve: use Map while loading index • Converts positions in file to proxy instances to solve this
Linking Multiple Indexes • Use one Map instance while reading all indexes • For each position in file, check if already in Map • Use existing proxy instance, if position already in Map • If a search in Mapreturns null, create new instance • Make sure to call put()when we must create proxy
What to Study for Midterm • Study your Maps and Dictionarys • When would we use each of the ADTs? Why? • What do their methods do? Why do they differ? • Consider each implementation of these ADTs • Explain why method has its given big-Oh complexity • Why use an implementation? Where is it used? • What are negatives or limitations of implementation? • What fields needed by implementation? Why is this?
What to Study for Midterm • Hash tables • How do hash functions work? What does mod do? • How do we add & remove data from hash table? • What are collisions & how do we handle them? • What is real & pretend big-Oh complexity? Why? • Binary Search Trees • How do we add, remove, & search in these trees? • How are data in BSTs organized? Tricks to their use? • How do we code & use BSTs? What methods exist?
What to Study for Midterm • List-based approaches – Why? When? • Hash tables • How do hash functions work? What does mod do? • How do we add & remove data from hash table? • What are collisions & how do we handle them? • What is real & pretend big-Oh complexity? Why? • Binary Search Trees • How do we add, remove, & search in these trees? • How are data in BSTsorganized? Tricks to their use? • How do we code & use BSTs? What methods exist?
What to Study for Midterm • AVL Trees • How do we add, remove, & search in these trees? • How are data in them organized? Tricks to their use? • When must we reorganize tree? How is this done? • Splay Trees • How do we add, remove, & search in these trees? • For each method is node splayed & which one? • How to chain splayings together? When do we stop?
What to Study for Midterm • Class selection & design • Where do classes come from? How do we know? • When to use each connection between classes? • How to list methods & fields in UML class diagram? • Comments & Outlines • When, where, and how much? • What should & should not be included?
Midterm Process • Open-book & open-notetest; do not memorize • But have methods & information at your fingertips • Use my slides ONLY with note(s) on that day's slides • Cannot use daily or weekly activities • Must submit all printed pages along with test • Problems resembles tone of those already seen • All new problems, however; do not memorize answers • Includes tracing, showing state of ADT, method returns • Coding, big-Oh analysis, and more can be asked
For Next Lecture • Midterm #1 in class week on Friday • Project #2 available on Angel on Friday, too • Lab phase #2 due on Friday at midnight • I still will be out of town, but lab activity will be posted • Due week from Friday; chance to use indexed files • No class on Monday; take some time to relax • I will be out-of-town serving on an NSF grant panel • Updated schedule on Angel accounts for change