450 likes | 602 Views
Data Organization. Lots of Data Quickly Found. Image courtesy: http://www.flickr.com/photos/juhansonin/4734829999/sizes/o/in/photostream//. Organization. Are you an organized person? How much time would it take you to find your keys? a file on your computer? a phone number?
E N D
Data Organization Lots of Data Quickly Found Image courtesy: http://www.flickr.com/photos/juhansonin/4734829999/sizes/o/in/photostream//
Organization • Are you an organized person? How much time would it take you to find • your keys? • a file on your computer? • a phone number? • your homework paper? • Computers must keep track of literally trillions of items. Organization is essential for finding and processing this data.
Naming Did I ever tell you that Mrs. McCave Had twenty-three sons and she named them all Dave? Well, she did. And that wasn't a smart thing to do. You see, when she wants one and calls out, "Yoo-Hoo! Come into the house, Dave!" she doesn't get one. All twenty-three Dave's of hers come on the run! This makes things quite difficult at the McCave's' • Theodor Geisel (Dr. Seuss) wrote over 45 children's books. • Too many Daves:
Naming • Unique: a name must refer to exactly one thing; never more. 23 Daves! • One item should not have two names. Dr. Seuss or Theodore Geisel? • Descriptive: The name should describe it’s purpose or nature. Which is better? • zqiy.qcl • The Star Spangled Banner.mp3 • The name should be related to the location of the data if possible.
Case Study in Naming : URL • A URL is unique: referring to exactly one web site; never two. • One web site rarely has two URLs. • Most URL’s are descriptive. • http://en.wikipedia.org/wiki/Coral_snake • Location: the above URL describes where to find the web page on the server
Lists • A list is a sequence of items. Order matters. • Example: Most expensive paintings • The Card Players by Paul Cezanne • No. 5, 1948 by Jackson Pollock • Woman III by Willem de Kooning • Portrait of Adele Bloch-Bauer I by Gustav Klimt • Portrait of Dr. Gachetby Vincent van Gogh • Any item can be identified by it’s position • Woman III is the 3rd item in the list • When we identify an item by it’s position, we are using indexing • Indexing associates a unique number with an item in a list. • The index is unique, refers to exactly one item, and is related to the location of the item.
Lists • Conventions: • If X is a list, we will denote the ith item as X[i]. • Most computing systems use i=0 to denote the first item in the list. • Assume that the previous list is named Paintings. • Paintings[2] refers to Woman III • Paintings[0] refers to The Card Players
Storage • Computer memory is linear • Memory is a one-dimensional arrangement of storage units • Each storage unit is numbered with an address (or index) • Each storage unit can hold one item of data • Might consider memory to be a list of storage units.
Arrays • An array is the simplest way to store a list. • A section of memory is used to store the list items sequentially in memory
Arrays • Can we store more than one array in memory? • The name is an anchor • The name is a memory location • Indexing through the name is an offset • Arrays cannot resize • How to add to the paintings list?
Array Retrieval • Item retrieval • A[i] means : get the item at memory location (A + i) • Item deletion • Add an item as the most expensive. Erase the list. Re-write the new list.
Array Deletion • Delete item i from array A • Move A[i+1] to A[i] • Move A[i+2] to A[i+1] • etc… • How efficient is this? • How many items must move?
Array Insertion • Insert an item at index 0 of A • Move A[0] to A[1] • Move A[1] to A[2] • Move A[2] to A[3] • etc.. • How efficient is this? • How many items move?
Arrays 1 2 3 4 5 6 7 8 • Advantages • direct (fast) access to data • efficiently uses available memory • Disadvantages • requires an index to access data • size of the data is fixed • adding/removing from the middle is 'hard'
GeoCache Metaphor • GeoCaching races: • start with a GPS coordinate • Go to the location • Find the cache (treasure chest) • Find an ‘item’ • Find the next GPS coordinate • Find the 'next' cache with the new coordinate • The first location allows you to access all items in the race
Linked Lists • Lists can be linked together in memory • A ‘node’ (analogous to the treasure chest) is a pair of adjacent memory locations • The 1st part of the node is the item • The 2nd part of the node is the memory location of the next node • If the next memory location is zero, you are at the end.
Links • Consider storing a list of numbers in memory • What list does the array contain? • Assume that the anchor is "104" • Assume that the 'end link' is zero • What if the value at location 109 were set to 104? • What if the value at location 105 were set to 0?
Linked List Retrieval • Given an index i, how to find the ith item in the list? • Must chain through i-1 items • Deleting an item • Once we find the item to delete we change the value held in one memory location. Which one? • Adding an item • Find an empty pair of memory locations and create a node • Insert the item to store into the first part of the node • Insert the memory location of the next thing into the second part of the node • Change the 2nd part of the previous node to reference the newly created node.
Graphs • A graph is a mathematical abstraction • Node • sometimes called vertices • an item in the graph • Arc • a directed connection between two nodes • written as (N1, N2) meaning from N1 to N2 • A graph is a set of nodes and a set of arcs • G = (V, E) • V is a set of nodes • E is a set of arcs
Example • Example: • V = {A,B,C,D,E} • E = {(A,E), (A,B), (B,A), (B,D), (C,E), (D,C), (E,B), (E,C), (E,D)} • G = (V, E)
Graphs Model the Real World • Graphs are truly ubiquitous in computational thought because they are able to capture the essence of a wide variety of real-world problems and their solutions. • Games:each node represents the board after a player has moved and each arc represents one players move. • Chemical structures : each node is associated with an atom and each arc represents a bond between atoms. • Electrical circuits : each node represents an electrical connection between two components and each arc represents an electrical component such as a resistor or capacitor. • The national power grid : each node represents a transformer and each arc represents a power line that connects transformers. • Computer networks (i.e. the Internet) : each node represents ? and each arc ? • A Universities curriculum : each node represents ? each arc ?
Graphs : Definitions Adjacency: Assume that U and V are vertices in some graph. Vertex U is adjacent to vertex V if there is an arc (U, V) in the graph. Loop: any arc such that the first and second nodes of the arc are the same. In-degree. The in-degree of a vertex V is the number of arcs in the graph having V as the second vertex. Out-degree. The out-degree of a vertex V is the number of arcs in the graph having V as the first vertex. Order: the number of vertices. Size: the number of arcs. Path: A path is a sequence of vertices such that for every pair of adjacent vertices in the sequence there is a corresponding arc in the graph. Also, a sequence containing a single vertex is a path. Path Length: the number of arcs in the path. Cycle: A cycle is a path where the length is greater than zero and the first and last vertex are the same. A graph without any cycles is known as an acyclic graph.
Graphs • The order? • The size? • Is A adjacent to E? • Is E adjacent to A? • Out-degree of A? • In-degree of A? • Is there a loop? • Is [A, E, C, E] a path? • Is [A, B, A] a path? • Is the graph acyclic?
Storing Graphs in Memory • They can be stored using array-like techniques. We will discuss a linking strategy. • Similar to lists but each node may have an out-degree other than 1. A ‘node’ stores the information associated with a single vertex V. • The vertex contents • The out-degree (an integer number we’ll call N) • N addresses of the adjacent nodes.
Trees not a tree not a tree a tree • A Tree is a type of graph that models hierarchical data. • Has exactly one node with in-degree zero. This node is referred to as the ‘root’. A tree may have no nodes at all; a situation that is an exception to the ‘one-node’ rule given here. • Every node other than the root has an in-degree of one • There is a path from the root to every other vertex
Example • Consider storing the following tree in memory • We have 30 memory slots • Each slot can hold either a 'letter' or a 'number' • Each 'node' is formatted as • node value • number of children N • N links