190 likes | 203 Views
Explore the challenges of data management in computer programs and learn about solutions such as reducing volume, structuring data, and residualizing data. Database management systems play a crucial role in storing and managing residualized data.
E N D
Section 2 2. Horizontal DataLet's take a look at some issues and problems regarding data in general CENTRALITY OF DATA Data are central to every computer program. If a program has no data, there is no input, no output, no constants, no variables... It is hard to imagine a program in which there is no data? Therefore, virtually all programs are data management programs and therefore, virtually all computing involves data management. However, not the all data in computer programs is RESIDUALIZED. RESIDUALIZED data is data stored and managed after the termination of the program that generated it (for reuse later). Database Management Systems (DBMSs) store and manage residualized data.
ISSUES AND PROBLEMS WITH DATA CONTINUED HUGE VOLUME(EVERYONE HAS LOTS OF DATA AVAILABLE TO THEM TODAY!) Data are collected much faster than data are process or managed. NASA's Earth Observation System (EOS), alone, has collected over 15 petabytes of data already (15,000,000,000,000,000 bytes). This collection includes Landsat data, earth surface “snapshots (RGB and IR plus some hybrids) which have been collected of nearly every spot on the earth every ~18 days since 1972). Most of it will never be use! Most of it will never be seen! Why not? There's so much volume, so usefulness of much of it will never be discovered. SOLUTION: Reduce the volume and raise the information density through structuring, querying, filtering, mining, summarizing, aggregating... That's the main task of Data and Database workers today! Claude Shannon's information theory principle comes into play here: More volume means less information.
Shannon's Law of Information The more volume you have, the less information you have.(AKA: Shannon’s Canon) A simple illustration: Which phone book has more useful information? (both have the same 4 data granules; Smith, Jones, 234-9814, 231-7237) BOOK-1BOOK-2 Name NumberName Number Smith 234-9816 Smith 234-9816 Jones 231-7237 Smith 231-7237 Jones 234-9816 Jones 231-7237 The Red Book has no useful phone number information! Data analysis, querying and mining reduce volume and raises info level
STRUCTURING and RESIDUALIZING DATA Proper Structuring of datamay be the second most important task in data and database system work today! At the highest level, is the decision as to whether a data set should be structured as horizontal or vertical data (or some combination). Another important task to be addressed in data systems work today is RESIDUALIZATION OF DATA MUCH WELL-STRUCTURED DATA IS DISCARDED PREMATURELY Databases are about storing data persistently, for later use. RESIDUALIZING DATA may be the third most important task in data and database system work today!
WHAT IS A DATABASE? There are many definitions in the literature. Here is the one we will use: An integrated shared repository of operational data of interest to an enterprise. INTEGRATED: it must be the unification of several distinct files SHARED: same data can be used by more than 1 user (concurrently) REPOSITORY: implies "persistence". OPERATIONAL DATA: data on accounts, parts, patients, students, employees, genes, stock, pixels,... By contrast, nonoperational incl. I/O data, transient data in buffers, queues... ENTERPRISE: bank, warehouse, hospital, school, corp, gov agency, person..
WHAT IS A DATABASE MANAGEMENT SYSTEM (DBMS) A program which organizes and manages access to residual data Databases also contains METADATA also (data on the data). Metadata is non-user data which contains the descriptive information about the data and database organization (i.e., Catalog data).
WHY USE A DATABASE? COMPACTNESS (saves space - no paper files necessary). EASE OF USE(less drudgery, more of the organizational and search work done by the system; user specifies what, not how). CENTRALIZED CONTROL (by DB Administrator (DBA) and by the CEO). REDUCES REDUNDANCY(1 copy is enough, but concurrent use must be controlled NO INCONSISTENCIES(again, since there is only 1 copy necessary). ENFORCE STANDARDS(corporate, dept, industry, national, international). INTEGRITY CONSTRAINTS(automatically maintained) (e.g., GENDER=male => MAIDEN_NAME=null). BALANCE REQUIREMENTS(even conflicting requirements? DataBase Administrator (DBA) can optimize for the whole company). DATA INDEPENDENCE(occurs because applications are immune to storage structure and access strategy changes. Can change the storage structure without changing the access programs and vice versa).
Almost all commerical databases today are HORIZONTAL. That is, the contain horizontally structure data. Horizontal data is data is formed into files of horizontal records of a common type. HORIZONTAL DATA TERMINOLOGY stored (physical, on disk) FIELDS, RECORDS, FILES logical (as viewed by user) type (e.g., datatype) FIELDS, RECORDS, FILES occurrences (instances) TYPE: defines structure and expected contents (time-independent - changes only upon DB reorganization) OCCURRENCE: actual data instances at a given time (time-dependent - changes with every insert/delete/update) HORIZONTAL DATA
STORED FIELDis the smallest unit of stored data in a database. e.g., is a Lname stored field occurrence. Char 25 might be the metadata type of that occurrence. Jones STORED RECORDis a named horizontal concatenation of related stored fields. e.g., | Jones | John | 412 Elm St | Fargo | ND | 58102 | an instance field names City St Zip Lname Fname Address Lname(char25), Fname(char15), Address(char20), City(char15), St(char2), Zip(char5) field types Employee | Lname | Fname | Address | City | St | Zip | record and field names | Jones | John | 412 Elm |Fargo| ND|58102| record instance | Smith | James | 415 Oak | Mhd |MN|56560| record instance | Thom | Bob | 12 Main | Mhd |MN|56560| record instance | Trath | Phil | 234 12St |Fargo|ND |58105| record instance . . . EoF End of File marker Stored? (versus logical) STORED FILEis a named collection of all occurrences of 1 type of stored record.
| Jones | John | 412 Elm |Fargo| ND|58102| | Smith | James | | Thom | Bob | 12 Main | Mhd |MN|56560| | Trath | | 415 Oak | Mhd |MN|56560| | Jones | John |Fargo| ND| | Phil | 234 12St |Fargo|ND |58105| EoF | How these entities are stored and how they are viewed or known to users may differ. They may be known to the users in various logical variations. A logical record based on the 1st occuring employee record above might be: Stored continued The employee file type IS the common employee record type (+ possibly, some other type characteristics, e.g., max-#-records) In todays storage device world, there is only linear storage space, so the 2-D picture of a stored file, strictly speaking, not possible in physical storage media today. Some day there may be truly 2-D storage (e.g., holographic storage) and even 3-D. A more accurately depiction of the store Employee file (as stored on linear storage): So we also have LOGICAL FIELD = smallest unit of logical data LOGICAL RECORD= named collection of related logical fields. LOGICAL FILE = named collection of occurrences of 1 type of logical record which may or may not correspond to the physical entities.
Terminology Unfortunately there is a lot of variation in terminology. It will suffice to "equate" terms as follows in this course: COMMON USAGERELATIONAL MODELTABULAR USAGE File Relation Table Record Tuple Row Field Attribute Column When we need to be more careful we will use: relation is a "set" of tuples whereas a table is a "sequence" of rows or records (has order) tuple is a "set" of fields whereas a row or record is a "sequence" of fields (has order)
DATA MODELS For conceptualizing (logically) and storing (physically) data in a database we have horizontal and vertical models. Here are some of the HORIZONTAL MODELS for files of horizontal records: (in which processing is typically done through vertical scans, e.g., Get and process 1st record. Get and process next record ...) RELATIONAL (simple flat unordered files or relations of records of tuples . of unordered field values) TABULAR (ordered files of ordered fields) INVERTED LIST (Tabular with an access paths (index?) on every field) HIERARCHICAL (files with hierarchical links) NETWORK(files with record chains) OBJECT-RELATIONAL (Relational with "Large OBject" (LOBs) fields) (attributes . which point to or contain complex objects). MAP REDUCE (framework for processing parallelizable problems across huge datasets using a large number of computers (nodes) in a cluster arrangement.
DATA MODELS cont. Here are some of the VERTICAL MODELS (for vertical vectors or trees of attribute values, processing is typically through logical horizontal AND/OR programs). BINARY STORAGE MODEL (Copeland ~1986) (This model used vertical value . and bit vectors. It has virtually dissappeared!) BIT TRANSPOSE FILES (Wang ~1988) (This model used vertical bit files. . It has also virtually dissappeared!) VIPER STRUCTURES (~1998) (Used vertical bit vectors for data mining.) PREDICATE-Trees or pTrees (This model and technology is patented by NDSU . and uses vertical bit trees) (~1997). (The last one is the only one described in detail in these notes).
STUDENT COURSE S# SNAME LCODE C# CNAME SITE |25|CLAY |NJ5101| |8 |DSDE |ND | |32|THAISZ|NJ5102| |7 |CUS |ND | |38|GOOD |FL6321| |6 |3UA |NJ | |17|BAID |NY2091| |5 |3UA |ND | |57|BROWN |NY2092| ENROLL LOCATION S# C# GRADE LCODE STATUS |32|8 | 89 | |NJ5101| 1 | |32|7 | 91 | |NJ5102| 1 | |25|7 | 68 | |FL6321| 4 | |25|6 | 76 | |NY2091| 3 | |32|6 | 62 | |NY2092| 3 | |38|6 | 98 | |17|5 | 96 | REVIEW OF HORIZONTAL DATA MODELS RELATIONAL DATA MODEL The only construct allowed is a [simple, flat] relation for both entity description and relationship definition, e.g., The STUDENT and COURSE relations represent entities The LOCATION relation represents a relationship between the LCODE and STATUS attributes (1-to-many). The ENROLL relations represents a relationshipbetween Student and Course entities (a many-many relationship)
25|CLAY|OTBK 32|THAISZ|KNB 38|GOOD|GTR STUDENTS 7|CUS 8|DSDE 6|3UA COURSES 6|3UA 7|CUS 6|3UA ND|68 ND|89 NJ|98 ENROLLMENTS NJ|76 ND|62 ND|91 REVIEW OF HORIZONTAL DATA MODELS HIERARCHICAL DATA MODELentities=records relationships=links of records forming trees EX: root type is STUDENT (with attributes S#, NAME, LOCATION), dependent type is COURSE (with attributes C#, CNAME), 2nd-level dependent type ENROLLMENT (with attributes, GRADE, LOC) If the typical workload involves producing class lists for students, this organization is very good. Why? If the typical workload is producing course enrollment lists for professors, this is very poor. Why? The problem with the Hierarchical Data Model is that it almost always favors a particular workload category (at the expense of the others).
25| CLAY | MJ511 32 | THAISZ | NJ512 STUDENT records 68 76 89 91 62 ENROLLMENT records 8|DSDE|ND 7|CUS |ND 6|3UA |NJ COURSE records REVIEW OF HORIZONTAL DATA MODELS NETWORK DATA MODELentities = records relationships = owner-member chains (sets) many-to-many relationships easily accomodated EX: 3 entities (STUDENT ENROLLMENT COURSE) 2 owner-member chains: STUDENT-ENROLLMENT COURSE-ENROLLMENT Easy to insert (create new record and reset pointers), delete (reset pointers), update (always just 1 copy to worry about, ZERO REDUNDANCY!) network approach: fast processing, complicated structure (usually requires data processing shop) Again, it favors one workload type over others.
page1 RRN S# ST STATE-INDEX | 0 | 25 |NJ| RID STATE | 1 | 32 |NJ| |1,2| FL | | 2 | 38 |FL| |1,0| NJ | | 3 | 47 |NY| |1,1| NJ | |1,3| NY | page2 |2,0| NY | | 0 | 57 |NY| | | | | | | | | | | | | REVIEW OF HORIZONTAL DATA MODELS INVERTED LIST MODEL (TABULAR): Flat Ordered Files (like relational except there's intrinsic order visible to user programs on both tuples and attributes). Order is usually "arrival order", meaning each record is given a unique "Relative Record Number" or RRN when it is inserted. - RRNs never change (unless there is a reorganization). Programs can access records by RRN. Physical placement of records on pages is in RRN order ("clustered on RRN" so that application programs can efficiently retrieve in RRN order. Indexes, etc can be provided for other access paths (and orderings).
REVIEW OF HORIZONTAL DATA MODELS OBJECT RELATIONAL MODEL Object Relational Model (OR model) is like relational model except repeating groups are allowed (many levels of repeating groups - even nested repeating groups) and Pointers to very complex structures are allowed. (LOBs for Large OBjects, BLOBs for Binary Large OBjects, etc. for storing, e.g., pictures, movies, and other binary large objects.
Description of Map Reduce and Hadoop MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. Writing a parallel-executable program is difficult, requiring various specialized skills. MapReduce provides regular programmers the ability to produce parallel distributed programs much more easily, by requiring them to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" (also called "infrastructure", "framework") automatically takes care of marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process. The model is inspired by the map and reduce functions commonly used in functional programming. MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop. MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce takes advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.