90 likes | 110 Views
Learn about the differences between Data Warehouses and Database Management Systems, their advantages, and the challenges they face. Explore techniques to address data processing challenges and discover successful applications of Horizontal Processing of Vertical Data (HPVD).
E N D
Section 3 # 0 3. Vertical DataFirst, a brief description of Data Warehouses (DWs) versus Database Management Systems (DBMSs) • C.J. Date recommended, circa 1980, • Do transaction processing on a DataBase Management System (DBMS), rather than doingfile processing on file systems. • “Using a DBMS, instead of file systems, • unifies data resources, • centralizes control, • standardizes usages, • minimizes redundancy and inconsistency, • maximizes data value and usage. • Inmon, et all, circa 1990 • “Buy a separate Data Warehouse (DW) for long-running queries and data mining” (separate from DBMS for transaction processing)”. • “Double your hardware! Double your software! Double your fun!
Section 3 # 0.1 Data Warehouses (DWs)vs.DataBase Management Systems (DBMSs) • What happened? • Inmon's idea was a great marketing success!, • but foretold a great Concurrency Control Research & Development (CC R&D) failure! CC R&D people had failed to integrate transaction and query processing, Also Known As (AKA) OnLine Transaction Processing (OLTP) and OnLine Analytic Processing (OLAP), that is, update and read workloads) in one system with acceptable performance! • Marketing of Data Warehouses was so successful, nobody noticed the failure! (or seem to mind paying double) • Most enterprises now have a separate DW from their DBMS
Section 3 # 0.2 Some still hope DWs and DBs will one day be unified again. The industry may demand it eventually; e.g., Already, there is research work on real time updating of Data Warehouses (DW)s For now let’s just focus on DATA. You run up against two curses immediately in data processing. Curse of cardinality: solutions don’t scale well with respect to record volume. "files are too deep!" Curse of dimensionality:solutions don’t scale with respect to attribute dimension. "files are too wide!" • Curse of cardinality is a problem in the horizontal and vertical world! • In the horizontal world it was disguised as “curse of the slow join”. In the horizontal world we decompose relations to get good design (e.g., 3rd normal form), but then we pay for that by requiring many slow joins to get the answers we need.
Section 3 # 2 Techniques to address these curses. Horizontal Processing of Vertical Dataor HPVD, instead of the ubiquitous Vertical Processing of Horizontal (record orientated) Data or VPHD. Parallelizing the processing engine. • Parallelize the software engine on clusters of computers. • Parallelize the greyware engine on clusters of people (i.e., enable visualization and use the web...). Again, we need better techniques for data analysis, querying and mining because of: Parkinson’s Law: Data volume expands to fill available data storage. Moore’s law: Available storage doubles every 9 months!
Grasshopper caused significant economic loss each year. TIFF image Yield Map Early infestation prediction is key to damage control. Section 3 # 3 A few HPVD successes: 1. Precision Agriculture Yield prediction: Using Remotely Sensed Imagery (RSI) consists of an aerial photograph (RGB TIFF image taken ~July) and a synchronized crop yield map taken at harvest; thus, 4 feature attributes (B,G,R,Y) and ~100,000 pixels. Producer are able to analyze the color intensity patterns from aerial and satellite photos taken in mid season to predict yield (find associations between electromagnetic reflection and yeild). E.g., ”hi_green&low_red hi_yield”. That is very intuitive. A stronger association, “hi_NIR & low_redhi_yield”, found through HPVD data mining), allows producers to take and query mid-season aerial photographs for low_NIR & high_red grid cells, and where low yeild is anticipated, apply (top dress) additional nitrogen. Can producers use Landsat images of China of predict wheat prices before planting? 2. Infestation Detection (e.g., Grasshopper Infestation Prediction - again involving RSI) Pixel classification on remotely sensed imagery holds much promise to achieve early detection. Pixel classification (signaturing) has many, many applications: pest detection, Flood monitoring, fire detection, wetlands monitoring …
Section 3 # 4 3. Sensor Network Data HPVD • Micro and Nano scale sensor blocks are being developed for sensing • Biological agents • Chemical agents • Motion detection • coatings deterioration • RF-tagging of inventory (RFID tags for Supply Chain Mgmt) • Structural materials fatigue • There will be trillions++ of individual sensors creating mountains of data which can be data mined using HPVD (maybe it shouldn't be called a success yet?).
Situation space ================================== \ CARRIER / Section 3 # 5 4. A Sensor Network Application: CubE for Active Situation Replication (CEASR) Nano-sensors dropped into the Situation space Wherever a threshold level is sensed (of chemical, biological, thermal, etc.), a ping is registered in a compressed Vertical data structure for that location (The compressed vertical data structure is a Ptree. A detailed definition Ptrees is coming up later). .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. A clear plexiglass cube, with embedded nano-LEDs at each voxel (volume pixel) displays the situation to theuser. The single compressed vertical data structure (Ptree) containing all the information is transmitted to the cube, where the pattern is reconstructed (uncompress, display). Each energized nano-sensor transmits a ping (location is triangulated from the ping). These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. This is the expendable, one-time, cheap sensor version. A more sophisticated CEASR device could sense and transmit the intensity levels, lighting up the display voxel with the same intensity. Soldier sees replica of sensed situation prior to entering space
Section 3 # 6 3. Anthropology ApplicationDigital Archive Network for Anthropology (DANA)(analyze, query and mine arthropological artifacts (shape, color, discovery location,…)
visualization Pattern Evaluation and Assay Data Mining Classification Clustering Rule Mining Loop backs Task-relevant Data Data Warehouse: cleaned, integrated, read-only, periodic, historical database Selection Feature extraction, tuple selection Raw data must be cleaned of: missing items, outliers, noise, errors Smart files Section 3 # 7 What has spawned these successes?(i.e., What is Data Mining?) Queryingis asking specific questions for specific answers Data Miningis finding the patterns that exist in data (going into MOUNTAINS of raw data for the information gems hidden in that mountain of data.)