190 likes | 203 Views
CSCI 765 Big Data and Infinite Storage. One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing across those vertical structures.
E N D
One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing across those vertical structures. This is in contrast to the traditional method of structuring data into horizontal structures and processing down those horizontal structures (horizontal structures are often called records, e.g., an employee file containing horizontal employee records which are made up of fields such as Name, Address, Salary, Phone, etc.) Thus, horizontal processing of vertical data (HPVD) will be introduced as an alternative to the traditional vertical processing of horizontal data (VPHD). Why do we need to structure and process data differently than we have in the past? What has changed? Data (digital data) has gotten really BIG!! How big is BIG DATA these days and how big will it get?
An Example: The US Library of Congress is storing EVERY tweet sent since Twitter launched in 2006. Each tweet record contains fifty fields. Let's assume each of those horizontal tweet records is about 1000 bits wide. Let's estimate approximately 1 trillion tweets from 1 billion tweeters, to 1 billion tweetees over 10 years of tweeting? As a full data file that's 1030 data items (1012 *109 * 109)
That's BIG! Is it going to get even bigger? Yes. Let’s look at how the definition of “big data” has evolved just over my work lifetime. My first job in this industry was as THE technician at the St. John’s University IBM 1620 Computer Center. I did the following: 1. I turned the 1620 switch on. 2. I waited for the ready light bulb to come on (~15 minutes) 3. I put the Op/Sys punch card stack on the card reader (~4 inches high) 4. I put the FORTRAN compiler card stack on the reader (~3 inches) 5. I put the FORTRAN program card stack on the reader (~2 inches) 6. The 1620 produced an object code stack which I read in (~1 inch) 7. I read in the object stack and a 1964 BIG DATA stack (~40 inches) The 1st FORTRAN upgrade allowed for a “continue” card so that the data stack could be read in segments (and I could sit down).
How high would a 2013 BIG DATA STACK reach today if it were put on punch cards? Let's be conservative and assume an exabyte (218 bytes) of data on cards How high is an exabyte punch card stack? Take a guess.................? Keep in mind that we're being conservative because the US LoC tweet database may be ~1030 bytes or more soon (if it's fully losslessly stored).
That exabyte stack of punch cards would reach to JUPITER! So, in my work lifetime, BIG DATA has gone from 40 inches high all the way to Jupiter! What will happen to BIG DATA over your work lifetime?
I must deal with a data file that would reach Jupiter as a punch card stack, but I can replace it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally. You may have to deal with a data file that would reach the end of space (if on cards), but you can replace it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally. The next generation may have to deal with a data file that creates new space, but can replace it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally. You will be able to use my code! The next generation will be able to use my code too! It seems clear that DATA WILL HAVE TO BE COMPRESSED and that data will have to be VERTICALLY structured. Let's take a quick look at how one might organize and compressed vertical data (more on that later too).
predicate Trees = pTrees: slice by column (4 vertical structures). A Vertical Data Structuring Traditional Vertical Processing of Horizontal Data (VPHD) e.g., find the number of occurences of 7 0 1 4 using vertical pTreesfind number occurrences of 7 0 1 4 R[A1] R[A2] R[A3] R[A4] R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 for Horizontally structured, record-oriented data, one scans vertically = pure1? true=1 pure1? false=0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false=0 pure1? false=0 0 0 0 1 0 01 0 1 0 1 0 1 1. Whole thing pure1? false 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false 0 P11 0 0 0 0 0 01 3. Right half pure1? false 0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 4. Left half of rt half? false0 7 0 1 4 0 0 1 0 0 01 5. Rt half of right half? true1 0 *23 0 0 *22 =2 0 1 *21 *20 0 1 0 To count (7,0,1,4)s use 111000001100P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = vertically slice off each bit position (12 vertical structures) then compress each bit slice into a treeusing a predicate (We will walk thru the compression of R11 into pTree, P11 ) =2 Base 10 Base 2 R11 0 0 0 0 0 0 1 1 Imagine an excillion records, not just 8 (We need speed!). Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. More typically, we compress strings of bits not single bits (eg, 64 bit strings or strides). P11 But it's pure0 so this branch ends
The age of Big Data is upon us and so is the age of Infinite Storage. Many of us have enough money in our pockets right now to buy all the storage we will be able to fill for the next 5 years. So having adequate storage capacity is no longer much of a problem. Managing our storage is a problem (especially managing BIG DATA storage). How much data is there?
Googolplex 10Googol Googol 10100 . . . (tredecillion) 1042 (duodecillion) 10 39 (undecillion) 10 36 (decillion) 1033 (nontillion) 1030 (octillion) 1027 Yotta (septillion) 1024 Zetta (sextillion) 1021 Exa (quintillion) 1018 Peta (quadrillion) 1015 Tera (trillion) 1012 Giga (billion) 109 Mega (million) 106 Kilo (thousand) 103 • Tera Bytes (TBs) are certainly here already. • 1 TB may cost << 1k$ to buy • 1 TB may cost >> 1k$ to own • Management and curation are the expensive part • Searching 1 TB takes a long time. • I’m Terrified byTeraBytes • I’m Petrified by PetaBytes • I’m Exafied by ExaBytes • I’m Zettafied by ZettaBytes • You could be Yottafied by YottaBytes. • You may not be Googified byGoogolBytes, but the next generation may be? We are here
Yotta Zetta Exa Peta Tera Giga Mega Kilo How much information is there? Everything! Recorded • Soon everything may be recorded. • Most of it will never be seen by humans. • Data summarization, Vertical Structuring, Compression, trend detection, anomaly detection, data mining, are key technologies All Books MultiMedia All books (words) .Movie A Photo A Book 10-24 Yocto, 10-21 zepto, 10-18 atto, 10-15 femto, 10-12 pico, 10-9 nano, 10-6 micro, 10-3 milli
First Disk, in 1956 • IBM 305 RAMAC • 4 MB • 50 24” disks • 1200 rpm (revolutions per minute) • 100 milli-seconds (ms) access time • 35k$/year to rent • Included computer & accounting software(tubes not transistors) 7th Grade C.S. lab Tech.
10 years later 30 MB 1.6 meters
Kilo Mega Giga Tera Peta Exa Zetta Yotta Disk Evolution
MemexAs We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can enter material freely”
On a Personal Terabyte,how Will We Find Anything? • Need Queries, Indexing, Vertical Structuring?, Compression, Data Mining, Scalability, Replication… • If you don’t use a DBMS, you will implement one of your own! • Need for Data Mining, Machine Learning is more important then ever! Of the digital data in existence today, • 80% is personal/individual • 20% is Corporate/Governmental DBMS
Parkinson’s Law(for data) • Data expands to fill available storage Disk-storage version of Moore’s Law • Available storage doubles every 9 months! How do we get the information we need from the massive volumes of data we will have? • Vertical Structuring and Compression • Querying (for the information we know is there) • Data mining (for answers to questions we don't know to ask precisely Moore’s Law with respect to processor performance seems to be over (processor performance doubles every x months…). Note that the processors we find in our computers today are the same as the ones we found a few years ago. That’s because that technology seems to have reached a limit (minaturizing). Now the direction is to put multiple processor on the same chip or die and to use other types of processor to increase performance.