220 likes | 377 Views
What Is Big Data?. Craig C. Douglas University of Wyoming. What Is Big Data?... It Depends. What Is Big Data?... It Depends. What if time counts? Given a time period t, How much data can be read and written? This changes over time as technology changes .
E N D
What Is Big Data? Craig C. DouglasUniversity of Wyoming
What Is Big Data?... It Depends • What if time counts? • Given a time period t, • How much data can be read and written? • This changes over time as technology changes. • What if the quantity of data counts? • How long does it take to read and write data? • This changes over time as technology changes. • Definition of Big Data is fluid, not static.
Some Sources of Big Data • Interactions with dynamic databases • Internet data • City or regional transportation flow control • Environment and disaster management • Oil/gas fields or pipelines, seismic imaging • Credit cards and online businesses • Government or industry regulation/statistics • Dynamic data-driven apps
Why is Big Data a Hot Topic? • Open positions in data analytics by 2020 (USA) • up to 200,000 open positions • might only be 140,000 open positions • Bureau of Labor Statistics projects that 70% of all newly created jobs across all STEM fields during 2010’s, • across engineering, the physical sciences, the life sciences, and the social sciences, • will be in computer science
Unprecedented Opportunities • Significant contributions to the development of these transformative technologies have been made from diverse fields including: • mathematics, • natural sciences • engineering • social sciences • arts and entertainment industries • business world
Unprecedented Opportunities • Algorithm and software development belong to computer science over the past 50 years: • Computer science researchers have designed and implemented the algorithms and data structures, languages, models, tools, and abstractions that have enabled these transformational technology developments
Quick summary • Simulation oriented computational science is transformational science, but is only a niche in the grand scheme of things. • Big data computing capabilities must be broadly available in any institution that strives to compete in the coming decade. • If not, an institution will simply cease to be competitive, similar to not joining the ARPAnet or CSnet in the 1970’s and 1980’s.
Big File Format • One line per sentence with no punctuation • Each word is separated by one blank • All lower case • Multiple languages and gibberish • Watch for an extra blank at end of some lines
Goals • In the big file of sentences: • Eliminate similar sentences • Find similar sentences of some distance or less • Either goal is hard work if the file has enough sentences • Both goals of about the same hardness • Methods in Chapter 3 of Ullman et al’s Data Mining book useful
Goal 1 • Eliminate all duplicate lines (distance 0) • Eliminate all sentences of distance 1 • Two sentences S1 and S2 are distance n if S1 can be transformed into S2 by adding, removing, or substituting at most n words. • What happens if you eliminate sentence Si because of sentence Si-j, but you later find a sentence Sk that has distance 0 or 1 from Si? • Need to define how you handle this case.
Goal 2 • List all sentences that have duplicates. • List all sentences that have distance 1 sentences • List first one followed by all distance 0 or 1 sentences related to it • Can do as separate lists or just one • Should be sorted • Redo for distance n
Preprocessing • Read all of the file and build a dictionary with each word given a natural number as an index: • Given sentence one here as the first one • 1 2 3 4 5 6 3 7 • Next sentence after sentence one • 8 2 9 2 3 • And so on • 10 11 12
Implementation Suggestions • Use hash tables of considerable size • Hash table size should be a prime number • Build and debug your code with small files • Start with < 10 sentences • Next try 100, 1000, and 10,000 sentences • Then try 17,788,002 sentences • Consider using Hadoop (requires knowledge of Java, however) or MR-MPI (C/C++)
Tricky Part • Build a code to do Goal 1 or 2. Notes: • Shingling and minhash do not work well for edit distance • Two approaches: • Try Jaccard similarity or distance methodology on sentences considered as sets of words • Modify index-based and length-based methods
Generalizing • Substitute n for 1 • Not much extra work to do so • Instead of looking at sentences of word length difference 1, look at ones of difference up to n • Makes a much more useful program • Take arbitrary sentences • Convert to one per line, each word separated by one blank • Take lower and upper case into account and convert to all lower case as preprocessing
Some Interesting Problems • An Open Source, secure Hadoop replacement suitable for hospitals and medical records. • Must be HPPA compliant. • Must scale well for very large databases. • Must have individual access capabilities. • Must not have complexity O(disk access) on a DFS. • Should use OpenMP and MPI. • Should use cache aware hashing methods. • Will be useful well beyond medical records.
Some Interesting Problems • Dynamic Data-Driven Application Systems and Big Data • A natural fit and there is no agreed upon softwarefor DDDAS or DDDAS-BD or DBDDAS. DDDAS has been applied to many, many fields. • DDDAS researchers agree something should be produced: not considered an application and too applied to be considered networking research. • Need to find a niche or a program officer in a funding agency willing to think outside of the box. • Many Big Data issues long common to DDDAS.
Some Interesting Problems • Sensors and telemetry • SensorML was supposed to provide a standard way of describing sensor data and be able to get the data and deliver it to applications. It went commercial ($$$...$$$) after the original PI retired. • A true Open Source, internationally recognized standard would benefit one area of Big Data and DDDAS.
Some Interesting Problems • Reservoirs (oil, gas, water) • Dynamic reservoir meshing • Vertical wells with micro sensors provide updates to fracked reservoirs. • Speed up the meshing to including in a reservoir simulator time (e.g., go from a year to a day). • Dynamically improve predictions. • Corporate oil/gas fields or pipelines (even small ones) produce excessive amounts of data • Open Source data mining tools for specific problem
Some Interesting Problems • Audio and photographic data mining • World’s largest databases based on VoIP and phone monitoring by many governments (e.g., P.R. China, France, Germany, Kingdom of Saudi Arabia, United Kingdom, USA, …). • Keeps disk drive makers in business and lowers hard disk prices very significantly. • Another problem: Find all file duplicates in a file system efficiently. Similar to sentence problem earlier. • Has commercial (e.g., Bing, satellite transmission) and research ramifications that are not nefarious.