What Is Big Data?

What Is Big Data? Craig C. DouglasUniversity of Wyoming

What Is Big Data?... It Depends

What Is Big Data?... It Depends • What if time counts? • Given a time period t, • How much data can be read and written? • This changes over time as technology changes. • What if the quantity of data counts? • How long does it take to read and write data? • This changes over time as technology changes. • Definition of Big Data is fluid, not static.

Some Sources of Big Data • Interactions with dynamic databases • Internet data • City or regional transportation flow control • Environment and disaster management • Oil/gas fields or pipelines, seismic imaging • Credit cards and online businesses • Government or industry regulation/statistics • Dynamic data-driven apps

Why is Big Data a Hot Topic? • Open positions in data analytics by 2020 (USA) • up to 200,000 open positions • might only be 140,000 open positions • Bureau of Labor Statistics projects that 70% of all newly created jobs across all STEM fields during 2010’s, • across engineering, the physical sciences, the life sciences, and the social sciences, • will be in computer science

Unprecedented Opportunities • Significant contributions to the development of these transformative technologies have been made from diverse fields including: • mathematics, • natural sciences • engineering • social sciences • arts and entertainment industries • business world

Unprecedented Opportunities • Algorithm and software development belong to computer science over the past 50 years: • Computer science researchers have designed and implemented the algorithms and data structures, languages, models, tools, and abstractions that have enabled these transformational technology developments

Quick summary • Simulation oriented computational science is transformational science, but is only a niche in the grand scheme of things. • Big data computing capabilities must be broadly available in any institution that strives to compete in the coming decade. • If not, an institution will simply cease to be competitive, similar to not joining the ARPAnet or CSnet in the 1970’s and 1980’s.

Similarities in Sentences in Big Files

Big File Format • One line per sentence with no punctuation • Each word is separated by one blank • All lower case • Multiple languages and gibberish • Watch for an extra blank at end of some lines

Goals • In the big file of sentences: • Eliminate similar sentences • Find similar sentences of some distance or less • Either goal is hard work if the file has enough sentences • Both goals of about the same hardness • Methods in Chapter 3 of Ullman et al’s Data Mining book useful

Goal 1 • Eliminate all duplicate lines (distance 0) • Eliminate all sentences of distance 1 • Two sentences S1 and S2 are distance n if S1 can be transformed into S2 by adding, removing, or substituting at most n words. • What happens if you eliminate sentence Si because of sentence Si-j, but you later find a sentence Sk that has distance 0 or 1 from Si? • Need to define how you handle this case.

Goal 2 • List all sentences that have duplicates. • List all sentences that have distance 1 sentences • List first one followed by all distance 0 or 1 sentences related to it • Can do as separate lists or just one • Should be sorted • Redo for distance n

Preprocessing • Read all of the file and build a dictionary with each word given a natural number as an index: • Given sentence one here as the first one • 1 2 3 4 5 6 3 7 • Next sentence after sentence one • 8 2 9 2 3 • And so on • 10 11 12

Implementation Suggestions • Use hash tables of considerable size • Hash table size should be a prime number • Build and debug your code with small files • Start with < 10 sentences • Next try 100, 1000, and 10,000 sentences • Then try 17,788,002 sentences • Consider using Hadoop (requires knowledge of Java, however) or MR-MPI (C/C++)

Tricky Part • Build a code to do Goal 1 or 2. Notes: • Shingling and minhash do not work well for edit distance • Two approaches: • Try Jaccard similarity or distance methodology on sentences considered as sets of words • Modify index-based and length-based methods

Generalizing • Substitute n for 1 • Not much extra work to do so • Instead of looking at sentences of word length difference 1, look at ones of difference up to n • Makes a much more useful program • Take arbitrary sentences • Convert to one per line, each word separated by one blank • Take lower and upper case into account and convert to all lower case as preprocessing

Some Interesting Problems • An Open Source, secure Hadoop replacement suitable for hospitals and medical records. • Must be HPPA compliant. • Must scale well for very large databases. • Must have individual access capabilities. • Must not have complexity O(disk access) on a DFS. • Should use OpenMP and MPI. • Should use cache aware hashing methods. • Will be useful well beyond medical records.

Some Interesting Problems • Dynamic Data-Driven Application Systems and Big Data • A natural fit and there is no agreed upon softwarefor DDDAS or DDDAS-BD or DBDDAS. DDDAS has been applied to many, many fields. • DDDAS researchers agree something should be produced: not considered an application and too applied to be considered networking research. • Need to find a niche or a program officer in a funding agency willing to think outside of the box. • Many Big Data issues long common to DDDAS.

Some Interesting Problems • Sensors and telemetry • SensorML was supposed to provide a standard way of describing sensor data and be able to get the data and deliver it to applications. It went commercial ($$$...$$$) after the original PI retired. • A true Open Source, internationally recognized standard would benefit one area of Big Data and DDDAS.

Some Interesting Problems • Reservoirs (oil, gas, water) • Dynamic reservoir meshing • Vertical wells with micro sensors provide updates to fracked reservoirs. • Speed up the meshing to including in a reservoir simulator time (e.g., go from a year to a day). • Dynamically improve predictions. • Corporate oil/gas fields or pipelines (even small ones) produce excessive amounts of data • Open Source data mining tools for specific problem

Some Interesting Problems • Audio and photographic data mining • World’s largest databases based on VoIP and phone monitoring by many governments (e.g., P.R. China, France, Germany, Kingdom of Saudi Arabia, United Kingdom, USA, …). • Keeps disk drive makers in business and lowers hard disk prices very significantly. • Another problem: Find all file duplicates in a file system efficiently. Similar to sentence problem earlier. • Has commercial (e.g., Bing, satellite transmission) and research ramifications that are not nefarious.

What Is Big Data?

What Is Big Data?

Presentation Transcript

Chapter 2: Data Preprocessing

Chapter 2 Data Warehousing

Data Mining

Data Mining

Chapter 14

Data Quality

Data Mining

Data Warehousing

Best practices to ensure efficient data models, fast data activation, and performance of your SAP NetWeaver BW 7.3 data

BIG DATA

DATA ANALYSIS

Chapter 2. Aerodrome Data

Data Quality

Data Mining: Concepts and Techniques — Chapter 2 —

Chapter 5: The Data Link Layer

Data Preprocessing

Data Mining with DB

DATA MINING

Introduction to Data Structures

UNIT-II Data Preprocessing

Chapter 2: Data Preprocessing