210 likes | 411 Views
CS 519 : Big Data Exploration and Analytics. 1: Introduction. Welcome to CS519!. Arash Termehchy Assistant professor in the school of EECS Usable data management and exploration. Your turn: Name, field, DB background. The Era of Big Data.
E N D
CS 519: Big Data Exploration and Analytics 1: Introduction
Welcome to CS519! • Arash Termehchy • Assistant professor in the school of EECS • Usable data management and exploration. • Your turn: • Name, field, DB background
The Era of Big Data • People and devises generate and share data in staggering rates • Your friends: social networks, online games, … • 30 billion data items shared on Facebook every month • Your cell phone: your positions, daily activities, … • Your car • Your shopping activities • Web: Surface and deep web
The Era of Big Data • Hubble Telescope: 50 GB each month. • High throughput screening devices • Environmental sensor networks
Data is valuable • In the mid-1850s, Dr. John Snow plotted cholera deaths on a map, and in the corner of a particularly hard-hit buildings was a water pump. • A 19th-century version of big data, which suggested an association between cholera and the pump.
Data is valuable • “The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray • Empirical • Theoretical • Computational • Data exploration, eScience • Sloan Sky server is one the most cited resources in astronomy
Data is valuable • Spread of diseases by analyzing Google query log • Personalized medicine, drug discovery, … • “The Unreasonable Effectiveness of Data”
Three V’s of big data: Volume • Large HardonColider: 500 exabyte per day of all sensors work. • Sloan Digital Sky Server has to accommodate 30 TB new data per day at 2016. • According to McKinsey & Company’: • 40% growth in the global data each year • 90% of world’s data was generated in the last two year!
Three V’s of big data: Variety • Valuable information are scattered across various sources in various forms. • Large number of social networks • Large number of life science databases
Three V’s of big data: Variety • “The systemic risks associated with the subprime lending market and the crash of the housing market in 2007 could have been modeled through a comprehensive integration and analysis of available public datasets. …. Integrating these datasets may have provided financial analysts, regulators and academic researchers, with comprehensive models to enable risk assessment.” http://wiki.umiacs.umd.edu/clip/datascience
Three V’s of big data: Variety • It is arguably more challenging than volume, as it requires deeper understanding of the data • Data integration has been recognized as a hard problem in DB community.
Three V’s of big data: Velocity • Data is rapidly evolving. • Web sites, social networks, scientific data, … • Trends are changing in a short amount of time. • News media, stock market, … • We like to get the insight fast. • We do not like to rewrite our programs.
An extra V: Veracity • Data is not clean and consistent. • Common experience between data engineers and scientists
Data exploration and analysis • The focus of course is on variety and velocity: data heterogeneity. • Why data is heterogeneous and how we can handle it? • We will discuss other issues as well.
Prerequisites • CS 540 or equivalent • Contact instructor if you are not sure.
Course format • Some basic lectures at the beginning. • Mostly paper presentation and discussion. • One paper in most sessions. • Student presentation followed by group discussion.
Student presentation • Select a paper by the end of this week. • Multiple papers for some subjects • Choose an exciting paper /subject for you. • Email the presentation material by 5:00 pm the day before.
Discussion and participation • All students must read the paper and post a summary and questions on Piazza. • Each student should ask at least one question per week. • A short wrap-up quiz.
Readings • The list of papers posted on the course Web site. • Referred text books: • Foundations of Databases, Serge Abiteboul, Richard Hull, and Victor Vianu • Database Systems: The Complete Book, Hector Garcia Molina, Jeffry Ullman, and Jennifer Widom
Project • A research project on big data exploration • Pick a question • Define the problem rigorously and get insights. • Provide a solution: building a proof of concept prototype, prove some interesting results • Group 1 – 4 • Talk to the instructor • Project proposal is due in the third week of the class.
Grading Scheme • One assignment to cover the basic concepts: 5% • Paper review: 15% • Paper presentation: 15% • Discussion: 10% • Quiz: 15% • Project: 40%