1 / 21

CS 519 : Big Data Exploration and Analytics

CS 519 : Big Data Exploration and Analytics. 1: Introduction. Welcome to CS519!. Arash Termehchy Assistant professor in the school of EECS Usable data management and exploration. Your turn: Name, field, DB background. The Era of Big Data.

adah
Download Presentation

CS 519 : Big Data Exploration and Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 519: Big Data Exploration and Analytics 1: Introduction

  2. Welcome to CS519! • Arash Termehchy • Assistant professor in the school of EECS • Usable data management and exploration. • Your turn: • Name, field, DB background

  3. The Era of Big Data • People and devises generate and share data in staggering rates • Your friends: social networks, online games, … • 30 billion data items shared on Facebook every month • Your cell phone: your positions, daily activities, … • Your car • Your shopping activities • Web: Surface and deep web

  4. The Era of Big Data • Hubble Telescope: 50 GB each month. • High throughput screening devices • Environmental sensor networks

  5. Data is valuable • In the mid-1850s, Dr. John Snow plotted cholera deaths on a map, and in the corner of a particularly hard-hit buildings was a water pump. • A 19th-century version of big data, which suggested an association between cholera and the pump.

  6. Data is valuable • “The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray • Empirical • Theoretical • Computational • Data exploration, eScience • Sloan Sky server is one the most cited resources in astronomy

  7. Data is valuable • Spread of diseases by analyzing Google query log • Personalized medicine, drug discovery, … • “The Unreasonable Effectiveness of Data”

  8. Three V’s of big data: Volume • Large HardonColider: 500 exabyte per day of all sensors work. • Sloan Digital Sky Server has to accommodate 30 TB new data per day at 2016. • According to McKinsey & Company’: • 40% growth in the global data each year • 90% of world’s data was generated in the last two year!

  9. Three V’s of big data: Variety • Valuable information are scattered across various sources in various forms. • Large number of social networks • Large number of life science databases

  10. Three V’s of big data: Variety • “The systemic risks associated with the subprime lending market and the crash of the housing market in 2007 could have been modeled through a comprehensive integration and analysis of available public datasets. …. Integrating these datasets may have provided financial analysts, regulators and academic researchers, with comprehensive models to enable risk assessment.” http://wiki.umiacs.umd.edu/clip/datascience

  11. Three V’s of big data: Variety • It is arguably more challenging than volume, as it requires deeper understanding of the data • Data integration has been recognized as a hard problem in DB community.

  12. Three V’s of big data: Velocity • Data is rapidly evolving. • Web sites, social networks, scientific data, … • Trends are changing in a short amount of time. • News media, stock market, … • We like to get the insight fast. • We do not like to rewrite our programs.

  13. An extra V: Veracity • Data is not clean and consistent. • Common experience between data engineers and scientists

  14. Data exploration and analysis • The focus of course is on variety and velocity: data heterogeneity. • Why data is heterogeneous and how we can handle it? • We will discuss other issues as well.

  15. Prerequisites • CS 540 or equivalent • Contact instructor if you are not sure.

  16. Course format • Some basic lectures at the beginning. • Mostly paper presentation and discussion. • One paper in most sessions. • Student presentation followed by group discussion.

  17. Student presentation • Select a paper by the end of this week. • Multiple papers for some subjects • Choose an exciting paper /subject for you. • Email the presentation material by 5:00 pm the day before.

  18. Discussion and participation • All students must read the paper and post a summary and questions on Piazza. • Each student should ask at least one question per week. • A short wrap-up quiz.

  19. Readings • The list of papers posted on the course Web site. • Referred text books: • Foundations of Databases, Serge Abiteboul, Richard Hull, and Victor Vianu • Database Systems: The Complete Book, Hector Garcia Molina, Jeffry Ullman, and Jennifer Widom

  20. Project • A research project on big data exploration • Pick a question • Define the problem rigorously and get insights. • Provide a solution: building a proof of concept prototype, prove some interesting results • Group 1 – 4 • Talk to the instructor • Project proposal is due in the third week of the class.

  21. Grading Scheme • One assignment to cover the basic concepts: 5% • Paper review: 15% • Paper presentation: 15% • Discussion: 10% • Quiz: 15% • Project: 40%

More Related