1 / 26

THE QUALITY OF BIG DATA IS MORE THAN A PROBLEM TODAY.

THE QUALITY OF BIG DATA IS MORE THAN A PROBLEM TODAY. Complexity of the framework and factors affecting quality of big data. Presenter Dr Nesterov S. nnesterov@tpg.com.au, co-Author Prof. Simchera V.M 62ndISI World Statistics Congress 2019 (ISI WSC 2019)

menrique
Download Presentation

THE QUALITY OF BIG DATA IS MORE THAN A PROBLEM TODAY.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. THE QUALITY OF BIG DATA IS MORE THAN A PROBLEM TODAY. Complexity of the framework and factors affecting quality of big data. Presenter Dr Nesterov S. nnesterov@tpg.com.au, co-Author Prof. Simchera V.M 62ndISI World Statistics Congress 2019 (ISI WSC 2019) 18 – 23 August 2019 in Kuala Lumpur, Malaysia Invited Paper Sessions (IPS146)

  2. Big Data definition • Big data (Gartner, 2012) is high-volume, high-velocity and/or high-variety information assets that demand: cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. • Enhanced Insight levels are: • Descriptive • Predictive • Prescriptive

  3. Big Data data examples: • Web data • Page views, searches, • Text data • Emails, news, feeds • Time and location GPS • Smart Grid and sensor data • Data collected from cars ….. • Social Network data • Manual data recording by observer or expert

  4. Big Data Lifecycle (system engineering ) • Design big data system (modeling experiment) • Implementation of big data system (model) • Development platforms: over 50 big data platforms rated and listed at in the publication [www1] with the best practice examples. • System Operation and Maintenance • Big data supporting platforms as tools • Decommissioning • Destroy data • Keep data • Sell data

  5. Big Data generic modelling loop The general activities involved with big data modeling or processing are: • Ingesting data into the system • Persisting the data in storage • Computing and Analyzing data • Visualizing the results • Results Analysis

  6. Big Data as a trading “product” • Data and application models have to fit in using big data acquired from the third party • Integration of developed elements (API) of Big Data and acquired data to provide enhanced functionality cost effectively • Subscription to access data is an usual traiding mode • Data Analytics software and Rapid Application Development platform [www3]is an example of extended and enhanced options

  7. Big Data “lawful and regulation” issues • Security • Confidentiality • Privacy • Validation (data certification) • Settlements in practice example the next slide

  8. Big Data: Facebook privacy violation case Compilation from Internet • Facebook will pay $5 billion and submit to “government restrictions” on its treatment of users’ private data under the terms of the largest-ever Federal Trade Commission settlement imposed for privacy violations, sources said. (July 2019)The FTC only narrowly approved the fine in a vote that came down along party lines. The fine punishes Facebook for allowing as many as 87 million users’ data to fall into the possession of political consulting firm Cambridge Analytica in violation of a 2012 FTC consent decree. • Even though the fine is larger by a factor of 100 than the next-largest FTC penalty for privacy violation, it’s also less than a tenth of Facebook’s 2018 revenue and a fraction of its $600 billion valuation.

  9. Big Data ecosystem of technological components • Technologies for capturing, storing and accessing big data • Traditionally, data are stored in relational database (server) • Cloud computing: infrastructure and technologies • Analytical technologies • Statistical methods, forecasting, regression analysis • Database querying • Data warehouse • Machine learning and data mining • Visualization • Graphics are used primarily for two reasons: exploratory data analysis and presenting results.

  10. Big Data vs “traditional“ data • There are some important ways that big data is different from traditional data sources. In Franks book [R2] The author suggested the following ways where big data can be seen as different from traditional data sources: • First, big data can be an entirely new source of data. • Second, sometimes one can argue that the speed of data feed has increase to such an extent that it qualifies as a new data source. • Third, increasingly more semi-structured and unstructured data are coming in. Most traditional data sources are in the structured realm.

  11. Big Data operational characteristics • Volume, more then terabytes (data size) • Velocity, fast transaction with high refresh in real time (speed of change) • Variety, different data sources • internal and external • Structured and non-structure However, Big data or small data does not in and by itself possession any value. It is valuable only when you can get some insight out of the data. And that insight can be used to guild your decision making.

  12. Big Data characteristics cont …. Some common additions are: • Veracity: The variety of sources and the complexity of the processing can lead to challenges in evaluating the quality of the data (and consequently, the quality of the resulting analysis) • Variability: Variation in the data leads to wide variation in quality. Additional resources may be needed to identify, process, or filter low quality data to make it more useful. • Value: The ultimate challenge of big data is delivering value (usability). Sometimes, the systems and processes in place are complex enough that using the data and extracting actual value can become difficult.

  13. Big Data characteristics cont …. • Some more additions are: • Reliability: The variety of sources and the complexity of the processing can lead to challenges in evaluating the quality of the data (and consequently, the quality of the resulting analysis) • Usability: Variation in the data leads to wide variation in quality. Additional resources may be needed to identify, process, or filter low quality data to make it more useful. • Usability Value: The ultimate challenge of big data is delivering value. Sometimes, the systems and processes in place are complex enough that using the data and extracting actual value can become difficult

  14. Big Data reliability characteristic • The most important quality of data itself is data reliability. Reliability is the degree to which an assessment tool produces stable and consistent results. The next question is how reliable any data. Today there is no answer to this question and paradoxically an answer does not exist. It is not only because Big data usually is a class of purely random. • If someone interested in approaches how to deal with demanding situation we included in Annex list of Prof Simchera V.M. publications, providing more information.

  15. Big Data current quality fit the purpose? Building the modern digital economy based on the unreliable and falsified data is useless, not fit. The lack of reliable data leads to false decisions The interest in methods of modeling efficient systems for collecting, processing, exchanging and analyzing reliable data is growing on the basis of which it is only possible for mankind to make “objectively-based” decisions based on digital evidence and facts.

  16. Big Data in Current Digital Landscape • Digital economy without adequate and reliable data. • Mostly linear scientific hypotheses and plausible models • The foundation of such models is the availability of adequate and complete sets of reliable primary data, known today as Big Data. • Pitfalls in modern Big Data information systems: • Incomplete, unreliable and repeatedly duplicated blank data • Scattered methods for cross-data analysis • Deficiency of reliable smart data • Inability to solve non-linear multifactor and multidimensional problems • Lack of unified informational tools • Absence of a new generation of data analysts capable of working in the R statistic mode operators, able to model creative observations, flexibly convert and apply modern smart data in hierarchical multipurpose control systems

  17. Big Data in future Digital Landscape • Digital economy based on reliable and sound smart data • Reliable data: how to detect fault, if yes what to do • Data certification may be a solution • Big Data stochastic modeling for identifying and eliminating false data • The other methods for search and verification of reliable data

  18. Big Data – Tool and source of information for data scientists/analysts • The people that does the big data analytic job are called data scientist nowadays. • The job title “data scientist” is sometimes criticized because it lacks specificity and can be perceived as a glorified synonym for data analyst. • Regardless, the position is gaining acceptance with large enterprises who are interested in deriving meaning from big data, the large amount of structured, unstructured and semi-structured data that a large enterprise produces.

  19. Big Data Summary • Big data is here and it is here to stay. Big data is a foundation for analytics platforms and applications. It enables enhanced insight, decision making, and process automation. • The key characteristics of big data is the three V: Volume, Velocity and Variety. In addition: Veracity, Variability and Value • Data comes from variety of sources, and can be used in various industry applications. Often it is the combination of data sources that counts. • Paradigm shift in terms of analytic focus. That is a shift from descriptive analytics to predictive and prescriptive analytics. • Big data necessitates a new type of data management solution because of management of highly scalable, massively parallel, and cost-effective

  20. Big Data: End of Presentation

  21. Annexes

  22. Big Data related Web references • [www1] https://www.predictiveanalyticstoday.com • Pat Research is a B2B discovery platform which provides Best Practices, Buying Guides, Reviews, Ratings, Comparison, Research, Commentary, and Analysis for Enterprise Software and Services. • [www2]https://www.ntnu.no/iie/fag/big/lessons/lesson2.pdf • Xiaomeng Su, Institutt for informatikk og e-læring ved NTNU, Learning material is developed for course IINI3012Big Data • [www3] https://www.digitalocean.com/community/tutorials/an-introduction-to-big-data-concepts-and-terminology • DigitalOcean.com An Introduction to Big Data Concepts and Terminology. By by Justin Ellingwood

  23. References • [R1] Mark A. Beyer and Douglas Laney. “The Importance of 'Big Data': A Definition”. Gartner, 2012 • [R2] Bill Franks. “Taming the big data tidal wave”. Wiley, 2012 • [R3]David R. Hardoon and Galit Shmueli. “Getting started with business analytics –insightful decision making”. Talor & Francis Group.2013 • [R4] Foster Provost and Tom Fawcett. “Data science for business”. O’Relly, 2013 • [R5] Thomas H. Davenport and D.J. Patil . “Data Scientist: The Sexiest Job of the 21st Century”, Harvard Business Review, 2012

  24. Big Data: Hadoop Open Source project Popular Big Data development environment Hadoop: Hadoop is an Apache project that was the early open-source success in big data. It consists of a distributed filesystem called HDFS, with a cluster management and resource scheduler on top called YARN (Yet Another Resource Negotiator). Batch processing capabilities are provided by the MapReduce computation engine. Other computational and analysis systems can be run alongside MapReduce in modern Hadoop deployments [www3].

  25. Bibliography of Prof Simchera V.M. Books related to algorithms which used in constructing an application program with illustrative examples: • Methods of Multivariate Analysis of Statistical Data. Moscow: “Finances and Statistics” Publisher, 2007, 400 p.p. (Contents and Abstract in English – 395-397 p.p.). • Russia: 100 Years of Economic Growth. First edition: 1900-2000 – Historical series. Trends of centuries. Institutional Cycles. Moscow: “Nauka” Publisher, 2006, 587 p. (in English – 584-587 p.p.; Second edition: Historical series: X-XX centuries – Historical series. Trends of centuries. Periodical cycles. “Economica” www Economizdat.ru 684 pp. (in English – 681-684 p.p.). (With co-authors). • Introduction to Financial and Actuarial Calculation. Moscow: “Finansy and Statistika” Publisher, 2003, 353 p.p. (in English – 346-353 p.p.). • Encyclopedia of Statistical Publications X-XX centuries. Moscow: “Finansy and Statistika” Publisher, 2001, 991 p.p. (in English 3-47 p.p.) • The Chronological Measurement of Benchmark and Civilizations Cycles of Historical Progress (1-XX centuries. In Journal “Economic Strategies”, 2009, № 3, p.p. 88-94 (in Russian). Rhodes Public Forum “Dialogue of Civilizations”, 2008 (in English - 12 p.).

  26. Big Data folklore • Art of statistics consists in the correct addition of the wrong numbers • Garbage in (the system) garbage out

More Related