440 likes | 449 Views
Explore the fundamentals of streaming data, use cases, Azure options, and architectural design in real-time analytics using Kafka and Spark in Microsoft Azure. Includes demos and Q&A.
E N D
Getting started with real-time analytics with Kafka and Spark in Microsoft Azure Joe Plumb Cloud Solution Architect – Microsoft UK @joe_plumb
Alternative title: Everything I know about real time analytics in Microsoft Azure Joe Plumb Cloud Solution Architect – Microsoft UK @joe_plumb
Agenda • Fundamentals of streaming data • What streaming data can be useful for • What options are there to use data streams in Microsoft Azure? • Demo • Q&A
What is streaming data? • “Streaming data is data that is continuously generated by different sources.” https://en.wikipedia.org/wiki/Streaming_data • Streaming system - A type of data processing engine that is designed with infinite datasets in mind. https://learning.oreilly.com/library/view/streaming-systems/9781491983867/ch01.html
Why bother? • Batch processing can give great insights into things that happened in the past, but it lacks the ability to answer the question of "what is happening right now?” • “Data is valuable only when there is an easy way to process and get timely insights from data sources.”
What is it good for? Where is streaming data? Clickstream data • Website monitoring • Network monitoring • Fraud detection • Web clicks • Advertising • Environment monitoring • Application usage tracking • Recommendations …. Sensors Smart machinery (e.g. production lines) GPS
Streaming System architecture Source: https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/real-time-processing
Event vs Message • Could be argued its an issue of semantics, as they ‘look’ the same (e.g. JSON object, CSV etc) • Message is a catch-all term, as messages are just bundles of data • Event message is a type of message “When a subject has an event to announce, it will create an event object, wrap it in a message, and send it on a channel.” https://www.enterpriseintegrationpatterns.com/patterns/messaging/EventMessage.html
It’s all about time Cardinality is important because the unbounded nature of infinite datasets imposes additional burdens on data processing frameworks that consume them.. We need ways to reason about time
It’s all about time: Event time vs Processing time • In an ideal world, the processing system receives the event when it happens. • In reality, the skew between an event happening and the system processing that event can vary wildly Processing time Time the system becomes aware of the event Event Time Time the event occurs Time
It’s all about time: Event time vs Processing time • In an ideal world, the processing system receives the event when it happens. • In reality, the skew between an event happening and the system processing that event can vary wildly • Processing time lag is the difference in observed time vs processing time • Event-time skew is how far behind the processing pipeline is at that moment.
It’s all about time: Watermarking • An event time marker that indicates all events up to “a point” have been fed to the streaming processor. By the nature of streams, the incoming event data never stops, so watermarks indicate the progress to a certain point in the stream. • Watermarks can either be a strict guarantee (perfect watermark) or an educated guess (heuristic watermark)
It’s not just about time: Triggers • They determine when the processing on the accumulated data is started. • Repeated update triggers • These periodically generate updated panes for a window as its contents evolve. • Completeness triggers • These materialize a pane for a window only after the input for that window is believed to be complete to some threshold
Delivery Guarantees • At-most-once • means that for each message handed to the mechanism, that message is delivered zero or one times; in more casual terms it means that messages may be lost. • At-least-once • means that for each message handed to the mechanism potentially multiple attempts are made at delivering it, such that at least one succeeds; again, in more casual terms this means that messages may be duplicated but not lost. • Exactly-once • means that for each message handed to the mechanism exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.
Streaming + Batch? • Lambda architecture • Increasingly viewed as a workaround, due to advances in capabilities and reliability of streaming data systems By Textractor - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=34963985 https://en.wikipedia.org/wiki/Lambda_architecture#/media/File:Diagram_of_Lambda_Architecture_(generic).png
Event Hubs • Fully-managed PaaS service • Big data streaming platform and event ingestion service. • It can receive millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters. • Wide range of use cases • Scalable • Kafka for Event Hubs • Data can be captured automatically in either Azure Blob Storage or Azure Data Lake Store
Stream Analytics • Event-processing engine that allows you to examine high volumes of data streaming from devices. • Supports extracting information from data streams, identifying patterns, and relationships. • Can then use these patterns to trigger other actions downstream, such as create alerts, feed information to a reporting tool, or store it for later use
Integration with Azure event hubs and IoT hub Azure Stream Analytics has built-in, first class integration with Azure Event Hubs and IoT Hub Data from Azure Event Hubs and Azure IoT Hub can be sources of streaming data to Azure Stream Analytics The connections can be established through the Azure Portal without any coding Azure Blob Storage is supported as a source of reference data Azure Stream Analytics supports compression across all data stream input sources—Event Hubs, IoT Hub, and Blob Storage Streaming data Azure Event Hubs Streaming data Azure IoT Hub Azure Stream Analytics Reference data Azure Blob Storage
Fully-managedHadoop and Spark for the cloud. 99.9% SLA 100% Open Source Hortonworks data platform Clusters up and running in minutes Familiar BI tools, interactive open source notebooks 63% lower TCO than deploy your own Hadoop on-premises*Scale clusters on demand Secure Hadoop workloads via Active Directory and Ranger Compliance for Open Source bitsBest in class monitoring and predictive operations via OMSNative Integration with leading ISVs Azure HDInsightCloud Spark and Hadoop service for the Enterprise *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Apache storm on HDInsight One of seven HDInsight cluster types Integrates with Event Hub Apache Storm offered as a managed service on Azure HDInsight Develop with Visual Studio using Java or C# Scalable. Can analyse millions of events per second Dynamically scale-up and scale-down SLA of 99.9 percent uptime
Azure Databricks • Apache Spark-based analytics platform optimized for Microsoft Azure. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace or analytics.
Spark structured streaming overview The simplest way to perform streaming analytics is not having to think about streaming at all! Unifies streaming, interactive, and batch queries. Uses a single API for both static bounded data and streaming unbounded data Supports streaming aggregations, event-time windows, windowed grouped aggregation, and stream-to-batch joins Features streaming deduplication, multiple output modes, and APIs for managing and monitoring streaming queries Also supports interactive and batch queries Aggregate data in a stream, then serve using JDBC Change queries at runtime Build and apply machine learning models Built-in sources: Kafka, file source (JSON, CSV, text, and Parquet) App development in Scala, Java, Python, and R A unified system for end-to-end fault-tolerant, exactly-once, stateful stream processing Develop continuousapplications That need to interact with batch data, interactive analysis, machine learning… Pure streaming system Continuous application Ad-hocqueries >_ Outputsink(transactions often up to user) Outputsink(transactions often up to user) Input stream Input stream Streaming computation Continuous application consistent with (interactions with other systemsleft to the user Batch job Static data
What we’re looking at • Event hubs • Kafka enabled • Stream Analytics • Simple tumbling window Power BI Python Flask app - kafka-python Azure
INGESTION Services- A ComparisoN • A side-by-side comparison of the capabilities and features
Comparing STREAMING ANALYTICS Service (1/2) • A side-by-side comparison of the capabilities and features
Comparing STREAMING ANALYTICS Service (2/2) • A side-by-side comparison of the capabilities and features
Further reading Hands on with Event Hubs and python https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-python Hands on with streaming ETL with Azure Databricks https://medium.com/microsoftazure/an-introduction-to-streaming-etl-on-azure-databricks-using-structured-streaming-databricks-16b369d77e34 Choosing the right service(s) for your use case https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing
Further reading http://shop.oreilly.com/product/0636920073994.do
We'd love your feedback! aka.ms/SQLBits19
Thanks! Joe Plumb Cloud Solution Architect – Microsoft UK @joe_plumb