180 likes | 200 Views
This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/
E N D
What Is Apache MADlib ? ● For scalable in-database analytics ● Open source Apache 2.0 license ● For machine learning in SQL ● At big data scale ● Offers graph, statistics, analytics, deep learning ● Provides data-parallel implementations ● For structured and unstructured data
MADlib Prerequisites ● Currently supports databases – PostgreSQL ●Needs Python extension specified – Greenplum (distributed db) – Apache Hawq ( v1.12+ ) (distributed db) ● Requires the GNU M4 Unix macro processor ● Works with Python 2.6 and 2.7
MADlib Architecture ● MADlib has three main layers ● Python driver functions – Main entry point from user input – Largely responsible for algorithm flow control – Validating input parameters – Executing SQL statements – Evaluating the results – Potentially looping to execute more SQL statements ●Until some convergence criteria has been hit
MADlib Architecture ● MADlib has three main layers ● C++ implementations functions – C++ definitions of the core functions/aggregates ●Needed for particular algorithms – Implemented in C++ rather than Python ●For performance reasons
MADlib Architecture ● MADlib has three main layers ● C++ database abstraction layer – Provide a programming interface – Abstracts all the Postgres internal details – Provides support for different back end platforms – Focuses on the internal functionality ●Rather than the platform integration logic
MADlib Data Types and Transformations ● Arrays and Matrices ● Encoding Categorical Variables ● Path ● Pivot ● Sessionize ● Stemming
MADlib Graph Functionality ● All Pairs Shortest Path ● Breadth-First Search ● HITS ● Measures ● PageRank ● Single Source Shortest Path ● Weakly Connected Components
MADlib Model Selection / Sampling ● Model Selection – Cross Validation – Prediction Metrics – Train-Test Split ● Sampling – Balanced Sampling – Stratified Sampling
MADlib Statistics / Supervised Learning ● Statistics – Descriptive Statistics – Inferential Statistics – Probability Functions ● Supervised Learning – Conditional Random Field – k-Nearest Neighbors – Neural Network – Regression Models – Support Vector Machines – Tree Methods
MADlib Time Series / Unsupervised Learning ● Time Series Analysis – ARIMA ● Unsupervised Learning – Association Rules – Clustering – Dimensionality Reduction – Topic Modelling
MADlib Utilities ● Columns to Vector ● Database Functions ● Linear Solvers ● Mini-Batch Preprocessor ● PMML Export ● Term Frequency ● Vector to Columns
MADlib Deep Learning Example SQL ● First define the model configurations to train ● Meaning either model architectures or hyperparameters ● Load them into a model selection table ● The combination of model architectures and hyperparameters ● Constitutes the model configurations to train ● In the picture there are three model configurations ● Represented by the three different purple shapes
MADlib Deep Learning Example SQL ● Once we have model combinations ● In the model selection table ● Call the fit function to train the models – In parallel. ● In the picture the three orange shapes ● Represent the three models that have been trained
Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –
Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration