290 likes | 322 Views
Creating valuable insights out of raw data files, such as audio or video, has traditionally been a very manual and tedious process, and has produced mixed results due to an influential human element in the mix. Thanks to enhancements in machine learning systems, coupled with the rapidly deployable nature of serverless technology as a middleware layer, we are able to create highly sophisticated data insight platforms to replace the huge time requirements that have typically been required in the past. With this in mind, we’ll look at: - How to build end-to-end data insight and predictor systems, built on the back of serverless and machine learning systems. - Best practices for working with serverless technology for ferrying information between raw data files and machine learning systems through an eventing system. - Considerations and practical examples of working with the security implications of dealing with sensitive information.
E N D
Better Data with Machine Learning and Serverless Jonathan LeBlanc Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: jleblanc@box.com
Agenda for Today Building Blocks: How are these systems built? Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 What Machine Learning Isn’t Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 Components of the System Serverless Framework Provides the compute and data management from stored data location to machine learning engine. Machine Learning System Provides the data enhancement capabilities which improves the underlying source data’s metadata (information about information). Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 Why Serverless? On Demand: Machine learning ties are only required when files need processing, which may be infrequent. No hosting: You don’t have to run or manage any servers, containers, or VMs of your own. Pricing based on use: Execution resources are only run (and charged for) based on your use, typically resulting in very low server costs. Different stack options: Multiple serverless systems exist to fit stack needs, including numerous open source options. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 Components of the System Webhook / Event Pump System: Handles notifications to the middleware layer when a new file should be processed. Middleware Layer: Handles communication between the data source and machine learning systems. Metadata Layer: The storage facility for machine learning data responses. Token Downscoping System: Allows you to pass tightly scoped read / write tokens through multiple uncontrolled system layers. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 How a Data / ML System Works Webhook Execute Metadata Callback Cloud Data Data store & initial metadata Serverless Framework Callback handler and code execution Machine Learning Data processor and enhancer Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 Common Serverless Frameworks AWS Lambda: https:/ /aws.amazon.com/lambda/ Considerations Azure Functions: https:/ /azure.microsoft.com/en-us/services/functions/ 1. Your stack Google Cloud Functions: https:/ /cloud.google.com/functions/ 2. Pricing / free use IronFunctions: https:/ /github.com/iron-io/functions 3. Supported languages 4. Regional support OpenWhisk: https:/ /openwhisk.apache.org/ Fission: https:/ /fission.io/ Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
1 Machine Learning Frameworks Audio / Video / Image Text Extraction Open Source • [video] MS Video Indexer • [audio] Voicebase • [face] Hive AI • [image] Clarifai • [image] Google Vision • [mixed] IBM Watson • [moderation] MS Content Moderator • [face] Kairos • [audio] AT&T Speech • [image] Amazon Rekognition • [id] Acuant • [invoice] Rossum.AI • [contract] eBrevia • [lease] Leverton • [resume] TextKernal • [prediction] AmazonML • [analysis] Aylien • [classification] MonkeyLearn • [natural language] ApiAI • [sentiment] AlchemyText • TensorFlow • Keras • Scikit-learn • MS Cognitive Toolkit • Theano • Caffe • Torch • Accord.NET Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
2 Program Logic and Serverless Separation Serverless function agnostic: The core logic of the function should be separate from the serverless requirements. Thin handlers / routers may be written on top of the core logic to maintain separation. Service deployments: To allow for deployment amongst numerous serverless technologies, systems like serverless.com may be utilized. Testability: The separation of concerns allows you to test the function separately from the container. Handler: Separate handler from core program logic for testability. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
AWS Lambda Handler / / API Gateway Handler exports.handler = (event, context, callback) => { / / Check for valid event if (isValidEvent()) { processEvent(); } else { callback(null, { statusCode: 200, body: 'Event received but invalid' }); } }; Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
2 Dealing with Cold Starts What is it: The latency experienced when a function is triggered, which only runs when there isn’t a warn / idle container. A container is automatically dropped after a period of inactivity. Options: You can either keep the container warm through memory increases and calls, or deal with the cold start. Fewer libraries: The more libraries that are used the longer it will take to start the container. Smaller functions: Writing smaller functions decreases start time. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
2 Exit Callback Hygiene Error logging: With many serverless environments proper callback use will provide full data logging. Reliability: Failing to exist properly can result in your function executing until a timeout is hit. Timeouts may also cause subsequent invocations to require a cold start, which results in additional latency. Cost: If a timeout occurs, you will be charged for the entire timeout time. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Processing AWS Lambda Exit Callbacks / / Success Callback callback(null, { statusCode: 200, body: 'Event processed' }); / / Error Callback callback({ statusCode: 400, body: 'Event error' }); Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
2 Writing Stateless Single Purpose Functions Error isolation: Debugging and error handling is easier with function / concern isolation. Scaling: With monolith functions, you have to optimize entire for all elements of the functions, rather than the specific functionality receiving the most calls / traffic. Planning and testing: It’s easier to plan and write test plans for functions with singular concerns. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Valid Event Function /** * Check for a valid event. * @param {object} indexerEvent – indexer event * @return {boolean} - true if valid event */ const isValidEvent = (indexerEvent) => { return (indexerEvent.body || indexerEvent.queryStringParameters); }; Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Security Considerations Serverless use consideration: Are serverless systems a viable / approved mechanism within your organization? Token exposure: Many API auth systems are token based, with broadly scoped tokens, leading to the potential of token leakage. Credential exposure: With the use of numerous APIs, each with auth credentials, we have the potential of credential leakage. Sensitive information exposure: Data is being passed through multiple systems and we have to be aware of how the information is used / stored. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Middleware System Serverless Solution All compute functionality is offloaded to the serverless framework. On-prem Solution All computer functionality (and connection to the ML system) is run off of existing internal servers. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Protecting Credentials Use Secure Storage: Use a secure system to store API credentials or tokens, such as the AWS Systems Manager Parameter Store. Least Privilege Principle: Functions requiring access to credentials should follow the least privilege principle, meaning they have access to only as much data as they absolutely need. Separate Environment Credentials: Credentials used in a more open developer environment should not be the same used in a production deployment. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Token Downscoping Access Token Fully scoped access token Downscoped Token Tightly scoped child token Channel Transmission Transmit through uncontrolled channels Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Token Downscoping Components Tightly scoped for single file: A token should only be scoped for the item needed for processing, such as a file. Short lived: Downscoped tokens should only live for their natural useful time (e.g. 1 hour) Revocable: Downscoped tokens may be revoked before natural expiration through the API. Split read / write functions: To further scope token exposure, separate read / write tokens can be issued. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Sensitive Information Exposure Data in the files: What information is being transmitted through the channels in the files, and is it sensitive information? Are channels secure: Are all connections between your systems, the serverless framework, and the machine learning system secure? How the ML system handles data: Does the machine learning system store any data long-term, and how secure is that storage? Logging sensitive information: Are you logging sensitive information during general program flow unintentionally? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
3 Tokenisation Specification 2. PAN 1. PAN 4. Token / Status 3. Token / Status Data Request Sensitive information request Cloud Data API Data hosting service API Secure Data Vault Secure vault hosting data files Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Wrapup Topics Building Blocks: How are these systems built? Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Better Data with Machine Learning and Serverless Slides: http:/ /bit.ly/ato-bdml Jonathan LeBlanc Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: jleblanc@box.com