260 likes | 449 Views
Efficient Deployment of Predictive Analytics through Open Standards and Cloud Computing. ACM SIGKDD Explorations Volume 11, Issue 1, July 2009. 報告人:黃啟智 學號: 69821503. Outline. Introduction Interoperability and Open Standards Putting Models to Work Performance Conclusion.
E N D
Efficient Deployment of Predictive Analytics through Open Standards and Cloud Computing ACMSIGKDDExplorations Volume 11, Issue 1, July 2009 報告人:黃啟智 學號:69821503
Outline • Introduction • Interoperability and Open Standards • Putting Models to Work • Performance • Conclusion
Introduction • Deployment and practical application of predictive model: • Limited choice of options • Often takes months for models to be integrated and deployment(時間冗長) • Custom coding or proprietary process(成本昂貴) • Open standards and Internet-based technologies are available to provide a more effective end-to-end solution for the deployment.
Introduction • SOA:Service Oriented Architecture • For the design of loosely coupled IT systems(e.g. based on Web Services) • SaaS:Software-as-a-Service • A license model • Vendors deliver software solutions as a cost-effect service • PMML:Predictive Model Markup Language • A open standard that allows users to exchange predictive models among various software tools
Interoperability and Open Standards • Cloud Computing SaaS, IaaS, PaaS Cloud Computing (an computing architecture) Web Services RPC SOAP (access) SOA WSDL REST UDDI (SOA-related standards)
Interoperability and Open Standards • Cloud Computing • Reduce cost and management overhead for IT • Shift in the geography of computation • The Internet as a platform • A set of services that provide computing resources • A variety of services: Storage capacity, processing power, business application… • Cloud infrastructures Amazon Web Service(AWS) Sector/Sphere Hadoop … • The OCC, Open Cloud Consortium(www.opencloudconsortium.org)
Interoperability and Open Standards http://zh.wikipedia.org • Web Service • W3C definition • Providing the foundation of SOA • Use XML to code and decode data • Use SOAP(Simple Object AccessProtocol) standard to transport data • Data can be easily exchanged between different applications and platforms • Can be described by a WSDL(Web Service Description Language) file • UDDI(Universal Description, Discovery, and Integration):a platform independent XML-based registry for business to list themselvs on the Internet
Interoperability and Open Standards • A SOAP request for PMML file A JDM(Java Data Mining) call (The file/model was previously uploaded to the service provider.)
Interoperability and Open Standards • SaaS – Software as a Service • A license model, users may access software via the Internet(not actually “buy and install”) • Users only pay for the right for a certain time period(e.g. NT$100 for an hour) • No upfront costs in setting up servers or software • Minimizing the risk of purchasing costly software that may not provide adequate return of investment • E.g. Salesforce.com, Google Apps.
Interoperability and Open Standards • PMML-Predictive Model Markup Language • Developed by the Data Mining Group(www.dmg.org) • An open standard for representing data mining models • An XML-based language • Can describe data preprocessing and predictive algorithms • Can represent input data and data transformations
Interoperability and Open Standards PMML Structure examples(a test data file) Required (active)data fields Predicted data field
Interoperability and Open Standards PMML Structure examples
Interoperability and Open Standards PMML Structure examples Array of counts of different field values under different class labels
Interoperability and Open Standards • PMML Model specifics (parameters, architecture) are defined under different model elements, including: • Neural Networks • Support Vector Machines • Regressions Models • Decision Trees • Association Rules • Clustering • Sequences • Naïve Bayes • Text Models • Rules
Interoperability and Open Standards • PMML On-The-Go • PMML 4.0 Time series, boolean data types, model segmentation, lift/gain charts, expanded range of built-in functions… • More applications support export and import functionality in PMML • Open-source environments: KNIME(www.knime.org) The R project(www.R-project.org)
Putting Models to Work • Amazon EC2 • Elastic Compute Cloud • powered by Amazon Web Services • ADAPA scoring engine • uses JDM(Java Data Mining) Web Service calls and therefore • allows for automatic decisions to be virtually embedded into enterprise systems and applications • available as a service to minimize total cost
Putting Models to Work • Model Verification and Execution Typical tasks in the life cycle of a data mining project: • Building, deploying, testing and using data mining models (A cross-platform and multi-vendor environment)
Putting Models to Work • Model Verification and Execution • Model testing/verification • To ensure that both the scoring engine and the model development environment produce exactly the same result • It allows for a test file containing any number of records with all the necessary input variables and the expected result for each record to be upload for score matching
Putting Models to Work • Model Verification and Execution • Model execution • Batch mode: via the web console ,uploading a data file containing records (in CSV format or zipped) • Real-Time mode: via web services,embedded calls (SOAP request) instance
Putting Models to Work • Demo Excel-addin
Putting Models to Work • Demo Excel-addin
Putting Models to Work • Security on the Cloud • Uploading proprietary information to 3rd party service → security and control questions • The engine should not store any data • An instance shares nothing with other instances • And instance is Private (via authentication) • Access to an instance only via HTTPS • Models and data are deleted after an instance is terminated
Performance Instance type reference : http://aws.amazon.com/ec2/
Conclusion • Cloud computing It offers a powerful and revolutionizing way for putting data mining models to work. • Open standard(PMML) It helps predictive models to be easily accessed from anywhere in the enterprise (web-service calls or uploading data files). • The combination of both accelerates the deployment of predictive models and makes it more affordable.
Questions • Security (transmission via Internet, to a 3rd party vendors)、privacy • High-dimensionality /Large databasetransmission time + processing time