Data Mining on the Web via Cloud Computing

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy

Data Mining on the Web via Cloud Computing • Introduction to – • Web Mining • Cloud computing infrastructure • Apache’s Hadoop • Web Usage Mining using Hadoop HDFS and Map/Reduce technologies

What is Web Mining… • What is Web Mining - data mining techniques applied to the Web to discover user patterns like • what users are looking for on the internet, • to deduce type of information the users are looking for, • structuring data available on the web etc. • Why Web Mining – • amount of information available on the Web is enormous. • difficult for users to find and utilize information • not easy for content providers to classify and catalog documents

Types of Web Mining • Web mining types – • Web usage mining. • Web content mining. • Web structure mining. • Web usage mining - applying data mining techniques to discover usage patterns from Web data, to understand and serve the needs of Web-based applications better. • Web content mining describes the automatic search of information available online, and involves mining web data content. • Web structure mining is concerned with the description/ organization of the content.

More on Web Usage Mining… • Preprocessing. • convert the usage, content, and structure information in the available data sources. • regarded as the most difficult task in Web Usage Mining. • Pattern Discovery. • uses the algorithms and techniques from data mining, machine learning, statistics and pattern recognition. • Pattern analysis. • lot of redundant rules or patterns found during discovery phase. • the main objective here is to filter out such data which would aid in the data analysis. • SQL queries, visualization techniques such as graphing patterns etc

Cloud Computing • Use of existing commodities. • reduce cost of the services. • helps in concentrating on deploying the services faster. • more flexibility. • Virtualization technique used as a standard deployment object. • provides abstraction between hardware and computing software. • enables loose coupling of the resources. • Services are delivered over the network.

HDFS - Hadoop Distributed File System • Data parallel but process sequential. • Data processing is in a batch oriented fashion. • Data communication is via distributed file system. So, latency is an issue. But HDFS is designed for giving higher throughputs than latency. • In Facebook, jobs that took more than a day were cut down to less than a day by using Hadoop.

Important characteristics of HDFS… • Hardware Failure. • Streaming Data Access. • Large Data Sets. • Moving Computation is Cheaper than Moving Data

Web Mining, HDFS and Map/Reduce • HDFS can be the storage backbone for Web Mining applications. • HDFS replicates data at several nodes in the cluster to ensure robustness, data recovery in case of failure etc. • Map/Reduce – A framework for realizing Distributed computing/Compute Cloud.

Web Mining & HIVE • Developed by the Facebook Data Infrastructure Team in order to exploit the features of Hadoop HDFS and Map/Reduce. • The next generation infrastructure designed with the goals of providing data processing systems: • enable easy data summarization • ad-hoc querying and analysis of large volumes of data • Allows users to embed custom map/reduce functions

Web Usage Mining Architecture using HDFS, Map/Reduce and HIVE • How Apache Hadoop can be used in Web Usage Mining. • The system consists of HDFS as the Storage Cloud. • Map/Reduce framework can be used as the Compute Cloud. • Hive can be used to format the data.

Web Usage Mining Architecture

References • HDFS: http://hadoop.apache.org/hdfs • Map/Reduce: http://hadoop.apache.org/mapreduce • Web Mining: Information and Pattern Discovery on the World Wide Web: http://maya.cs.depaul.edu/~mobasher/webminer/survey/survey.html • Ashish Thusoo - Hive - A Petabyte Scale Data Warehouse using Hadoop: http://www.facebook.com/note.php?note_id=89508453919

References • Dhruba Borthakur: Hadoop Introduction: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html#Introduction • Jaideep Srivastava, Robert Cooleyz, Mukund Deshpande, Pang-Ning Tan: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Thank You!

Data Mining on the Web via Cloud Computing

Data Mining on the Web via Cloud Computing

Presentation Transcript

Computing on the Cloud

CS345A: Data Mining on the Web

Web 2.0 and Cloud Computing

Web/Google Data Mining

Standardization on Cloud Computing

CS345A: Data Mining on the Web

Data Mining The Social Web

Tutorial for Web Mining Project -cloud computing platform

Cloud Computing – The Cloud

Immersive Teaching and Research in Data Sciences via Cloud Computing

Computing on the Cloud

Cloud Data mining and FutureGrid

Cloud Computing Data Centers

Data Mining | Web Scraping

Grid Computing in Data Mining and Data Mining on Grid Computing

Data Mining Web Sites

Amazon Web Services Cloud Computing

Web-Mining Agents Data Mining

Grid Computing in Data Mining and Data Mining on Grid Computing

IEEE Data mining, Cloud computing Projects in Chennai