240 likes | 254 Views
Explore the opportunities and challenges of processing and analyzing large log data from search engines, including web-browsing, social network communications, and sensor data, to extract useful information about user behavior. This case study focuses on a search engine company and includes data mining, machine learning, process mining, and analytic applications. The use of cloud computing and distributed file systems is also discussed.
E N D
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012
Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data Opportunities Challenges • Big data processing • Extracting useful information that reflects user behavior from massive log • Instance data management • Data analysis Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on
BI/ Reporting Data Mining Machine Learning Process Mining Process Mining Analytic applications Cassandra Cloud computing (Map/ReduceFramework) Cloud computing (Map/ReduceFramework) Instance data Big Data processing Big Data Access Hive NoSQL NoSQL Distributed File System(HDFS) Cloud Storage UnstructuredData Raw data Distributed File System(HDFS) Key-value Database(HBase ,Cassandra, MongoDB)
Case study: Search Engine Company • News, Page, Image, Maps, Music, navigation • Dataset: • 66 million clicks in one month, 2.2 million clicks per day • ->generate behavior in 10 minutes • User Behavior: • Visiting path (Referer) • Searching result effectiveness • Abs Clicking Behavior • Source and Destination of User visiting • Robot Behavior Reorganization and Analysis • Visiting page layout • Behavior comparison and product improvement • User grouping and recommendation
Data features • It contains massive information in a well recorded format • Large scale with big growing potential • Real-time analysis
existing tools • Data extracting: XESame,Prom Import • Process Mining : ProM • Due to large data set, analysing has low speed and in most situations it got crash • Offline analysis-> real-time analysis Extracting data from cloud Cloud Storage /no rational DB Instance data(XES)
System Structure Understandablemodel Extracting useful information that reflects user behavior from massive log Log processing
CPU: Intel Xeon 2.40GHZ • RAM:2GB • 14Nodes
Process Discovery • One instance/case is defined as one visitor’s one time visiting. • IP+UA • CookieID • Activity varies based on different requirements • Alpha miner • Heuristic miner • Fuzzy miner • Sequence model
Conclusion • It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology. • Future work: • 1 More algorithms and technologies should be applied to this data set. • 2 Behavior comparison and user recommendation still need to be accomplished. • 3 Can process mining analyze the behavior that does not have a certain pattern. • 1 Log Sampling • 2 Detect the incorrectness from logs before applying log to analysis technologies. • 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.
feedback • 1 What is the real questions? • 2 Why process mining?
Thank you ! Meng Dou 13/9/2012