440 likes | 566 Views
Specialized Techniques. Chapter 11. 11.1 Time-Series Analysis. Time-series Problems: Prediction applications with one or more time-dependent attributes. An Example with Linear Regression. The Stock Index Dataset. Linear Regression Equations for the Stock Index Dataset.
E N D
Specialized Techniques Chapter 11
11.1 Time-Series Analysis Time-series Problems: Prediction applications with one or more time-dependent attributes.
An Example with Linear Regression The Stock Index Dataset
General Considerations Test and modify created models as new data becomes available. Try one or more data transformations if less than optimal results are obtained. Exercise caution when predicting future outcome with training data having several predicted fields. Try a nonlinear model if a linear model offers poor results. Use unsupervised clustering to determine if input attribute values allow the output attribute to cluster into meaningful categories.
Web-Based Mining(identifying the goal) • Decrease the average number of pages visited by a customer before a purchase transaction. • Increase the average number of pages viewed per user session. • Increase Web server efficiency • Personalize Web pages for customers • Determine those products that tend to be purchased or viewed together • Decrease the total number of item returns • Increase visitor retention rates
Web-Based Mining(preparing the data) • Data is stored in Web server log files, typically in the form of clickstream sequences • Server log files provide information in extended common log fileformat
Extended Common Log File Format • Host Address • Date/Time • Request • Status • Bytes • Referring Page • Browser Type
Extended Common Log File Format 80.202.8.93 - - [16/Apr/2002:22:43:28 -0600] "GET /grbts/images/msu-new-color.gif HTTP/1.1" 200 5006 "http://grb.mnsu.edu/doc/index.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.01 [nb]“ 134.29.41.219 - - [17/Apr/2002:19:23:30 -0600] "GET /resin-doc/images/resin_powered.gif HTTP/1.1" 200 571 "http://grb.mnsu.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Q312461)"
Preparing the Data(the session file) A session file is a file created by the data preparation process. Each instance of a session file represents a single user session.
Preparing the Data(the session file) A user session is a set of pageviews requested by a single user from a single Web server. A pageview contains one or more page files each forming a display window in a Web browser. Each pageview is tagged with a unique uniform resource identifier (URI).
Preparing the Data(the session file) • Creating the session file is difficult • Identify individual users in a log file • Host addresses are of limited help • Host address combined with referring page is beneficial • One user page request may generate multiple log file entries from several types of servers • Easiest when sites are allowed to use cookies
Web-Based Mining(mining the data) • Traditional techniques such as association rule generators or clustering methods can be applied. • Sequence miners, which are special data mining algorithms used to discover frequently accessed Web pages that occur in the same order, are often used.
Web-Based Mining(evaluating results) • Consider four hypothetical pageview instances P5 P4 P10 P3 P15 P2 P1 P2 P4 P10 P8 P15 P4 P15 P1 P4 P3 P7 P11 P14 P8 P2 P10 P1 P3 P10 P11 P4 P15 P9
Evaluating Results(association rules) • An association rule generator outputs the following rule from our session data. IF P4 & P10 THEN P15 {3/4} • This rule states that P4, P10 and P15 appear in three session instances. Also, a four instances have P4 and P10 appearing in the same session instance
Evaluating Results(unsupervised clustering) • Use agglomerative clustering to place session instances into clusters. • Instance similarity is computed by dividing the total number of pageviews each pair of instances share by the total number of pageviews contained within the instances.
Evaluating Results(unsupervised clustering) • Consider the following session instances: P5 P4 P10 P3 P15 P2 P1 P2 P4 P10 P8 P15 P4 P15 P1 • The computed similarity is 5/8 = 0.625
Evaluating Results(summary statistics) • Summary statistics about the activities taking place at a Web site can be obtained using a Web server log analyzer. • The output of the analyzer is an aggregation of log file data displayed in graphical format.
Web-Based Mining(Taking Action) • Implement a strategy based on created user profiles to personalize the Web pages viewed by site visitors. • Adapt the indexing structure of a Web site to better reflect the paths followed by typical users. • Set up online advertising promotions for registered Web site customers. • Send e-mail to promote products of likely interest to a select group of registered customers. • Modify the content of a Web site by grouping products likely to be purchased together, removing products of little interest, and expanding the offerings of high-demand products.
Data Mining for Web Site Evaluation Web site evaluation is concerned with determining whether the actual use of a site matches the intentions of its designer.
Data Mining for Web Site Evaluation • Data mining can help with site evaluation by determining the frequent patterns and routes traveled by the user population. • Sequential ordering of pageviews is of primary interest. • Sequence miners are used to determine pageview order sequencing.
Data Mining for Personalization • The goal of personalization is to present Web users with what interests them without requiring them to ask for it directly. • Manual techniques force users to register at a Web site and to fill in questionnaires. • Data mining can be used to automate personalization.
Data Mining for Personalization Automatic personalization is accomplished by creating usage profiles from stored session data.
Data Mining for Personalization(WebPersonalizer) • WebPersonalizer uses two profiling techniques: • Association rules are used to directly create usage profiles. • Usage profiles are created by generalizing clusters.
WebPersonalizer(generalizing clusters) • For each cluster, only those pageviews that satisfy a minimum score become part of an individual usage profile. • The score is the ratio computed by dividing the total number of times a pageview appears in a cluster across all session instances by the total number of session instances within the cluster. • Pageviews meeting the minimum criterion become part of the usage profile representing the cluster.
WebPersonalizer(recommendation engine) • The recommendation engine takes the set of usage profiles and matches current user navigation activity with the stored profiles. • The engine outputs recommended hypertext links to be displayed to the user.
Figure 11.3 Hypertext link recommendations from usage profiles
Data Mining for Web Site Adaptation The index synthesis problem: Given a Web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages.
11.3 Mining Textual Data Train: Create an attribute dictionary. Filter: Remove common words. Classify: Classify new documents.
11.4 Improving Performance Bagging Boosting Instance Typicality
Improving Performance(bagging) • A supervised learning approach that allows several models to have an equal vote in the classification of new instances • The same mining tool is employed to build each model • Different training instances are selected to build each model
Improving Performance (boosting) • Several models are used to vote in the classification of new instances • Each new model is built based upon the results of previous models • Each model is assigned a weight based on its performance on the training data
Improving Performance(instance typicality) • The typicality of an instance is its average similarity to the other members of its class. • In some cases, test set classification accuracy can be improved by limiting the inclusion of atypical instances in the training data.