Specialized Techniques

Specialized Techniques Chapter 11

11.1 Time-Series Analysis Time-series Problems: Prediction applications with one or more time-dependent attributes.

An Example with Linear Regression The Stock Index Dataset

Linear Regression Equations for the Stock Index Dataset

A Neural Network Example

Categorical Attribute Prediction

General Considerations Test and modify created models as new data becomes available. Try one or more data transformations if less than optimal results are obtained. Exercise caution when predicting future outcome with training data having several predicted fields. Try a nonlinear model if a linear model offers poor results. Use unsupervised clustering to determine if input attribute values allow the output attribute to cluster into meaningful categories.

11.2 Mining the Web

Web-Based Mining(identifying the goal) • Decrease the average number of pages visited by a customer before a purchase transaction. • Increase the average number of pages viewed per user session. • Increase Web server efficiency • Personalize Web pages for customers • Determine those products that tend to be purchased or viewed together • Decrease the total number of item returns • Increase visitor retention rates

Web-Based Mining(preparing the data) • Data is stored in Web server log files, typically in the form of clickstream sequences • Server log files provide information in extended common log fileformat

Extended Common Log File Format • Host Address • Date/Time • Request • Status • Bytes • Referring Page • Browser Type

Extended Common Log File Format 80.202.8.93 - - [16/Apr/2002:22:43:28 -0600] "GET /grbts/images/msu-new-color.gif HTTP/1.1" 200 5006 "http://grb.mnsu.edu/doc/index.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.01 [nb]“ 134.29.41.219 - - [17/Apr/2002:19:23:30 -0600] "GET /resin-doc/images/resin_powered.gif HTTP/1.1" 200 571 "http://grb.mnsu.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Q312461)"

Preparing the Data(the session file) A session file is a file created by the data preparation process. Each instance of a session file represents a single user session.

Preparing the Data(the session file) A user session is a set of pageviews requested by a single user from a single Web server. A pageview contains one or more page files each forming a display window in a Web browser. Each pageview is tagged with a unique uniform resource identifier (URI).

Figure 11.1 A generic Web usage model

Preparing the Data(the session file) • Creating the session file is difficult • Identify individual users in a log file • Host addresses are of limited help • Host address combined with referring page is beneficial • One user page request may generate multiple log file entries from several types of servers • Easiest when sites are allowed to use cookies

Web-Based Mining(mining the data) • Traditional techniques such as association rule generators or clustering methods can be applied. • Sequence miners, which are special data mining algorithms used to discover frequently accessed Web pages that occur in the same order, are often used.

Web-Based Mining(evaluating results) • Consider four hypothetical pageview instances P5  P4  P10 P3 P15 P2 P1 P2  P4  P10 P8 P15 P4 P15 P1 P4  P3  P7 P11 P14 P8  P2 P10 P1  P3  P10 P11 P4 P15  P9

Evaluating Results(association rules) • An association rule generator outputs the following rule from our session data. IF  P4 & P10 THEN P15 {3/4} • This rule states that P4, P10 and P15 appear in three session instances. Also, a four instances have P4 and P10 appearing in the same session instance

Evaluating Results(unsupervised clustering) • Use agglomerative clustering to place session instances into clusters. • Instance similarity is computed by dividing the total number of pageviews each pair of instances share by the total number of pageviews contained within the instances.

Evaluating Results(unsupervised clustering) • Consider the following session instances: P5  P4  P10 P3 P15 P2 P1 P2  P4  P10 P8 P15 P4 P15 P1 • The computed similarity is 5/8 = 0.625

Evaluating Results(summary statistics) • Summary statistics about the activities taking place at a Web site can be obtained using a Web server log analyzer. • The output of the analyzer is an aggregation of log file data displayed in graphical format.

Web-Based Mining(Taking Action) • Implement a strategy based on created user profiles to personalize the Web pages viewed by site visitors. • Adapt the indexing structure of a Web site to better reflect the paths followed by typical users. • Set up online advertising promotions for registered Web site customers. • Send e-mail to promote products of likely interest to a select group of registered customers. • Modify the content of a Web site by grouping products likely to be purchased together, removing products of little interest, and expanding the offerings of high-demand products.

Data Mining for Web Site Evaluation Web site evaluation is concerned with determining whether the actual use of a site matches the intentions of its designer.

Data Mining for Web Site Evaluation • Data mining can help with site evaluation by determining the frequent patterns and routes traveled by the user population. • Sequential ordering of pageviews is of primary interest. • Sequence miners are used to determine pageview order sequencing.

Data Mining for Personalization • The goal of personalization is to present Web users with what interests them without requiring them to ask for it directly. • Manual techniques force users to register at a Web site and to fill in questionnaires. • Data mining can be used to automate personalization.

Data Mining for Personalization Automatic personalization is accomplished by creating usage profiles from stored session data.

Data Mining for Personalization(WebPersonalizer) • WebPersonalizer uses two profiling techniques: • Association rules are used to directly create usage profiles. • Usage profiles are created by generalizing clusters.

WebPersonalizer(generalizing clusters) • For each cluster, only those pageviews that satisfy a minimum score become part of an individual usage profile. • The score is the ratio computed by dividing the total number of times a pageview appears in a cluster across all session instances by the total number of session instances within the cluster. • Pageviews meeting the minimum criterion become part of the usage profile representing the cluster.

Figure 11.2 Creatiing usage profiles from session data

WebPersonalizer(recommendation engine) • The recommendation engine takes the set of usage profiles and matches current user navigation activity with the stored profiles. • The engine outputs recommended hypertext links to be displayed to the user.

Figure 11.3 Hypertext link recommendations from usage profiles

Data Mining for Web Site Adaptation The index synthesis problem: Given a Web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages.

11.3 Mining Textual Data Train: Create an attribute dictionary. Filter: Remove common words. Classify: Classify new documents.

11.4 Improving Performance Bagging Boosting Instance Typicality

Improving Performance(bagging) • A supervised learning approach that allows several models to have an equal vote in the classification of new instances • The same mining tool is employed to build each model • Different training instances are selected to build each model

Improving Performance (boosting) • Several models are used to vote in the classification of new instances • Each new model is built based upon the results of previous models • Each model is assigned a weight based on its performance on the training data

Improving Performance(instance typicality) • The typicality of an instance is its average similarity to the other members of its class. • In some cases, test set classification accuracy can be improved by limiting the inclusion of atypical instances in the training data.

Specialized Techniques