280 likes | 462 Views
KEYWORD – BASED FILTERING Content base filtering that uses keyword counts from documents as representations of items. Advantages Mature technology Works as well as more sophisticated content-filtering technologies in high-quality document domains. Disadvantages Only works in document domains
E N D
KEYWORD – BASED FILTERINGContent base filtering that uses keyword counts from documents as representations of items. • Advantages • Mature technology • Works as well as more sophisticated content-filtering technologies in high-quality document domains. • Disadvantages • Only works in document domains • Cannot capture subjective notions of quality, style of documents being filtered. • ApplicationsFiltering high-quality news wires and document databases. Web search engines
NEURAL NETWORKS: Highly sophisticated content-based filtering technology that can use any arbitrary attribute information about items being filtered.. • AdvantagesVery powerful technology (can work with many kinds of items having attribute information). Given sufficient training examples, can learn almost any concept. • DisadvantagesRequire long training “Black boxes”: no way to determine what exactly they have learned. Not scaleable (works only for small samples). Cannot capture subjective notions.. • ApplicationsFilters any information stream where the items are tagged with attributes (documents, credit records, etc.) or contain keywords.
ACTIVE COLLABORATIVE FILTERING:“Manual” collaborative filtering technique where users explicitly identify other users in the community whose opinions they are interested in.. • AdvantagesWorks well for small communities where users know each other and their areas of expertise. Combines elements of feature-based filtering with opinion-based filtering. • DisadvantagesNot feasible in large communities of users. The burden of identifying the appropriate members of the community and constructing the appropriate query rests on the user. • ApplicationsInformation and document sharing in small workgroup environments.
AUTOMATED COLLABORATIVE FILTERING (ACFAn automated version of “word of mouth,” where the technology uses the opinions of a large community to filter items for each person. • AdvantagesIncorporates subjective notions of quality into the filtering process. Very effective for domains where items cannot be easily analyzed by computer or that are highly subjective. • DisadvantagesNo knowledge about the “kinds of items” being filtered: can lead incorrect results. Technology cannot utilize additional information about the items even when it may be available and relevant. • ApplicationsHighly subjective domains (music, travel…). Domains that are not amenable to machine analysis (e.g., video). Domains where the perceived quality of items fluctuates very widely (e.g., Web sites)..
FEATURE-GUIDED AUTOMATED COLLABORATIVE FILTERING (FGACF):Technology that utilizes features of items to partition items and more effectively apply the ACF algorithm. • AdvantagesUtilizes available feature information to partition the item space to apply ACF effectively. Combines strengths of simple content- based filtering with those of collaborative filtering while addressing the limitations of standard ACF. • DisadvantagesFeature information used must be relevant to partitioning the item space. • Applications“Broad” subjective domains (Web sites, books, restaurants) where additional feature information is available. Any domain standard ACF applies to.
Rule-based Technology Rule 1:If visitor age under 40 and not married and income greater than $100,000, show a Mercedes ad. Rule 2:If visitor age under 40 and married and income not greater than $100,000, show a Plymouth ad. What will happen if visitor age under 40 and not married and income not greater than $100,000? Maybe show a VW ad, but the rule must be explicitly given!
Rule-based Technology What will happen if visitor age under 40 and not married and income not greater than $100,000? Maybe show a VW ad, but the rule must be explicitly given! There are algorithms, such as ID3, which will generate a set of business rules based on a list of example cases. These rules then can be examined to verify their validity. Neural networks can perform the same type of classification but are “black boxes”, the business rules are not explicit.
Collaborative Filtering Algorithm How do we use rating to make predictions?How do we predict Ken’s rating for product 6?
Collaborative Filtering Algorithm We use the correlation coefficients. Notation RKL is the correlation between Ken & Lee.But how?
Collaborative Filtering Algorithm • Did the user like or dislike the product? • How close is the user’s rating to his/her average? • E.G. Lee’s AVG = 3, and gave Product 6 a 2, so use -1. Write L6-Lavg = 2 - 3 = -1. • Weighted average from Ken’s average:K6 = Kavg + (L6-Lavg)RKL + (M6-Mavg)RKM + + (N6-Navg)RKN = 3 + (2-3)(-.8) + (5-3)(.33) + (3-2.6)(0) = 3 + .8 + .66 = 4.46
Neural Network Algorithm A diagram of a single-layer neural network. xi is the signal level at input i (attribute i). wi is the weight associated with input i. wi(t) is the weight associated with input i at time t. is a threshold level.y =
Neural Network Algorithm Neural nets “learn” by adjusting the values of the weights. Initially the values of the weights are set to small random values. Training (learning) involves the readjustment of the input weights to develop the correct response to the training set.
Neural Network Algorithm • 15.. Calculate output ywixi = (0.1)(1) + 2.0(1) + (-1.7)(0) = 2.1and 2.1 > . So y(4)= 1. Therefore the network also recommends Plymouth correctly and we can stop.
15 Dimensions of Data Quality(in no actual order of importance) • First 5: • Believability (believable) • Accuracy (data are certified error-free, accurate, correct, flawless, reliable, errors can be easily identified, the integrity of the data, precise) • Timeliness (age of data) • Accessibility (accessible, retrievable, speed of access, available, up-to-date) • Value –added (data give you a competitive edge, data add value to your operations)
15 Dimensions of Data Quality • Second 5: • Relevancy (applicable, relevant, interesting, usable) • Objectivity (unbiased, objective) • Concise (well-presented, concise, compactly represented, well-organized, aesthetically pleasing, form of presentation, well-formatted, format of the data) • Appropriate amount of data (the amount of data) • Representational consistency (data are continuously presented in same format, consistently represented, consistently formatted, data are compatible with previous data)
15 Dimensions of Data Quality • Last 5: • Ease of understanding (easily understood, clear, readable) • Interpretability (interpretable) • Completeness (breadth, depth, and scope of information contained in the data) • Reputation (reputation of the data source, reputation of the data) • Access security (data cannot be accessed by competitors, data are of a proprietary nature, access to data can be restricted, secure)