720 likes | 802 Views
Qualitative Description of Complex Objects Enrique H. Ruspini Artificial Intelligence Center. Linked Data Objects (1). O 5 : Person. O 1 : Person. R 5 : Owner_of. R 2 : Name_of. R 4 : Paid. O 7 : Institution. R 1 : Friend. O 3 : Name. R 6 : Drawn_at. R 3 : Received. O 4 : Payment.
E N D
Qualitative Description of Complex Objects Enrique H. Ruspini Artificial Intelligence Center
Linked Data Objects (1) O5: Person O1: Person R5: Owner_of R2: Name_of R4: Paid O7: Institution R1: Friend O3: Name R6: Drawn_at R3: Received O4: Payment O6: Fin_Instrument O2: Person
Similarity in Linked Structures • Similarity between Objects: • Mexico City is similar to Denver (Altitude) • France is similar to Germany (Economy) • Checks are similar to Money Orders (Financial Instruments) • Similarity between Relations: • Cash Payments are similar to Money Transfers (Financial Transactions) • Sequence (G1, Pos1234) is similar to Sequence (G2, Pos3421) (Genomics) • The 1929 Market Crash was similar to the 1987 Market Crash (Economics)
Link Discovery • Find interesting linked structures: • Equal or similar to a predefined pattern • Must satisfy (extended) equivalence relations between template and instantiated pattern • Similar objects • Similar relationships between objects“Find transactions matching money-laundering patterns” • Find interesting relations between structures:“The actors/roles in Situation A12 are similar to those in Situation B34”
Biomolecules as seen by a Computer .... HEADER GENE REGULATING PROTEIN 26-JUL-90 3CRO 3CRO 2 COMPND 434 CRO PROTEIN COMPLEX WITH 20 BASE PAIR PIECE OF /DNA$ 3CRO 3 COMPND 2 CONTAINING OPERATOR /OR1$ 3CRO 4 SOURCE PHAGE 434 3CRO 5 AUTHOR A.MONDRAGON,S.C.HARRISON 3CRO 6 REVDAT 1 15-OCT-91 3CRO 0 3CRO 7 ............................................................................. ATOM 5 O5* A A 1 -16.851 -5.543 74.981 1.00 55.62 3CRO 148 ATOM 6 C5* A A 1 -18.254 -5.683 75.238 1.00 51.97 3CRO 149 ATOM 7 C4* A A 1 -18.600 -7.125 75.571 1.00 37.32 3CRO 150 ATOM 8 O4* A A 1 -19.740 -7.166 76.456 1.00 26.97 3CRO 151 ATOM 9 C3* A A 1 -18.978 -8.004 74.382 1.00 34.63 3CRO 152 ATOM 10 O3* A A 1 -18.314 -9.224 74.465 1.00 30.96 3CRO 153 ATOM 11 C2* A A 1 -20.466 -8.236 74.564 1.00 54.40 3CRO 154 ATOM 12 C1* A A 1 -20.537 -8.253 76.076 1.00 31.85 3CRO 155 ATOM 13 N9 A A 1 -21.868 -7.978 76.551 1.00 18.79 3CRO 156 ATOM 14 C8 A A 1 -22.501 -6.770 76.700 1.00 20.51 3CRO 157 ATOM 15 N7 A A 1 -23.737 -6.871 77.141 1.00 6.86 3CRO 158 ATOM 16 C5 A A 1 -23.910 -8.231 77.267 1.00 2.00 3CRO 159 ATOM 17 C6 A A 1 -24.991 -8.982 77.706 1.00 4.41 3CRO 160 ....................... 4
A graph of a computational object ... Source: http://www.marketguide.com/MGI/PRODUCTS/chart.htm?symb=NKE, Accessed 2/2/98
... and its interpretation in investing terms Trends: OverallSubstantial Decline Rapid Decline in 1Q followingSharp Reversal Relatively Stable Period afterwards Moderate Decline in 4Q ....... Temporal Patterns: Major Upward Spike in late Spring 4Q Decline followed byShort Panic Reversal and Partial Recovery Pronounced Double Top Reversal in Summer Descending Triangle Reversal in 3Q .....
OBJECTIVES 1. DISCOVER:Interesting patterns within an object , at various levels of detail EXAMPLE: “Charged Arm,” “Concave Pocket” 2. RELATE:Discovered patterns according to relevant, interesting, relationships EXAMPLE: “Spike follows Recovery” 3. DESCRIBE:Discovered Structures and Patterns (Qualitative Descriptions) EXAMPLE: <<Spike, Midterm>> 4. ANNOTATE:Objects with textual descriptions based on discovered patterns EXAMPLE: “The arm protrudes midway from the ..” 5. MINE: Variable and Object Relationships in Collection on the basis of Qualitative Descriptions (Qualitative DM) EXAMPLE: “Panic Reversals follow Short Periods of High Level Buying” 8
Technical Approach 1. DISCOVER:Based on solution of constrained optimization problems (best fitting) by soft-computing techniques (Fuzzy Clustering) 2. RELATE:Relations are found by optimal fitting of relationships from catalog of interesting relations to discovered structures and by summarization of results of the discovery step 3. DESCRIBE:Hierarchical structures organized by level of granularity/ inclusion 4. ANNOTATE:Text generation employing Natural Languagemethods developed at the AIC 5. MINE: Based on Generalized Association Rule Discovery(e.g., ANFIIS), Possibilistic Network Learning, and Fuzzy Clustering 15
The Notion of Similarity • Basic primitive concept • Fundamental cognitive capability • Commonly expressed by (numerical) measures quantifying “resemblance” • - S : X X [0,1] • - S(x, y) = 1 means that x and y are very similar • S(x, y) = 0 means that x and y are very different • Resemblance is always measured from some perspective • Generalized Reflexive, Symmetric, Transitive Relation (Fuzzy Equivalence Relation)
Fuzzy Clustering (Ruspini, 1969) • Map each point into a vector representing degrees of membership to a fuzzy partition • C : X [0, 1] c : x ( C1(x), C2(x), ..., Cc(x) ) , • For all x, C1(x) + C2(x) + ... +Cc(x) = 1.
Similarity-based Clustering • Basic idea: Map Sample Space into “Classification Space” • Mapping should be optimal in some sense • Mapping should define a partition of the sample space • - Similar sample points should receive similar classifications (Ruspini)
Object Clustering • Data is generally expressed as a collection of vectors in Rn • Categorical Data may be considered but requires special handling • Relies on various (sometimes hidden or implicit) measures of distance/similarity between objects
Comments onc-means Clustering • In most cases, the distance D is simply the Euclidean Distance in Rn • Crisp c-means: assign X to the class of its closest prototype • Fuzzy, Possibilistic c-means: degree of fuzzification depends on m • Possibilistic c-means: The weights in the definition of the Objective Functional J penalize clusters that are too small • Solution is determined by Alternating Optimization : • Find clusters for fixed prototypes • Find prototypes for fixed clusters • Alternative solution algorithms are important from the viewpoint of Data Mining (e.g., Sequential Iteration) • Related to ISODATA
Possibilistic Clustering • Ruspini, 1977; Krishnapuram and Keller, 1993 • Does not require “probabilistic” constraint • Based on idea of model fitness and utilization of additional constraints
Generalized Prototypes • The notion of prototype may be changed to include more general structures • Linear varieties • Shells • Elliptotypes (adaptive cluster dimensionality)
Subtractive Clustering Method Objective:Describe a time series in terms of significant events or epochs Approach: • Stepwise iterative determination of interesting structures • “Good clusters” rather than “good clusterings” • Clusters may overlap • Multiple, domain-specific, models of significant structures • Nonlinear constrained-optimization techniques • Optimal Fitness subject to Size and Extent Constraints • Clusters are local constrained optima of the fitness function • Minimization of functional describing quality of fitness of line over a fuzzy (trapezoidal) interval subject to size constraints • Penalty-function Approach • GA Tournament Domination Selection Algorithm
FEATURE IDENTIFICATION ALGORITHM RELATIONS OF INTEREST SUMMARIZATION ALGORITHM QUALITATIVE DESCRIPTION APPROACH MODELS OBJECT
Localized, Non-dominated, Solutions • All local maxima • Multiple objectives (Effective Frontier)
Genetic Algorithm for Feature Identification • Multiobjective Optimization (Quality of Fit, Extent) • Multiple Models of Interesting Features • Supports Complex Model Definitions (MV Logic/Approximate Matching) • NLP Genetic Algorithm (extension of GA by Horn, Napfliotis, and Goldberg): • Localized: Solutions are not dominated by their neighbors (i.e., a generalization of the notion of local maxima) • Niched:The algorithm clusters candidate solutions to isolate each generalized “peak” in the multimodal distribution • Pareto:Multiple objectives define a notion of dominance based on separate consideration of all objectives • Tournament Selection:Comparison between randomly chosen population pairs to simulate selective pressure • Sharing:Procedure to promote diversity in solution space • Exhaustive:Seeks to find all solutions in the “localized” effective frontier • Special Genetic Operators: T-norm/conorm-based crossover operators, FP-based mutation operator
NLP GENETIC ALGORITHM Initialization Old Population Genetic Selection Genetic Operations Random Selection: Candidate-1, Candidate-2, Comparison Set Crossover Dominant Comparison Mutation no Sharing Winner? yes New Population
Example of Model Definition (Downtrends) For all peaks in epoch, peak(i) t peak(i+1) and For all valleys in epoch, valley(i) tvalley(i+1)
Example of Feature Extraction (Triangular and Rectangular Patterns)
Visualizing the Effective Frontier (i,I) Diagram - All Intervals (i,I) Diagram - Final Intervals S-Q Diagram - All Intervals S-Q Diagram - Final Intervals
Heuristic Summarization Algorithm • Merging of suboptimal intersecting results(unnecessary if enough NLP generations produced) • Eliminate “approximately” dominated solutions(i.e., deletion of neighbors of lower quality) • Hierarchical organization by approximate inclusion • Removal of conflicts in multimodel applications (where NLP has been run separately for each model)
Hierarchical Organization of Summarized GA Output(Uptrends, Downtrends, H&S)
Biological Sequence Description • Description of Repetitive Elements • Genome of Trypanosoma Cruzii • Short Interspersed Repetitive Element (SIRE) • GOAL: Identify all interesting alignments Pattern: TTTTATT ---------TTTTTATTTTT----TTATT----- TTTAAAATATTTTTATTTTTAAAATTATTTTATT ----TT---TTT---A-TT- ---------TATT ACGTTTCGGTTTCCACCTTG TGCTTAAATTAT-
Gene Expression From I. Zwir (with permission)
Linked Data Objects O5: Person O1: Person R5: Owner_of R2: Name_of R4: Paid O7: Institution R1: Friend O3: Name R6: Drawn_at R3: Received O4: Payment O6: Fin_Instrument O2: Person
Patterns and Data are expressed through logical formulas that: Define Patterns Specify known information as partially-specified, database entries Patterns and Data Patterns: Data: p(UID345), q(UID345, UIF348), r(UID345, UID675, 500), ….
Data and Pattern Structure PATTERN Similarity Functions Sp Sq Sr Sv DATA
Relational Data as Triples • N-ary relations as set of triples • Triples:(Type, PMY001, Payment) (Date, PMY001, 1 Oct 1888) (Type, PER234, Person) (Name, PER234, “David Copperfield”) (Paid-by, PMY001, PER234) . . . (Transaction-type, PMY003, Stock-Transfer) . . .
Conceptual Approach • Data is represented as a set of triples • Logical conjunction of corresponding logical clauses
Conceptual Approach • Data is represented as set of Triples • Logical conjunction of corresponding logical clauses • Pattern is a logical expression that may or may not be satisfied by the Data