DOMINIK ŚLĘZAK infobright infobright

ROUGH SETSand FCAFoundations and Case Studies of Feature Subset Selection andKnowledge Structure Formation DOMINIK ŚLĘZAK www.infobright.com www.infobright.org

Contents • Rough Sets & Feature Selection • Association Reducts • Conceptual Reducts • Building Ensembles • Towards Clustering • Rough Sets & Infobright Story • Rough & Granular Computation • Knowledge Structure Formation

Rough Sets • Rough set theory proposed by Z. Pawlak in 82 is an approximate reasoning model • In applications, it focuses on derivation of approximate knowledge from databases • It provides good results in such domains as, e.g., Web analysis, finance, industry, multimedia, medicine, and bioinformatics

Decision Systems IF (H=Normal) AND (T=Mild) THEN (S=Yes) It corresponds to a data block included in the positive region of the decision class “Yes”

Rulesand Approximations Lower & Upper Approximations POS(Sport?|B)

Feature Reduction (Selection) • Reducts: optimal attribute subsets, which approximate well enough the pre-defined target concepts or the whole data source • Feature reduction is one of the steps in the knowledge discovery in databases process • In real-world situations, we may agree to slightly decrease the quality, if it leads to asignificantly simpler knowledge model

Discernibility

{a,b,c}  {d,e} {a,b,d,f}  {c,e} {a,b,f}  {e} {a,c,e}  {b,d} {a,c,f}  {d} {a,d,e}  {b,c} {a,d,f}  {c} {a,e,f}  {b} {b,c,d}  {a,e} {b,d,e}  {a,c} {b,e,f}  {a} {c,d,f}  {a} {c,e,f}  {a,b,d} Association Reducts

Association Reducts as Association Rules in InDiscernibility Tables

Most Interesting Reducts • Given association reduct (C,D), we evaluate it with the value F(|C|,|D|) • Function F: N  N  R should hold: IF n1 < n2 THEN F(n1,m) > F(n2,m) IF m1 < m2 THEN F(n,m1) < F(n,m2) • F(|C|,|D|) is maximized subject to # from the space of approximation parameters • Such maximization problem is NP-hard

What # can actually mean? 1) |POS(d|B)| 2) Disc(d|B) = Disc(B{d}) – Disc(B) where Disc(X) = |{(u1,u2): X(u1)≠X(u2)}| 3) Relative Gain R(d|B) = 4) Entropy H(d|B) = H(B{d}) – H(B)

(empty,empty) ({3,7,12,13},{O}) ({1-3,7,9,12,13},{O,T}) ({1-3,7-9,11-13},{O,H}) ({3-7,10,12-14},{O,W}) ({10,11,13},{T,H}) ({2,5,9},{T,W}) ({5,9,10,13},{H,W}) ({1-3,7-13},{O,T,H}) (1-14,{O,T,W}) (1-14,{O,H,W}) ({2,5,9-11,13},{H,T,W}) Conceptual Reducts Reduct as a pair (X,B), where XU, POS(B)=X, POS(C)X for any CB

Reduct „Lattice” empty empty 3,7,12,13 O 1-3,7,9,12,13 O,T 1-3,7-9,11-13 O,H 3-7,10,12-14 O,W 10,11,13 T,H 2,5,9 T,W 5,9,10,13 H,W 1-3,7-13 O,T,H 1-14 O,T,W 1-14 O,H,W 2,5,9-11,13 H,T,W

Most Interesting Reducts • Given conceptual reduct (X,B), we evaluate it with the value F(|X|,|B|) • Function F: N  N  R should hold: IF n1 < n2 THEN F(n1,m) < F(n2,m) IF m1 < m2 THEN F(n,m1) > F(n,m2) • So we should maximize F(|X|,|B|) or... • ... shall we rather search for ensembles?

“Good” Ensembles of Reducts • Reducts with minimal cardinalities (or rules) • Reducts with minimal pairwise intersections ATTRIBUTES Challenge: How to modify the existing attribute reduction methods to search for such „good” ensembles R1 R2 R3

Hybrid Genetic Algorithm (1) • Genetic part, where each chromosome encodes a permutation of the attributes • Heuristic part, where permutationsare put into the following algorithm: • LETLEFT = A • FOR i = 1 TO |A| REPEAT • LETLEFT  LEFT \ {a(i)} • IF NOT LEFT #dUNDO(a) • EVALUATE REDUCTLEFT

Hybrid Genetic Algorithm (2) • LET(LEFT,RIGHT)= (,A) • FOR i = 1 TO|U|+|A| REPEAT IF (i){1,...,|U|} THEN IF u(i)POS(RIGHT) THEN LETLEFT  LEFT {u(i)} IF (i){|U|+1,...,|U|+|A|} THEN IF POS(RIGHT \ {a(i)})  LEFTTHEN LETRIGHT RIGHT \ {a(i)} • EVALUATE REDUCT(LEFT,RIGHT)

Reduct „Lattice” once more empty empty 3,7,12,13 O 1-3,7,9,12,13 O,T 1-3,7-9,11-13 O,H 3-7,10,12-14 O,W 10,11,13 T,H 2,5,9 T,W 5,9,10,13 H,W 1-3,7-13 O,T,H 1-14 O,T,W 1-14 O,H,W 2,5,9-11,13 H,T,W

Feature Clustering / Selection • Frequent occurrence of representatives in reducts yields splitting clusters • Rare occurrence of pairs of close representatives yields merging clusters Grużdź, Ihnatowicz, Ślęzak:Interactive gene clustering – a casestudy of breast cancer microarray data. Inf. Systems Frontiers 8 (2006). REDUCTS WITH CLUSTER REP-RESENTATIVES CLUSTERS OF ATTRIBUTES FEEDBACK

How about groups of rows (1) Data-based knowledge models, classifiers... Database indices, data partitioning, data sorting... Difficulty with fast updates of structures...

How about groups of rows (2)

Infobright’s Technology

Two-Level Computing Large Data (10TB) & Mixed Workloads

SELECT MAX(A) FROM T WHERE B>15; DATA STEP 1 STEP 2 STEP 3

Knowledge Structures (Nodes) Order Detail Table – assume many more rows Supplier/Part Table – assume many more rows

DATA – Best Inspiration • New Objectives • New Schemas • New Volumes • New Queries • New Types • New KNs • ...........

References (Unfinished List) • D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Brighthouse - An Analytic Data Warehouse for Ad-hoc Queries. VLDB 2008: 1337-1345. • D. Ślęzak: Rough Sets and Few-Objects-Many-At-tributes Problem - The Case Study of Analysis of Gene Expression Data Sets. FBIT 2007: 437-440. • D. Ślęzak: Rough Sets and Functional Dependen-cies inData - Foundations of Association Reducts. To appear. • ......

THANK YOU!! www.infobright.com www.infobright.org slezak@infobright.com

DOMINIK ŚLĘZAK infobright infobright

DOMINIK ŚLĘZAK infobright infobright

Presentation Transcript

BCG Immunisation in High Risk Infants

Immunity and pathogen competition

Der vestibulo-okuläre Reflex (VOR) und seine klinische Bedeutung Dominik Straumann Neurologische Klinik Universitätssp

Forefront Identity Manager 2010

IRON DEFICIENCY ANEMIA

Sebastian B. Schneider, Dominik Baumann Ludwig-Maximilians-Universität München, LMU

Wyjątki, rzutowanie, operator instanceof

Enterprise Vaadin