150 likes | 294 Views
Improving the Naming Process for Web Site Reverse Engineering. Selima Besbes Essanaa , Nadira Lammari ISID - CEDRIC Laboratory - CNAM - Paris. The context of the research work. Assigns labels to words, to concepts, etc. Reverse engineering of Web sites. . . The naming process :.
E N D
Improving the Naming Process for Web Site Reverse Engineering Selima Besbes Essanaa, Nadira Lammari ISID - CEDRIC Laboratory - CNAM - Paris
The context of the research work Assigns labels to words, to concepts, etc. Reverse engineering of Web sites The naming process : Concerns various computer science domains It is a challenging task Our contribution Improving the process by reducing the number of objects to name Introduction NLDB'04 - Besbes - Lammari
Agenda Introduction RetroWeb overview The naming process in RetroWeb Conclusion and Perspectives NLDB'04 - Besbes - Lammari
HTML Pages RetroWeb is based on the inversion of a life cycle of a web application design Extraction RetroWeb is applied forsemi-structured and undocumented sites Physical views RetroWeb gives a description of the informative content of the site at various abstraction levels Conceptualization EER Schemas Integration RetroWeb uses meta-models and reverse engineering rules Global EER Schema RetroWebOverview NLDB'04 - Besbes - Lammari
Extraction Pre-treatment Phase HTML Pages Coded sequence Extraction phase Extraction Unnamed physical views Physical views Naming Phase Conceptualization Conceptualization Physical views transformation phase EER Schemas Logical views Integration Logical views transformation phase Global EER Schema Unnamed EER Schemas Naming Phase The Naming Process in RetroWeb (1) NLDB'04 - Besbes - Lammari
A page from an academic journal publication Web site … Volume N° 19 (3) Competitive Strategy, Economics, and the InternetAuthors : ChircuAlina M. and Kauffman Robert J. Volume N° 19 (2) Enterprise Resource PlanningAuthors : RagowskyArik … Example (1) NLDB'04 - Besbes - Lammari
MV1 CV1 MV2 SV2 SV1 CV2 • N° Volume 19(3) • N° Volume 19(2) • … • Compétitive …Internet • Entreprise …planning • … Physical view Naming SV3 SV4 • Alina M. • Robert J. • Arik • … • Chircu • Kauffman • Ragowsky • … Multi-valued type variable Volume-Authors Composed type variable Volume Simple variable Authors Title Volume Number Simple variable domain Author • N° Volume 19(3) • N° Volume 19(2) • … • Compétitive …Internet • Entreprise …planning • … Last Name First Name • Alina M. • Robert J. • Arik • … • Chircu • Kauffman • Ragowsky • … Example (2) NLDB'04 - Besbes - Lammari
Determines automatically classes of concepts that may share the same labels Extraction Pre-treatment Phase HTML Pages Coded sequence Naming phase Extraction phase Extraction Defining concept classes Unnamed physical views Physical views Naming Phase Concept classes Conceptualization Conceptualization Physical views transformation phase EER Schemas Assigning names to concepts Logical views Integration Logical views transformation phase Finds labels and assigns them to concepts Global EER Schema Unnamed EER Schemas Naming Phase The Naming Process in RetroWeb (1) NLDB'04 - Besbes - Lammari
Definition domain of a simple variable = set of its instances Definition domain of an entity type = the set of properties describing this entity type. Based on the comparison of the definition domains of concepts Definition domain of another type of variable = set of variables constructing this variable We have to build the IS_A hierarchy of concept classes. A label found for a concept may be assigned to all the concepts of its class and to all the concepts of the sub classes Defining Concept classes (1) IF D(C1) D(C2) THEN any label assigned to C1 can also be assigned to C2 NLDB'04 - Besbes - Lammari
D(C1) D(C2) D(C2) D(C1) D(C1) D(C2) C1 ↔C2 C1 → C2 Case a Case b Case c D(C1) D(C2) C1 ↔ C2 Defining Concept classes (2) The relations between definition domains are expressed through existence constraints The use of thresholds to define the bigness or the smallness of the intersection and the differences and then to assimilate the considered case to the case a, b or c. ? NLDB'04 - Besbes - Lammari
C1 – C2 Big Small C2 – C1 C2 – C1 Big Small Big Small C1 C2 C1 C2 C1 C1 C2 C2 Big C1↔C2 C2 →C1 C1 →C2 C1 ↔C2 Intersection C1 C2 C1 C2 C1 C2 C1 C2 Small C1 ↔C2 C1↔C2 C1 ↔C2 C1↔C2 Defining Concept classes (3) NLDB'04 - Besbes - Lammari
Step1: Determine valid classes a b c C1 C2 C1 C2 C1 C2 C3 D(C1) C1 C2 C4 C1 C2 D(C3) C3 C4 C3 C4 Using conditioned existence constraints Using mutual existence constraints Using exclusive existence constraints Step 3 : Derive the Is_A hierarchy D(C4) Step 2 : Organize them into an inclusion graph C1 C2 C1 ↔ C2 C1 C2 C3 C1 C2 C4 C1 C2 C3 ↔ C4 C4 C3 C3 → C1 C4 → C1 Defining Concept classes : The Algorithm D(C2) NLDB'04 - Besbes - Lammari
Manual (except for simple-type variables) Based on a set of heuristics (for simple-variables) Examples H1 : An invariant string in all instances of a simple variable is a potential label {Volume N° 19(3), Volume N° 19(2), …} … Volume N° Assigning Labels H2 : IF a value domain contains the symbol « @ » THEN the corresponding single variable is an electronic address NLDB'04 - Besbes - Lammari
This research work applies an algorithm that allows to recover concepts from a flat set of data dispatched through all the pages of the web site. The naming of recovered concepts is just initiated • The enrichment of the set of heuristics is in progress. The use of ontologies to find pertinent labels The study of the applicability of learning approaches for the naming in our context Conclusion and Perspectives NLDB'04 - Besbes - Lammari