600 likes | 759 Views
Component Search and Retrieval. Advanced Reuse Seminars Eduardo Cruz. Information Retrieval - 1948. Structured Documents Unstructured Documents No software documentation standard Semi-Structured Documents. Calvin Northrup Mooers. Mooers' Law: “An information retrieval system
E N D
Component Search and Retrieval Advanced Reuse Seminars Eduardo Cruz
Information Retrieval - 1948 • Structured Documents • Unstructured Documents • No software documentation standard • Semi-Structured Documents Calvin Northrup Mooers
Mooers' Law: “An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have it,” 1959 Calvin Northrup Mooers
Mass Production Software components [Mcllroy, 1968]
“software industry is weakly founded, and that one aspect of this weakness is the absence of a software components subindustry” [McIlroy, 1968]
“The storage and retrieval of software assets is nothing but a specialized form of information storage and retrieval” [Mili, 1998]
Software Library • Browsing – Inspecting without a predefined criterion • Retrieval – Satisfy a predefined matching criterion
Classification Scheme • Facet-based • Better than hierarchical classification • Manual classification different facets • Automatic classification • Controlled Vocabulary • Semantic information • Uncontrolled Vocabulary • Big software libraries • Little or no descriptors
Recall and Precision • High Precision – Most retrieved elements are relevant • High Recall – Few elements left behind • Spreading Activation (Relaxed Search) – Related matches are retrieved • Coverage – The average number of assets that are visited over the total size of the library
Asset Representation • Library representation is made in full knowledge of the artifact. User representation is made in ignorance of the artifact • Asset representation is purposefully abstract to capture important features while overlooking miner or irrelevant details • Asset's surrogate is used in retrieval literature
Asset retrieval Goals • Exact retrieval – Black box reuse • Approximate retrieval – White box reuse • Generative modification – Reusing the design • Compositional modification – using building blocks of the retrieved asset
Usually non included information • Interface description • Non-functional requirements • Interoperability
Situational Model x System Model Component retrieval model [Lucrédio et. al, 2004]
“Repository representation is made in full knowledge of the artifact at hand” “User representation is made in ignorance of the artifact” [Mili, 1998]
Web Delphi Search Engine Ispey CSourceSearch.net (2004) Gonzui SourceBank Koders (2004) Codase (2005) Aplications Agora (1998) Codebroker (2002) Koders Enterprise (2004) Maracatu (2005) Component Search Tools
Filter SPARS-J – (2003)
SourceBank Filter
CODASE – Launched Sep 9, 2005 Multiple Search Options Example Searches Browsing “…based on the number of people in your company, starting from $5,000 USD”
JavaBeansAgent JavaBeansAgent JavaBeansAgent JavaBeansIntrospector JavaBeansIntrospector JavaBeansIntrospector INTERNET AGORA - Location and Indexing (1998) INDEX AltaVistaSearchIndex Server Filter AltaVista Query Server Web Server
Component Rank (1998) 0.4 0.2 0.2 V1 V2 D12 = 0.5 D13 = 0.5 0.2 D23 = 1 0.2 0.4 V3 Nodes v Edges e Graph G Weight w Distribution Ratio d D31 = 1 0.4
“Classes defining data structures and their containers are highly ranked”
V3 V7 V2 V6 V1 V4 V5 V’3 V7 Clustered Component Graph V’26 V’14 V’5 V1 ≡ V4 , V2 ≡ V6
V3 V7 V2 V6 V1 V4 V5 NO MORE MULTIPLE DISCONNECTED COMPONENTS
.java file ≡ component Component Rank System Architecture INPUT (1) Similarity Measurement (3) Use Relation Extraction (2) Clustering (4) Component Graph Construction (6) De-Clustering to Original Component Graph (5) Component Rank Computation by Repetition OUTPUT Order of Weights ≡ Component Rank of .java files
A A X X’ A’ X’ A’ B’ Y’ Y B’ B Y’ B 1/4 1/4 Simple Copied Components Copied Components Other Components 1/4 1/4 Clustering Before Weight Computation 1/6 1/3 Non-clustered component Graph 1/6 1/3 Clustering After Weight Computation
DO NOT COUNT SIMPLY DUPLICATED COMPONENTS
A A X X’ X’ B C Y Y’ Y’ 2/5 1/5 A Copied AND MODIFIED Components Copied and Modified Components Original Components Other Components B’ C’ 1/5 1/5 1/5 Clustering Before Weight Computation 1/3 1/5 A’ Non-clustered component Graph B’ C’ 1/6 1/6 1/6 Clustering Before Weight Computation
Beyond Searching and Browsing • Searching and browsing • Require users to initiate the information seeking process • Information access and Information Delivery
CodeBroker – (2001) • Components repositories are often so large that software developers cannot learn about all of the components • Component repositories are not static • New components added • Old components updated • Context-Aware browsing
May not have suficient knowledge about the reuse repository • May perceive that reuse costs more than developing from scratch • May not be able to use the repository by formulating a proper query • May not be able to understand the found components
L4: Entire Information Space Information Islands Belief Vaguely Known Well Known Unknown components
L4: Entire Information Space CodeBroker L3: Belief L2: Vaguely Known L1: Well Known Information Use: L1 – Use by Memory L2 – Use by Recall L3 – Use by Anticipation L4 – Use by Delivery Already Known Components Task Relevant Information Irrelevant Components
Program Aspects • Concept • Formal • Informal • Indentation, comments, identifier names (semantic) • Executability • Code • Constraint environment • Signature
Information delivery • Feedback • After execution of the action • Feedforward • Affects the execution of the action
Information delivery • Interruptive • Noninterruptive
Latent Semantic Analysis (LSA) • Synonymy • Polysemy • “Text documents and queries are represented as vectors in the semantic space, based on the words contained and the similarity between a query and a document is determined by the distance of their respective vectors”
Comments signature Discourse model User model
M.A.R.A.C.A.T.U. – Modern Architecture for Retrieving All Components At The Universe (2005)