Building Structured Community Portals: A Top-Down Approach

Building Structured Community Portals:A Top-Down, Compositional, and Incremental ApproachBy Pedro DeRose, Warren Shen, Fei Chen, Anhoi Doan, RaghuRamakrishnan Presented By – Yogesh A. Vaidya

Introduction • What are Structured Web Community Portals? • Advantages of SWCP • powerful capabilities for searching, querying and monitoring community information. • Disadvantage of current approaches • No standardization of process. • Domain specific development. • Current approaches in research community • Top Down, Compositional and Incremental Approach • CIMPLE Workbench • Toolset for development • DBLife case study

What are Community Portals? • These are portals that collect and integrate relevant data obtained from various sources on the web. • These portals enable members to discover, search, query, and track interesting community activities.

What are Structured Web Community Portals? • They extract and integrate information from raw Web Pages to present a unified view of entities and relationships in the community. • These portals can provide users with powerful capabilities for searching, querying, aggregating, browsing, and monitoring community information. • They can be valuable for communities in a wide variety of domains ranging from scientific data management to government agencies and are an important aspect of Community Information Management.

Example

Current Approaches towards SWCP used in Research Community • These approaches maximize coverage by looking at the entire web to discover all possible relevant web sources. • They then apply sophisticated extraction and integration techniques to all data sources. • They expand the portal by periodically running the source discovery step. • These solutions achieve good coverage and incur relatively little human effort. • Examples: Citeseer, Cora, Rexa, Deadliner

Disadvantages of Current Approaches • These techniques are difficult to develop, understand and debug • That is, they require builders to be well versed in complex machine learning techniques. • Due to monolithic nature of the employed techniques, it is difficult to optimize the run-time and accuracy of these solutions.

Top-Down Compositional and Incremental Approach • First, select a small set of important community sources. • Next, create plans that extract and integrate data from these sources to generate entities and relationships. • These plans can act on different sources and use different extraction and integration operators and are hence not monolithic. • Executing this operators yields an initial structured portal. • Then expand the community by monitoring certain sources for mentions of new sources.

Step 1: Selecting Initial Data Sources • Basis for selecting small set of initial data sources is 80-20 phenomenon that applies to Web Community data sources. • That is, 20% of sources often cover 80% of interesting community activities. • Thus the sources used should be highly relevant to the community. • Example: For the Database research community, the portal builder can select homepages of top conferences (SIGMOD, PODS, VLDB, ICDE) and the most active researchers (such as, PC members of top conferences, or those with many citations.)

Selection of Sources contd… • In order to assist the portal builder B to select sources, a tool called RankSource is developed that provides B with the most relevant data sources. • Here B first collects as many community sources as possible (using methods like focused crawling, query search engine etc). • B applies RankSource to these sources to get sources in decreasing order of relevance to the community. • B examines the ranked list, starting from top and selects truly relevant data sources.

RankSource principles: • Three relevance ranking strategies used by RankSource tool: • PageRank only • PageRank + Virtual Links • PageRank + Virtual Links + TF-IDF

PageRank only • This version exploits the intuition that • Community sources often link to highly relevant sources • Sources linked to/by relevant sources are also highly relevant. • Formula used for giving rank: n P(u) = (1-d) + d∑i=1 P(vi)/c(vi). • PageRank only version achieves limited accuracy because some highly relevant sources are often not linked

PageRank + Virtual Links • This is based on the assumption that if a highly relevant source discusses an entity, then all other sources which discuss that entity may also be highly relevant. • Here, first create virtual links between sources that mention overlapping entities, and then do PageRank. • From results, the accuracy reduces after using PageRank + Virtual Links. • The reason for reduction in accuracy is that virtual links are created even with those sources for which a particular entity may not be their main focus.

PageRank + Virtual Links + TF-IDF • From previous case, we want to ensure that we add a virtual link only if both the sources are relevant for the concerned entity. • Hence in this case use of TF-IDF metric for measuring relevance of the document is done. • TF is term frequency given by number of times entity occurs in a source divided by total number of entities (including duplicates) in the source. • IDF is inverse document frequency given by logarithm of total number of sources divided by number of sources in which a particular entity occurs.

Contd… • TF-IDF score is given by TF*IDF. • Next, for each source filter out all entities for which TF-IDF score is below a certain level Ѳ. • After this filtering, apply PageRank + Virtual Links on the results obtained. • This approach gives better accuracy than both of the previously discussed techniques.

Step 2: Constructing the E-R graph • After selecting the sources, B first defines E-R schema G that captures entities and relationships of interest to the community. • This schema consists of many types of entities (e.g., paper, person) and relationships (e.g., write-paper, co-author). • B then crawls daily (actual frequency dependent on domain/developer) to first create a snapshot W of all the concerned data web pages. • B then applies plan Pdaywhich consists of other plans to extract entities and relations from this snapshot to create a daily E-R graph. • B then merges these daily graphs obtained (using plan Pglobal ) to create a global E-R graph on which it offers various services like querying, aggregating etc.

Create Daily Plan Pday • Daily Plan (Pday ) takes daily snapshot W (of web pages to work on) and E-R schema G as input and produces as output a daily E-R graph D. • Here, for each entity type e in G, B creates a Plan Pe that discovers all entities of type e from W. • Next, for each relation type r, B creates a plan Pr that discovers all instances of relation r that connects the entities discovered in the first step.

Workflow of Pday in database domain:

Plans to discover entities • Two plans, Default Plan and Source Aware Plan, are used for discovering entities. • Default Plan (Pdefault) uses three operators • ExtractM to find all mentions of type e in web pages in W, • MatchM to find matching mentions, that is, those referring to the same real world entity, thus forming groups g1,g2,…,gk • and CreateE to create an entity for each group of mentions • The problem of encapsulating and matching mentions as encapsulated by ExtractM and MatchM is known to be difficult and complex implementations are available.

Default Plan contd… • But in our case, since we are looking at community-specific, highly relevant sources, a simple dictionary-based solution that matches a collection of entity names N against the pages in W to find mentions gives accurate results in most of the cases. • Assumption behind this is that in a community entity names are often designed to be as distinct as possible.

Source Aware Plan • Simple dictionary-based extraction and matching may not be appropriate for all cases especially where ambiguous sources are possible. • In DBLife, DBLP is an ambiguous source since it contains information pertaining to other communities too. • To match entities in DBLP sources, a source aware plan is used, which is a stricter version of an earlier default plan.

Source Aware Plan Contd… • First apply simple default plan or Pname (only Extract and Match operators) to all unambiguous sources to get Results R, which is groups of related mentions. • Add to this result set R mentions extracted from DBLP to form result set U. • For matching mentions in U, not just names is used, but also their context in the form of related persons. • Two names are matched only if they have similar names and are related by at least one related person.

Plans for extracting and matching mentions:

Plans to find Relations: • Plans for finding relations are generally domain specific and vary from relation to relation. • Hence we try to find types of relations that commonly occur in community portals and then create plan templates for each one. • First amongst these relation types is Co-occurrence relations. • In Co-occurrence relations, we compute CoStrength between the concerned entities and register the match if the score of CoStrength is greater than a certain threshold. • CoStrength is calculated using ComputeCoStrength operator, for which input is entity pair(e,f) and output is number that quantifies how often and closely mentions of e and f occur. • Example: Co-occurrence relations are: write-paper(person,paper), co-author(person,person)

Label Relations • Second type of relations we focus on are Label Relations. • Example: A label relation is served(person,committee) • We use the ExtractLabel operator to find an instance of a label immediate to a mention of an entity (e.g. Person). • This operator can thus be used by B to make plans that find label relations.

Plans for finding relations:

Neighborhood Relations: • Third type of relations we focus on are Neighborhood relations. • Example: A neighborhood relation is talk(person, organization) • These relations are similar to label relations with the difference that we here consider a window of words instead of immediate occurrence.

Decomposition of Daily Plan • For a real world community portal, the E-R schema is very large and hence it would be overwhelming to cover all entities and relations in a single plan. • Decomposition aids splitting of tasks across people and also makes evolution of schema possible in a smooth manner. • Decomposition is followed by the merge phase in which E-R fragments from individual plans are merged into a complete E-R graph. • We use MatchE and EnrichE operators to find matching entities from different graph fragments, and merge them into a single day E-R graph.

Decomposition and Merging of Plans:

Creating Global plan • Generating global plan is similar to constructing a daily plan from individual E-R fragments. • Hereto operators MatchE and EnrichE are used.

Step 3: Maintaining and Expanding • The Top-down, compositional and incremental approach makes maintaining the portal relatively easy. • The main assumption is that relevance of data sources changes very slowly over time and new data sources would be mentioned within the community in certain sources. • In expanding, we add new relevant data sources. • Approach used for finding new sources is to look for them only at certain locations (e.g. looking for conference announcements at DBWorld.)

CIMPLE Workbench • This workbench consists of an initial portal “shell” (with many built-in administrative controls,) a methodology on populating the shell, a set of operator implementations, and a set of plan optimizers. • The workbench facilitates easy and quick development since a portal builder can make use of built-in facilities. • Workbench provides implementations of operators and also gives facility for builders to have their own implementations.

Case Study: DBLife • DBLife is a structured community portal for the database community. • It was developed as a proof of concept for developing community portals using the top-down approach and the CIMPLE workbench. • Portal is a proof of development of SWCP achieving high accuracy with current set of relatively simple operations.

DBLife lifecycle:

Conclusion & Future Work • Experience with developing DBLife suggests that this approach can effectively exploit common characteristics of Web communities to build portals quickly and accurately using simple extraction/integration operators, and to evolve them over time efficiently with little human effort. • Still a lot of research problems need to be studied: • How to develop a better compositional framework? • How to make such a framework as declarative as possible? • What technologies (XML, relational, etc) need to be used to store portal data?

Building Structured Community Portals: A Top-Down Approach