Web Service Composition and Record Linking

Web Service Composition and Record Linking Mark Cameron, Kerry Taylor & Rohan BaxterCSIRO Information & Communication Technologies Centre

Problem… • As more and more data sources become available to an information integration system, it becomes more difficult to track an individual entity. • This is especially true where there is no unique global identifier for an entity • There are many causes • Data sources contain inconsistent information • Collected for different purpose • Quality control of data entry • Currency of data New Patients All Patients

Problem & One Solution… • How can we reconcile entity identity across multiple inconsistent data sources? • Apply record-linking technique(s) • Record linking is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources.

Record Linking Link Table New Patients All Patients

Record Linking partial match Y-Axis no match match Classification • But record linking is a difficult task for non-specialists • Need to know how to go about record linking to avoid worst-case n2 record comparison scenario • Access (multiple) specialist applications • Commercial systems cost big $$$$ • Academic systems free but built for research: • Freely extensible biomedical record linkage (Python) http://sourceforge.net/projects/febrl • Second String (Java string comparison library) • Write task specifications for each application • Bewildering array of parameters for most functions

Record Linking • Wouldn’t it be nice if… • Non-specialists could use record-linking services for their domain-specific purposes • We could mix and match different linking service implementations as required • Have flexibility of both high-level (non-specialist) and low-level (specialist) use of record linking services • Non-specialists just use a virtual process definition, supplying data and parameters as needed • Specialists can set low-level parameters • Specialists can modify virtual process specifications

Our Approach… • Aim to support service composition to deliver knowledge intensive applications and data products • rapidly • flexibly/adaptively • scalably (in complexity, numbers of resources, data volumes) • knowledgeably • Consequentially, need • Data-level interoperability • Data integration through views (where standards may not apply) • Fine-grained services • Machine-readable, fine grained description • Sensible data management • And • Declarative, goal-directed composition (like SQL) • Tools for extracting, merging and displaying the knowledge • Put knowledge tools in the hands of the domain expert • like a spreadsheet

Information Integration Theory… • I = <G,M,S> (Lenzerini 2002): • Source schemas (S) • A (local) representation of the data and services available to or known about by the IIS. • Mapping statements (M) • Expressions that map schemas and services between G and S in one of four styles: P2P; GAV; LAV; GLAV. • A global schema (G) • This is a (global) representation of the (integrated) domain, including real and virtual data schemas, real and virtual transformation services and domain constraints, against which actors can address queries: Q::q(G) • The system essentially turns a user query Q::q(G) into a union of conjunctive queries against individual resources Q::q’(S)

Our Approach… RuntimeEnvironment CompositionCompiler WorkflowEngine DomainGUI DomainModel Mappings IntegrationDatabase Data Resources Transformation Resources

Our Approach… Query Generation Declarative Resource Model Web Service «call» User Query Integration Compiler Workflow Execution Delegated Call Evaluation Service Execution Monitoring Workflow • Process flow computed at compile-time from declarative resource model and user query • Runtime infrastructure invokes services

Disease Registry Example • Will our approach work? • How do we build web services? • extract fine-grained functional components from existing packages • Is compile-time composition feasible? • Is our runtime-environment going to perform acceptably? • Experiment to see what impact our choices have • Service/operation interface for packages • SQL generation technique • Composition (ie workflow) vs stand-alone application performance

Disease Registry Example • Problem specification written in domain terms • To link New Patients with Existing Patients • Get new patient data • Get existing patient data • Use probabilistic linking to identify individuals who match in both datasets • To probabilistically link datasets A and B • Standardise A and Standardise B • Index standardised A and Index standardised B • Compare each in A with each in B when indexes match • Classify compared into (match; partial match; no match)

A Process Model for Record Linking link(A,B,C)  standardize(A,As), standardize(B,Bs), index(As,AsI), index(Bs,BsI), comparison(AsI,BsI,Cs), classify(Cs,C).

Process Template link(A,B,C)  standardize(A,As), standardize(B,Bs), index(As,AsI), index(Bs,BsI), comparison(AsI,BsI,Cs), classify(Cs,C). Virtual Process Specification index( standard(Id, Gname, Sname, Dob, …), indexed(ClusterId, standard(Id, Gname, Sname, Dob, …))) truncate(Gname,4,GnameTrunc), truncate(Sname,4,SnameTrunc), block_index(GnameTrunc, SnameTrunc, ClusterId). Specify Virtual Service(s)

Query The Process Model probabilistic_link(classified(Z)) ’newpatients@laboratoryDS’(A,B,C,D,E,F,G),’allpatients@registryDS’(H,I,J,K,L,M),link( (’newpatients’(A,B,C,D,E,F,G)), (’allpatients’(H,I,J,K,L,M)), classified(Z)).

Unfolded Query

Compiler Generates Dependency Graph

SQL Generation • Leaf nodes are data generators • Each non-terminal node is a web service call • SQL for input generated from backward closure of dependency graph • Call results stored in table • Cartesian products have disjoint backward closures • Heuristic optimization delays CP evaluation

Performance • We were disappointed • On 5k run, one large join taking over ½ hour and consuming all available disk space • SQL statements not being optimised • Use a temporary table to speed join • 30 sec to process join! • We were surprised

Why Did We Get Non-linear Improvement? • Distribution of work between database and record linking service machines • Lots of data parallelism • Task level (eg date_standardize; truncate; name_standardize) • Message level (eg time to process 500 record message block) • Bounded memory requirement for linking service to process 500 record message block • Non-linear virtual memory requirement a known issue for Febrl • Limited synchronisation points • Execution time of implementation languages is significant! • Java vs Python between 1:200 and 1:1000

Conclusions & Future Work • Simple process model for record linking • We plan to incorporate recursion & iteration • Array style invocation enables us to pass more (uniform) data in each web service message • Much more work needed for complex structured messages • Virtual service specifications enable flexible implementation choice • But someone or something must construct them! • Compiler automates the tedious work • We need to look more closely at join performance! • Incremental treatment of changes not addressed • Obvious application of incremental view maintenance techniques

Web Service Composition and Record Linking Mark Cameron, Kerry Taylor & Rohan BaxterCSIRO Information & Communication Technologies Centre

Web Service Composition and Record Linking