1.43k likes | 1.55k Views
Improving the effectiveness of Web searching: Methodological issues. Barry Eaglestone. Department of Information Studies University of Sheffield B.Eaglestone@shef.ac.uk. Overview. An inductive study to build evidence-based meta-cognitive models of web searching by the general public.
E N D
Improving the effectiveness of Web searching: Methodological issues Barry Eaglestone Department of Information Studies University of Sheffield B.Eaglestone@shef.ac.uk
Overview • An inductive study to build evidence-based meta-cognitive models of web searching by the general public. • Data modelling issues • A Temporal data modelling solution • Discussion & Final thoughts
Setting the scene – the database approach and state of the art. An inductive study of how the general public search on the web.
Motivation • Need to develop new models for searching: update outdated usage paradigms. • Improve training methods • Develop automated assistance systems
Previous studies of search logs • Web search is shallow + promiscuous • Low use of advanced features • Global statistics • number of queries/search • Pages viewed / user • query reformulation (change in no of terms) • Most users enter few terms • Little to be gained by increasing complexity
The Team Database Database chemoinformatics chemoinformatics Information Seeking Information Seeking
Spectrum of Research Perspective Soft Hard Formally Defined problems Computer world formalisations Hardware / Software solutions Problem Solving formalism Human / organisational issues Modelling/engineering/ empirical Qualitative / quantitative data analysis / modeling Discovery Invention People world IS CS Computer World
Who are the searchers? What are they searching for? Context • The GENERAL PUBLIC • Volunteers (c500 searches): • ICT courses • University evening classes • City Learning Centre courses • Citizens’ forum • Personal contacts • Library • Advertising • Students and academics • + over 1,000,000 search logs anonymous searchers • Observe and record • Over 1,000,000 anonymous search engine transaction logs • c500 observed and recorded searches; talk to searchers Evidence-based meta-cognitive training Intelligent interfaces Meta-cognitive Knowledge about web searching? Determine query similarity Delimit searches Code query transformation Model searches as transformation graphs Data mine for stereotypical search strateges Correlate with who, why and effectiveness Thus, establish evidence-based models of search strategy, related to user and problem characteristics and likelihood of success How do they search? Effectiveness? • Self-selected searches explained through interview and think aloud protocols • 2-3 set searches • Infer effectiveness from • search transformation patterns • subject’s narrative How will we use it?
Why Meta-cognition? . “Meta-cognition refers to higher order thinking which involves active control over the cognitive processes engaged in learning. ….” Livingston (1997) • Meta-cognitive knowledge • “…knowledge of personal variables to general knowledge about how human beings learn and process information, as well as individual knowledge of one’s own learning processes…” e.g. “I have a bad memory!” • Meta-cognitive regulation • “… activities used to ensure that that a cognitive goal has been met….”, e.g., question yourself about the text and then re-read. Livingston (1997)
Verbalizer Holist Analyst Imager Cognitive Styles Analysishttp://www.memletics.com/manual/default.asp?ref=ga&data=999+learning+styles+free+test
Syntactical/quantitative Semantic/qualitative Exite search logs ~106 searches Holistic search logs Supplemented with qualitative data
Preliminary work • Analysis of search logs • Development of descriptive codes • Aim is to form a basis for the analysis of our experimental data
Strengths /Limitations • Large sample • Definitely general public. • No enquiry context – what are they looking for? What are they thinking? • No measure of success. • Are they searching or just browsing? • Where does one enquiry end and another begin? • Limited to one search engine – what did they do during a delay?
qid uid time rank query querymore totwords 343 000000000000006a 192141 0 alco fence company ohio No 4 344 000000000000006a 192219 0 alco fence company ohio No 4 345 000000000000006a 192228 10 alco fence company ohio No 4 346 000000000000006a 192243 20 alco fence company ohio No 4 347 000000000000006a 192328 0 lifetime fence company ohio No 4 348 000000000000006a 192359 10 lifetime fence company ohio No 4 349 000000000000006a 192455 0 lifetime wire fence No 3 Sessions 350 000000000000006a 192634 0 high tensile wire fence No 4 351 000000000000006b 161906 0 sickle cell anemia No 3 352 000000000000006b 162006 10 sickle cell anemia No 3 353 000000000000006b 162130 0 sickle cell anemia No 3 354 000000000000006c 144303 0 Hilton Garden Inn No 3 1 355 000000000000006c 144331 0 Hilton Garden Inn Jacksonville No 4 356 000000000000006c 144433 0 Hotel Search No 2 357 000000000000006c 144541 0 Jacksonvill Hotel No 2 358 000000000000006c 144728 0 www.hilton.com No 1 2 3 Excite Database Sample ~ 106 queries
Query Transformations • Changes in search strategy • conceptual e.g. changes in type of search: broad specific text image • Linguistic: syntactic, query structure. • Examples Q1: shakespeare hamlet Q2: shakespeare hamlet quotes Q3: to be or not to be Q4 “to be or not to be” Q5: “to be or not to be” +shakespeare
Our Preliminary Analysis • To look at textual (syntactic) changes. • Link queries by text similarity. • Infer enquiry change from textual dissimilarity. • Use these elements to develop a machine-readable codification of QT’s. • To mine for characteristic patterns.
QT graphs paid undergraduate nursing schools in baltimore city maryland nursing careers
QT graphs molsworth "us army"
Preliminary Conclusions • We have developed a rich set of codes describing syntactic part of QT’s • These can be used to develop a graph-based description • Correlations between the codes are meaningful/interesting • They form part of the analysis for our current experimental study.
…and if you want to read about our preliminary results…. • Whittle M, Eaglestone B, Ford N, Madden A (2007), Data Mining of Search Logs, Journal of the American Society for Information Science and Technology (in press) • Whittle M, Eaglestone B, Ford N, Gillet V.J., Madden A (2006), Query Tranformations And Their Role In Web Searching By The General Public, Information Research, 12(1) October 2006 • Whittle M, Eaglestone B, Ford N, Gillet V, Madden A(2006), Query transformations and their role in web searching by the general public. Information Seeking in Context Conference 2006 ISIC, Austrailia • Andrew Madden, Barry Eaglestone, Nigel Ford, MartinWhittle (2006) Search engines: a first step to finding information: preliminary findings from a study of observed searches, Information Seeking in Context Conference 2006 ISIC, Austrailia.
Model development Temporal database Sheffield Experimental Study Keystrokes Queries Web page titles Screens Audio Transcribing Pre-Processing Qualitative analysis Quantitative analysis
Setting the scene – the database approach and state of the art. Evolution of databases The database approach – A database should be a naturalrepresentation of information as data, suitable for all relevant applications without duplication, including the ones you have not yet though of. “A well designed database system will mirror its users’ perceptions of the problem space, and thus allows them to address the problem in hand without complexities and distractions of computer world implementation details… Implicit is the notion that users should work within the bounds of ‘good practice’”
Setting the scene – the database approach and state of the art. The semantic gap The gap between what you wish to represent and what you can represent. Customer C# Name … C1 Dr. Eaglestone C2 Ms Smith Salesperson Customer Salesperson 1 1 Placed_by Take_by SP# Names … S5 Mr. Chan … S8 Dr. Shao n m SalesOrder Sales_Order C# SP# Product Quantity C1 S5 P99 120 C1 S5 P2 10
Principles of database technology… ….. & Data Independence Applications/Users External Model Logical Model Internal Model
QT graphs molsworth "us army"
GENREG – A ready-made solution that has also been proposed for healthcare ? • The Organisation: National Museum of Denmark • Multimedia • Pictures as well as descriptions • Distributed • Each department ran their own database system for their collection (ownership!) • Object-oriented design • Entities, not just values • Relational implementation
Database Research Praxis Application Technology Science Theory
Topology Danish Pre-history Ethnographic Department LAN 1,000,000 artefacts 200,000 images Department of Antiquity Coin Collection
Design / Abstractions • Design • Object oriented • Based on a curator’s perspective • “Curators apply scientific training to determine the history of artefacts…creating knowledge about past and present societies by determining relationships which group artefacts within certain times and places in history” • Abstractions • Artefact • Event • Relationship • relate artefacts which participate in common events
Mould used to fabricate Brooches
GENREG data model EVENT/ARTIFACT ARTIFACT One (or more) artifacts participates in one or more events.
Burial site Grave Grave Artefact Artefact Artefact Artefact Artefact Artefact
Manor House A B C D Merchant’s House E Furniture Purchase event G Rooms F H I J K L Furniture
Integrated Care Pathways Application[Procter, P., Eaglestone, B.M. & Burdis, C. “A unified model to support an information intensive healthcare environment, MIE '99] Past P1 It Treatment Present P2 It+1 It+1 Alternative diagnoses P3 P6 It+2 Alternative prognoses It+2 P4 P5 Future(s)
A formal GENREG Model type Genreg = abs [tuple[ Collection : Artifacts, Events : set[Event]] new : () Genreg, = : (Genreg × Genreg) boolean, events : (Genreg) set[Event], collection : (Genreg) Artifacts] type Artifacts = graph[Artifact] type Event = abs[ tuple [id: E_Id, type : Exent_type, t : Time, place : Location, actors : set[Actor_Type], edge : set[Edge]] = : (Event × Event) boolean, id : (Event) E_Id, type : (Event) Event_Type, t : (Event) Time, place : (Event) Location, actors : (Event) set[Actor_Type], edgeset : (Event) set[Edges]] …
type Time = abs[tuple[ lower, upper: T] new : () Time, = : (Time×Time) boolean, before : (Time×Time) boolean, meets : (Time×Time) boolean, overlaps : (Time×Time) boolean, during : (Time×Time) boolean, starts : (Time×Time) boolean, finishes : (Time×Time) boolean,
add_artifact / delete_artifact (D, a) • add_event / delete_event (D, e) • merge (D,F,E) • select_artefacts (D,p) • select_events (D,p) • related_to (D,n) • related_by (D,e,n)
Temporal Data Models(See also SQL/Temporal) Entity: Barry; Height: 2’ 3’’ Time: 1950 Attribute Time Entity: Barry; Height: 5’ 10’’ Time: 2004 Entity
Artefact histories are created retrospectively • Multiple orthogonal time dimensions can be represented (using specialised events), e.g., discovery and historic time. • Relationships between events and states are modelled. • Multiple objects can represent different states and interpretations of an entity.
QT graphs Q4 Q3 molsworth QJt "us army"
Some final thoughts… • The Database Approach? • Semantic gap? • Data independence? • Temporal modelling? • Query language? • So, what’s happening?
IR & DB? Problem-related Query Server(s) Internet accessible repositories of artefacts Client(s) User are researchers who derive knowledge from retrieved artefacts Problem-relevant artefacts Researcher’s workspace – Developed to model the Problem space Artefact collection IR – collections of artefacts are available for ad hoc querying (any relevant problem) – The problem is modelled by the query DB – collections of artefacts are structured to model the problem space.
…final thoughts… • Knowledge of research methodology is important (qualitative and quantitative) • Nudist, Atlas, SPSS don’t support mixed methods • Database approach allows integration of qualitative and quantitative data, and organisation of data to evolve to model emerging theory • Temporal data models are key to modelling evolving strategy…
Acknowledgments • The project team – Nigel Ford, Andrew Madden,Martin Whittle • Arts and Humanities Research Council (formerly Board) for funding • Mark Sanderson and Amanda Spink for making the Excite logs available • Val Gillet and Eleanor Gardiner for help with graphs.