240 likes | 256 Views
This research delves into generating reasonable queries utilizing Patent Search Query Log analysis for efficient IR system evaluation. By analyzing query log data, this work investigates term frequencies, term proximity, and more to enhance query generation. The study utilizes a dataset of 242 patent query logs, with insights on the frequency of terms in patents and their selections in queries. Findings also uncover the importance of term proximity in query formulation. The research highlights the potential applications of query log analysis in improving patent retrieval systems and query expansion techniques. Explore this study for valuable insights in the field of automatic query generation and IR system improvement.
E N D
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna University of Technology
General Theme • In Automatic Evaluation of IR systems, query generation contains valuable importance. • Generally, query generation space is very large. • Need to understand, how to generate reasonable queries. • In this work, we understand this issue with the help Patent Search QUERY Log.
Automatic Query Generation for Analysis • Motivation/Problem • Patents contain large number of terms. • IR systems analysis using all combinations of terms is a difficult task. • Demands large processing time. • Can give wrong picture • A large combination of query terms are never used by users. • Question? • How to generate reasonable queries?
Query Log of Patents Search • (Patents Search Query Log) can help in generating queries for Analysis. • Patent search users are more experimented, we can utilize their experienced for effective queries generation. • In Query Log Analysis, on one side we have Query Patents and on the other side, we have their Query Logs • So this helps us in understanding • The types of terms that are mostly used for searching patents. • Can Prune Irrelevant Terms.
Applications of Query Log Analysis • Analyzing Bias of Retrieval Systems (Findability of Documents). • Selecting Terms for Query Expansion. • Learn to Rank for Prior-Art Search.
Experiments (QUERY Log DATASET) • Patent Search Query Log can be downloadable from USPTO portal (http://portal.uspto.gov/external/portal/pair). • Can’t be downloadable as a whole. Can be downloadable manually on individual patent basis. • Available in Scan Format, need OCR to convert in digital text format. • Need further cleansing operations to remove noise in queries. • Some queries contain past queries reference numbers. • There were lot of number in the queries. • Patents application number • IPC classes
Experiments (QUERY Log DATASET) • 242 Query Log of Patents are used for analysis. • 15013 queries. • We only considered the text queries for analysis.
Query Log Analysis • Given Query Log, we analyze it on the basis of following factors. • Term Frequencies of Query Terms. • Does Frequency of Terms in Patents contain any importance in Query Formulation? • Proximity/Closeness of Query Terms in Patent Text. • Query Terms Confidence in Similar IPC Classes. • Number of Retrieved Documents. Query Patent (Y) All Terms of Query Patent Understand diff between (All Terms of Patents/ and only Query Log Terms) Automatic Queries Generation Query Log of (Y) All Terms of Query Log
Terms Frequencies in Patents (1) • All Terms of Query Patents: • Large percentage of Terms in Patents have lower frequency. • While, very few percentage of Terms have higher frequency > 10.
Terms Frequencies in Patents (1) • [Percentage/out of Total Terms] Selected in Queries: • Higher Frequency Terms have very good percentage of selection in Queries. • Lower Frequency Terms such as <= 5, contain very poor percentage. Note in last slide almost 75% of Terms in Patents have <= 5 Frequency.
Terms Frequencies in Patents (1) • [Percentage/out of Query Terms] Appeared in Query Log: • Higher Frequency Terms are more frequently appeared in Query Log as compared to Lower Frequency Terms (<= 5).
Terms Proximity/Closeness in Query Log (2) • Proximity refers to closeness of Two Terms in Patent Text. • Helps in understanding whether Terms Proximity contains any importance in Queries formulation. • Proximity of Terms is calculated with two approaches • Minimum distance between two terms. • Co-Occurrence Frequency using Window Size. • Terms Pairs are selected based upon two factors • All Terms pairs of Query Patent. • Only Terms pairs that appeared in Query Log.
Terms Proximity/Closeness in Query Log • With Minimum Distance: • Lower Proximity Pairs are appeared in a larger percentage in Query Log, as compared to Higher Proximity Pairs. • This indicates that users give more focus toward those terms, which are closer together in the text. • In All Terms Pairs of Patents, 71% of Pairs have Minimum Proximity > 7.
Terms Proximity/Closeness in Query Log • With Co-Occurrence Frequency with Window Size = 14: • Higher Co-Occurrence Pairs are appeared in a larger percentage (90%) in Query Log, as compared to Lower Co-Occurrence Pairs (10%). • Almost 75% of All Pairs of Patents have Co-Occurrence Frequency <= 1.
Frequency in Similar IPC Classes • Query Patents fall in many IPC Classes. • Patent Users are usually experienced. • Their terms are more target oriented. • Need to check what is the Frequency of Query-Log Terms Pairs similar IPC classes. • Freq (IPC Classes) = Freq / |qd| • Freq = Frequency in similar IPC Classes • |qd| =Total # of Retrieved Documents.
Support in IPC Classes • Analysis indicates higher support of QUERY Log Terms Pairs in similar IPC classes as compared to All Terms Pairs of Patents.
Number of Retrieved Documents • Number of Retrieved Document denotes, QUERY Terms are present in how many Patents. • More common the QUERY Terms will be, the Larger Number of Retrieved Documents will be • This factor is analyzed with • All Terms Pairs of Patent • All Terms Pairs of Query Log
Number of Retrieved Documents • Analysis indicates Terms Pairs of Query Log, can retrieve smaller number of Patents as compared to All Terms Pairs of Patents.
Conclusion • For automatic IR System evaluation, Query Generation is an important factor. • We believe on the basis of past Query Log, we can understand this problem. • Using different statistical factors, there exists a huge difference between random queries and users queries. • We can considered these factors, while generating automatic queries.