180 likes | 302 Views
Patent Search using IPC Classification Vectors. ManishA VermA , Vasudeva varma. OUTLINE. Motivation Proposed Approach Vector Generation Vector Propagation Evaluation Dataset Cosine similarity (CS) Graded Cosine Similarity (GSC) Text Score + Similarity Score Results Future Work.
E N D
Patent Search using IPC Classification Vectors ManishAVermA, Vasudevavarma
OUTLINE • Motivation • Proposed Approach • Vector Generation • Vector Propagation • Evaluation • Dataset • Cosine similarity (CS) • Graded Cosine Similarity (GSC) • Text Score + Similarity Score • Results • Future Work
Search Patents, BUT How ? • Input : A patent application • Process : • Study the patent • Find key words related to the invention • Formulate query with help of several operators • Query the patent database • Refine the query depending on the results • Problem : Search results depend on the query, query depends on keywords selection. Quality of search results depends heavily on the choice of words and their weight in the query
Motivation • Selection of keywords requires domain expertise, restricts the number and areas of examiner’s patent searches. • Manual and tedious process of query creation. So many words and so many operators, optimal combination needs expertise. • What claims to focus on, which ones to leave. Some patent applications have over 100 claims.
What may help • Automatic query creation and refinement. • Exploiting the meta-data present in the application to improve results.
What's in patents • Inventor information • IPC (International Patent Classification) class code information. • Date of filling • Citations • Images • And of course the patent text – Title, Abstract, Description and Claims
Our approach • Use Category and citation information in patents to filter results. • Use text search to improve precision.
IPC Vectors Category information at each level is represented as a Vector. Combination of vectors of all the levels results in IPC Vector of a patent. Level 1 : Section + Class + Subclass Level 2 : Section + Class + Subclass + Main Group Level 3 : Entire classification code
Vector propagation • Citation graph of patents is used to enrich the vector. It is a directed graph which has a link from Node A to Node B if patent A cites patent B. • Inlinks (incoming edges) of a node are used to add information to its vector. • Propagation ensures that if Node A is retrieved then its neighbors are also present in the solution set, this improves the recall of the system
Vector propagation contd.. • For a given node Pi, let In(Pi) be subset of the set of nodes that point to it (predecessors) and k be the current iteration. The vector of node Pi for k +1th iteration is defined as follows : • 1/2kis used to dampen the effect of adjacent vectors as the iterations increase. The above formula simply adds the average of vectors of all nodes that point to Pi.
Evaluation • We use the CLEF-IP 2011 collection of Prior Art Search (PAC) task that has 2.6 million patents pertaining to 1.3 million patents European Patent Office (EPO) with content in English, German and French, and extended by documents from WIPO. • There are 300 sample patent applications as querieswith the dataset. • Both English and original patents are used for making queries.
Evaluation • Base: Simple Text Retrieval, 20 words, from the query patent, with high tf-idfvalues are used to form a weighted query. The weight of each word is its tf-idfscore. • COS: Cosine Similarity, IPC information present in the patent is used to make the vector. Entire vector is used to calculate cosine similarity between a patent and query. • GCS: Graded Cosine Similarity, calculating similarity at each level and linearly combining them to get final score.
Graded Cosine Similarity If Pqis the query patent vector and Pi is a vector of ith patent in the corpus, we use following to calculate graded similarity: where ajrepresent importance of similarity score at level j and simlevelj is cosine similarity between vectors of levelj.
Re-Ranking top Patents • We re-rank top 1000 documents by using following methods: • COS + Base: top 1000 documents obtained from COS are re-ranked using λBase + (1 − λ)COS. • GCS + Base: top 1000 documents from GCS are re-ranked using λBase + (1 − λ)GCS.
Conclusion and future work • Both IPC based representation and re-ranking on sample queries of CLEF-IP 2011 dataset perform better than the baseline i.e. text based retrieval in terms of precision and recall. • An extension to this work would be to use a learning-to-rank approach to re-rank top documents. It would be interesting to observe effects of combining both vector representation with patent text to avoid re-ranking.
Questions ????? • THANK YOU