230 likes | 386 Views
USPTO P atent D ata S ource and D ata E xtraction. Mandy Dang MIS 580 University of Arizona 02-06-2008. Outline. Patent USPTO Search USPTO Patents D ata E xtraction : Case Study of NSE Patents. Patent.
E N D
USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008
Outline • Patent • USPTO • Search USPTO Patents • Data Extraction: Case Study of NSE Patents
Patent • “Patent" usually refers to a right granted to anyone who invents or discovers any new and useful process, machine, article of manufacture, or composition of matter, or any new and useful improvement. • A patent is not a right to practice or use the invention. Rather, it provides the right to exclude others from making, using, selling, offering for sale, usually 20 years from the filing date. • It is a limited property right that thegovernment offers to inventors in exchange for their agreement to share the details of their inventions with the public. • A patent is a special type of technology document which documents many important innovations and technology advances.
USPTO • The United States Patent and Trademark Office (USPTO) is an agency in the United States Department of Commerce that provides patent protection to inventors and businesses for their inventions, and trademark registration for product and intellectual property identification. • Each year, the USPTO issues thousands of patents to companies and individuals worldwide. As of March 2006, the USPTO has issued over 7 million patents, with 3,500 to 4,500 newly granted patents each week. • USPTO provides online full-text access for patents issued since 1976. • URLs: • USPTO Official Website:http://www.uspto.gov/ • USPTO Patent Search: http://www.uspto.gov/main/search.html
http://www.uspto.gov/main/search.html Search USPTO Patents
Data Extraction: Case Study of NSE Patents • Nanoscale Science and Engineering (NSE) field • Fundamental technology that is critical for a nation’s technological competence. • Revolutionize a wide range of application domains. • Nanotechnology • Is an applied science/ technology field that is multi-disciplinary and encompasses engineering and other work taking place at the nanoscale. • Critical for a nation’s technological competence. • R&D status attracts various communities’ interest.
Data Extraction Procedure • The goal is to gather all the related patents from USPTO Web site as free-text html pages and then parse them into structured data and stored in a database. • Procedure of extracting NSE patents from USPTO: • Spider search results (summary pages) • Spider individual patent documents (detailed pages) • Noise filtering • Parsing
1. Spider search results (summary pages) • A list of keywords can be used to search for patents related to NSE domain. The keywordswere provided by domain experts. • A spider program written by Perl was used to spider the search result pages.
Example code use HTML::TokeParser; use LWP; use URI::Escape; use strict; sub query { … … … … open(f, $ARGV[0]); my @keywords = <f>; close(f); … … … … $query_url = "http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=$pno&u=%2Fnetahtml%2Fsearc-bool.html&r=0&f=S&l=50&TERM1=$kw&FIELD1=&co1=AND&TERM2=$start%3E$end&FIELD2=ISD&d=ptx"; $response = $browser->get($query_url); $result = $response->content(); open(f, "> $fpage-$pno.html"); select(f); print $result; close(f); } query('1/1/2007', '12/31/2007'); Get keywords Download search pages Set up time range
Search result page example Patent IDs
2. Spider individual patent documents (detailed pages) • In this step, we need to: • 1st, collect all the patent IDs; • 2nd, download all the patents based on the patent IDs by using proxies. • The data set is often very large, so using proxies can save a lot of time.
Download detailed patent documents Create several files, each of which contains a fixed amount of patent IDs (e.g., 300 patent IDs). … … … … open(f, $ARGV[0]); my @theids = <f>; close(f); my $theid; foreach $theid (@theids){ $new_sock = $sock->accept(); my $buf = <$new_sock>; print ($new_sock $theid."\n"); print $buf . " " . $theid."\n"; close $new_sock; … … … … Server: Send different patent ID files to different client threads. Client: Use proxy to download the patents whose IDs are in the file sent from the server. 1 … … … … do { $response = $browser->get($pat_url); if (!$response->is_success()){ select(stdout); print $response->status_line, "\n\n"; sleep(rand(7)+1); }while (!$response->is_success()) … … … …
3. Noise filtering • Some patents we gathered may have noisy NSE keywords, some may even have no NSE keywords. • Such patents need to be filtered out. • Noise keywords includes: • nanosecond • nanoliter • nano$ • nano-second • nano-liter • nano.sub • nano [space] • nano2
4. Parsing • Extract different data fields from the HTML patent documents and parse into database.
Parsing example: parsing inventor data Process inventor name public static void processAssignees() throws IOException { … … … … String[] assignees = assigneeString.split("<BR>"); for (int i = 0; i < assignees.length; i++) { currentassignee=assignees[i].trim(); if(currentassignee.length()==0) continue; currentassignee = currentassignee.replaceAll("\r\n", ""); name =findBetween(currentassignee,0,"<B>","</B>"); currPosition=currentassignee.indexOf("</B>")+"</B>".length(); address=findBetween(currentassignee,currPosition,"(",")"); if(address==null) {System.err.println("wrong address: " + patentId);} int startIndex=0, endIndex=0; if((endIndex = address.lastIndexOf(',')) >= 0) {city = address.substring(0, endIndex); if (city.lastIndexOf(',') >= 0) {city = city.substring(city.lastIndexOf(',') + 1); city.replaceAll("[^a-zA-Z]", ""); } startIndex = endIndex + 1; } else city="-"; address = address.substring(startIndex); country=findBetween(address,0,"<B>","</B>"); if(country==null) {country="US"; state=address.trim(); } else state="-"; name=name.trim(); city=city.trim(); state=state.trim(); rank++; } } Process inventor address Keep the ranking order of inventors
Data Analysis Examples • Bibliographic analysis • Top 50 countries select c.countryName, count(distinct b.patentId) from usp_assignee a, usp_patentAssignee b, usp_countryName c where a.assigneeId=b.assigneeId and a.aCountry not in ('unknown','') and a.aCountry=c.countryCode group by c.countryName order by count(distinct b.patentId)desc
Citation Network Analysis Developing software: Graphviz http://www.pixelglow.com/graphviz/download/
Content Map Analysis Developing software: multi-level self-organizing map algorithm developed by AI Lab at the U of Arizona