1 / 34

Presented to The Federal Big Data Working Group Meetup On 02 June 2014 By Chuck Rehberg, CTO

Two Semantic-based products for Large Document Corpus Research; Semantic Insights Research Assistant™ and Research Librarian™. Presented to The Federal Big Data Working Group Meetup On 02 June 2014 By Chuck Rehberg, CTO Semantic Insights™ a Division of Trigent Software.

lee-quinn
Download Presentation

Presented to The Federal Big Data Working Group Meetup On 02 June 2014 By Chuck Rehberg, CTO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Two Semantic-based products for Large Document Corpus Research; Semantic InsightsResearch Assistant™ and Research Librarian™ Presented to The Federal Big Data Working Group Meetup On 02 June 2014 By Chuck Rehberg, CTO Semantic Insights™ a Division of Trigent Software

  2. Introducing the SIRA Technology • In this presentation we introduce two web-enabled Semantic-based products for “Large Document Corpus[1]Research”; • Research Assistant™ • Research Librarian™ • These products are just two of many under development based on the SIRA (Semantic Insights Research Assistant)Technology • Note: These two products are currently in limited Beta Test [1] by “Document Corpus” we mean any discrete [evolving] online collection of documents

  3. Covered in this Presentation • Research Assistant • What is does • How to use it • Examples • “What causes autism?” • BASS using Research Assistant • “Finding a needle in a haystack”… • How to Improve SIRA Results • Who we are • Context • By “Large Document Corpus” we mean… • By “Research” we mean… • A few more key points… • Forms of Knowledge used by SIRA • Research Librarian • What is does • How to use it • Examples • “Who has seen Adnan Shukrijuma?” • GAO • PubMed

  4. By “Large Document Corpus” we mean… • SIRA can “read” 100-10,000 documents/min depending on size, hardware and bandwidth. All reading is done live without previous indexing or other processing. • Although not a requirement, SIRA can use existing keyword search engines. • If the Document Corpus has a keyword search engine, SIRA can analyze the investigation statements, develop a keyword search strategy and automatically drive the search engine to identify a prioritized list of documents to read. • Limited only by time, SIRA could read the whole corpus…well at least faster than humans could.

  5. By “Research” we mean… • Given an investigation (i.e. one or more sentences/questions that describe the information of interest) and a document corpus, • Read the document corpus, and • Return a report (including bibliography) siting the semantically relevant information found for each sentence in the investigation and identifying the most semantically relevant documents/sections/sentences for this investigation. • In many cases, reading the content of the report alone may be sufficient, without the need to access the original documents. • Of course, hyperlinks to the original documents can be followed to provide better context. • Report structure and content is customizable.

  6. A few more key points… • SIRA is not a keyword search engine or anything like it. • Grammar is important to SIRA. SIRA is not statistical. SIRA generates high-speed readers for each investigation and reads sentence-by-sentence without pre-indexing. Results are ordered by decreasing relevance determined by counting the matching “concept clusters” identified in the investigation. • SIRA does not use statistical parts of speech taggers. • When you add a term/sense to the dictionary, it is used right away. • When SIRA analyzes your investigation, it may add new Semantic Items to the Ontology. • When a Semantic Item is added to the Ontology, it is used right away. • SIRA can find semantically relevant information even when it doesn’t resemble your investigation terms or grammar. • SIRA uses domain knowledge [and adds knowledge in real time] • The more domain knowledge SIRA has, the more accurate the results…more later.

  7. Forms of Knowledge used by SIRA Sample requirement for implication pattern: For companies to be similar, they must have nearly (+/- 10%) the same; market cap, age, and annual revenue. • Language • An encryption of Concepts and Relationships (decrypted using Meaning Maps) • Equivalent ways of expressing the same meaning are handled using Equivalent Pattern Sets • Dictionary • Terms and Senses (+ Linguistic Metadata) and Synonymy… • Domains (knowledge disciplines) • Ontology • A “World View” in terms of Concepts, Relationships,… • Implication Patterns • Allow for “Do what I want, not what I said” • Apply knowledge from experience and expertise to improve results

  8. Introducing the Research Librarian™ Research Librarian™ is a Website that allows you to (1) select a set document sources and (2) provide a description of your investigation. The Research Librarian™ will then read the documents in the selected source and generate a research report with bibliography.

  9. Research Librarian™ • What it does • Given an investigation and large document corpus • Automatically identifies a subset of the documents by • Generating a set of keyword queries from the investigation, and • Executing the queries to identify a subset of the documents • Reads the documents (limit is variable – default is top 500 documents) • Automatically identifies potential relevant dictionary domains • Generates a report with bibliography identifying the relevant information in the documents read. In many cases you do not need to access the original documents.

  10. Simple Research Librarian Example:“Who has seen Adnan Shukrijuma?” • Adnan Gulshair el Shukrijumah[According to Wikipedia entry] • Born in Saudi Arabia, Adnan Gulshair el Shukrijumah(born 4 August 1975) is a member of al-Qaeda who grew up in the United States. • In March 2003, a provisional arrest warrant was issued calling him a "material witness", and he was subsequently listed by the U.S. Federal Bureau of Investigation (FBI) on the Seeking Information - War on Terrorism list, and since then United States Department of State, through the Rewards for Justice Program, has offered a bounty of up to US$5 million for information about his location. • Last known to have lived with his family in Miramar, Florida, Shukrijumah is known to have a Guyanese passport but might also use a Saudi, Canadian, or Trinidadian passport. Saudi Arabia has repeatedly denied that el Shukrijumah is a Saudi citizen. • Now he is considered to be a high-ranking member of Al-Qaeda. • His mother insists that her asthmatic son has been wrongly accused.Healso goes by the names Abu Arif, and Jafar al-Tayyar, the latter translating to "Jafar the Pilot". • Preparation for running Research Librarian • Dictionary contains “Adnan Gulshair el Shukrijumah” with synonyms, “Adnan Shukrijumah”, “Shukrijumah”, “Abu Arif”, “Jafa-al-Tayyar” and “Jafar the Pilot”. • The investigation [grew as I worked on it –Chuck] • Who has seen Adnan Shukrijumah? • Who saw Adnan? • Who has contact with Adnan Shukrijumah?

  11. “Who has seen Adnan Shukrijuma?” Out of 200 documents read, 18 had relevant information per this limited investigation. Google and Yahoo identified both the same and unique documents for SIRA to read. Where to go from here...

  12. Research Librarian™ • How to use it • Login to the Research Librarian http://www.semanticinsights.com/products/ResearchLibrarian.htm • Select the document collection from among the options provided • Enter your investigation (use complete sentences, questions, etc.) • Select the Submit button • When your report is ready, the Save as PDF button will appear • Select the Download PDF button • then select the PDF once it is downloaded • Note: Requires an account (username/password)

  13. Research Librarian Demos • GAO • PubMed and ClinicalTrials.gov • Drug Gene interaction of CYP2D6

  14. GAO demo1 Who is GAO? 1 This demo is not done at the request of GAO, but solely as a capability demonstration for FBDWG.

  15. BLUF – Progress has been made but much remains to be done • GAO's 2014 annual report identifies 64 new actions that executive branch agencies and Congress could take to improve the efficiency and effectiveness of 26 areas of government. • GAO identifies 11 new areas in which there is evidence of fragmentation, overlap, or duplication. • For example, under current law, individuals are allowed to receive concurrent payments from the Disability Insurance and Unemployment programs. Eliminating the overlap in these payments could save the government about $1.2 billion over the next 10 years. • GAO also identifies 15 new areas where opportunities exist either to reduce the cost of government operations or enhance revenue collections. • For example, Congress could rescind all or part of the remaining $4.2 billion in credit subsidies for the Advanced Technology Vehicles Manufacturing Loan program unless the Department of Energy demonstrates sufficient demand for this funding. • The executive branch and Congress have made progress in addressing the approximately 380 actions across 162 areas that GAO identified in its past annual reports. • As of March 6, 2014, nearly 20 percent of these areas were addressed, over 60 percent were partially addressed, and about 15 percent were not addressed, as shown in the figure • Executive branch and congressional efforts to address these and other actions over the past 3 years have resulted in over $10 billion in cost savings with billions of dollars more in cost savings anticipated in futureyears.

  16. GAO demo Answer a question – What areas are identified that improve efficiency?

  17. GAO demo - results • The tool reads the GAO website and provides a report with bibliography. • Perhaps this is more useful to GAO technology analysts than the existing advanced search using the same query that provides 16,489 of results. • For the last four years GAO has produced an annual efficiency and effectiveness report. • What if they had a tool like this one? How much time would it save? It takes about 5 minutes or so to run this question through this big data tool. • Also, what if they could do this analysis on reports produced before they started doing the annual assessment. How would that impact the statistics they are using today to monitor and track the completion of actions?

  18. Key benefit to GAO A key metric for GAO is specified in terms of financial benefits.  Specifically a return for every dollar invested in GAO.  Our "big data" demonstration leads to the development and substantiation of a savings metric.  Using the suggested efficiency question as a start and expanding to all reports with specific recommendations, the tool develops a list of additional data sources to examine in order to provide further evidence of efficiencies gained.

  19. Research Librarian: PubMed Example PubMed investigation: “What metabolizes CYP2D6?” 414 documents read, 11 documents reported with total of 15 citations.

  20. Introducing the Research Assistant™ More than a search engine: Research Assistant™ is a Google Chrome™ plugin that, given an investigation, can read web pages, along with their links and generates a research report with bibliography.

  21. Research Assistant™ • What it does: • Given an investigation • Automatically identifies potential relevant dictionary domains • Reads the current webpage and “all” the links in it, and “all” the links in the links (currently set at max 500 but can be much more) • Then generates a report with bibliography identifying the relevant information in the documents read. In many cases you do not need to access the original documents.

  22. Simple Research Assistant Example:“What causes Autism?”

  23. Research Assistant™ • How to use it • While viewing a webpage of interest (this identifies your Corpus) • Select the Research Assistant plugin (Google Chrome only now) • Enter your investigation (use complete sentences, questions, etc.) • Select the Submit button, then wait • When your report is ready, the Save as PDF button will appear • Select the Download PDF button • then select the PDF once it is downloaded

  24. Research Assistant™ • Prerequisites • Available as a Google Chrome plugin at http://www.semanticinsights.com/products/ResearchAssistant.htm • Requires key to download • Note: it can also be embedded as a button on a given page.

  25. Research Assistant Demos • BASS Demo • “looking for a needle in a haystack”… • Search Brand’s Wiki…http://semanticommunity.info/A_NITRD_Dashboard/Making_the_Most_of_Big_Data#Story

  26. Benefits Assisted Support Services (BASS) The Department of Veteran Affairs (formerly the VA) is committed to ending Veteran homelessness by 2015. As a result, a number of new and improved programs are being implemented. There are programs available to assist veterans, many of which span across traditional service silos. The programs come from within the Department of Veterans affairs, other government agencies or from non-profit organizations. These organizations have diverse eligibility and access criteria, making it difficult to leverage the appropriate services. The Benefits Assisted Support Services (BASS) Project proposes to support VA staff case management services by providing them with the knowledge and skills to help Veterans apply for appropriate benefits.

  27. But first, you need little more technology backgroundon “Implication Patterns” • The Problem: • Sometimes the language of the investigation/query alone is insufficient to find the answer. • For example, “I need help to pay rent.” • Or, “What companies are similar to IBM?” • To make such investigations useful often requires more than using synonyms or “kind of” relationships. • Implication Patterns are a way to transform and/or augment the user’s investigation into one that can find more relevant and useful information.

  28. An Example of BASS using the Research Assistant™ Prior to executing BASS investigations we created: • A BASS-specific Ontology with 293 Concepts/Instances • A corresponding BASS Dictionary with 439 terms • Implication Patterns* to translate “user requests” into effective “research request” For example, a (simplified) Implication Pattern could be: IMPL_SET "Help to Pay" EXAMPLE: I need help to pay rent. END_EXAMPLE REPLACE: I need help to pay rent. END_REPLACE REPLACE: I need help paying rent. END_REPLACE IMPL: #?# pays rent. END_IMPL IMPL: #?# subsidizes rent END_IMPL IMPL: #?# pays housing. END_IMPL IMPL: #?# subsidizes housing. END_IMPL END_IMPL_SET • The SIRA “read link depth” is set at 1 (i.e. read only the links on the given page) *Implication Patterns typically evolve to be generalized and quite powerful.

  29. BASS Demo A web page with a set of relevant links Research Assistant investigation. Report with most relevant results Note: The answers are per the Implication Patterns. In this case we have found; the usefulness of the document corpus and the completeness of the Implication Patterns tend to drive accuracy.

  30. Demo: “Finding a Needle in a Haystack” Starting with http://semanticommunity.info/A_NITRD_Dashboard/Making_the_Most_of_Big_Data#Story, an information rich web page with 345 hyperlinks (note: set to link depth processing was set to 1) The investigation: “What has NASA invested?” A total of 346 pages were read in about 2 minutes.

  31. Some “Needles” were found… 16 documents were sited with a total of 21 references.

  32. Coming to Conclusion: General ways to Improve SIRA Results • Enable Validation of the Results [not enabled by default] • Employs deep NLP on each sentence • Matches the results with the concept clusters from the investigation • Takes longer and reduces “false positives”. However, near matches can be informative as well. • Dictionary • The easiest way to improve results to be sure the dictionary contains all the senses needed to adequately describe your domain. • Ontology • Add enough generalization/specialization information for the concepts, instances, and relationships needed to adequately describe your domain. • Implication Patterns • Encode the best practices of experts in the domain

  33. Who we are • We are: • Semantic Insights™ is the R&D division of Trigent Software, Inc. www.trigent.com • We focus on developing semantics-based information products that produce high-value results serving the needs of general users requiring little or no special training. • Visit us at www.semanticinsights.com

  34. Chuck Rehberg As CTO at Trigent Software and Chief Scientist at Semantic Insights, Chuck Rehberg has developed patented high performance rules engine technology and advanced natural language processing technologies that empower a new generation of semantic research solutions. Chuck has more than thirty years in the high-tech industry, developing leading-edge solutions in the areas of Artificial Intelligence, Semantic Technologies, analysis and large –scale configuration software.

More Related