1 / 4

B ig Data at B ITEM R esearch Group

B ig Data at B ITEM R esearch Group. ( Text|Web ) Mining Research Group patrick.ruch@hesge.ch, http:// bitem.hesge.ch Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance , Clinical trials…

Download Presentation

B ig Data at B ITEM R esearch Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data atBITEM Research Group • (Text|Web) Mining Research Group • patrick.ruch@hesge.ch, http://bitem.hesge.ch • Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials… • Specialised in (semi|un)structured data • We like text, text and more text • Especially on the noisy/dirty Web • Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…

  2. Web Sources NoSQL Replication Forum CouchDB CouchDB RSS CouchDB CouchDB Twitter API Cleaning Normalisation Solr Cloud 26’000 per day Drugbank 19’000 drugnames checkedeach 10 mn 7 M of docs in 9 months Pharmacovigilance on Big Social Media Data Dynamic and Real Time Data Analysis Correlation Analysis NoveltyDetection Trends Analysis

  3. Proteins annotation based on litterature by curators 23 000 000 articles 40’000 concepts [Big-scaleMulticlass Multilabel Classifier]  Lazylearning ! annotated articles Manual annotation planned for 2045 ! (Baumgartner et al) GOA Machine Learning based on Information Retrievalmethods Assisting curators Macro reading of litterature Profilinganytextual content Managing the data deluge for proteins annotation

  4. Patent retrieval The real situation (0.5-1 TB) Experiments Database 13 millions of patents Database A sample of 1 million of patents Extraction 33 days Extraction 2.5 days XML patents 17 Gb XML patents 0.221 Tb Normalization 33 days Normalization 2.5 days XML patents + metadata 0.234 Tb XML patents + metadata 18 Gb Indexing 5 days Indexing 10 hours Index 0.1 Tb Index 3 Gb

More Related