130 likes | 244 Views
Primary Research Team & Capabilities. URL: http://ikt.ui.sav.sk. Dept. of Parallel and Distributed Computing Research and Development Areas: Large-scale HPCN, Grid and MapReduce applications Intelligent and Knowledge oriented Technologies Experience from IST:
E N D
Primary Research Team & Capabilities URL: http://ikt.ui.sav.sk Dept. of Parallel and Distributed Computing Research and Development Areas: • Large-scale HPCN, Grid and MapReduce applications • Intelligent and Knowledge oriented Technologies Experience from IST: • 3 project in FP5: ANFAS, CrosGRID, Pellucid • 6 project in FP6: EGEE II, K-Wf Grid, DEGREE (coordinator),EGEE, int.eu.grid, MEDIGRID • 4 projects in FP7: Commius, Admire, Secricom, EGEE III Several National Projects (SPVV, VEGA, APVT) IKT Group Focus: • Information Processing (Large Scale) • Graph Processing • Information Extraction and Retrieval • Semantic Web • Knowledge oriented Technologies • Parallel and Distributed Information Processing Solutions: • SGDB: Simple Graph Database • gSemSearch: Graph based Semantic Search • Ontea: Pattern-based Semantic Annotation • ACoMA: KM tool in Email • EMBET: Recommendation System • Experts on MapReduce and IR (Nutch, Solr, Lucene) Director & leader of PDC: Dr. Ladislav Hluchý 11 October 2013
Towards Entity Search • Current approaches • Confirmed human knowledge • Google Knowledge Graph • Facebook Graph Search • Data sets Available • Wikipedia • DBPedia (111 languages) • Freebase • Linked Data cloud • Our approach • Quite unique mix of skills: • IR, Semantic Web, Graphs and Networks • Networks, Text, metadata • Graph algorithms • Information Retrieval techniques • Anchor texts: aliases, properties, types 11 October 2013
Entity Search Applications https://www.linkedin.com/today/post/article/20130805134105-50510-search-what-s-cooking-in-the-lab http://www.siliconrepublic.com/strategy/item/31182-global-enterprise-search-ma 11 October 2013
Entity Search Applications • Online Advertising • Query Categorization • Keyword Extension • Business Intelligence • Enterprise Search • Knowledge Management • Text analytics • Multilingual short text categorizations • Based on Wikipedia Language versions, DBPedia, Freebase • Query Categorization • Social media (Twitter) categorization, analysis • Security Domain • Information Leakage prevention • Categorization 11 October 2013
Large scale Text and Graph data processing Underlined are the technologies developed by IISAS Core Technology • Web crawling • Nutch + plugins • Full text indexing and search • lucene, Sorl • Information Extraction • Ontea, GATE • All above large scale • Hadoop, S4 • Graph processing and Querying • Simple Graph Database (SGDB) • gSemSearch • Neo4j • Blueprints 11 October 2013
Relation to Business Intelligence • Old BI approaches • Data Integration from RDBM • Data ware houses • OLAP • … • New BI approaches • Other than RDBM data structures: Networks, Semantics • Networks/Graphs in Telecom, Social Networks, Transactions, Linked Data … • NoSQL: key value (Tokyo Cabinet), column stores (HBase), Graph databases, RDF(s) • In-Memory computing • Commodity PCs solutions for large data: • MapReduce style - Hadoop, Pregel style – Giraph, Hama • Big unstructured data processing (on Hadoop): • Sentiment analysis, topic detection, named entity detection 11 October 2013
Ontea: Information Extraction Tool http://ontea.sf.net Tree of annotations • Regex patterns • Gazetteers • Resuls • Key-value pairs • Structured into trees • graphs • Transformers, Configuration • Automatic loading of extractors • Visual Annotation Tool • Integration with external tools • GATE, Stemers, Hadoop … • Multilingual tests • English, Slovak, Spanish, Italian Text with annotations Network /Graph of annotations 11 October 2013
Named Entity Recognition (NER) • Combination of Existing NER • ANNIE (GATE), Apache OpenNLP, • Illinois NER, Illinois Wikifier, • LingPipe, Open Calais • Stanford NER ,WikiMiner, • Miscinator • Machine Learning • Decision Trees models • Received second place at MSM 2013, missing first place by 1%, where participated 17 teams word widehttp://ikt.ui.sav.sk/index.php?n=Main.IEChallenge2013 11 October 2013
gSemSearch: Graph based Semantic Search • http://ikt.ui.sav.sk/esns/ • Entity relation search in semantic networks/graphs • Search, Navigation, Data Interaction • Aiming at data integration of • Structured data(Relational data, LinkedData) • Unstructured Data(text, documents, communication) • Applications: • Email, Web, Text documents, LinkedData 11 October 2013
SemSets: Sematnic Search • Answering list type questions: astronauts who walked on the Moon • Wikipediaas text and networks/graph • Text: IR methods, Lucene based • Graph/network: sprading activation and SemSets • Winning solution on Semantic Search Challenge 2011 Eugene_Cernan Alan_Bean David_Scott John_Young_(astronaut) Neil_Armstrong Pete_Conrad Harrison_Schmitt Alan_Shepard Charles_Duke Buzz_Aldrin James_Irwin Edgar_Mitchell 11 October 2013
SGDB: Simple Graph Database • Storage for graphs • Optimized for graph traversing and spread of activation • Faster then Neo4j for graph traversing operations • Supports Blueprints API • https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3 • Graph Database Benchmarks • Graph Traversal Benchmark for Graph Databases • http://ups.savba.sk/~marek/gbench.html • Blueprints API - possibility to test compliant Graph databases Source: http://geza.kzoo.edu/bionet/html/scalefree.html 11 October 2013
Community Detection in Complex Networks • Task: Identify densely connected subgraphs in complex networks • community collapsing problem • SCCD • Near-linear time complexity • Avoids community collapsing problem (to certain extend) • KDD paper • Re-weighting approach • Better results on real networks • Marek Ciglan , Kjetil Nørvåg: Fast detection of size-constrained communities in large networks, proceedings of WISE'10, LNCS Volume 6488/2010 • Marek Ciglan, Michal Laclavík and Kjetil Nørvåg: On Community Detection in Real-World Networks and the Importance of Degree Assortativity, 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2013 11 October 2013
Future Direction: Entity Search in Large Graph Data • Motivation • Graph/Network data are everywhere: social networks, web, LinkedData, transactions, communication (email, phone). • Also text can be converted to graph. • Interconnecting graph data and searching for relations is crucial. • Approach • Forming semantic trees and graphs from text, web, communication, databases and LinkedData • User interaction with graph data in order to achieve integration and data cleansing • Users will do it, if user effort have immediate impact on search results 11 October 2013