380 likes | 489 Views
Letizia Tanca Politecnico di Milano. Goal 2 Activities 4, 6, 7. Goal 2: Knowledge Management ( Polimi ) . Activity 4: Knowledge extraction from natural language actions ( Polimi + IBM + Bari)
E N D
Letizia Tanca Politecnico di Milano Goal 2Activities 4, 6, 7
Goal 2: Knowledge Management(Polimi) Activity 4: Knowledge extraction from natural language actions (Polimi + IBM + Bari) Activity 6:Knowledgeextraction, modeling and integrationfromsemi-structured information sources, drivenby domain ontologies(PoliMI) Activity 7:Knowledgefusion, “tailoring” and disseminationfor business modelredesign(PoliMI)
Contextual data analysis • At GialloRosso the oenologist and the agronomist interact with the data related to harvesting and to the wine ageing • the information they interact with depend on their role and on the workflow phase • The agronomist inserts information related to the nature of the natural phoenomena • The agronomist and the oenologist ask information related to the phase • At BiancoRosso the sales manager: • analyzes sales data • in a different moment analyzes the market trends, then • reads similar information in natural language from the web • GialloRosso performs market analyses by accessing its own information combined with market information collected by its ally BiancoRosso
BiancoRossoLogical Schema VINO(ID_Vino, nome, vinificazione, invecchiamento, denominazione, temperatura, min_temp, note) EVENTO(ID_Evento, nome, tipo, data, luogo) TRENDSETTER(ID_Trend, nome, professione) FONTE(ID_Fonte, nome, uri, tipo, rilevanza, provenienza, descrizione) DOCUMENTO(ID_Doc, riassunto, url, data, autore, titolo, argomento, descrittore, ID_Fonte) VALUTAZIONEMERITO(ID_valutazione, descrizione, giudizio, lingua) RISULTATORICERCA(ID_risultato, ID_vino, ID_evento, ID_trend, ID_fonte, ID_doc, posizione, ID_valutazione)
At GialloRosso the oenologist and the agronomist interact with the data related to cultivation and to the cellar
oenologist A portion of the CDT of our scenario
Some contextualviews C1=<role=agronomist, *, phase=harvesting> C2 =<role=agronomist, *, phase=ageing> C3=<role=enologist, *, phase=harvesting> C4 =<role=enologist, *, phase=ageing>
Some contextualqueries The agronomist during the harvesting phase (context C1) wants to collect all the available information coming from sensors: SELECT m.date_time,m.value,s.s_id,s.meas_unit FROM sensor s, measure_data m WHERE s.s_id=m.s_id; S/he obtains only the information from sensors placed in the vineyards (see Rel(C1))
Some contextualqueries The oenologist during the harvesting phase (context C3) wants to collect all the available information about bottles of “Aglianico” wine: SELECT * FROM bottle b WHERE b.appellation="aglianico"; But the query is out of context, in the context C3 only information about vineyard and grapevine are available for the oenologist.
Some more contextualqueries The previous query makes sense in context C4, where the oenologist is in the ageing phase: SELECT * FROM bottle b WHERE b.appellation="aglianico"; Produces a non- empty result.
At BiancoRosso: the sales manager analyzes sales data The oenologist analyzes wine features to design a new wine Then s/he reads similar information in natural language from the web also intensional queries are performed
Sales and promotions planning (Q1) Sales and promotions planning for events and festivals • The sales manager of BiancoRosso wants to select the wines to promote for each event or festival • For each event or type of event he/she needs to identify the most related wines • Interesting wines for each event can be obtained by analyzing frequent rules in the form • EventType=value → Wine=value • E.g., EventType=“Summer party” → Wine=“White wine” support=20%, confidence=36%
Sales and promotions planning (Q2) Sales and promotions planning depending on time periods • The sales manager wants to plan specific promotions for each time period of the year • For each time period (e.g., month) the manager needs to select the most related wines • Interesting wines can be obtained by analyzing frequent rules in the form • Month=value → Wine=value • E.g., Month=“June” → Wine=“White wine” support=20%, confidence=36%
Design of wine (Q3) Analysis of the main characteristics of wines • The oenologist of BiancoRosso wants to produce new wines • He/she needs to know the main characteristics of each wine to select the most interesting wines to produce • He/she obtains the characteristics of each wine by exploiting rules in the form • Wine=value → Characteristic=value • E.g., Wine=“White wine” → Characteristic=“Mainly drunk in a specific time period” support=6%, confidence=100%
Design of wine (Q4) Identification of correlations between wines and time periods • The time period in which each wine is mainly consumed is useful to select the wines to produce • For each wine the oenologist wants to obtain the time period (e.g., month) in which the wine is mainly consumed • Allows selecting wines related to time periods not already covered by the wines currently produced by BiancoRosso • He/she uses rules in the form • Wine=value → Month=value • E.g., Wine=“White wine” → Month=“June” support=20%, confidence=100%
Design of wine (Q5) Identification of correlations between wines and information sources • Once the oenologist has selected the new wines to be produced, he/she needs to identify the sources containing documents related to the selected wines • The oenologist identifies the sources containing information about the wines of his/her interest by exploiting the following rules • Wine=value → Source=value • E.g., Wine=“Montello e colli asolani cabernet superiore” → Source=“Gambero Rosso” support=11%, confidence=100%
DIESIRAEA semantic search engine based on Natural Language Processing
Domain model Ontology (W3C OWL standard) Describes the concepts of the domain Domain vocabulary Semantic Network Describes the lemmas of the domain Mapping model Stochastic model 2° order HMM-inspired model Transition probs approximated by means of MaxEnt models Solves mapping ambiguities Queries: Keyword-based (AND/OR; max probability/exaustive) Phrase-based (Disambiguated Word queries and Ontological queries) Knowledge Indexing & Extraction: Goals
Knowledge indexing & extraction: Functionalities Training Indexing, querying, and extending
Linguistic Context Extractor: Calls linguistic tools (Stanford Parser, FreeLing, JavaRAP,…) words Wi (lemmas Li , linguistic context information Ii ) Knowledge indexing & extraction: Information Extraction Engine • MaxEnt Models: • Calculates HMM transition probabilities (takes in account the linguistic context info) • Extended Viterbi: • (Li , Ii) concepts Ci • TF-IDF: • Document ranking, based on concept frequencies Training Indexing, querying, and extending
Sequence of isolated words No linguistic structure Exhaustive AND/OR keywords No concept disambiguation Searches for multiple tuples Example: light wine several meanings found…country wine search for instances… taste wine search for subclasseses… Max probability AND/OR keywords Searches for a single tuple Exploits the a-priori concept probabilities Example: [light wine] max probability meaning Keyword-based queries
Phrase Linguistic structure Context-based disambiguation Disambiguated Word queries Context used for concept disambiguation Index the phrase ( extract concepts) Search for AND-ed concepts Example: (fruit taste) disambiguates fruit Ontological queries Context used to select the request to the ontology Indexes the sentences Select the request; searches the ontology for the mapped concepts Example: “type of tannins in wine” instance list Phrase-based queries
GialloRosso performs market analyses by accessing its own information combined with market information collected by its ally BiancoRosso
DATA SOURCE 1 (RDBMS) DATA SOURCE 4 (Base station) DATA SOURCE2 (XML) DATA SOURCE3 (WWW) The Integration problemfrom the user point of view User query answer GLOBAL KNOWLEDGE INTERFACE APPLICATION
Knowledgeretrievalfrom the sources In order to integrate the two original sources, we define the following query to populate the ontology: PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX fn: http://www.w3.org/2005/xpath-functions# PREFIX do: file:///home/lele/workspace/TIS-RewSparQL_progetto/ontologies/art-deco/wineDomain.owl# SELECT ?w1 ?w2 ?wn1 ?wn2 ?wb ?bq ?dse ?dso ?sn FROM file:///home/lele/workspace/TIS-RewSparQL_progetto/ontologies/art-deco/wineDomain.owl WHERE { ?w1 rdf:type do:WineInFarm . ?wb rdf:type do:WineBottle . ?wb do:containsWine ?w1 . ?wb do:bottleQuantity ?bq . ?w1 do:appellationInFarm ?wn1 . ?w2 do:appellationInDocument ?wn2 . ?w2 rdf:type do:WineInDocument . ?dse rdf:type do:DocSearch . ?dso rdf:type do:DocSource . ?dse do:searchWineID ?w2 . ?dse do:searchSrcID ?dso . ?dso do:docSrcName ?sn . }
Query 1 Quantity of bottles (in the GialloRosso DB) available for each wine cited by the web source “Percorsi di Vino” (stored in the BiancoRosso DB): PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX fn: http://www.w3.org/2005/xpath-functions# PREFIX do: file:///home/lele/workspace/TIS-RewSparQL_progetto/ontologies/art-deco/wineDomain.owl# SELECT ?wine_name sum(?bottle_quantity) FROM file:///home/lele/workspace/TIS-RewSparQL_progetto/ontologies/art-deco/wineDomain.owl WHERE { ?w1 rdf:type do:WineInFarm . ?wb rdf:type do:WineBottle . ?wb do:containsWine ?w1 . ?wb do:bottleQuantity ?bottle_quantity . ?w1 do:appellationInFarm ?wn1 . ?w2 do:appellationInDocument ?wine_name . ?w2 rdf:type do:WineInDocument . ?dse rdf:type do:DocSearch . ?dso rdf:type do:DocSource . ?dse do:searchWineID ?w2 . ?dse do:searchSrcID ?dso . ?dso do:docSrcName ?source_name . FILTER regex(?source_name, “PercorsiDiVino") FILTER fn:contains(?wine_name, ?wn1) } GROUP BY ?wine_name ?source_name
Query 2 Which sources (from BiancoRosso) cite wines of which we (GialloRosso) have at least a bottle available? PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX fn: http://www.w3.org/2005/xpath-functions# PREFIX do: file:///home/lele/workspace/TIS-RewSparQL_progetto/ontologies/art-deco/wineDomain.owl# SELECT ?wine_name ?source_name FROM file:///home/lele/workspace/TIS-RewSparQL_progetto/ontologies/art-deco/wineDomain.owl WHERE { ?w1 rdf:type do:WineInFarm . ?wb rdf:type do:WineBottle . ?wb do:containsWine ?w1 . ?wb do:bottleQuantity ?bottle_quantity . ?w1 do:appellationInFarm ?wn1 . ?w2 do:appellationInDocument ?wine_name . ?w2 rdf:type do:WineInDocument . ?dse rdf:type do:DocSearch . ?dso rdf:type do:DocSource . ?dse do:searchWineID ?w2 . ?dse do:searchSrcID ?dso . ?dso do:docSrcName ?source_name . FILTER (?bottle_quantity > 0) FILTER fn:contains(?wine_name, ?wn1) } GROUP BY ?wine_name ?source_name
Q & A Q & A (If you see this slide we’ve not run out of time)
Part 3 of the book • Ontology-based knowledge elicitation: an architecture (Chapter editor Licia Sbattella, Roberto Tedesco, Giorgio Orsi, Politecnico di Milano, Marcello Montedoro, IBM Italia) • Knowledge extraction from Natural Language (Chapter editor Licia Sbattella, Roberto Tedesco, Politecnico di Milano) • Knowledge extraction from event flows (Chapter editor Alberto Sillitti, Università di Bolzano) • Context-aware knowledge querying in a networked enterprise (Chapter editor Cristiana Bolchini, Elisa Quintarelli, Fabio A. Schreiber, Politecnico di Milano, Teresa Baldassare, Università di Bari) • On-the-fly and Context-Aware Integration of Heterogeneous Data Sources (Chapter editors Giorgio Orsi, Letizia Tanca, Politecnico di Milano) • A methodology for context-driven data-warehouse design (Chapter editor Cristiana Bolchini, Elisa Quintarelli, Letizia Tanca, Politecnico di Milano)