1.14k likes | 1.35k Views
知識發掘之發展與應用. 蔣以仁. I Know That I Don‘t Know Anything. 知識. 易經 觀乎天文,以察時變;觀乎人文,以化成天下 三字經 知某數 識某文 老子 『 論知識 』 知者弗言,言者弗知。 知人者智,自知者明。 莊子 知道不知的道理,就達到最高的境界了 buste de Socrate I Know That I Don‘t Know Anything Francis Bacon Nam et ipsa scientia protestas est ( 知識就是力量 ).
E N D
知識發掘之發展與應用 蔣以仁 I Know That I Don‘t Know Anything.
知識 • 易經 • 觀乎天文,以察時變;觀乎人文,以化成天下 • 三字經 • 知某數 識某文 • 老子『論知識』 • 知者弗言,言者弗知。 • 知人者智,自知者明。 • 莊子 • 知道不知的道理,就達到最高的境界了 • buste de Socrate • I Know That I Don‘t Know Anything • Francis Bacon • Nam et ipsa scientia protestas est (知識就是力量)
“The chief economic priority for developed countries is to raise the productivity of knowledge . . . The country that does this first will dominate the twenty-first century economically.” 開發中國家首要經濟目標為知識的創造力…誰先掌握誰就統領二十一世紀的經濟 為何知識管理如此迫切? Peter F. Drucker
知識經濟時代的來臨 • 微軟總裁比爾.蓋茲(Gates, 1999) • 在《數位神經網路》一書中,更明白指出未來的企業是以知識與網路為基礎的企業。
知識經濟的核心理念 • 知識經濟具有以下十項核心理念(高希均,民89): (1)知識獨領風騷 (2)管理推動變革 (3)變革引發開放 (4)科技主導創新 (5)創新顛覆傳統 (6)速度決定成敗 (7)企業家精神化不可能為可能 (8)網際網路超越時空限制 (9)全球化同創商機與風險 (10)競爭力決定長期興衰
知識經濟的定義 • 知識經濟乃是一種強調知識的創造、傳播與運用之經濟;換言之,知識經濟的真正意涵在鼓勵知識的創造,將這些知識有效地散播出去,並讓這些知識能廣泛地被運用於經濟發展的整套體制。 • 張忠謀(2002) • 「知識經濟的重點不是知識,而是轉知識為利潤,所以「使用」科技知識比「擁有」科技知識更重要。 • 美國前總統柯林頓(Clinton) • 知識經濟係以科技為燃料,由創業精神(企業家精神)(entrepreneurship)及創新(innovation)所驅動的新經濟運作模式。 • 知識經濟時代最重要的兩件事就是知識管理與創新
知識保存價值 • 企業知識的保留與轉換 • 知識資產的投資 • 精簡與退休 • 人員輪替 • 生產力 • 能力 • 重複能量消耗 • 過多的會議 • 溝通問題 • 組織目標 • 下達決策 • 可行性 • 快速 • 非正規 增加 生產力與品質 企業知識的轉換 快且有效的決策 課程 創新 群策群力 … 等等 減少 循環時間 反應時間 重複投資 作業花費 會議時間 外界顧問 …等等
知識管理6C的觀念 • Collect蒐集:累積並蒐集個人知識專業技能 • Clarify確認:確認並篩選所要擷取的知識內容 • Classify分類:便於檢索或搜尋 • Communicate溝通:虛擬溝通環境之建置 • Comprehend了解:增進組織及個人間的了解 • Create創新/分享 :知識創新並提升組織整體能力
資料 (文字) 探勘 常識 知識階層 Wisdom 知識 資訊 資料 Knowledge 訊息 Information Data 資源分佈 知識架構
區別 從台北到機場搭機到高雄開會。 • 資料: 100, 12, 34, 15. • 資訊:颱風在台北東南方100公里海面,以時速12公里朝西北34度方向前進,瞬間最大陣風15級風 。 • 知識:天氣會造成延遲或使你必須取消與會。 • 常識:可能必須改訂下一班火車去高雄以能趕上會議。
知識架構 領導 文化, 結構 推手 腦力激盪- 解題 –專案控管 需求分析 –設計 –測試 –分析 等. 應用 功能 (整合式) 虛擬 會議 資訊庫 (虛擬檔案館) 分享資源 通訊 網路 連結, 聯繫
實證 & 案例推論 案例推理 實證推理 病例診斷 案例研究 文獻 實證 & 案例研判 實證佐證 相似案例 概念查詢 經驗分析 個案分析
資料探勘的應用領域 • 零售業-於銷售資料中採礦顧客的消費習性,並可藉由交易紀錄找出顧客偏好的產品組合,找出流失顧客的特徵與推出新產品的時機點等等 • 直效行銷業-其強調的分眾概念與資料庫行銷方式在導入探勘的技術後,使直效行銷的發展性更為強大,例如利用資料探勘分析顧客群之消費行為與交易紀錄,結合基本資料,並依其對品牌價值等級的高低來區隔顧客,進而達到差異化行銷的目的 • 製造業-其對資料探勘的需求多運用在品質控管方面,由製造過程中找出影響產品品質最重要的因素,以期提高作業流程的效率 • 財務金融業-利用資料探勘來分析市場動向,並預測個別公司的營運或是股價利率等的走向 • 醫療業-用來預測手術、用藥、診斷、或是流程控制的效率 • 詐欺行為的偵測(Fraud Detection)-電話公司、信用卡公司、保險公司、股票交易商、以及政府單位等等,這些行業每年因為詐欺行為而造成的損失都非常可觀。Data Mining可以從一些信用不良的客戶資料中找出相似特徵並預測可能的詐欺交易,達到減少損失的目的 • 其他-在NBA球賽資料中,找出球員的強弱點/星際星體分類/從太空船拍攝的影像資料,找尋星球上的火山
Interpretation/ Evaluation Data Mining Knowledge Transformation Pattern Preprocessing Selection/ cleansing Transformed Data Preprocessed Data Data Warehouse Target Data 資料知識形成流程 Integration Raw Data Understanding
BI 結構 Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) Clients (Tier 3) e.g., MOLAP OLAP Semistructured Sources Data Warehouse serve extract transform load refresh etc. Query/Reporting serve e.g., ROLAP Operational DB’s serve Data Mining Data Marts
Gaining market intelligence from news feeds Sreekumar Sukumaran and Ashish Sureka
Integrated BI Systems Intermedia Data ETL Complete Data Warehouse RDBMS XML Text taggor & Annotator ETL Structural Data Unstructured Data DBMS CMS File System XML EA Legacy Scanned Documents Email Sreekumar Sukumaran and Ashish Sureka
知識來源與價值 網路訊息 新聞報導 專利 電子郵件 文件… “On average, professional users spend 11 hours per week looking for information. Seventy-one percent said they could not find what they were looking for." — "Information Management Software" Lazard Freres & Co. LLC February 2001 "The volume of digitized information will double every year from 2000 to 2005 (an increase to 30 times today's volume)." — "Knowledge Management vs. Information Management" Gartner Group September 2000
文獻問題 • 出版統計 • 8TB(書籍), 25TB(新聞), 20TB(雜誌), 2TB(期刊) • 平均每分鐘科學知識增加2000頁 • 新材料的閱讀須時5年(24hrs/day) How Can I Keep Up With the Literature?
Problems using MEDLINE: No articles retrieved “Answers” definitively answered years ago Manifestations of Renal TB Viral/Bacterial bronchitis: Duration of Symptoms Legionella: prevalence of relative bradycardia Acute allergic episodes: ? thrombocytopenia MEDLINE indexed using a system obtuse to most clinicians Too many articles retrieved Find the Evidence
Evolution “To study history one must know in advance that one is attempting something fundamentally impossible, yet necessary and highly important.” Father Jacobus (Hesse's Magister Ludi) Das Glasperlenspiel (The Glass Bead Game)
資訊巨幅成長 2006 年數位資訊量已達 1,610 億GB( 相當於 161 Exabytes) 。 IDC 預估從 2006 至 2010 年間,資訊成長量約為六倍。 2010 年時,有近 70% 的數位世界的資訊是由個人使用者所創造,而至少有 85% 的資訊量是組織企業必須負起資訊安全、隱私、可靠性及相關法規遵從的責任。 The Expanding Digital Universe, http://www.emc.com/leadership/digital-universe/expanding-digital-universe.htm 網路訊息 新聞報導 專利 電子郵件 文件… Oracle
網路搜尋引擎 • 以離線方式抓去網頁,透過建立一種內部資料儲存方式,稱之為 (反轉;inverted) 索引,儲存資料 • 線上檢索 Monika Henzinger, Search Technologies for the Internet Science, Vol. 317. no. 5837, 468 – 471, 27 July 2007
Search Engine Problems • Index Comprehensiveness • Relevance
Deterministic Search • Search Query • Jaguar(Animal) • Jaguar(Automobile) • Problem: Scalable J, Beall, The Weaknesses of Full-Text Searching. The Journal of Academic Librianship, 34(5):438-444, 2008.
搜尋引擎之演進 1995-1997 AV, Excite, Lycos, etc • 第一代– 只使用“網頁內”文字資料 • 字頻, 語言 • 第二代--使用非頁內, 網路上特殊屬性資料 • 連接分析 • 點擊資料 (What results people click on) • 下錨文字 (Hyperlinks, How people refer to this page) • 第三代– 回答 “查詢所知” • 語意分析 -- what is this about? • 專注使用者所需, 非僅僅查詢 • 關鍵資料之推定 • 輔助使用者 • 整合搜尋及文件分析 From 1998. Made popular by Google but everyone now Still experimental
網路搜尋問題 • 問題 • 查詢過於簡短不夠精確 • 同意與相似字詞讓查詢匹配度難預期 • 網頁作者混淆式安排, 讓搜尋結果差強人意 • 使用者需要額外功能, 如過濾器 • 解決 • 增加理解 • 結果排列 • Trailblazer • Car • Basketball team Monika Henzinger, Search Technologies for the Internet Science, Vol. 317. no. 5837, 468 – 471, 27 July 2007
分群檢索 • Walter Warnick, Problems of Searching in Web Databases. Science . Vol. 316. no. 5829, 1284, June 2007. • I-Jen Chiang, Discover the Semantic Topology in High-Dimensional Data, Expert Systems with Applications, 33 (1), September, 2007.
Gartner 2005 Hype Cycle for Emerging Technologies http://www.gartner.com/resources/130100/130115/gartners_hype_c.pdf
Mashup can quickly meet tactical needs with reduced development costs and improved user satisfaction. Enables new ways to performing vertical applications that will result in significantly increased revenue or cost savings for an enterprise. Applications Architecture Enables new ways of doing business across industries that will result in major shifts in industry dynamics Real World Web Gartner 2006 Hype Cycle for Emerging Technologies http://www.gartner.com/it/page.jsp?id=495475
t1 t2 … tn t t t t Term similarity t t t t d1 d2 … dm w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn t t t t 分群 d Doc similarity d d d d d d d d d d d d d Term Weighting Vector centroid Sentence selection Tokenized text 摘要 d Stemming & Stop words META-DATA/ ANNOTATION Raw text 分類 知識產生
Text ETL to Mining Mining target: individual text Mining unit: >texts >category labeled items extracted from text using NLP IBM TAKMI (Nasukawa, Nagano,1999) Original Data Category Meta Data Item Visualization & Interactive Mining Category Dictionary Structured Data [Call Taker] James [Date] 2002/08/30 [Duration] 10 min. [CustomerID] ADC00123 [Noun] Customer [Software] BIOS [Subj...Verb] customer system..stop [SW..Problem] BIOS..need Call Taker: James Date: Aug. 30, 2002 Duration: 10 min. CustomerID: ADC00123 Synonym Dictionary Mining Linguistic Analysis Q: cust sys has stopped working. A: checked cust bios and it needupdated. … Unstructured Data • Tagging • Dependency Analysis • Named Entity Extraction • Intention Analysis
Text is Tough • 其係一個極不容易表達的抽象性概念 • (AI-Complete) • 是許多概念彼此間抽象而複雜的無盡關係組合 • 一種名詞可以代表很多不同的概念 • CELL, IV • 類似的概念也有很多種方式可以表達 (aliases) • space ship, flying saucer, UFO, figment of imagination • 概念是很難加以視覺化的 • 高維度 • 其分析構面可能高達成百上千
Text Mining is Easy • 重複性很高 • 只要一些簡單的演算法,就可以從一些極為粗糙的工作中,得到不錯的結果 • 找出重要片語 • 找到有意義的相關字 • 從文章中建立摘要 • 主要問題: • 結果評估 • 必須定義目標及目的
Luhn's ideas (1958) It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165. van Rijsbergen 79
foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 資訊萃取
Internet Commercial data sources Public repositories Agency data sources Locally held data Spiders Library catalogs Dynamic content Search engine Search engine Search engine Search engine Search engine Search engine Metasearch Tool Custom content Email alerts Personalized access Virtual Reference Visualization Online collaboration Data/Text Mining Collaborative Environment Automated categorization Taxonomy-driven web portal/Security control
Text Analysis Spectrum Targeted Facts and Events Classification Concept Identification Entity Extraction Clustering Who did what to whom when where, etc. What is this document about?
Why is getting dimensional data so hard? Hank bought plastic explosives from Henry in Tucson yesterday. Named Entity Extraction Hank Henry People, Weapons, Vehicles, Dates NER Engine Plastic explosives Tucson 11/01/07 FrameNet
Locations Persons Organizations Name Extraction via MMs The delegation, which included the commander of the U.N.troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic. Training Program training sentences answers NE Models Entities Speech Speech Recognition Extractor Text • An easy but successful HMM application: • Prior to 1997 - no learning approach competitive with hand-built rule systems • Since 1997 - Statistical approaches (BBN (Bikel et al. 1997), NYU, MITRE, CMU/JustSystems) achieve state-of-the-art performance
Annotation and Tagging Date Acquiring Organization Acquisition Event Acquired Organization On November 16, 2005, IBM announced it had acquiredCollation, a privately held company based in Redwood City, California for undisclosed amount. Place Amount Output to RDBMS Text Annotator XML output On <Date>November 16, 2005</Date>, <ACQUIRING ORG>IBM</ACQUIRING ORG> announced it had <ACQUISITION EVENT>acquired</ACQUISITION EVENT> <ACQUIRED ORG>Collation</ACQUIRED ORG>, a privately held company based in <PLACE>Redwood City, California</PLACE> for <AMOUNT>undisclosed</AMOUNT> amount.
醫學文獻告訴我什麼 • 醫學文獻來源:Medline • 可發現疾病、症狀與藥物或化合物的因果關聯 • Swanson DR. Searching natural language text by computer. Machine indexing and text searching offer an approach to the basic problems of library automation. Science. 132:1099–1104, 21 Oct. 1960. • 2. Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 30(1):7–18, 1986. • 3. Swanson, D.R., Complementary structures in disjoint science literatures. In A. Bookstein, et al (Eds.), SIGIR91: Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval Chicago, Oct 13-16, 280-289, 1991.
偏頭痛? • Stress is associated with migraines • Stress can lead to loss of magnesium • Calcium channel blockers prevent some migraines • Magnesium is a natural calcium channel blocker • Spreading cortical depression (SCD) is implicated in some migraines • High levels of magnesium inhibit SCD • Migraine patients have high platelet aggregability • Magnesium can suppress platelet aggregability • Smalheiser, N.R. & Swanson, D.R.. Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neuroscience Research Communications, 15, 1-9, 1994.
文獻實証 All MigraineResearch All NutritionResearch CCB PA migraine magnesium SCD stress
找出新線索 Hypothesis generation 雷諾氏現象 Raynauds Fish oils vasoconstrictions 血管收縮 platelet aggregation 血小板活化凝集 blood viscosity 粘滯血症 Intermediate concepts • Swanson, D.R. (1994). Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. Autumn;30(1):7-18, 1986 .
Literature processing MEDLINE citations MetaMap NER Annotated citations UMLS EBP Domain Model Semantic processing Knowledge Extraction Clinical Task Classification Strength of Evidence Classification Document Retrieval Query terms E-Utilities Essie PICO Query Formulation Document frame Semantic matching Answer Generation Question frame Answer Dina Demner-Fushman
Problem Extractor Population Extractor Intervention Extractor Outcome Extractor Semantic processing example Amiodarone versus diltiazem for rate control in critically ill patients with atrial tachyarrhythmias. Semantic processor … Patients withatrial fibrillation(n = 57), … were randomly assigned to one of three intravenous treatment regimens. Group 1 received diltiazem… group 2 received amiodarone…. Sufficient rate control can be achieved in critically ill patients with atrial tachyarrhythmias using either diltiazem or amiodarone … Task Classifier Strength of Evidence Classifier Task: Therapy Strength of Evidence: A (RCT) Dina Demner-Fushman
Outcome extractor Problem Extractor Score: 0.99 Sufficient rate control can be achieved in critically ill patients with atrial tachyarrhythmias using either diltiazem or amiodarone. Score: 0.75 Although diltiazem allowed for significantly better 24-hr heart rate control, this effect was offset by a significantly higher incidence of hypotension requiring discontinuation of the drug. Population Extractor Intervention Extractor Base classifiers Multiple Linear Regression Meta-classifier Cue-terms Heuristic N-gram Naïve Bayes Position Length Training: 275 manually annotated abstracts Dina Demner-Fushman