210 likes | 403 Views
Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval. 出處: institute of information science , academia sinica , taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授. Abstract. PAT-tree-based adaptive approach
E N D
Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處:institute of information science , academia sinica , taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授
Abstract • PAT-tree-based adaptive approach • IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification
Introduction • Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification.ex(哈電族)
Definition of the Problems • Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. • For example:關鍵詞抽取 • LPs:關鍵、建詞、 詞抽、抽取、關鍵詞、鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、關鍵詞抽取
Definition of the Problems (cont) • Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. • For example: 關鍵詞抽取 • CLP:關鍵、抽取、關鍵詞、關鍵詞抽取
Definition of the Problems (cont) • Significant lexical pattern: A CLP which is either “specific” or “significant” in the database • For example: 關鍵詞抽取 • SLP:關鍵詞、關鍵詞抽取
Definition of the Problems (cont) • Definition 1:SLP Extraction Problem • Definition 2:CLP Estimation Problem • To solve problem 1, first we should solve problem 2
Definition of the Problems (cont) • Proposed Approach: 3 modules • Text analysis and PAT-tree indexing module • CLP extraction module • SLP extraction module
Estimation of CLP • Most CLP have strong associations between their composed and overlapped substrings • Association Norm Estimation function • If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection (關鍵詞抽取、鍵詞抽取、關鍵詞抽)
Estimation of CLP (cont) • It’s not enough to check if x has complete lexical boundaries using AE (關鍵詞) • To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex.李登輝 • By these metrics we can say: • X is a CLP iff it has no LCD and RCD, and AE > (t3) threshold
Estimation of CLP (cont) • X has LCD if |L|<t1, or MAX z (f(zx)/f(x))>t2, where t1, t2 are threshold values , z E L and |L| means the number of unique right adjacent characters of x • X has RCD if |L|<t1, or MAX z f(xy)/f(x)>t2, where t1, t2 are threshold values , y E L and |L|means the number of unique right adjacent characters of x
Text Analysis and PAT-Tree Indexing • PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction • Use delimiter(, “” .) to determine a segment boundary, then build semi-infinite string • For example:個人電腦,人腦 • 個人電腦,人電腦,電腦,腦,人腦,腦 • Node information (comparison bit, external nodes,frequency) • PAT Is easy for prefix search. • IPAT is easy for postfix search.
Text Analysis and PAT-Tree Indexing (cont) • Convert semi-infinite strings to bits • According semi-infinite strings’ bit sequences and differences to build PAT Tree • We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs • (詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進)
Text Analysis and PAT-Tree Indexing (cont) • Why use Pat tree (patricia)? • Log key value comparison times is low. • Computing time and space is down. • Efficient search. • We can use Pat tree to check RCD. • We can use Inverse Pat tree to check LCD.
Extraction of SLP • A CLP is not always a SLP • It cannot prove its significance in the text collection • Many CLP are commonly found in daily use • All CLP is checked against a set of lexical rules and a general-domain corpus • Rules: • Numbers, Adverbs, Timing-related Terms • General Domain Pat Tree vs Specific Domain Pat Tree.
Evaluation • Extraction of SLP • Ask 3 people to select CLPs and keyphrases from 50 “seed sentence” • Use these test data to test accuracy of SLP extraction
Evaluation (cont) • Speed and Space Requirements
Conclusion • This method reduced the difficulty of keyphrase extraction in Chinese, with better performance
String Bit 0 2 4 6 1 9 8 9 11 17 25 個人電腦/節點0 10101101 11010011 10100100 … 個 人 電 腦 , 人 腦 人電腦/節點2 10100100 01001000 10111001 … 電腦/節點4 10111001 01110001 00000000 … 腦/節點6 10111000 0000000 00000000 … 人腦/節點9 10100100 01001000 00000000 … 腦/節點6 10111000 00000000 00000000 … 節點號碼 Semi-infinite strings
(比較位元,外部節點數,字串次數) (0,6,1) 0 (4,6,1) 4 (5,3,1) (8,3,2) 2 6 (24,2,1) 9 0 4 2 6 9