290 likes | 511 Views
第一届中国大数据技术创新与创业大赛 关键 词行业分类. ThuFit 队: 周 昕宇,吴育昕 ,任杰 ,王 禺淇 ,罗鸿胤 指 导:方展 鹏 , 唐 杰 清华大学 未来互联网兴趣团队. Task. Given: Partially labeled keywords First 10 search results for each keywords Keyword-buyer relationship Goal: Predict unlabeled keywords. Data summary. keyword_class.txt
E N D
第一届中国大数据技术创新与创业大赛关键词行业分类第一届中国大数据技术创新与创业大赛关键词行业分类 ThuFit队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤 指导:方展鹏 ,唐杰 清华大学 未来互联网兴趣团队
Task • Given: • Partially labeled keywords • First 10 search results for each keywords • Keyword-buyer relationship • Goal: • Predict unlabeled keywords
Data summary • keyword_class.txt • 10,787,584 keywords • 1,143,928 labeled, 10.6% • 9,963,062 unique keywords • 33 classes • keyword_users.txt • 23,942,643 entries • Each entry is a keyword-buyer pair • keyword_titles.txt • 21,575,166 entries, but only 10,787,583 entries are non-empty. • Each entry comprised of keyword and its first 10 search result using Baidu
Approach • Preprocessing: • Keyword segmentation • Feature Extraction: • Keyword segment • Keyword-buyer relation • Keyword-segment relation • Search result utilization • Model: • liblinear
Keyword segmentation • Keyword • Segement • A sub-string of a keyword • Semantic unit • Segmentation • Break a keyword to a set of segment • Two ways: • Exact segmentation • 清华大学 => 清华/大学 • Full segmentaion • 清华大学 => 清华/大学/华大/清华大学 • 结巴中文分词:https://github.com/fxsjy/jieba
Feature Extraction - segment • Sparse representation of segments • Smoothened TFIDF-based feature • N-gram • “End-gram”
Feature Extraction - TFIDF • Just in this page: segment = term • Definition of will be given later
Feature Extraction - N-gram • N-gram • To capture some structure information • Recall • There are two ways of segmenting a keyword • , a set • , an ordered list <- adopt this one • 2-gram • Limitation • Large character set produce large keyword set • Noise • Reduced 2-gram
Feature Extraction - End-gram • End-gram • is more likely to carry discriminative information • Emphasis on the last segment: append a character that did not appear in , e.g “漢” • Example • rnu209e.tvp2轴承 • “hj系列双锥混合机市场调查报告” • Similarly we can define
Feature Extraction • Where is ? • Experiments showed that, when adding , performance slightly degrades.
Keyword-buyer/segment relation B0 K0 S0 K0 C0 B1 K1 S1 K1 C1 B2 K2 S2 K2 C2 B3 K3 S3 K3 C3
Keyword-buyer/segment relation B0 K0 S0 K0 C0 B1 K1 S1 K1 C1 B2 K2 S2 K2 C2 B3 K3 S3 K3 C3 S3: C2 C3 K3: C3 B3: S2: K2: B2: S0: C2 K0: C2 B0: C2 C3 S1: C3 K1: B1:
Keyword-buyer/segment relation B0 K0 S0 K0 C0 B1 K1 S1 K1 C1 B2 K2 S2 K2 C2 B3 K3 S3 K3 C3 S3: C2 C3 C0 C3 K3: C3 B3: S2: C0 K2: C3 B2: S0: C2 K0: C2 C0 B0: C2 C3 S1: C3 C3 K1: B1: C0 C3
Keyword-buyer relation • Assumption: A user tends to by similar class of keywords • Obtain the distribution of classes of keywords a buyer buys on labeled data. • Each buyer has a 33-dimensioned feature vector • For each keyword , its feature vectors is an averageover feature vector of a buyers that buys this keyword. • Using only this feature we get an accuracy of 0.82
Keyword-buyer relation B0 K0 S0 K0 C0 B1 K1 S1 K1 C1 B2 K2 S2 K2 C2 B3 K3 S3 K3 C3
Keyword-buyer relation • We have made effort trying modeling buyers by the segments of keywords they bought, and model keywords-keywords relationship by exploiting their common connection with segments. • Buyer -> Keyword ->Segment =>Buyer -> Segment • We further introduced higher order relation influence between buyers and keywords, but improvements are subtle.
Keyword-segment relation • Reverse the link between segment and keywords • Keyword ->Segment => Segment -> Keyword
Keyword-segment relation B0 K0 S0 K0 C0 B1 K1 S1 K1 C1 B2 K2 S2 K2 C2 B3 K3 S3 K3 C3
Search Result Utilization • Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/ • 1-1828169-5: 1 1828169 5 • 1-1838143-0: 1 1838143 0 • Their search results • 1-1838143-0 1-1838143-0全国供货商【IC37旗下站】1-1838143-0价格|PDF ... IC芯片1-1838143-0品牌、价格、PDF参数 - 电子产品资料 - 买卖IC网 PIC16C57-XT/SP145的IC、二极管、三极管查询,采购PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0 2000pcs 1005+ 现货 泰科Tyco431829-1集成电路、连接器、接插件 AMP欧式背板连接器崧晔达_达价格_优质崧晔达批发/采购 - 阿里巴巴 供应聚氯乙烯_连接器_供应聚 崧晔达价格_优质崧晔达批发/采购 - 阿里巴巴 供应聚氯乙烯_连接器_供应聚氯乙烯批发_供应聚氯乙烯供应_阿里巴巴 上海金庆电子技术有限公司 限位开关12 福州福铭仪器
Search Result Utilization • For normal keywords, the keyword itself has semantic meaning. • For those keywords with less semanticinformation, they are usually a product serial number or some domain specific terminology , e.gchemical element names. • These supplementary information yields more accuracy results on “weird” keywords. • But these keywords did not seem to be included in online test.
Search Result Utilization • Recall: • If we add one more term: • where is the search result of • Performance decreased by noise introduced • Example • “hj系列双锥混合机市场调查报告” • “混合设备 HJ系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 - ...混合机-供应HJ系列双锥混合机-混合机尽在阿里巴巴-常州欧朋干燥... HJ系列双锥混合机厂家_价格-食品机械行业网HJ系列双锥混合机供应信息,常州市步群干燥设备有限公司 HJ系列双锥混合机_百度百科 HJ系列双锥混合机 - 常州普耐尔干燥设备有限公司 HJ系列双锥混合机价格(江苏 常州)-盖德化工网...”
Feature Statistics • Dimensionality: 200,000 • Lower dimensionality introduce better generalization ability.
Implementation • Life is short, you need Python
Model • Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/ • A Library for Large Linear Classification • L2-loss logistic regression • 33 one-vs-all classifiers for each class.
Experiments and Results • We split labeled data into training and validation set • All following results are local results. • Online test result are higher due to utilizing more training data. • Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.
Analysis • We split labeled data into training and validation set • All following results are local results. • Online test result are higher due to utilizing more training data. • Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.
Limitations • Two types of feature • Relation feature: • Utilized prior knowledge of class label information • Low dimension • May biased to training data • TFIDF feature: • No class label information utilized • High dimension • Robust, good generalization ability • But a simple combination of two does not work well • Ensemble methods may workaround this problem.