240 likes | 470 Views
Personal information. Hung- Hsuan Chen 陳弘軒 PhD, Computer Science and Engineering, the Pennsylvania State University (2008 – 2013) MS, BS, Computer Science, National Tsing Hua University (2000 – 2004, 2004 – 2006) Recent honors Best paper award , College of Engineering, PSU (2013)
E N D
Personal information • Hung-Hsuan Chen 陳弘軒 • PhD, Computer Science and Engineering, the Pennsylvania State University (2008 – 2013) • MS, BS, Computer Science, National TsingHua University (2000 – 2004, 2004 – 2006) • Recent honors • Best paper award, College of Engineering, PSU (2013) • Highest F1 score, the Competition of Plagiarism detection, PAN (2013) • Invited to Amazon PhD research symposium, present research work at Amazon, single digit acceptance rate (2013) • Travel award, SIGMOD 2013, ICHI 2013, SBP 2012
Data Science? From data scientist Drew Conway http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Research interest in general: data analysis and mining Data Type SAC’13 DBSocial’13 Link Only DMH’13 KDD’12 TKDD ‘14 SBP’12 K-CAP’11 JCDL’11 JCDL’13 Link + Content JCDL’11 ECIR’14 D-Lib’12 ICPADS’05 JCDL’10 ASONAM’13 IAAI’14 WWW’10 MIR’10 CLEF’13 WebSci’14 AAAI’14 Content Only TMIS ‘14 SIGCOMM’08 JCDL’14 Task MobiDE’09 JCDL’14 Ranking Function User Analysis Text Analysis Link Prediction Similarity Search Info Prop P2P Trans.
Research experience • RA, IST, Pennsylvania State University (2008 – now) • Large scale text mining + social network analysis • RA, CSIE, National Taiwan University (08/2011 – 01/2012) • Information propagation analysis • Software engineer Intern, Google (05/2010 – 08/2010) • Recommender system + user log analysis • RA, IIS, Academia Sinica (11/2007 – 07/2008) • User traffic analysis • RA, CS, National TsingHua University (2004 – 2006) • Distributed data stream analysis
CSSeer • An open source expert recommender system based on a given digital library • Live site (based on CiteSeerX): http://csseer.ist.psu.edu • The framework is shipped to Dow Chemical • Expert discovery based on internal technical reports • Author disambiguation (by random forest) • Wen-Yi:雯怡?文溢? • “C. Giles” = “C. Lee Giles” = “Lee Giles” = “C. L. Giles”? • Google Scholar suffers from a similar problem • Keyphrase extraction (by naïve Bayesian) • Expert ranking (by naïve Bayesian)
How to rank experts? • Rank authors, not documents • Text indexing and PageRank-like methods cannot be directly applied • Ranking efficiency • Aggregating author scores on-the-fly is time consuming • Offline computing unigrams • Ranking quality • What is the probability that a is an expert given a query term q? • P(a|q) = ΣP(d)P(q|d)P(a|q,d) = ΣP(d)P(q|d)P(a|d) Query term Current search engine Document Ranking Function d1 d3 d2 d2 d3 d5 d4 d1 d5 d4
Comparison with other expert recommenders Simulates Google Scholar Obtain ground truth from: http://arnetminer.org/lab-datasets/expertfinding/
ASCOS • Discovering similar objects in a network • Symmetric vs. asymmetric similarity • Similarity can be asymmetric • Coauthoring behavior: a young researcher might be more interested in collaborating with a strong researcher than vise versa • Asymmetric property may reveal the hierarchical relationshipbetween objects • Word association network: “fruit” should be the super-class of “banana” and “apple”, but sub-class of “food” • Link prediction • ASCOS better predicts future collaborations than SimRank and several other state-of-the-art link prediction algorithms
Intuition of ASCOS • Similarity from i to j is dependent on the similarity score from i’s neighbors to j • N(i): the set of neighbors of node i • Utilize all paths between nodes • Asymmetric
Hierarchical structure inference The score difference between the neighbor words of “instrument” to the word “instrument” (node i) The score difference between the neighbor words of “fruit” to the word “fruit” (node i)
Data science • Hacking skills in handling big data • CiteSeerX, CSSeer, CollabSeer • 3 million+ documents • 1 million+ authors, 300K+ disambiguated authors • 3 billion+ log entries to analyze • Google ad logs • Several TB per day • Knowledge in math and stats • Various data mining techniques • Social network analysis • Natural language processing • Inter-discipline collaboration • Collaborated with Dow Chemical, Alcatel-Lucent, etc.
Data is changing the world • WhosCall • Telephone number crowd-sourcing • Reverse lookup and number identification • Waze • Map crowd-sourcing + Google Map info • Automatic road update + real time traffic update • 零時政府 • 台灣懸浮微粒汙染圖–已被用於新聞台氣象報導 • 萌典–教育部林主任:應用層面已經不是廠商做不做得出來的問題,是我們想都想不到能有這些應用。 • And many others… • Netflix, Amazon, Walmart, State Farm, Spotify, Yelp, medical data used in hospital, etc.
Potential research projects: IR and DM on MOOC • MOOC: Massive Open Online Course • Mining keyphrases from slides + audio lectures • Slide texts may not be a complete sentence POS taggers may not work • Slides provides unique style clues for keyphrase extraction • Speaker’s voice, tones, and other features may provide other clues for keyphrase extraction • Combining above heterogeneous perspectives to improve performance • Automatic course topic clustering or classification • Rely on students’ interactions with MOOC to predict their learning performance • Find talented students and slow learners as early as possible • 因材施教
Potential research projects: IR and DM on digital libraries • Math equation retrieval • How to correctly parse equation (from PDF)? • How to index equation? • Query interface? • Music score retrieval • How to parse music notes? • How to index music? • Query interface? • Inferring “meaning” of figures • Retrieving x, y labels and the points in a figure could make a search engine more powerful • Sample query: what’s the performance of method Y when x = x1? • “Artificial” paper detection • IEEE and Springer withdraw 120 papers • Fake paper influences user experience and Scientometrics
Teaching experience • Guest lectures in classes • Information Retrieval and Search Engines (Spring 2011, Spring 2013, at PSU) • TA • Operating Systems (Fall 2005, at NTHU) • Guest speaker of various seminars (in addition to conference presentations) • Dept of CS, RIT – 2014 • Amazon – 2013 • Graduate Exhibitions, PSU – 2012, 2013 • College of Engineering, PSU – 2013 • Network Science Seminar, PSU – 2013
Teaching philosophy • Incorporate research in teaching • Students understand the usefulness of what they’ve learned • Students understand what’s happening in science • Students may bring useful feedback or fresh ideas • Learning by doing • Students may better understand these topics • Students usually feel more confident when they implement a concept by themselves • Competition • Online competitions stimulate students’ motivations to think and work hard
Advanced courses I can offer • 資料探勘(data mining) • Overview • Evaluation methods • Supervised • Classification • Regression • Unsupervised • Clustering • Density estimation • Applications of DM • Practical issues • Advanced techniques • 資料擷取與蒐尋引擎(information retrieval and search engine) • Overview • Retrieval evaluation • Concept of documents • Text processing • Query models and indexing • Web crawling and robots.txt • Link analysis
Advanced courses I can offer • 社群網路分析(Social network analysis) • Overview • Graph theory • Properties of real social networks • Community detection • Link prediction • Information propagation • Heterogeneous social network • 大規模文字資料分析(Large-scale Text Document analysis) • Word and document representation • Text mining pipeline • Fundamentals of NLP • Association rules • MapReduce framework • Keyphrase extraction • Duplicate detection • Recommender systems
Basic courses I can offer • 線性代數(Linear algebra) • Systems of linear equations • Vector and matrix • Eigenvalues and eigenvectors • Determinants • When LA meets DM • Matrix decomposition vs recommender systems • PCA vs dimension reduction • Eigenvector vsPageRank • 機率與統計(Probability and Statistics) • Intro to probability • Random variables and Bayes’ theorem • Discrete random variables and PMF • Continuous random variables and PDF • Joint probability distribution • Confidence interval • Hypothesis testing
Other courses I am interested to offer • 計算機程式設計 • 資料結構 • 資料庫系統概論 • Web 程式設計 • 開源軟體開發實務