Personal information

Personal information • Hung-Hsuan Chen 陳弘軒 • PhD, Computer Science and Engineering, the Pennsylvania State University (2008 – 2013) • MS, BS, Computer Science, National TsingHua University (2000 – 2004, 2004 – 2006) • Recent honors • Best paper award, College of Engineering, PSU (2013) • Highest F1 score, the Competition of Plagiarism detection, PAN (2013) • Invited to Amazon PhD research symposium, present research work at Amazon, single digit acceptance rate (2013) • Travel award, SIGMOD 2013, ICHI 2013, SBP 2012

Data Science? From data scientist Drew Conway http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Research interest in general: data analysis and mining Data Type SAC’13 DBSocial’13 Link Only DMH’13 KDD’12 TKDD ‘14 SBP’12 K-CAP’11 JCDL’11 JCDL’13 Link + Content JCDL’11 ECIR’14 D-Lib’12 ICPADS’05 JCDL’10 ASONAM’13 IAAI’14 WWW’10 MIR’10 CLEF’13 WebSci’14 AAAI’14 Content Only TMIS ‘14 SIGCOMM’08 JCDL’14 Task MobiDE’09 JCDL’14 Ranking Function User Analysis Text Analysis Link Prediction Similarity Search Info Prop P2P Trans.

Research experience • RA, IST, Pennsylvania State University (2008 – now) • Large scale text mining + social network analysis • RA, CSIE, National Taiwan University (08/2011 – 01/2012) • Information propagation analysis • Software engineer Intern, Google (05/2010 – 08/2010) • Recommender system + user log analysis • RA, IIS, Academia Sinica (11/2007 – 07/2008) • User traffic analysis • RA, CS, National TsingHua University (2004 – 2006) • Distributed data stream analysis

Selected recent research

CSSeer • An open source expert recommender system based on a given digital library • Live site (based on CiteSeerX): http://csseer.ist.psu.edu • The framework is shipped to Dow Chemical • Expert discovery based on internal technical reports • Author disambiguation (by random forest) • Wen-Yi：雯怡？文溢？ • “C. Giles” = “C. Lee Giles” = “Lee Giles” = “C. L. Giles”? • Google Scholar suffers from a similar problem • Keyphrase extraction (by naïve Bayesian) • Expert ranking (by naïve Bayesian)

How to rank experts? • Rank authors, not documents • Text indexing and PageRank-like methods cannot be directly applied • Ranking efficiency • Aggregating author scores on-the-fly is time consuming • Offline computing unigrams • Ranking quality • What is the probability that a is an expert given a query term q? • P(a|q) = ΣP(d)P(q|d)P(a|q,d) = ΣP(d)P(q|d)P(a|d) Query term Current search engine Document Ranking Function d1 d3 d2 d2 d3 d5 d4 d1 d5 d4

Comparison with other expert recommenders Simulates Google Scholar Obtain ground truth from: http://arnetminer.org/lab-datasets/expertfinding/

ASCOS • Discovering similar objects in a network • Symmetric vs. asymmetric similarity • Similarity can be asymmetric • Coauthoring behavior: a young researcher might be more interested in collaborating with a strong researcher than vise versa • Asymmetric property may reveal the hierarchical relationshipbetween objects • Word association network: “fruit” should be the super-class of “banana” and “apple”, but sub-class of “food” • Link prediction • ASCOS better predicts future collaborations than SimRank and several other state-of-the-art link prediction algorithms

Intuition of ASCOS • Similarity from i to j is dependent on the similarity score from i’s neighbors to j • N(i): the set of neighbors of node i • Utilize all paths between nodes • Asymmetric

Hierarchical structure inference The score difference between the neighbor words of “instrument” to the word “instrument” (node i) The score difference between the neighbor words of “fruit” to the word “fruit” (node i)

Future research direction (and several ongoing research)

Data science • Hacking skills in handling big data • CiteSeerX, CSSeer, CollabSeer • 3 million+ documents • 1 million+ authors, 300K+ disambiguated authors • 3 billion+ log entries to analyze • Google ad logs • Several TB per day • Knowledge in math and stats • Various data mining techniques • Social network analysis • Natural language processing • Inter-discipline collaboration • Collaborated with Dow Chemical, Alcatel-Lucent, etc.

Data is changing the world • WhosCall • Telephone number crowd-sourcing • Reverse lookup and number identification • Waze • Map crowd-sourcing + Google Map info • Automatic road update + real time traffic update • 零時政府 • 台灣懸浮微粒汙染圖–已被用於新聞台氣象報導 • 萌典–教育部林主任：應用層面已經不是廠商做不做得出來的問題，是我們想都想不到能有這些應用。 • And many others… • Netflix, Amazon, Walmart, State Farm, Spotify, Yelp, medical data used in hospital, etc.

Potential research projects: IR and DM on MOOC • MOOC: Massive Open Online Course • Mining keyphrases from slides + audio lectures • Slide texts may not be a complete sentence  POS taggers may not work • Slides provides unique style clues for keyphrase extraction • Speaker’s voice, tones, and other features may provide other clues for keyphrase extraction • Combining above heterogeneous perspectives to improve performance • Automatic course topic clustering or classification • Rely on students’ interactions with MOOC to predict their learning performance • Find talented students and slow learners as early as possible • 因材施教

Potential research projects: IR and DM on digital libraries • Math equation retrieval • How to correctly parse equation (from PDF)? • How to index equation? • Query interface? • Music score retrieval • How to parse music notes? • How to index music? • Query interface? • Inferring “meaning” of figures • Retrieving x, y labels and the points in a figure could make a search engine more powerful • Sample query: what’s the performance of method Y when x = x1? • “Artificial” paper detection • IEEE and Springer withdraw 120 papers • Fake paper influences user experience and Scientometrics

Teaching

Teaching experience • Guest lectures in classes • Information Retrieval and Search Engines (Spring 2011, Spring 2013, at PSU) • TA • Operating Systems (Fall 2005, at NTHU) • Guest speaker of various seminars (in addition to conference presentations) • Dept of CS, RIT – 2014 • Amazon – 2013 • Graduate Exhibitions, PSU – 2012, 2013 • College of Engineering, PSU – 2013 • Network Science Seminar, PSU – 2013

Teaching philosophy • Incorporate research in teaching • Students understand the usefulness of what they’ve learned • Students understand what’s happening in science • Students may bring useful feedback or fresh ideas • Learning by doing • Students may better understand these topics • Students usually feel more confident when they implement a concept by themselves • Competition • Online competitions stimulate students’ motivations to think and work hard

Advanced courses I can offer • 資料探勘(data mining) • Overview • Evaluation methods • Supervised • Classification • Regression • Unsupervised • Clustering • Density estimation • Applications of DM • Practical issues • Advanced techniques • 資料擷取與蒐尋引擎(information retrieval and search engine) • Overview • Retrieval evaluation • Concept of documents • Text processing • Query models and indexing • Web crawling and robots.txt • Link analysis

Advanced courses I can offer • 社群網路分析(Social network analysis) • Overview • Graph theory • Properties of real social networks • Community detection • Link prediction • Information propagation • Heterogeneous social network • 大規模文字資料分析(Large-scale Text Document analysis) • Word and document representation • Text mining pipeline • Fundamentals of NLP • Association rules • MapReduce framework • Keyphrase extraction • Duplicate detection • Recommender systems

Basic courses I can offer • 線性代數(Linear algebra) • Systems of linear equations • Vector and matrix • Eigenvalues and eigenvectors • Determinants • When LA meets DM • Matrix decomposition vs recommender systems • PCA vs dimension reduction • Eigenvector vsPageRank • 機率與統計(Probability and Statistics) • Intro to probability • Random variables and Bayes’ theorem • Discrete random variables and PMF • Continuous random variables and PDF • Joint probability distribution • Confidence interval • Hypothesis testing

Other courses I am interested to offer • 計算機程式設計 • 資料結構 • 資料庫系統概論 • Web 程式設計 • 開源軟體開發實務

Personal information