OAG: Toward Linking Large-scale Heterogeneous Entity Graphs

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li and Kuansan Wang. Tsinghua University Microsoft Research

OAGoverview Open Academic Graph (OAG) is a large knowledge graph unifying two web-scale academic graphs: Microsoft Academic Graph (MAG) and AMiner. Linkinglarge-scaleheterogeneousacademicgraphs

OAG: Open Academic Graph https://www.openacademic.ai/oag/

Problem & Challenges Input: twoheterogeneousentity graphsand. Output: entitylinkingssuch that and represent exactly the same entity.

Challenges • Entity heterogeneity • Different types of entities • Heterogeneous attributes • Entity ambiguity • Long-standing name ambiguity problem • Large-scale entity linking • Hundreds of millions of publications in each source.

Related work • Rule-based method: DiscR [TKDE’15] • Traditional ML method: RiMOM [JWS’06], Rong et al. [ISWC’12], Wang et al. [WWW’12], COSNET [KDD’15]. • Embedding-based method: IONE [IJCAI’16], REGAL [CIKM’18], MEgo2Vec [CIKM’18].

Framework: LinKG Authorlinkingmodule Venuelinkingmodule Paperlinkingmodule

Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.

Linking venues — sequence-based entities • Input: venue names in each graph • Output: linked venue pairs • Idea: LSTM-based method Direct name matching Easy cases Fuzzy-sequence linking

Venuelinkingcharacteristics • Wordordermatters • E.g.‘Diagnostic and interventional imaging’ and ‘Journalof Diagnostic Imaging and Interventional Radiology’ • Fuzzymatchingforvaried-lengthvenuenames. • Extra or missing prefix or suffix • E.g.Proceedings of the Secondinternational conference on Advances in social network miningand analysis.

Venuelinkingmodel Two-layer LSTM layers Raw word sequence Input Similarity score Keywords extracted from integral sequences

Linking papers — large-scale entities • Problemsetting:To link paper entities, we fully leverage the heterogeneousinformation, including a paper’s title and authors. • Leveragethehashingtechnique(LSH)forfastprocessing • AdoptDoc2Vectotransformtitlestoreal-valuedvectors • UseLSHtomapreal-valuedpaperfeaturestobinarycodes. • Andtheconvolutional neural network foreffective linking.

Paperlinkingcharacteristics • Large-scaleentities • Hundredsofmillionsofacademicpublicationsforeachgraph. • Localandhierarchicalmatchingpatterns • Papertitles are often truncated if they contain punctuation marks, suchas ‘:’ and ‘?’ • Differentauthornameformats:JingZhang,J.,Zhang&Zhang,J.

Paperlinkingmodel—CNNmodel Convolutiononinputsimilaritymatrix word-levelsimilaritymatrix MLP layers

Linking authors — ambiguous entities • Problemsetting:To link author entities, we generate a heterogeneoussubgraph for each author. • One author’ssubgraph is composedof his or her coauthors, papers, and publication venues. • Alsoincorporatethevenueandpaperlinkingresults. • Presenta heterogeneous graph attention network basedtechnique for author linking.

Authorlinkingcharacteristics • Nameambiguity • 16,392JingZhanginAMinerand7,170JingZhanginMAG • Attributesparsity • Missingaffiliations,homepages… • Alreadylinkedpapersandvenues! • Viewauthorlinkingasasubgraphmatchingproblem • Aggregateneededinformationfromneighbors

Graph neural networks • Neighborhood Aggregation: • Aggregate neighbor information and pass into a neural network • It can be viewed as a center-surround filter in CNN---graph convolutions! b a v c e d

GCN: graph convolutional networks GCN is one way of neighbor aggregations • GraphSage • Graph Attention • … …

LinKGstep1:pairedsubgraphconstruction • Subgraph nodes • direct (heterogeneous) neighbors, including coauthors, papers, and venues • coauthors’papersandvenues (2-hop ego networks) • Merge pre-linked entities (papers or venues) • Construct fixed-size graph

Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Inputnodefeatures(insubgraphs) • Semanticembedding:averagewordembeddingofauthorattributes • Structureembedding:trainednetworkembeddingonalargeheterogeneousgraph(e.g.LINE)

Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Encoderlayers • attentioncoefficient attnlearntbyself-attentionmechanism • Normalizedattentioncoefficient:differentiatedifferenttypesofentities aggregation weight ofsource entity ’s embedding on target entity

Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Encoderlayers(cont.) • Multi-headattention • Twographattentionlayersintheencoder • Decoderlayers • Fuseembeddingsofcandidatepairs,andusefully-connectedlayerstoproducethefinalmatchingscore. concatenation Element-wisemultiplication

Authorlinkingmodel—heterogenous graph attention Heterogeneoussubgraphforacandidateauthorpair Differentattentionparametersfor differententitytypes Attentioncoefficient

Experiment Setup • Datasets • Baselines • Rule-based method: Keyword • Traditional ML method: SVM & Dedupe • SOTA author linking model • COSNET: based on factor graph model • MEgo2Vec: based on graph neural networks

Experimentalresults LSTM-based method CNN-based method

Modelvariantsofpaperlinking Table3:Running time of different methods for paper linking (in second). Table2:Paperlinkingperformance 100xprediction speed-up

OAG: Open Academic Graph https://www.openacademic.ai/oag/

Applications • Dataintegration • Graphmining • collaboration and citation • Textmining • titleandabstract • Scienceofscience… Citation Network Dataset https://www.aminer.cn/citation

Thank You Code:https://github.com/zfjsail/OAG Data:https://www.openacademic.ai/oag/

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7