310 likes | 323 Views
OAG: Toward Linking Large-scale Heterogeneous Entity Graphs. Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li and Kuansan Wang. Tsinghua University Microsoft Research. OAG overview.
E N D
OAG: Toward Linking Large-scale Heterogeneous Entity Graphs Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li and Kuansan Wang. Tsinghua University Microsoft Research
OAGoverview Open Academic Graph (OAG) is a large knowledge graph unifying two web-scale academic graphs: Microsoft Academic Graph (MAG) and AMiner. Linkinglarge-scaleheterogeneousacademicgraphs
OAG: Open Academic Graph https://www.openacademic.ai/oag/
Problem & Challenges Input: twoheterogeneousentity graphsand. Output: entitylinkingssuch that and represent exactly the same entity.
Challenges • Entity heterogeneity • Different types of entities • Heterogeneous attributes • Entity ambiguity • Long-standing name ambiguity problem • Large-scale entity linking • Hundreds of millions of publications in each source.
Related work • Rule-based method: DiscR [TKDE’15] • Traditional ML method: RiMOM [JWS’06], Rong et al. [ISWC’12], Wang et al. [WWW’12], COSNET [KDD’15]. • Embedding-based method: IONE [IJCAI’16], REGAL [CIKM’18], MEgo2Vec [CIKM’18].
Framework: LinKG Authorlinkingmodule Venuelinkingmodule Paperlinkingmodule
Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.
Linking venues — sequence-based entities • Input: venue names in each graph • Output: linked venue pairs • Idea: LSTM-based method Direct name matching Easy cases Fuzzy-sequence linking
Venuelinkingcharacteristics • Wordordermatters • E.g.‘Diagnostic and interventional imaging’ and ‘Journalof Diagnostic Imaging and Interventional Radiology’ • Fuzzymatchingforvaried-lengthvenuenames. • Extra or missing prefix or suffix • E.g.Proceedings of the Secondinternational conference on Advances in social network miningand analysis.
Venuelinkingmodel Two-layer LSTM layers Raw word sequence Input Similarity score Keywords extracted from integral sequences
Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.
Linking papers — large-scale entities • Problemsetting:To link paper entities, we fully leverage the heterogeneousinformation, including a paper’s title and authors. • Leveragethehashingtechnique(LSH)forfastprocessing • AdoptDoc2Vectotransformtitlestoreal-valuedvectors • UseLSHtomapreal-valuedpaperfeaturestobinarycodes. • Andtheconvolutional neural network foreffective linking.
Paperlinkingcharacteristics • Large-scaleentities • Hundredsofmillionsofacademicpublicationsforeachgraph. • Localandhierarchicalmatchingpatterns • Papertitles are often truncated if they contain punctuation marks, suchas ‘:’ and ‘?’ • Differentauthornameformats:JingZhang,J.,Zhang&Zhang,J.
Paperlinkingmodel—CNNmodel Convolutiononinputsimilaritymatrix word-levelsimilaritymatrix MLP layers
Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.
Linking authors — ambiguous entities • Problemsetting:To link author entities, we generate a heterogeneoussubgraph for each author. • One author’ssubgraph is composedof his or her coauthors, papers, and publication venues. • Alsoincorporatethevenueandpaperlinkingresults. • Presenta heterogeneous graph attention network basedtechnique for author linking.
Authorlinkingcharacteristics • Nameambiguity • 16,392JingZhanginAMinerand7,170JingZhanginMAG • Attributesparsity • Missingaffiliations,homepages… • Alreadylinkedpapersandvenues! • Viewauthorlinkingasasubgraphmatchingproblem • Aggregateneededinformationfromneighbors
Graph neural networks • Neighborhood Aggregation: • Aggregate neighbor information and pass into a neural network • It can be viewed as a center-surround filter in CNN---graph convolutions! b a v c e d
GCN: graph convolutional networks GCN is one way of neighbor aggregations • GraphSage • Graph Attention • … …
LinKGstep1:pairedsubgraphconstruction • Subgraph nodes • direct (heterogeneous) neighbors, including coauthors, papers, and venues • coauthors’papersandvenues (2-hop ego networks) • Merge pre-linked entities (papers or venues) • Construct fixed-size graph
Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Inputnodefeatures(insubgraphs) • Semanticembedding:averagewordembeddingofauthorattributes • Structureembedding:trainednetworkembeddingonalargeheterogeneousgraph(e.g.LINE)
Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Encoderlayers • attentioncoefficient attnlearntbyself-attentionmechanism • Normalizedattentioncoefficient:differentiatedifferenttypesofentities aggregation weight ofsource entity ’s embedding on target entity
Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Encoderlayers(cont.) • Multi-headattention • Twographattentionlayersintheencoder • Decoderlayers • Fuseembeddingsofcandidatepairs,andusefully-connectedlayerstoproducethefinalmatchingscore. concatenation Element-wisemultiplication
Authorlinkingmodel—heterogenous graph attention Heterogeneoussubgraphforacandidateauthorpair Differentattentionparametersfor differententitytypes Attentioncoefficient
Experiment Setup • Datasets • Baselines • Rule-based method: Keyword • Traditional ML method: SVM & Dedupe • SOTA author linking model • COSNET: based on factor graph model • MEgo2Vec: based on graph neural networks
Experimentalresults LSTM-based method CNN-based method
Modelvariantsofpaperlinking Table3:Running time of different methods for paper linking (in second). Table2:Paperlinkingperformance 100xprediction speed-up
OAG: Open Academic Graph https://www.openacademic.ai/oag/
Applications • Dataintegration • Graphmining • collaboration and citation • Textmining • titleandabstract • Scienceofscience… Citation Network Dataset https://www.aminer.cn/citation
Thank You Code:https://github.com/zfjsail/OAG Data:https://www.openacademic.ai/oag/