170 likes | 182 Views
Explore metadata harvesting, automatic metadata extraction, text analysis, and social network analysis to foster scholarly communication and collaboration. Utilize OAI metadata, textual characteristics, and contextual network graphs for efficient academic networking. Leverage semi-structured data from OAI-DC and Wikipedia to build social network-like graphs and enhance informal academic collaboration.
E N D
Writeslike.us Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk
Background Relevant research themes: • Metadata harvesting and reuse • Automatic metadata extraction • Text analysis • Social network analysis • Scholarly communication, particularly informal communication
Aim Helping people to find each other: • Finding other researchers with similar interests to yourself in your geographic area • Or in your area of research • Not everybody with similar interests will attend the same conferences! • Helping students find potential research supervisors • Encouraging serendipity.
Relevant technologies In fact there are an awful lot of these. Social network analysis: • Requires a very large dataset • Solvable either by a) being Facebook or similar (but adoption rates are far from 100%) b) automated analysis of relevant data • Solution b) is cheap, simple, and very fallible. • Not a new approach – at the core of bibliometrics
Relevant technical problems • Author identity disambiguation • Formal social networks disambiguate between instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which). • Needs to be solved to acceptable level. • Need to define how good 'acceptable' is. • Formal solutions usually depend on unique identifiers + registries • Cheap, moderately effective solution: disambiguate via textual characteristics + metadata
Methodology • Harvest OAI metadata: captures large list of: • Author names (somewhat randomly formatted) • Digital object titles, descriptions (sometimes), dates (sometimes) and content (sometimes) • Citations (sometimes) • Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc. • Retain OAI source: useful clue regarding author affiliations (sometimes)
Methodology (II) • Analyse text for noun-phrase-like structures – useful clue as to theme • Background information required, such as: Institution name, domains/URLs associated with each institution • Retrieved via harvesting from Wikipedia • Much of this information is not well-structured, so unavailable via DBPedia • Poorly structured information needs filtering: for example, author names are not consistently structured between repositories. - machine learning problem. • Search with contextual network graph algorithm
'Sometimes' and 'usually' • Statistics are: • Cheap • Imperfect • Available • Rapid innovation philosophy: • Cheap is good • Simple is good • Solutions requiring novel/additional uptake of infrastructure are out of reach
Results • Basic concept worked well • Law of diminishing returns: beyond the first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!) • Interface development actually required more time than the dataset development, and exceeded project length... • But useful dataset can be released as linked data, reused for various purposes
Conclusion • OAI-DC (and Wikipedia!) is a good source for 'semi-structured' data • There is a great deal of potential for using this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network-like graphs • Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications