540 likes | 729 Views
Google and Google Scholar . Roger Mills and Judy Reading May 2007 . Welcome to the Web. The world’s biggest haystack. What can you do in a haystack?. Romp Get hay fever Have unexpected encounters Sleep Not do research So what do you fancy?. Finding needles.
E N D
Google and Google Scholar Roger Mills and Judy Reading May 2007
Welcome to the Web The world’s biggest haystack
What can you do in a haystack? • Romp • Get hay fever • Have unexpected encounters • Sleep • Not do research • So what do you fancy?
Finding needles • Google helps you find needles in haystacks • But: • Google is an index of web pages • A journal article is not a web page • So Google is not good at finding journal articles • However: • An image of a journal article may be placed on a web page • So Google may find it • If it’s free and not behind a firewall • How do you know?
Google is fast • Very fast • Proudly fast • Tells you how fast • Found OUCS home page in 0.09 secs • Also found 350,000 other ‘relevant’ pages • But put home page first • Brilliant - How does it do it? • Not telling….
Did I need 350,000 references? • Nobody looks at all the references Google retrieves • So why display them? • Algorithm takes into account links made by other pages • And click-throughs • So the top result for a given search is determined over time by the people who make that search • Is that the same as the ‘best’ result?
OK, how would you do it? • To index a document, I’d read it first. • Google can’t read • We don’t read the web – we view it • We remember references visually – that red book on the third shelf down… • If Google can list all the red books on all the third shelves down in all the world I’m bound to find it, right? • Actually I remember I saw in Oxford, so I just need to list all the red books in Oxford – doddle • That’s not really how Google works – is it?
So you read the article, and then…? • Give it some index terms • Not ones I’ve just made up, but ones from a standard list. • That way, everyone will know what the article’s about, and every article on the same topic can be found. • Provided everyone agrees what the article’s about. • Then I’d list the authors in a standard form: so everything by Roger Mills, Roger Anthony Mills, Roger A Mills, R Anthony Mills, Anthony Mills, R A Mills can be found in one go. • That’s a controlled vocabulary. • Works for journal titles too.
Google doesn’t do that • No controlled terms • So you must think of synonyms, different forms of name, title abbreviations etc • You must define the context – that matters….
OK, we get it. So let’s invent… • Google Scholar • Let’s team up with publishers so they let us search behind their firewalls • Let’s modify our algorithm so it excludes non-scholarly material (how do we define that?) • Let’s look at citations so when one article we index cites another one we index, we can move it higher up the relevance ranking • Let’s link together different versions of the same article • Let’s include library locations for full-text access • Let’s see how it goes
But let’s not allow: • creation of sets • Or controlled vocabularies • Or combining of searches • Or hit rate figures for individual search terms • Or proximity searching • Or saving and e-mailing results • Or creation of alerts • Or standardisation of journal names/abbreviations • Or info on what is included and what is not • Or info on how the system decides what is scholarly • Or an indication of update frequency – seems slower than normal Google
Which of these statements is true? • Google is comprehensive • Google is all I need • Google is up-to-date • Google is not evil • Google is commercial • Google is independent • Google is secretive • Google wants to rule the world • Google wants to beat Microsoft • Google loves me • I love Google
Google is a family • A range of products under a common brand • Some add value to the basic search engine; others are nothing to do with searching • Google Scholar is a variant of the standard search engine • It uses a different algorithm, but we don’t know how it differs
What’s in Google Scholar? • “Google Scholar provides a simple way to broadly search for scholarly literature. From one place, you can search across many disciplines and sources: peer-reviewed papers, theses, books, abstracts and articles, from academic publishers, professional societies, preprint repositories, universities and other scholarly organizations. Google Scholar helps you identify the most relevant research across the world of scholarly research.”
NB: only in Beta • Features may change • Developing in tandem with Google Books, which will include digitised texts from Oxford collections and others • In competition with WoK, ScienceDirect, SCOPUS, Scirus etc
Content • Algorithm to identify scholarly materials crawled by Google from the open web • Access to materials locked behind subscription barriers • Must include abstract • Full-text access requires institutional subscriptions or individual payment • Includes peer-reviewed papers, theses, books, preprints, abstracts, full-text, citations, etc.
Library links • Includes OpenURL links to local library holdings • In Oxford displays as ‘Oxford Full Text’ beside title
Includes citation data • Uses ‘citation extraction’ to build connections between papers • ‘Cited by’ link lists items (known to Google Scholar) that cite the original paper • Cited items not available online are listed with prefix [citation] • ‘Citation analysis’ puts the most-cited papers at the top of the results list
Searching • AND implied between words as in normal Google • + to include common words, letters or numbers that Google’s search technology generally ignores • “quote marks” to search for a phrase • minus sign – to exclude from a search • OR for either search term • author: for author search • intitle: to search document title • restrict by date and publication • advanced search screen available
Exercise • Try searching for: French national identity • In Google and Google Scholar • With and without quotation marks • Now try searching in Web of Science (or other relevant database) • Is it clear why results differ? • What approach provides the most useful results: • For writing a paper for publication • For quoting in a thesis • For preparing a speech • For preparing for a pub quiz • Or any other purpose…
Alternatives to Google • Google it! • See Charles Knight’s up-to-date ‘Top 100’ list in Reade/Write Web: http://www.readwriteweb.com/archives/top_100_alternative_search_engines_mar07.php • Use Intute www.intute.ac.uk for reputable human-selected sites, chosen for a UK academic audience • Check OxLIP www.ouls.ox.ac.uk/oxlip for complete listing and subject guide to university-subscribed databases. Most list the sources they cover and use controlled vocabularies for indexing
An example of Google’s strengths • and weaknesses in finding a specific article: a search done in 2005 and repeated in Nov 2006:
Comparing citations data: 2005 X GS X SC X GS
SCIRUS phrase search: 2 journals, this first; 8 other web sources (inc previous versions of this talk!)
SCIRUS keyword search: 735 journals, this first; 6996 others
Biological Abs phrase search: exact match in 1note controlled keywords
SCIRUS • Very similar to Scholar but can also: • Mark records • Save records • E-mail records • Export set in RIS format (for Endnote)