160 likes | 300 Views
Innovation Acceleration by Public Data Analysis. Or , Big Data in Hungary - Archiving and Mining the Academic Web George Kampis, CEO PetaByte Nonprofit Research Ltd. PetaByte Nonprofit Research Ltd. www.dynanets.org. www.textrend.org. www.futurict.szte.hu.
E N D
Innovation Acceleration by Public Data Analysis Or, Big Data in Hungary - Archivingand Mining the Academic Web George Kampis, CEO PetaByteNonprofit Research Ltd.
PetaByteNonprofit Research Ltd. www.dynanets.org www.textrend.org www.futurict.szte.hu www.petabyte-research.org www.hungarianscience.org
Whatwe do in futurict.hu • In thecontextofscientificresearchandhighereducation (in particular, in Hungary): • Investment andreturn („ROI“) analysis • „scienceofsuccess“ • Structuralanalysisofinstitutions • www.hungarianscience.org • http://www.oktatas.hu/felsooktatas/projektek/tamop721_eszafejl/projekthirek/hazai_tudomanymetriai_felmeres • New formsofpublication (e.g. datasharing in papers)
Context in FuturICT • „Innovation Accelerator“ • .. tohelp (scientific) innovationwith [...] socialmediaaswellasdataservices Helbing, D., & Balietti, S. (2011). Howtocreate an innovationaccelerator. The European Physical Journal Special Topics, 195(1), 101-136. van Harmelen, F., Kampis, G., Börner, K., van den Besselaar, P., Schultes, E., Goble, C., ... & Helbing, D. (2012). Theoreticalandtechnologicalbuildingblocksfor an innovationaccelerator. The European Physical Journal Special Topics, 214(1), 183-214. Leydesdorff, L., Rotolo, D., & De Nooy, W. (2012). Innovation as a NonlinearProcess, theScientometricPerspective, andtheSpecificationof an Innovation Opportunities Explorer. Technology Analysis & Strategic Management (Forthcoming). „BIG DATA“
PartiallySimilardevelopments • Mendeley • Reference manager andcollaborationnetwork • ResearchGate • Research networkand publicationsportalw/ qualityassessment • Altmetrics • Article-level online metrics • VIVO • Connect, share, discover
Big (web) datais A key • Big Data in Google trends • „deepdata“ • controversy... • Massive Web Data: harvesting / archiving • Google itself... • The Internet Archive • UK web archive, British Library
Web archiving in hungary • None. Nope. • „MIA“ (Magyar Internet Archivum, HU Internet Archive) • Variousdocuments, plansandsmall-scalepilots • Since 2006 • Ourambition: toarchiveandmine HU academia = „HUA“ • 500 NIIF institutions (NIIF = Nat‘l Information Infrastructure Dev‘t.) • 42 HAS (HU AcadSci) researchinstitutes • 47 highereducationentities (universitiesandpolytechnics) • Now in collaborationwith: OSZK (National Library), NIIF...
A running„HUA“ pilotin petabyte/Futurict.hu • Hardware: Dell T710 server(2x4 core Xeon E5520, 48GB RAM, 2TB HDD) • Software: Heritrixcrawlerscalledfrom API and CURL, spawnedfromtimedsripts... • Not downloaded: exe, gz, iso, jar, mp3, ogg, ppt, rar, wav, xls, xlsx, zip • Manytechnicalissues: Flash pages, portletcontainers (e.g. WebSphere), CMSs (e.g. Joomla)... • Operation since April 2013. • Longitudinal archiving in mirrorformat (2-weekly periods), using a form of „diff“ in owndevelopment
THE Processing ofresults • Future plans: keywordextraction, timed (dynamic) keywordnets, correlationwithsupportprogramsandgrantcalls (toanalyze ROI in publications, citations, ...terms) • „The Science ofSuccess“ (A.-L. Barabási) • http://www.eccs13.eu/index.php/satellites • http://barabasilab.com/success/ • http://www.facebook.com/SuccessScience • Bottleneck: availabilityofpublicfundingdata, needfor open data initiatives enforcement • In thispilotphase: basicstatistics, turnoverrates etc.
Quick results, basicstats • All 89 HU academicinsitutitions: 86GB total (text 42GB) • Rank distributions (total) HAS Higher Ed.
Quick results, basicstats 2. • Rank distributions (text, i.e. html, doc, docx, rtf, pdf, ps) HAS Higher Ed.
Quick firstinsights • (Outliersarechem.catalogsviz. astronomydatasets) • Average size: 974 MB per site (median: 137 MB [!]) • Average textsize: 474 MB per site (median: 47 MB [!]) • Forcomparison: • Kampis website @ ELTE = 180 MB (textonly) • Hypothesis: usefulcomparisonsandmetricspossible • Add dynamicaspect...
Conclusions, suggestions • Veryfirststeps, only 2 monthsintothepilot • Data intensive, hasnaturaltiming • Big (web) dataareimportantforresearchassessment • Big dataareoftensmall (also elsewhere...) • Suggestsitselfforreadilyavailableindexesand derivative measures • Wehaveshown a simplestyetinstructivecase(„sizematters“) • Caveat: neednormalizations!
Thankyou! • Coworkers: Laszlo Gulyas(PhD), Sandor Soos(PhD), Balazs Balint (MSc), Zsolt Juranyi (BSc), Attila Palmai (BScstudent) • This work was partiallysupportedbythe European Union andthe European Social Fund throughprojectFuturICT.hu (grantno.: TÁMOP-4.2.2.C-11/1/KONV-2012-0013).