1 / 12

M/R for MR

M/R for MR. Market Research powered by Hadoop. Who is nurago?. founded in early 2007 technology for usability and ad efficiancy research part of USYS Nutzenforschung: Hanover , Hamburg, Berlin, Munich , London

Download Presentation

M/R for MR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. M/R for MR Market Research poweredbyHadoop nurago - applied research technologies

  2. Who is nurago? • founded in early 2007 • technologyforusabilityand ad efficiancyresearch • partof USYS Nutzenforschung: Hanover, Hamburg, Berlin, Munich, London • consultants, developers, operationengineers (about 30+ employees) • drivenbytechnologyandmethodology • researchplatform I: LEOtrace • usabilityresearch, audiancemeasurement • Proxy Servers, Browser Add-ons • sampleddatafrompanelmembers (2k – 25k UU per project, about 6TB per monthoverall) • researchplatform II: BrandSpector • ad efficiancyresearch • Cookies, Online Surveys • fulldatabased on mediavolume (10m – 100m AIs per project, about 500GB per monthoverall) nurago - applied research technologies

  3. Technology StackPlatformStrategy • Data Collection Framework • LEOtrace Browser Add-Onsprovide a unified JavaScript Runtime Environment for IE und FF • thinkofGreasemonkeywith a remote control • Services API supportsstudysetups (eventtriggeredsurveys, DOM manipulation etc.) • BrandSpector • Unified Tracking Tags embeddedinto ad creativesorsites • Data Processing Framework • Magic Happens Here • Data Reporting Framework • „Portal Server“ providing SSO, I18N, ACLs andpluggablereportingmodules • PHP Zend Framework forserverside MVC • GUI libraryextJSforconsistent Look & Feel plus Charting nurago - applied research technologies

  4. LEOtrace Data ProcessingFrontend Example nurago - applied research technologies

  5. Data Processing FrameworkMagic Happens Here • ExamplesofLEOtrace Data Analysis • Input Log Data: UserID, Timestamp, URL, Body, etc. • Input Ad Contact: UserID, Timestamp, Landingpage, • Standard Output : FrequencyBy Domains: • countthenumbersof Page Impressionsand Unique Users forcertain URL patterns • projectedgrossreach, netreach, sessions, durationper URL pattern • whatsisthecombinedreachof facebook.com and studivz.net? • Standard Output : Filtered Logs • advancedgrep on the URL based on RegEx • UserID, Timestamp, URL, Duration, Session ID • export all malesofage 25 to 35 whoenteredthecheckoutprocess on spreadshirt.de • Standard Output : Ad Contacts • Advancedgrep on Ad Landing Page URL, Google Search Terms, visited URLs • UserID, Timestamp, ContactType, ContactDetails • export all Ad Impressionsfor aircerlin.de Display Adsand Google Searchesforthe „Air Berlin“ nurago - applied research technologies

  6. Technology StackThe Missing Link • Data Transport • from web serverstoprocessingnodes: a bunchofshellscriptsusingscp, rsync, etc. • LEOtrace: Data Processing FrameworkAttempts • Generation 1: oneproject = oneMySQL DB + Ad Hoc SQL Queries • Generation 2: oneproject = 23 partitionedMySQL DBs + stand alonejavaapps • Generation 2.5: oneproject = 23 partitionedMySQL DBs + MapReduce Jobs • Generation 2.8: oneproject = 23 partitionedMySQL DBs + HDFS Flat Files + MapReduce Jobs • Generation 3: oneproject = Flat Files in HDFS + MapReduce Jobs • BrandSpector: Data Processing FrameworkAttempts • Generation 1: oneproject = oneMySQL DB + Ad Hoc SQL Queries • Generation 2: oneproject = Flat Files + awkjobs • Generation 3: oneproject = Flat Files in HDFS + MapReduce Jobs (t.b.d.) nurago - applied research technologies

  7. LEOtrace Data ProcessingData Flow Example • rawdata (XML encoded „events“) iscreatedby Browser Add-ons • web serversreceiveand parse the XML into an intermediate format • dataistransferedtooneofseveralprocessingnodesusingrsync • JAVA applicationispreprocessingthedataandstoresresults in sharded DBs • a daily M/R jobdumpsaggregateddatainto Flat Files in HDFS (one per day) • end usersissuenewqueriesusing a Click&Point GUI based on PHP und extJS • PHP talksto a Servlet (REST, JSON) whotalkstotheHadoopJobTracker • Job resultsarewrittento HDFS andaccessedthroughtheServlet nurago - applied research technologies

  8. Data ProcessingWeaknesses • It does its job, but: • still very fragmented storage: many potential pitfalls • too much code dealing with the infrastructure (focus more on business logic) • stability issues due to M/R jobs being to complex or bloated • Job Tracker sometime just freezes with many pending maps (forces a restart of mapred) • hard to maintain a set of shared job implementations for common cases • Sometimes veeeeeryslow (not a real problem but expensive) • Troubled Operation Engineers • we need to get better to meet the demand of internal and external users • we need to cover just a small set of rather simple aggregations: counts, distinct counts, greps • analysts should be able to issue such queries using a simple GUI • there should exist an easy way to implement additional customized jobs and run them in a controlled environment enforcing best practices and reuse code • We need one more Framework to complete the Platform Strategy nurago - applied research technologies

  9. Dudong: provides a unifiedframeworkfordatahandlingandprocessing • simplified Cluster Management • include/excludenodes, debianpackages • centralizeddescriptionoftopology • unifiedstoragemanagement • directimportofdatafrom web servers • provides a commondatarepresentation • Dudong Job API • supportsdynamic COUNT, GROUP, GREP • supports Job Chaining • coversLEOtraceas well asBrandSpector • Scheduling • based on a Hudson nurago - applied research technologies

  10. Look Out • migrateexistingdataprocessinginfrastructure • wearegoingtoscale out tosupportmoreinternational business • oneunifiedclusterorseparatedclusters? • how do wechargetheclusterutilization? • moreadvancedstudiesmeansmoreadvancedanalytics • Taxonomy, Classification, etc. • Annotation based on collectedcontent (analyzebreadcrumbs, shoppingbaskets, etc.) • Even moredata • additional datacollectionmethodsare in thepipeline nurago - applied research technologies

  11. nurago - applied research technologies

  12. Backup • unsere Input und Output Daten sind tabellenbasiert.. • Dudong erlaubt uns dynamisch auf Basis des Spaltenindex (Nummer der Spalte) Records und Keys zu erstellen • Beispiel: Spalte 1 soll als Long behandelt werden, Spalte 2 soll als Text behandelt werden.. • Dadurch können ggf. in Zukunft weniger-komplexe Aufgaben schneller gelöst werden • Die JobChain erlaubt es mehrere Jobs einfacher hintereinander zu hängen. dabei können für die temporären Outputs Policies mitgegeben werden, die zum Beispiel dafür sorgen, dass temporäre Dateien erst gelöscht werden wenn spätere Jobs erfolgreich waren • zusätzlich können mehrere JobChains ineinander kaskadiert werden • ListRecords und CombinedKeys können durch das oben angesprochene RecordSchema (Index -> Hadoop-Datentyp) erzeugt werden • Allgemeine Counterjobs (Zählung von Records) bzw. Filterjobs (Filterung von Records) können allgemeine Aggregationen übernehmen • Über ExecutionCallbacks kann auf Zustände eines asynchronen Jobs reagiert werden nurago - applied research technologies

More Related