1 / 15

Transitioning of existing applications to use HDFS

Transitioning of existing applications to use HDFS. August 2008. ContextWeb: Traffic. Traffic – up to 10 thousand Ad requests per second. Comscore Trend Data:. ContextWeb Architecture highlights. Pre – Hadoop aggregation framework

norman
Download Presentation

Transitioning of existing applications to use HDFS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transitioning of existing applications to use HDFS August 2008

  2. ContextWeb: Traffic Traffic – up to 10 thousand Ad requests per second. Comscore Trend Data:

  3. ContextWeb Architecture highlights Pre – Hadoop aggregation framework Logs are generated on each server and aggregated in memory to 15 minute chunks Aggregation of logs from different servers into one log Load to DB Multi-stage aggregation in DB About 20 different jobs end-to-end Could take 2hr to process through all stages 3

  4. Hadoop Data Set • Up to 160GB of raw log files per day. 10TB for 60 days • 30 different aggregated data sets 25TB total to cover 1 year (uncompressed) • For example, list of URLs with Keywords for every impression – 15TB total (uncompressed). • Multiply by 3 replicas … • Compression would help – potential compression ratio 1:15 – 1:20

  5. Architectural Challenges How to organize data set to keep aggregated data sets fresh. Logs constantly appended to the main DataSet. Reports and aggregated datasets should be refreshed every 15 minutes Mix of .NET and Java applications. (80%+ .Net, 20%- Java) How to make .Net application write logs to Hadoop? Some 3rd party applications to consume results of MapReduce Jobs (e.g. reporting application) How make 3rd party or internal Legacy applications to read data from Hadoop ?

  6. Partitioned Data Set: approach • Date/Time as dimension for Partitioning • Segregate results of MapReduce jobs into Daily Files/Directories • Each Daily file is regenerated if input into MR job contains data for this Day • Use revision number for each file. This way multi-stage jobs could overlap during processing (HDFS is still write-once, at least for now) 6

  7. Partitioned Data Set: processing flow

  8. Partitioned Data Set: Implementation • Use MultipleOutputFormat to generate daily files/directories • Provide your own generateFileNameForKeyValue() • Compression is supported out of the box • Use PartitionerClass to make sure that all rows for the same day go to the same reducer • Provide your own getPartition() • int partitionID = dateHash % numPartitions;

  9. Getting Data in and out • Mix of .NET and Java applications. (80%+ .Net, 20%- Java) • How to make .Net application write logs to Hadoop? • Some 3rd party applications to consume results of MapReduce Jobs (e.g. reporting application) • How make 3rd party or internal Legacy applications to read data from Hadoop ?

  10. Getting Data in and out: distcp • Hadoop Distcp <src> <trgt> • <src> - hdfs • <trgt> - /mnt/abc – network share • Easy to start – just allocate storage on network share • But… • Difficult to maintain if there are more than 10 types of data to copy • Need extra storage. Outside of HDFS. (oxymoron!) • Extra step in processing • Clean up 10

  11. Getting Data in and out: WebDAV driver • WebDAV server is part of Hadoop source code tree • Needed some minor clean up • WebDAV client is pre-installed on Windows • Linux • Mount Modules available from http://dav.sourceforge.net/ • [root@pglnx mnt]# cd /mnt • [root@pglnx mnt]# mkdir -p hadoop/prod • [root@pglnx mnt]# mount -t davfs http://cw-grid100.contextweb.prod/ • hadoop/prod/ • [root@pglnx ~]# mount | grep hadoop • http://cw-grid100.contextweb.prod/ on /mnt/hadoop/prod type davfs • (rw,nosuid,nodev,_netdev) • [root@pglnx ~]# cd /mnt/hadoop/prod/ • [root@pglnx prod]# ls • geo geo1.txt hadoop home lost+found old_versions rpt system testing tmp user wide 11

  12. Getting Data in and out: Running Server on Linux 12

  13. Getting Data in and out: Running Server on the same node where client is installed 13

  14. WebDAV and compression • But your results are compressed… • Options: • Decompress files on HDFS – an extra step again • Refactor your application to read compressed files… • Java – Ok • .Net – much more difficult. Cannot decompress SequenceFiles • 3rd party- not possible

  15. WebDAV and compression • Solution – extend WebDAV to support compressed SequenceFiles • Same driver can provide compressed and uncompressed files • If file with requested name foo.bar exists – return as is foo.bar • If file with requested name foo.bar does not exist – check if there is a compressed version foo.bar.seq. Uncompress on the fly and return as if foo.bar • Outstanding issues • Temporary files are created on Windows client side • There are no native Hadoop (de)compression codecs on Windows

More Related