80 likes | 93 Views
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
E N D
www.prwatech.in Spark over Hadoop Nowadays Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce performed on Scala is much faster and efficient than Hadoop. Hadoop MapReduce so the task So to understand the basic difference between these two techniques and how they are different from each other we need to first understand how they function Hadoop: Hadoop: Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in data) across computer clusters using simple programming models. Hadoop can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power. Hadoop, in essence, is the ubiquitous 800-lb big data gorilla in the B Big A Analytics nalytics space. Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are: Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce Although the above four modules comprise Hadoop’s core, there are several other modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s power and reach into big data applications and large data set processing. Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce. ig D Data ata Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in Hadoop is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time. MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks. Spark Defined: Spark Defined: The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah. Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk. Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning. Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module. Spark has its own page because, while it can run in Hadoop clusters through YARN YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical. Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop example, Spark doesn’t have its own distributed filesystem but can use HDFS. Spark uses memory and can use the disk for processing, whereas MapReduce is strictly disk-based. The primary difference between Hadoop E Ecosystem cosystem. For Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section. Why Why C Choose Scala over Hadoop: hoose Scala over Hadoop: Performance: Performance: The reason why Scala is faster than Hadoop is that Scala P Processes rocesses everything in memory. It can also use the disk for data that doesn't all fits into memory. Spark’s in-memory processing delivers near real-time analytics for data from marketing campaigns, machine learning, Internet of Things sensors, log monitoring, security analytics, and social media sites. MapReduce alternatively uses batch processing and was really never built for blinding speed. It was originally set up to continuously gather information from websites and there were no requirements for this data in or near real-time. Ease of use: Ease of use: Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no learning curve required in order to use it. Spark also has an interactive mode so that developers and users alike can have immediate feedback for queries and other actions. MapReduce has Scala Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in no interactive mode, but add-ons such as Hive and Pig make working with MapReduce a little easier for adopters. Cost: Cost: Both Scala and Hadoop is open software and free software product so it doesn't require a license. Also, both products are designed to run on commodity hardware, such as a low-cost system. The only difference in cost occurs due to their different way of performing a task. MapReduce uses standard amounts of memory because its processing is disk-based, so a company will have to purchase faster disks and a lot of disk space to run MapReduce MapReduce. MapReduce also requires more systems to distribute the disk I/O over multiple systems. Sparks requires a lot of memory but can deal with the standard amount of disk that runs at standard speeds. Disk space is a relatively inexpensive commodity and since Spark does not use disk I/O for processing. Data Processing: Data Processing: MapReduce is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. Spark performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation on the data, and then writes it back to the cluster. Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section. Fault Tolerance: Fault Tolerance: For fault tolerance, MapReduce and Spark resolve the problem from two different directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a heartbeat is missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This method is effective in providing fault tolerance, however, it can significantly increase the completion times for operations that have even a single failure. Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously. Scalability Scalability: : By definition, both MapReduce and Spark are scalable using the HDFS. Compability: Compability: Spark can be deployed on a variety of platforms. It runs on Windows and UNIX (such as Linux and Mac OS) and can be deployed in Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India
www.prwatech.in standalone mode on a single node when it has a supported OS. Spark can also be deployed in a cluster node on Hadoop YARN Mesos. Hadoop YARN as well as Apache Address: No. 14, 29th Main, 2nd Cross, V.P road BTM-1st Stage, Behind AXA building, Land Mark: Vijaya Bank ATM Bangalore – 560068, India