Sameh Attia: Hadoop enables distributed ‘big data’ processing across clouds

Open source Hadoop enables distributed data processing for “big data” applications across a large number of servers. The idea is that distributed, parallel processing will result in redundancy and stronger application performance across clouds to prevent failure.

Hadoop, an open source project from The Apache Software Foundation, emerged from the needs of companies such as Google, Yahoo, AOL and Facebook. These companies need to support daily access to huge data sets across distributed servers.

But two factors will make Hadoop necessary for -- and available to -- many companies: a growing number of applications utilizing very large data sets, and the availability of clouds containing hundreds or thousands of distributed processors with a virtually unlimited amount of storage.

Hadoop in a cloud enables parallel processing spread across these many servers, speeding job completion. Hadoop can seriously boost performance in data search and processing scenarios, such as retail chain data mining that seeks trends across millions of individual retail store purchases, or security information that intelligence agencies collect from a wide variety of sources to detect terrorist activity patterns.

How Hadoop distributed data processing works

The Hadoop Distributed File System (HDFS) consists of HDFS clusters, which each contain one or more data nodes. Incoming data is split into segments and distributed across data nodes to support parallel processing. Each segment is then replicated on multiple data nodes to enable processing to continue in the event of a node failure.

To further protect against failure, copies are placed on data nodes residing on more than one rack. Using the default replication count of three, two copies of a segment are placed on nodes located on the same rack, and an additional copy is placed on a node on a different rack.

While HDFS protects against some types of failure, it is not entirely fault tolerant. A singleNameNode located on a single server is required. If this server fails, the entire file system shuts down. A secondary NameNode periodically backs up the primary. The backup data is used to restart the primary but cannot be used to maintain operation.

HDFS is typically used in a Hadoop installation, yet other distributed file systems are also supported. MapR Technologies recently announced a file system that is compatible with Hadoop. This file system adds new features, including a distributed NameNode that removes the single point of failure present in HDFS.

The Amazon S3 file system can be used but does not maintain information on the location of data segments, reducing the ability of Hadoop to survive server or rack failures. However, other file systems such as open source CloudStore and the MapR file system do maintain location information.

MapReduce engine manages distributed data processing

The MapReduce feature consists of one JobTracker and multiple TaskTrackers. Client applications submit jobs to the JobTracker, which assigns each job to a TaskTracker node. When HDFS or another location-aware filesystem is in use, JobTracker takes advantage of knowing the location of each data segment. It attempts to assign processing to the same node on which the required data has been placed.

Each TaskTracker can support a configured number of simultaneous tasks. If no additional tasks can be assigned to the TaskTracker located on the same node as the data, JobTracker attempts to assign processing to a node on the same rack as the data. This strategy helps minimize traffic on the data center backbone network and takes advantage of the higher network bandwidth available within a rack.

TaskTrackers create a separate Java Virtual Machine to prevent failure of a single task from crashing the entire TaskTracker. If a task fails, the TaskTracker will notify the JobTracker, which will then assign the task elsewhere. If a TaskTracker fails, all of its assigned tasks will be reassigned. The JobTracker periodically checks progress, and if it fails, a new JobTracker starts up and begins processing from the last checkpoint.

The output from each Map/Reduce task is combined with the output of others for further processing. Combining output from multiple tasks results in a large amount of data funneled to a smaller number of processors. Network switches are stressed as data converges on the nodes where the next processing phase will take place. Here again switches must be capable of supporting these transfers without blocking or dropping packets.

Vendors offer Hadoop data processing deployment support

While Hadoop users can download the software from Apache, a variety of vendor offerings make it easier to deploy and support a Hadoop installation. Here are some vendors and their offerings:

Hortonworks Inc. is a Yahoo spinoff that employs many former Yahoo employees who designed and implemented key Hadoop components. Hortonworks’ Hadoop Data Platform is a packaged and tested set of open source Hadoop software components. Hortonworks also offers support and training for Hadoop users.
Cloudera Inc. offers CDH and Cloudera Enterprise, Cloudera Management, and Service and Configuration Manager (SCM) Express. CDH is a freely downloadable integrated and tested package of open source Hadoop components. Cloudera Enterprise is a subscription-based offering that adds the Cloudera Management Suite to the open source components and includes ongoing support from Cloudera staff. Cloudera supports its products in both clients’ internal data centers or on public clouds such as Amazon EC2, Rackspace, SoftLayer or VMware’s vCloud. Cloudera also offers Hadoop training and professional services.
EMC Corp. has teamed with MapR to offer the EMC Greenplum HD Community Edition, the EMC Greenplum HD Enterprise Edition, and the EMC Greenplum HD Data Computing Appliance. The Community Edition is a tested set of open source Apache Hadoop components. The Enterprise Edition is an interface compatible with Apache open source Hadoop. This replaces HDFS with the MapR file system, which features high-availability enhancements and adds the ability to control the placement of data so that applications requiring intense computation can be placed on a server containing a high-performance processor. EMC also claims much higher performance than the Apache components.
IBM and Dell also offer integrated Hadoop packages. IBM offers the InfoSphere BigInsights Basic and Enterprise Editions, while Dell has partnered with Cloudera to offer Cloudera Enterprise.

Software vendors have also stepped up Hadoop support. Microsoft is working with Hortonworks to port Hadoop to the Windows environment. Oracle, Sybase and Informatica have all announced plans for Hadoop support.

Interest in Hadoop is expected to continue to grow. The availability of integrated and tested distributions, vendor offerings, support and training make it easier for enterprises to deploy and operate Hadoop. These factors, combined with the need to analyze large databases and the availability of public and private clouds, offer high potential for wide adoption of the technology.

Wednesday, February 8, 2012

Hadoop enables distributed ‘big data’ processing across clouds

No comments:

Post a Comment

Sameh Attia

Followers

About Me

Blog Archive