Hadoop and Advanced Analytics – SPARK the fire within!

by Ravi Narayanan for Blog
Hadoop and Advanced Analytics – SPARK the fire within!

Hadoop as we know it, has the ability to spread data and processing across several nodes. It can process very large amounts of data using collections of commodity hardware. This can also be accomplished by using a collection of remote virtual servers (leveraging cloud services like Amazon AWS).

The two key components of Hadoop include :

  1. Hadoop Distributed File System (HDFS), and
  2. MapReduce, a framework to split up a computing job across multiple processors.

However, for iterative processing, Hadoop MapReduce framework is time-consuming, thereby, making Hadoop jobs batch-oriented.

MapReduce vs Spark

The researchers at UC Berkeley’s AMPLab realized this and developed Spark as an alternative to MapReduce. Spark took better advantage of memory on the distributed set of machines than MapReduce and greatly reduced the need for the disk I/O. Its in-memory processing can help achieve 10-50X+ improvements or more in data-processing times than MapReduce. It also offers added incentive of being much easier to program than MapReduce.  With Spark, developers do not need to split up and coordinate their logic across separate Map and Reduce routines. They can seamlessly combine and create complex workflows.

Spark extends value from Hadoop

Spark Components
Spark Components

The  key platform components for SPARK include:

Spark SQL (Interactive real-time query tool),

Spark Streaming (Streaming Analytics Engine),

MLlib (machine learning library), and

GraphX (graph analysis engine),

Spark does not include its own file system for organizing files. For this reason many organizations install it on top of Hadoop. Spark’s advanced analytics applications can make use of data stored within the Hadoop Distributed File System (HDFS). Organizations can do more deeper analytics with less coding and faster response times than typical MapReduce applications.  Spark thus plays a very important part in extending the value of Hadoop.

Spark for Data Scientists

Spark is becoming a key data science tool for many iterative modeling challenges. It enhances data scientists’ productivity by enabling them to leverage existing HDFS data. In addition, it can access and process data stored in HBase, Cassandra, and any other Hadoop-supported storage system. Spark can combine SQL, streaming, and graph analytics within cloud analytics applications.

Spark’s three-fold value:

  1. Runtime processing environment,
  2. Development framework for in-memory advanced analytics, and
  3. Next-generation, cluster-computing solution.

Explore Spark capabilities today, for use cases within your business.  Take them beyond experimentation and apply to the business problems and opportunities you encounter daily. Spark the fire within your organization!

We can help realize quantum value from Hadoop and Spark for your business; please contact us.


Thoughts on Big Data Adoption
Prev post Thoughts on Big Data Adoption

The valley gets very excited about new technologies and applications. Adoption of technologies tends to…

Webinar: 5 Reasons to Augment your Data Warehouse with Hadoop
Next post Webinar: 5 Reasons to Augment your Data Warehouse with Hadoop

Orzota is excited to announce our first webinar to be conducted jointly with our partner…