Hadoop in the Cloud or Not
NOTE: This is a guest post by Jenny Richards. Contact her through her website.
Any discussion being held about “Big Data” or “Hadoop Functionality” today should be left in the hands of CIOs or IT heads rather than CEOs and CFOs. From selection of the right Hadoop service provider to implementation and on-going maintenance, the technical input of CIOs could go a long way in streamlining expectations of the latter, who usually understand just one language – the bottom-line.
However, given that the main goal of implementing Hadoop for big data analytics is to gain executable insights that will give enterprises an edge in an increasingly competitive marketplace, the decision to implement Hadoop cannot be made in isolation from the big enterprise business minds.
It is no secret that Hadoop and Cloud computing are two of the biggest current trends dominating the world of technology for the last couple of years, and with no signs of slowing down in the years to come. There has been much debate regarding the co-existence of the two, with many arguing that big data analytics are out of place in the cloud. Is there any truth to that? What’s the real deal then?
Hadoop vs. the Cloud – Overview
Many analysts have expressed skepticism regarding the usability of Hadoop in a cloud-based environment, which would have to be the case for organizations looking to implement cost effective remote DBA support services. The major premise of their argument is that cloud computing, by its very nature, does not support addition of Hadoop clusters.
According to Richard Fichera from Forrester, here are three reasons Hadoop cannot be used with a cloud computing environment.
- Heavy and Increasing Workloads
Given the operation of many enterprises, Hadoop clusters are usually utilized in high capacities, necessitating increase in capacity to match an ever growing need for space resources. This means that Hadoop clusters will be predictably filled with data at a steady pace, whether slow or fast depending on the organization itself. This essentially negates the need for elasticity that is usually sought after through cloud deployments.
- Growing Data Sets
Hadoop clusters amass upto 10 times the amount of data collected by legacy transactional environments. By nature, this is data that customer-centric stakeholders and data scientists will be hard pressed to be rid of. This makes cloud storage and access needs inherently more expensive when utilizing cloud-based solutions. Also, given the unpredictability of access requirements, on-premise storage may be more favorable where cloud access times are sub-optimal, which occurs more often than is thought.
- Performance vs Locality
Deploying Hadoop clusters from the cloud makes sense for data which already exists in the cloud, such as social media analytical data. However, for real-time data originating from business-customer interactions in multiple locations, the better option is to deploy Hadoop on-premise with the correct deterministic latency and bandwidth to reduce end-to-end application latency.
Deploying Hadoop in the Cloud for Cloud-based Data
Well, whether or not the above points hold water is a different matter with each enterprise, because there’s more than just one side to the story. What cloud-based Hadoop cynics seem to be doing is to downplay the ‘data gravity’ quagmire, which is the main reason enterprises opt for a deployment of Hadoop in the cloud.
Presently, it’s too soon to tell what the best place is for data, and hence their analytical applications to sit over the long term. However, the idea is that Hadoop needs to be deployed where the data is seated. If you have your data stored in S3 buckets for instance, it makes sense to opt for a deployment of Hadoop in the cloud.
For now, there isn’t enough data to make a long-term trend analysis and finally put the matter to bed. But it would be narrow-sighted to imagine that Hadoop will remain bound to the location of data. A more likely prediction is that Hadoop will pervade everywhere – a sensible premise given that data is also pervading everywhere.
As the data moves everywhere, its gravity will be exerted bringing Hadoop clusters with it. It’s not wild to imagine that Hadoop workloads will soon be present in vehicles, on wireless base stations as well as other up-and-coming IT infrastructure edges. With data coming from and going everywhere, Hadoop cannot, and will not, remain cloistered in data centers.
Hadoop in the cloud actually makes sense
More likely, deployments will include Hadoop in the cloud and on-premise data centers according to the multiple data gravity sites. As Hadoop architecture and markets continue to mature, cloud-based Hadoop deployment will become more significant.
In fact, cloud deployment for historical data will become more compelling – being a more economical storage location for enterprises that still wish to maintain historical data for reporting purposes – kind of like the tapes from yesteryears, but with higher availability and more accessibility.
Hadoop is a natural fit for cloud-based environments. The reasons for this are more than the reasons against it, which are:
- Faster procurement and deployment of large scale resources
- Lower costs of handling innovations
- More efficient schemes to handle batch workloads
- Ability to process variable resource requirements seamlessly
- Simplifying Hadoop operations
Think about it, far easier it is to say “We need 20 more servers” than it is to actually buy them, create space for them, buy fans to keep them cool and deploy security to keep them safe. You can have all the reasons why Hadoop is better on-premise, but in the current big data environment, cloud deployment is not something enterprises can side-step for too long. The economic sense is just too staggering.
The truth is that the theoretical application of Hadoop in physical data centers is far removed from the practicality of it. While it presently cannot be considered a perfect match for deployment on cloud infrastructure, all practicality favors adoption of Hadoop clusters in the cloud.
What Forrester seems to have overlooked in their argument for Hadoop on-premise implementation is that many enterprises’ data resources are increasingly becoming live, as well as the hardships of meeting expending data requirements in an on-premise situation.
However, a major reason for Hadoop cloud deployment is the need for skilled and experienced managers to handle this sophisticated technology at deployment, configuration, scaling and management. Unless you have limitless resources to have an in-house team, cloud deployment, which facilitates remote administration, just makes better sense.