NOTE: This article first appeared on Datanami

Mention ETL (Extract, Transform and Load) and eyes glaze over. The thought goes: “That stuff is old and meant for clunky enterprise data warehouses. What does it have to do with my internet/web/ecommerce application?”

Quite a lot – actually. ETL did originate in enterprise IT where data from online databases is Extracted, then Transformed to normalize it and finally Loaded into enterprise data warehouses for analysis. Although internet companies feel they have no use for expensive, proprietary data warehouses, the fact of the matter is that ETL is still a requirement and so is some kind of a data warehouse. The logic is simple: one doesn’t run business reports on the database powering the online application.

An ETL Example

Consider the classic example of key transformation. The application database uses a customer_id to index into the customer table, while the CRM system has the same customer referenced differently. The business analyst wants to analyze how customers are using the product and thus, the data warehouse needs a distinct way to refer to customers i.e. the keys need to be transformed and mapped to a new key in the DW. Even if there is a single source system, it is still a good idea to do such transformations to isolate the warehouse from the online database. In addition to such basic transformations, data is also often enriched (as for example using geocodes) to create the target customer record in the warehouse. There is no getting away from it: ETL is a requirement whether you are the hottest social media company or a 50-year-old bank.

Why Hadoop?

All right. We need ETL. But what has Hadoop got do with it?

Turns out that Hadoop is an ideal platform to run ETL. You can feed the results into a traditional data warehouse, or better yet, simply use Hadoop itself as your warehouse. Two for the price of one! And ingesting data from all sources into a centralized Hadoop repository is future proof: as your business scales and the data grows rapidly, the Hadoop infrastructure can scale easily.

The Hadoop platform has tools that can extract the data from the source systems, whether they are log files, machine data or online databases and load them to Hadoop in record time. It is possible to do transformations on the fly as well, although more elaborate processing is better done after the data is loaded into Hadoop. Programming and scripting frameworks allow complex ETL jobs to be deployed and executed in a distributed manner. Rapid improvements in interactive SQL tools make Hadoop an ideal choice for a low cost data warehouse.

Got it. What needs to be done to get this all to work? Read on to find out.

ETL Process in Hadoop

An architecture for setting up a Hadoop data store for ETL is shown below.

ETL Architecture

Here are the typical steps to setup Hadoop for ETL:

  1. Setup a Hadoop cluster
  2. Connect data sources
  3. Define the metadata
  4. Create the ETL jobs
  5. Create the workflow

Setup a Hadoop Cluster

This step can be really simple or quite difficult depending on where you want the cluster to be. On the public cloud, you can create a Hadoop cluster with just a few clicks using Amazon EMR, Rackspace CBD or other cloud Hadoop offerings. If the data sources are already on the same public cloud, then this is obviously the no-brainer solution.

If however your data sources happen to be in a Data Center, there are several things to take into consideration.

  1. Can the data be moved to the cloud? Legal, security, privacy and cost considerations apply.
  2. Can test data be used for development?

If the answer is No to both questions, then a cluster will need to be provisioned in the Data Center. Go befriend your IT/OPS guy right away.

Connect Data Sources

The Hadoop eco-system includes several technologies such as Apache Flume and Apache Sqoop to connect various data sources such as log files, machine data and RDBMS. Depending on the amount of data and the rate of new data generation, a data ingestion architecture and topology must be planned. Start small and iterate just like any other development project. The goal is to move the data into Hadoop at a frequency that meets analytics requirements.

Define the Metadata

Hadoop is a “schema-on-read” platform and there is no need to create a schema before loading data as databases typically require. That does not mean one can throw in any kind of data and expect some magic to happen. It is still important to clearly define the semantics and structure of data (the “metadata”) that will be used for analytics purposes. This definition will then help in the next step of data transformation.

Going back to our example of the customer id, define how exactly this id will be stored in the warehouse. Is it a 10 digit numeric key that will be generated by some algorithm or is it simply appending a 4 digit sequence number to an existing id?

Many Hadoop projects are begun without any clear definition of Metadata. Just like ETL, the term “Metadata Management” is considered old school and meant for traditional Enterprise IT, not for our modern data architecture? But in reality, metadata is crucial for the success of Hadoop as a data warehouse. With a clear design and documentation, there is no ambiguity in what a particular field means or how it was generated. Investing up front in getting this right will save a lot of angst later on.

With the metadata defined, this can be easily transposed to Hadoop using Apache HCatalog, a technology provides a relational table view of data in Hadoop. HCatalog also allows this view to be shared by different type of ETL jobs, Pig, Hive or MapReduce.

Create the ETL jobs

We can finally focus on the process of transforming the various sources of data. Here again, multiple technologies exist: MapReduce, Cascading and Pig are some of the most common used frameworks for developing ETL jobs. Which technology to use and how to create the jobs really depends on the data set and what transformations are needed. Many organizations use a combination of Pig and MapReduce while others use Cascading exclusively. Learn about all the different ways transform jobs are done and the strengths and weaknesses of the various technologies.

A word of caution – engineers experienced in enterprise data management may be prone to aggressive data cleansing and transformation. They want order and the data to confirm to pre-defined schemas. However, the whole notion of big data is that it can be unstructured. Machine and sensor data are likely to be noisy, social media and other data may not fit into neat buckets. Too much cleansing can get rid of the very insights that big data promises. A thoughtful approach is required to get the most value from your data.

Create the Workflow

Data cleansing and transformations are easier done when multiple jobs cascade into a workflow, each performing a specific task. Often data mappings/transformations need to be executed in a specific order and/or there may be dependencies to check. These dependencies and sequences are captured in workflows – parallel flows allow parallel execution that can speed up the ETL process. Finally the entire workflow needs to be scheduled. They may have to run weekly, nightly or perhaps even hourly.

Although technologies such as Oozie provide some workflow management, it is typically insufficient. Many organizations create their own workflow management tools. This can be a complex process as it is important to take care of failure scenarios and restart the workflow appropriately.

A smooth workflow will result in the source data being ingested and transformed based on the metadata definition and stored in Hadoop. At this point, the data is ready for analysis. And you guessed it! There are many different ways to do that with Hadoop; Hive, Impala and Lingual provide SQL-on-Hadoop functionality while several commercial BI tools can connect to Hadoop to explore the data visually and generate reports.


We are finally done! We have created a data warehouse in Hadoop. Although this seems complicated (depending on the data and requirements), almost all of the technologies are open-source and available for free. Tools are now emerging that help automate some part of this process. If your organization does not have the expertise, it may be a good idea to engage outside services to get started on this new architecture and technologies, while hiring/training your own staff.

Data warehouses are a requirement even for web/internet companies. Data cleansing, data transformation, ETL, metadata are all terms that are still relevant for new data architectures. But they don’t need to be created using proprietary, expensive products. Leveraging Big Data technologies such as Hadoop will ensure your data architecture stands the test of time (at least until the next big wave!)

A version of this article first appeared in the Online Backup News Journal.

For the past few years, we have heard a lot about the benefits of Hadoop, the dominant big data technology. But one less spoken of use case is backing databases to Hadoop.


Data Warehouse Backup to Hadoop

Data Warehouses, especially large ones are expensive and doing backups by replication to another DW is out of the question for many enterprises. Consequently, the method of choice to backup a data warehouse is quite often tape. Tape backup is neither cheap nor fast. Further, a restore from a backup can cause significant disruption to business depending on the time taken. Yet, there has been no other cost-effective solution until now.

By using commodity hardware and cheap disks that are replicated, Hadoop has proven to be a safe and fast backup solution. The backup solution is easy to setup and the attractive costs and recovery time make it an ideal choice for this important function. One of our customers, a major bank, took this approach and saved a considerable amount of money while avoiding a large Data Warehouse upgrade.

The other big advantage is that the backup system is live and the data in it can be analyzed. Use Hive, Impala or Lingual and users will never know whether they are accessing the active data warehouse or a backup!

Traditional backups are always squirreled away into hiding, never to be seen by engineers or analysts. In contrast, the Hadoop solution is not just active, fast and cheap but can be used for analytics.

Online Database Backup to Hadoop

It is not just large data warehouses that can benefit from backing to Hadoop, but online, relational databases as well.

As online databases get larger, many DBAs prefer online backups with the ability to do point in time recovery. This allows for fast restores – a very important requirement for online databases. To ensure reliable, fast backup and restores, expensive SAN/NAS or specialized disk drives are used. Every backup makes a physical copy of the database and depending on the frequency of backups and the number of backups you want to retain, the storage costs quickly add up.

In contrast, consider a Hadoop backup solution. Hadoop uses commodity servers with vanilla disks, achieving its scale and reliability because of its redundant, distributed architecture. You can even cobble together a Hadoop cluster using older equipment, perhaps beefing up the disks depending on the amount of storage required.

Backup your database as usual; then copy it over to the Hadoop cluster. Multiple backups can all be safely stored in the same cluster.

This article was originally published on
For the past few years, we have heard a lot about the benefits of augmenting the Enterprise Data Warehouse with Hadoop. The Data Warehouse vendors as well as the Hadoop vendors have been very careful with their terminology and showcasing how Hadoop can handle unstructured data while the EDW will continue to remain as the central source of truth in an enterprise.
That message was a desperate attempt by Teradata to hold on to its worldview while partnering with Hadoop vendors to ensure they stay on message as well.
Well, the game is over. With the rapid evolution of robust enterprise features like security, data management, governance, and enhanced SQL capabilities, Hadoop is all set to be the single source of truth in the enterprise.

Data Warehouse Augmentation with Hadoop

One of the first use cases for Hadoop in the enterprise was off-loading ETL tasks from the Enterprise Data Warehouse (EDW). Since Hadoop is excellent at batch processing, running ETL on Hadoop was an obvious choice. This provided the big benefit of saving precious resources on the EDW, leaving it to handle the more interactive and complex analytical queries. However, the term “off-load” or “migrate” ETL to Hadoop had negative connotations for Data Warehouse vendors who were concerned that this meant that Hadoop could do things that were traditionally done in the EDW at a much lower cost. Thus was born the term “augmentation”. Hadoop was not off-loading the EDW; it was augmenting it. The typical DW augmentation architecture thus shows Hadoop running ETL jobs and sending the results to the EDW.

DW augmentation with Hadoop

The advantage of this architecture is that the EDW is still the center of the data universe in an enterprise. Downstream data marts and BI tools and applications continue to work in the exact same manner, thus requiring no new tools or training for the business analysts. For first-time Hadoop users, this architecture makes sense – start small with a new technology, don’t chew more than you can bite.

Other DW Use Cases

Of course, once Hadoop entered the enterprise, there was no stopping it. Within Data Warehousing itself, it is now used in myriads of applications, including:

  • Off-loading Computations: Run ETL on Hadoop
  • Off-loading Storage: Move cold data (not often used) from EDW to Hadoop
  • Backup and Recovery: Replace tape backup with an active Hadoop cluster
  • Disaster Recovery: Use Hadoop in the DR site to backup the EDW

Notice that none of the above touches unstructured data – we are only talking about traditional Enterprise Data Warehousing and how Hadoop can make a difference in both reducing cost and increasing performance, while providing improved access to the data resources. Some of our customers have realized significant cost savings using Hadoop to augment the EDW.

Hadoop for Analytics

The big promise of Hadoop is the potential to gain new business insights with advanced predictive analytics enabled by the processing of new sources of unstructured data (social media, multimedia, etc.) in combination with data from the EDW. However, using it for online, real-time analytics has been a problem until now. Hadoop’s architecture and MapReduce framework makes it slow for real-time processing; its design point was batch.

This is the reason for the continued center stage of the Enterprise Data Warehouse. Databases like Teradata are excellent at performing complex, analytical queries on large amounts of data at great speeds.

SQL on Hadoop

Although the Hadoop community recognized the need for SQL early on, it is only in the last year or two that great strides have been made to create an enterprise-grade SQL solution that can meet the needs of data warehousing analytics.

The Hortonworks Stinger initiative has dramatically improved performance for interactive queries in Hive, the dominant SQL-on-Hadoop technology.

Cloudera developed Impala from scratch, touting MPP database like performance.

Meanwhile, Apache Spark has gained momentum, replacing MapReduce to focus on real-time, streaming analytics. Spark SQL holds much promise, and along with the likes of Hive and Impala, developers and users will have multiple technologies to choose from.

As adoption increases and the products mature, Hadoop will be more powerful than the MPP databases of today. We will see a shift from “augmentation” to “replacement” of the EDW with Hadoop.

The New Enterprise Data Architecture

With all the pieces for real-time, streaming analytics and enterprise features such as security (all Hadoop vendors now have security built-in), data management and governance with the likes of Apache Falcon, Hadoop is ready to become the source of truth in the enterprise. No longer does it need to play second fiddle to the data warehouse, it IS the data warehouse. In fact, in many Internet and technology companies, the data warehouse is built solely on Hadoop. Let us examine what this architecture might look like.

Hadoop as the source of truth

In the new enterprise architecture, Hadoop as the source of truth takes center stage. Businesses comfortable with their existing BI and reporting products can continue to use them as the products adapt to access the Big Data platform. Enterprise developers can build custom tools and applications while data scientists use big data exploratory and predictive analytics tools to obtain new business insights. This is after all the true promise of Big Data; combine multiple sources of information and create predictive and prescriptive analytical models.

This transition from Hadoop augmenting the Data Warehouse to replacing it as the source of truth in larger enterprises can be undertaken in a phased approach with new analytical applications being served from Hadoop while the EDW still feeds the legacy BI applications.

Data warehousing vendors recognize this – they are coming up with creative ways to stay relevant. Both Teradata and more recently Oracle have technologies that allow queries to span across Hadoop and the database, allowing Hadoop to process data stored in it while the database continues to handle the structured data. This is another good intermediate step in the transition process (albeit making one more dependent on the EDW, not less!)


It is a matter of time before the Enterprise Data Warehouse as we know it, with expensive proprietary appliances and software technologies becomes obsolete. The open source Hadoop eco-system is thriving and evolving very rapidly to perform all of the storage, compute and analytics required while providing dramatic new functionality to handle huge amounts of unstructured and semi-structured data. All of this functionality comes at a fraction of the cost. This ecosystem has proven that it is no longer true that innovation happens only in closed source, proprietary companies.

Hadoop as the source of truth in the enterprise is almost here. If your enterprise is yet to begin the Hadoop journey, step on the pedal – otherwise you may be left behind.

Sunnyvale, May 16, 2014— Orzota, a big data solutions company, today announced that it has joined the Hortonworks Systems Integrator Partner Program and will deliver big data solutions powered by Hortonworks Data Platform (HDP). Hortonworks is the leading contributor to and provider of ApacheTM Hadoop®, the most widely used big data technology today. Orzota’s customers can now benefit from a big data solution that includes an enterprise-grade Hadoop platform.

By joining the Hortonworks Systems Integrator Partner Program, Orzota will strengthen its ability to implement enterprise-level big data solutions based on the industry’s only 100- percent open source Hadoop distribution, explicitly architected, built, and tested for enterprise-grade deployments.

Orzota helps customers create a sound big data strategy and a robust, scalable big data architecture. Orzota has implemented solutions for Fortune 500 companies, enabling customers to derive new business insights from their data assets. Orzota’s custom process and tools help speed up deployments, reducing cost and complexity. With the Hortonworks partnership, our customers can now benefit from a fully supported Hadoop distribution as part of their data architecture.

“We are very excited to partner with Hortonworks to ensure that our solutions use an enterprise Hadoop platform. Our clients expect a scalable, highly available and secure architecture and this relationship will help us meet their expectations,” said Orzota founder and CTO, Bharath Mundlapudi.

”We welcome Orzota to the Hortonworks Systems Integrator Partner Program and look forward to working with them to provide data-driven applications powered by HDP for their customers,” said John Kreisa, vice president of strategic marketing at Hortonworks. “Orzota’s experience and exclusive focus on big data technologies enables them to deliver high value solutions for enterprises, now including 100-percent open source Hadoop.”

About Us

Orzota is a technology-enabled big data solutions company. Orzota provides a Big Data Management Platform and services to help businesses accelerate their big data deployments and gain insights from their data.

Orzota Partners with Concurrent to Deliver Big Data Applications

June 2, 2014—Orzota, Inc. a Big Data solutions company, today announced a strategic partnership with Concurrent, Inc. to accelerate the development of big data solutions for its clients. Concurrent is the company behind Cascading, the most widely used application development framework for building data applications. Orzota’s customers can now benefit from a big data solution that includes a powerful, robust, enterprise-grade development framework.

By joining forces with Concurrent, Orzota strengthens its ability to implement big data solutions with Cascading’s power to simplify big data application development. With Orzota’s expertise in designing and building big data applications using the right set of technologies, the addition of Cascading, represents a further expansion of its toolbox.

Orzota helps customers create a sound big data strategy and a robust, scalable big data architecture. Orzota has implemented solutions for Fortune 500 companies, enabling new business insights from their data assets. Orzota’s Big Data Management Platform and tools help speed up deployments, reducing cost and complexity. With the Concurrent partnership, customers can now benefit from a fully supported Cascading framework to accelerate the time to market for their data applications.

“We are excited to partner with Concurrent and add Cascading to the list of technologies we specialize in. This will help reduce the development time of enterprise Hadoop applications for our clients” said Orzota founder and CEO, Shanti Subramanyam.

“Concurrent is committed to simplifying data application development so that enterprises can make the most of their data”, said Gary Nakamura, CEO of Concurrent, Inc. “We look forward to working with Orzota to further our mission, by bringing the power of Cascading to their customers’ data-driven applications.”


About Orzota, Inc.

Orzota, Inc. is a technology-enabled big data solutions company. Orzota provides a variety of services powered by the Orzota Big Data Management Platform to help businesses accelerate their big data deployments and gain insights from their data. Visit the website for more information.

About Concurrent, Inc.

Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading™, the most widely deployed technology for data applications with more than 150,000 user downloads a month.


These days, we often see many articles talking about Big Data in various verticals – manufacturing, media, insurance, oil & gas, finance & retail etc. We are living in interesting times and the next 5 years will be fantastic w.r.t quality and safety of life on many fronts – travel safety, innovation in life science, new product discoveries in manufacturing, targeting customers with right products which they care about, assessing the degree of risk to the corporation at any given point with very high accuracy and so on. All these benefits are made possible primary due to ‘Big Data Technologies’ – Hadoop, Cassandra, MongoDB, to name a few.

At Orzota, we have helped customers ranging from SMBs to large corporations in many verticals – retail, financial and manufacturing. We have provided solutions on public clouds like AWS and Rackspace, private clouds using OpenStack, and of course in data centers ranging from a few nodes to thousands of nodes.

With this experience and knowledge, I’d like to share some of the scenarios that can go wrong, if one is not careful,  causing big data projects to fail. Hopefully, these highlighted areas will provoke some thought and help you plan and execute big data projects correctly and of course under planned budget!

Here are a few areas that can cause big data projects to fail:

1. Traditional way of thinking

2. Not having clear strategy and roadmap

3. Treating Hadoop as yet another data platform

4. Not clearly defining the use case(s) to solve

5. Technology focus rather than business focus

6. Selecting the wrong tool for the job

7. Not knowing and planning data access patterns

8. Not having the right team

And of course, there are many other areas where one could go wrong in Big Data projects. We will continue to share our experience. Do contact us for any help at any stage of your Big Data project life cycle.

The role of the Enterprise Data Warehouse (EDW) is to integrate data across operational systems which may be in operational silos and geographically distributed. The diagram below shows a typical architecture of a data warehouse.

Enterprise Data Warehouse Architecture

Slide1 The ETL software extracts data, transforms values to normalize data, filters and cleanses “bad” data and finally loads data into a target database.  Data from operational systems such as various transactional databases, ERP and CRM systems are loaded into the Enterprise Data Warehouse (EDW). Some businesses require a Staging area to synchronize data coming from different timezones, geographies or to handle other disparities.

The EDW in most large businesses is not accessed directly but through data marts. Data marts are typically departmentalized. The HR department is interested in information about employees, the sales department about sales and so on. Some data marts are refreshed daily, while others weekly or monthly.

Data Warehouse as Source of Truth

The EDW was considered the single version of truth across all disparate internal and external systems. Data is stored at a fine-grained level recording for example, every sale so the data can be sliced and diced in different ways. However, not all enterprise data is stored in a data warehouse, only data that is necessary for business insight and decisions. Still, EDWs tend to be large, storing 100s of GBs of structured data, either in a relational database or columnar store (columnar refers to the method of storage, the databases themselves are still relational; examples include Teradata and Greenplum).

Big Data in the Enterprise Data Warehouse

What can Big Data do for the Enterprise Data Warehouse? Quite a lot, actually.

According to a recent survey by TDWI, some of the reasons for the expansion of the Data Warehousing Environment to include Big Data technologies such as Hadoop are:

  • advanced analytics
  • increasing volume of data
  • requirement for real-time processing
  • raw data exploration

But there are much simpler ways to start on Big Data.  Hadoop provides the ability to archive and backup the EDW, off-load routine processing and infrequently used data. This can dramatically improve the performance and lowers the cost of the Enterprise Data Warehouse; all this without implementing a single new application!

 Data Warehouse Augmentation

The architecture below shows a first step to take on the big data journey with data warehouse augmentation.


This architecture provides several benefits:

  • Store a large amount of data in Hadoop which becomes the single source of truth
  • Move the infrequently used data from the EDW to Hadoop, thus lowering costs (sometimes dramatically)
  • Run ETL, reports and other types of analytics directly on Hadoop
  • Store only the critical data needed for fast analytics and BI in the data warehouse
  • Allow data scientists to conduct exploratory analytics of the raw data in Hadoop directly
  • No disruption to existing business analysts and reporting functions


Big Data is disrupting the Enterprise Data Warehouse in a big way. When faced with a variety of daunting goals and applications, it can be difficult to figure out how to get started on Big Data. At Orzota, we believe that the best strategy is to start small, exploit Big Data meaningfully before moving on to bigger things.

Orzota has experience creating roadmaps, strategy and big data architecture for augmenting enterprise data warehouses with big data technologies such as hadoop. Contact us to schedule a free one hour consultation.