Orzota

Author: Shanti Subramanyam

NOTE: This article first appeared on Datanami

Mention ETL (Extract, Transform and Load) and eyes glaze over. The thought goes: “That stuff is old and meant for clunky enterprise data warehouses. What does it have to do with my internet/web/ecommerce application?”

Quite a lot – actually. ETL did originate in enterprise IT where data from online databases is Extracted, then Transformed to normalize it and finally Loaded into enterprise data warehouses for analysis. Although internet companies feel they have no use for expensive, proprietary data warehouses, the fact of the matter is that ETL is still a requirement and so is some kind of a data warehouse. The logic is simple: one doesn’t run business reports on the database powering the online application.

An ETL Example

Consider the classic example of key transformation. The application database uses a customer_id to index into the customer table, while the CRM system has the same customer referenced differently. The business analyst wants to analyze how customers are using the product and thus, the data warehouse needs a distinct way to refer to customers i.e. the keys need to be transformed and mapped to a new key in the DW. Even if there is a single source system, it is still a good idea to do such transformations to isolate the warehouse from the online database. In addition to such basic transformations, data is also often enriched (as for example using geocodes) to create the target customer record in the warehouse. There is no getting away from it: ETL is a requirement whether you are the hottest social media company or a 50-year-old bank.

Why Hadoop?

All right. We need ETL. But what has Hadoop got do with it?

Turns out that Hadoop is an ideal platform to run ETL. You can feed the results into a traditional data warehouse, or better yet, simply use Hadoop itself as your warehouse. Two for the price of one! And ingesting data from all sources into a centralized Hadoop repository is future proof: as your business scales and the data grows rapidly, the Hadoop infrastructure can scale easily.

The Hadoop platform has tools that can extract the data from the source systems, whether they are log files, machine data or online databases and load them to Hadoop in record time. It is possible to do transformations on the fly as well, although more elaborate processing is better done after the data is loaded into Hadoop. Programming and scripting frameworks allow complex ETL jobs to be deployed and executed in a distributed manner. Rapid improvements in interactive SQL tools make Hadoop an ideal choice for a low cost data warehouse.

Got it. What needs to be done to get this all to work? Read on to find out.

ETL Process in Hadoop

An architecture for setting up a Hadoop data store for ETL is shown below.

ETL Architecture

Here are the typical steps to setup Hadoop for ETL:

  1. Setup a Hadoop cluster
  2. Connect data sources
  3. Define the metadata
  4. Create the ETL jobs
  5. Create the workflow

Setup a Hadoop Cluster

This step can be really simple or quite difficult depending on where you want the cluster to be. On the public cloud, you can create a Hadoop cluster with just a few clicks using Amazon EMR, Rackspace CBD or other cloud Hadoop offerings. If the data sources are already on the same public cloud, then this is obviously the no-brainer solution.

If however your data sources happen to be in a Data Center, there are several things to take into consideration.

  1. Can the data be moved to the cloud? Legal, security, privacy and cost considerations apply.
  2. Can test data be used for development?

If the answer is No to both questions, then a cluster will need to be provisioned in the Data Center. Go befriend your IT/OPS guy right away.

Connect Data Sources

The Hadoop eco-system includes several technologies such as Apache Flume and Apache Sqoop to connect various data sources such as log files, machine data and RDBMS. Depending on the amount of data and the rate of new data generation, a data ingestion architecture and topology must be planned. Start small and iterate just like any other development project. The goal is to move the data into Hadoop at a frequency that meets analytics requirements.

Define the Metadata

Hadoop is a “schema-on-read” platform and there is no need to create a schema before loading data as databases typically require. That does not mean one can throw in any kind of data and expect some magic to happen. It is still important to clearly define the semantics and structure of data (the “metadata”) that will be used for analytics purposes. This definition will then help in the next step of data transformation.

Going back to our example of the customer id, define how exactly this id will be stored in the warehouse. Is it a 10 digit numeric key that will be generated by some algorithm or is it simply appending a 4 digit sequence number to an existing id?

Many Hadoop projects are begun without any clear definition of Metadata. Just like ETL, the term “Metadata Management” is considered old school and meant for traditional Enterprise IT, not for our modern data architecture? But in reality, metadata is crucial for the success of Hadoop as a data warehouse. With a clear design and documentation, there is no ambiguity in what a particular field means or how it was generated. Investing up front in getting this right will save a lot of angst later on.

With the metadata defined, this can be easily transposed to Hadoop using Apache HCatalog, a technology provides a relational table view of data in Hadoop. HCatalog also allows this view to be shared by different type of ETL jobs, Pig, Hive or MapReduce.

Create the ETL jobs

We can finally focus on the process of transforming the various sources of data. Here again, multiple technologies exist: MapReduce, Cascading and Pig are some of the most common used frameworks for developing ETL jobs. Which technology to use and how to create the jobs really depends on the data set and what transformations are needed. Many organizations use a combination of Pig and MapReduce while others use Cascading exclusively. Learn about all the different ways transform jobs are done and the strengths and weaknesses of the various technologies.

A word of caution – engineers experienced in enterprise data management may be prone to aggressive data cleansing and transformation. They want order and the data to confirm to pre-defined schemas. However, the whole notion of big data is that it can be unstructured. Machine and sensor data are likely to be noisy, social media and other data may not fit into neat buckets. Too much cleansing can get rid of the very insights that big data promises. A thoughtful approach is required to get the most value from your data.

Create the Workflow

Data cleansing and transformations are easier done when multiple jobs cascade into a workflow, each performing a specific task. Often data mappings/transformations need to be executed in a specific order and/or there may be dependencies to check. These dependencies and sequences are captured in workflows – parallel flows allow parallel execution that can speed up the ETL process. Finally the entire workflow needs to be scheduled. They may have to run weekly, nightly or perhaps even hourly.

Although technologies such as Oozie provide some workflow management, it is typically insufficient. Many organizations create their own workflow management tools. This can be a complex process as it is important to take care of failure scenarios and restart the workflow appropriately.

A smooth workflow will result in the source data being ingested and transformed based on the metadata definition and stored in Hadoop. At this point, the data is ready for analysis. And you guessed it! There are many different ways to do that with Hadoop; Hive, Impala and Lingual provide SQL-on-Hadoop functionality while several commercial BI tools can connect to Hadoop to explore the data visually and generate reports.

Celebrate!

We are finally done! We have created a data warehouse in Hadoop. Although this seems complicated (depending on the data and requirements), almost all of the technologies are open-source and available for free. Tools are now emerging that help automate some part of this process. If your organization does not have the expertise, it may be a good idea to engage outside services to get started on this new architecture and technologies, while hiring/training your own staff.

Data warehouses are a requirement even for web/internet companies. Data cleansing, data transformation, ETL, metadata are all terms that are still relevant for new data architectures. But they don’t need to be created using proprietary, expensive products. Leveraging Big Data technologies such as Hadoop will ensure your data architecture stands the test of time (at least until the next big wave!)

A version of this article first appeared in the Online Backup News Journal.

For the past few years, we have heard a lot about the benefits of Hadoop, the dominant big data technology. But one less spoken of use case is backing databases to Hadoop.

Slide4

Data Warehouse Backup to Hadoop

Data Warehouses, especially large ones are expensive and doing backups by replication to another DW is out of the question for many enterprises. Consequently, the method of choice to backup a data warehouse is quite often tape. Tape backup is neither cheap nor fast. Further, a restore from a backup can cause significant disruption to business depending on the time taken. Yet, there has been no other cost-effective solution until now.

By using commodity hardware and cheap disks that are replicated, Hadoop has proven to be a safe and fast backup solution. The backup solution is easy to setup and the attractive costs and recovery time make it an ideal choice for this important function. One of our customers, a major bank, took this approach and saved a considerable amount of money while avoiding a large Data Warehouse upgrade.

The other big advantage is that the backup system is live and the data in it can be analyzed. Use Hive, Impala or Lingual and users will never know whether they are accessing the active data warehouse or a backup!

Traditional backups are always squirreled away into hiding, never to be seen by engineers or analysts. In contrast, the Hadoop solution is not just active, fast and cheap but can be used for analytics.

Online Database Backup to Hadoop

It is not just large data warehouses that can benefit from backing to Hadoop, but online, relational databases as well.

As online databases get larger, many DBAs prefer online backups with the ability to do point in time recovery. This allows for fast restores – a very important requirement for online databases. To ensure reliable, fast backup and restores, expensive SAN/NAS or specialized disk drives are used. Every backup makes a physical copy of the database and depending on the frequency of backups and the number of backups you want to retain, the storage costs quickly add up.

In contrast, consider a Hadoop backup solution. Hadoop uses commodity servers with vanilla disks, achieving its scale and reliability because of its redundant, distributed architecture. You can even cobble together a Hadoop cluster using older equipment, perhaps beefing up the disks depending on the amount of storage required.

Backup your database as usual; then copy it over to the Hadoop cluster. Multiple backups can all be safely stored in the same cluster.

This article was originally published on SandHill.com.
For the past few years, we have heard a lot about the benefits of augmenting the Enterprise Data Warehouse with Hadoop. The Data Warehouse vendors as well as the Hadoop vendors have been very careful with their terminology and showcasing how Hadoop can handle unstructured data while the EDW will continue to remain as the central source of truth in an enterprise.
That message was a desperate attempt by Teradata to hold on to its worldview while partnering with Hadoop vendors to ensure they stay on message as well.
Well, the game is over. With the rapid evolution of robust enterprise features like security, data management, governance, and enhanced SQL capabilities, Hadoop is all set to be the single source of truth in the enterprise.

Data Warehouse Augmentation with Hadoop

One of the first use cases for Hadoop in the enterprise was off-loading ETL tasks from the Enterprise Data Warehouse (EDW). Since Hadoop is excellent at batch processing, running ETL on Hadoop was an obvious choice. This provided the big benefit of saving precious resources on the EDW, leaving it to handle the more interactive and complex analytical queries. However, the term “off-load” or “migrate” ETL to Hadoop had negative connotations for Data Warehouse vendors who were concerned that this meant that Hadoop could do things that were traditionally done in the EDW at a much lower cost. Thus was born the term “augmentation”. Hadoop was not off-loading the EDW; it was augmenting it. The typical DW augmentation architecture thus shows Hadoop running ETL jobs and sending the results to the EDW.

DW augmentation with Hadoop

The advantage of this architecture is that the EDW is still the center of the data universe in an enterprise. Downstream data marts and BI tools and applications continue to work in the exact same manner, thus requiring no new tools or training for the business analysts. For first-time Hadoop users, this architecture makes sense – start small with a new technology, don’t chew more than you can bite.

Other DW Use Cases

Of course, once Hadoop entered the enterprise, there was no stopping it. Within Data Warehousing itself, it is now used in myriads of applications, including:

  • Off-loading Computations: Run ETL on Hadoop
  • Off-loading Storage: Move cold data (not often used) from EDW to Hadoop
  • Backup and Recovery: Replace tape backup with an active Hadoop cluster
  • Disaster Recovery: Use Hadoop in the DR site to backup the EDW

Notice that none of the above touches unstructured data – we are only talking about traditional Enterprise Data Warehousing and how Hadoop can make a difference in both reducing cost and increasing performance, while providing improved access to the data resources. Some of our customers have realized significant cost savings using Hadoop to augment the EDW.

Hadoop for Analytics

The big promise of Hadoop is the potential to gain new business insights with advanced predictive analytics enabled by the processing of new sources of unstructured data (social media, multimedia, etc.) in combination with data from the EDW. However, using it for online, real-time analytics has been a problem until now. Hadoop’s architecture and MapReduce framework makes it slow for real-time processing; its design point was batch.

This is the reason for the continued center stage of the Enterprise Data Warehouse. Databases like Teradata are excellent at performing complex, analytical queries on large amounts of data at great speeds.

SQL on Hadoop

Although the Hadoop community recognized the need for SQL early on, it is only in the last year or two that great strides have been made to create an enterprise-grade SQL solution that can meet the needs of data warehousing analytics.

The Hortonworks Stinger initiative has dramatically improved performance for interactive queries in Hive, the dominant SQL-on-Hadoop technology.

Cloudera developed Impala from scratch, touting MPP database like performance.

Meanwhile, Apache Spark has gained momentum, replacing MapReduce to focus on real-time, streaming analytics. Spark SQL holds much promise, and along with the likes of Hive and Impala, developers and users will have multiple technologies to choose from.

As adoption increases and the products mature, Hadoop will be more powerful than the MPP databases of today. We will see a shift from “augmentation” to “replacement” of the EDW with Hadoop.

The New Enterprise Data Architecture

With all the pieces for real-time, streaming analytics and enterprise features such as security (all Hadoop vendors now have security built-in), data management and governance with the likes of Apache Falcon, Hadoop is ready to become the source of truth in the enterprise. No longer does it need to play second fiddle to the data warehouse, it IS the data warehouse. In fact, in many Internet and technology companies, the data warehouse is built solely on Hadoop. Let us examine what this architecture might look like.

Hadoop as the source of truth

In the new enterprise architecture, Hadoop as the source of truth takes center stage. Businesses comfortable with their existing BI and reporting products can continue to use them as the products adapt to access the Big Data platform. Enterprise developers can build custom tools and applications while data scientists use big data exploratory and predictive analytics tools to obtain new business insights. This is after all the true promise of Big Data; combine multiple sources of information and create predictive and prescriptive analytical models.

This transition from Hadoop augmenting the Data Warehouse to replacing it as the source of truth in larger enterprises can be undertaken in a phased approach with new analytical applications being served from Hadoop while the EDW still feeds the legacy BI applications.

Data warehousing vendors recognize this – they are coming up with creative ways to stay relevant. Both Teradata and more recently Oracle have technologies that allow queries to span across Hadoop and the database, allowing Hadoop to process data stored in it while the database continues to handle the structured data. This is another good intermediate step in the transition process (albeit making one more dependent on the EDW, not less!)

Conclusion

It is a matter of time before the Enterprise Data Warehouse as we know it, with expensive proprietary appliances and software technologies becomes obsolete. The open source Hadoop eco-system is thriving and evolving very rapidly to perform all of the storage, compute and analytics required while providing dramatic new functionality to handle huge amounts of unstructured and semi-structured data. All of this functionality comes at a fraction of the cost. This ecosystem has proven that it is no longer true that innovation happens only in closed source, proprietary companies.

Hadoop as the source of truth in the enterprise is almost here. If your enterprise is yet to begin the Hadoop journey, step on the pedal – otherwise you may be left behind.

Sunnyvale, May 16, 2014— Orzota, a big data solutions company, today announced that it has joined the Hortonworks Systems Integrator Partner Program and will deliver big data solutions powered by Hortonworks Data Platform (HDP). Hortonworks is the leading contributor to and provider of ApacheTM Hadoop®, the most widely used big data technology today. Orzota’s customers can now benefit from a big data solution that includes an enterprise-grade Hadoop platform.

By joining the Hortonworks Systems Integrator Partner Program, Orzota will strengthen its ability to implement enterprise-level big data solutions based on the industry’s only 100- percent open source Hadoop distribution, explicitly architected, built, and tested for enterprise-grade deployments.

Orzota helps customers create a sound big data strategy and a robust, scalable big data architecture. Orzota has implemented solutions for Fortune 500 companies, enabling customers to derive new business insights from their data assets. Orzota’s custom process and tools help speed up deployments, reducing cost and complexity. With the Hortonworks partnership, our customers can now benefit from a fully supported Hadoop distribution as part of their data architecture.

“We are very excited to partner with Hortonworks to ensure that our solutions use an enterprise Hadoop platform. Our clients expect a scalable, highly available and secure architecture and this relationship will help us meet their expectations,” said Orzota founder and CTO, Bharath Mundlapudi.

”We welcome Orzota to the Hortonworks Systems Integrator Partner Program and look forward to working with them to provide data-driven applications powered by HDP for their customers,” said John Kreisa, vice president of strategic marketing at Hortonworks. “Orzota’s experience and exclusive focus on big data technologies enables them to deliver high value solutions for enterprises, now including 100-percent open source Hadoop.”

About Us

Orzota is a technology-enabled big data solutions company. Orzota provides a Big Data Management Platform and services to help businesses accelerate their big data deployments and gain insights from their data.

Orzota Partners with Concurrent to Deliver Big Data Applications

June 2, 2014—Orzota, Inc. a Big Data solutions company, today announced a strategic partnership with Concurrent, Inc. to accelerate the development of big data solutions for its clients. Concurrent is the company behind Cascading, the most widely used application development framework for building data applications. Orzota’s customers can now benefit from a big data solution that includes a powerful, robust, enterprise-grade development framework.

By joining forces with Concurrent, Orzota strengthens its ability to implement big data solutions with Cascading’s power to simplify big data application development. With Orzota’s expertise in designing and building big data applications using the right set of technologies, the addition of Cascading, represents a further expansion of its toolbox.

Orzota helps customers create a sound big data strategy and a robust, scalable big data architecture. Orzota has implemented solutions for Fortune 500 companies, enabling new business insights from their data assets. Orzota’s Big Data Management Platform and tools help speed up deployments, reducing cost and complexity. With the Concurrent partnership, customers can now benefit from a fully supported Cascading framework to accelerate the time to market for their data applications.

“We are excited to partner with Concurrent and add Cascading to the list of technologies we specialize in. This will help reduce the development time of enterprise Hadoop applications for our clients” said Orzota founder and CEO, Shanti Subramanyam.

“Concurrent is committed to simplifying data application development so that enterprises can make the most of their data”, said Gary Nakamura, CEO of Concurrent, Inc. “We look forward to working with Orzota to further our mission, by bringing the power of Cascading to their customers’ data-driven applications.”

 

About Orzota, Inc.

Orzota, Inc. is a technology-enabled big data solutions company. Orzota provides a variety of services powered by the Orzota Big Data Management Platform to help businesses accelerate their big data deployments and gain insights from their data. Visit the website for more information.

About Concurrent, Inc.

Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading™, the most widely deployed technology for data applications with more than 150,000 user downloads a month.

 

The role of the Enterprise Data Warehouse (EDW) is to integrate data across operational systems which may be in operational silos and geographically distributed. The diagram below shows a typical architecture of a data warehouse.

Enterprise Data Warehouse Architecture

Slide1 The ETL software extracts data, transforms values to normalize data, filters and cleanses “bad” data and finally loads data into a target database.  Data from operational systems such as various transactional databases, ERP and CRM systems are loaded into the Enterprise Data Warehouse (EDW). Some businesses require a Staging area to synchronize data coming from different timezones, geographies or to handle other disparities.

The EDW in most large businesses is not accessed directly but through data marts. Data marts are typically departmentalized. The HR department is interested in information about employees, the sales department about sales and so on. Some data marts are refreshed daily, while others weekly or monthly.

Data Warehouse as Source of Truth

The EDW was considered the single version of truth across all disparate internal and external systems. Data is stored at a fine-grained level recording for example, every sale so the data can be sliced and diced in different ways. However, not all enterprise data is stored in a data warehouse, only data that is necessary for business insight and decisions. Still, EDWs tend to be large, storing 100s of GBs of structured data, either in a relational database or columnar store (columnar refers to the method of storage, the databases themselves are still relational; examples include Teradata and Greenplum).

Big Data in the Enterprise Data Warehouse

What can Big Data do for the Enterprise Data Warehouse? Quite a lot, actually.

According to a recent survey by TDWI, some of the reasons for the expansion of the Data Warehousing Environment to include Big Data technologies such as Hadoop are:

  • advanced analytics
  • increasing volume of data
  • requirement for real-time processing
  • raw data exploration

But there are much simpler ways to start on Big Data.  Hadoop provides the ability to archive and backup the EDW, off-load routine processing and infrequently used data. This can dramatically improve the performance and lowers the cost of the Enterprise Data Warehouse; all this without implementing a single new application!

 Data Warehouse Augmentation

The architecture below shows a first step to take on the big data journey with data warehouse augmentation.

Slide2

This architecture provides several benefits:

  • Store a large amount of data in Hadoop which becomes the single source of truth
  • Move the infrequently used data from the EDW to Hadoop, thus lowering costs (sometimes dramatically)
  • Run ETL, reports and other types of analytics directly on Hadoop
  • Store only the critical data needed for fast analytics and BI in the data warehouse
  • Allow data scientists to conduct exploratory analytics of the raw data in Hadoop directly
  • No disruption to existing business analysts and reporting functions

Conclusion

Big Data is disrupting the Enterprise Data Warehouse in a big way. When faced with a variety of daunting goals and applications, it can be difficult to figure out how to get started on Big Data. At Orzota, we believe that the best strategy is to start small, exploit Big Data meaningfully before moving on to bigger things.

Orzota has experience creating roadmaps, strategy and big data architecture for augmenting enterprise data warehouses with big data technologies such as hadoop. Contact us to schedule a free one hour consultation.

 

 

The original use case for Hadoop is running ETL to transform the large amounts of raw data collected from the myriad sources, both internal and external of a business. Today, this use case is still very relevant and a primary one for enterprises intent on leveraging their data assets. The diagram below shows how this can be implemented in a public cloud environment where a business application (e.g. a web application) is deployed on the same cloud platform. Many other architectures are possible including a private and hybrid cloud model.

ETL on Hadoop

Hadoop ETL in the cloud

As we have gone about executing client projects, one thing that has stood out is how difficult it is to setup new infrastructure even for pilot projects. In many enterprises, IT can take 3-6 months to provision a single rack of servers. Once the servers are up, the next big complexity is in setting up and managing technologies such as hadoop. It is well known fact that Hadoop is hard. Although the availability of pre-packaged distributions from multiple vendors has helped, the fact remains that building and deploying hadoop applications is not easy.

Big Data Management Platform

At Orzota, we are doing our share to make Hadoop easy. Today, we are announcing the Orzota Big Data Management Platform. Our goals with this platform are two-fold:

  1. Make it easy to launch and manage a correctly configured Hadoop cluster on the cloud
  2. Make it easy to deploy, run and manage a variety of ETL jobs

Cluster Management

Amazon and Rackspace have greatly simplified the launch of clusters through EMR and Cloud Big Data respectively. These services at the infrastructure level simplify hadoop setup for developers. With our platform, we look at cluster management from the application level, leveraging the underlying infrastructure services provided by the cloud vendors.

Job Management

The process of creating a hadoop cluster on the cloud is now just a couple of clicks away. Our job management module makes deploying, scheduling, monitoring and managing jobs just as easy. We currently support MapReduce, Hive and Pig, the most common frameworks used by ETL applications.

For more details on our platform, please visit the Orzota Big Data Management page.

Oct 28, 2013 – Rackspace announced the availability of  “Cloud Big Data”, the Hortonworks Data Platform powered by Apache Hadoop in both the managed hosting environment and the Rackspace public cloud.

The Rackspace Cloud Big Data Platform features an industry leading platform powered by Apache Hadoop distribution that customers can deploy and scale rapidly in the cloud with simple pricing and no upfront investment. This provides an open service model powered by Apache Hadoop that will help foster a large ecosystem of application vendors. The solution offers:

  • Rapid Deployment/Low Operational Burden – Enables customers to provision a fully featured environment powered by Apache Hadoop and scale it elastically in minutes.
  • Simple Pricing and Cost Effective Infrastructure – Offers a cost effective platform with a simple pricing model to fit a majority of Big Data workloads without any hidden costs.
  • A Rich Partner Ecosystem – Delivers the first truly open cloud platform powered by Apache Hadoop, which will allow a broad ecosystem of partners and applications to integrate with no additional development.

“Using Rackspace’s Cloud Big Data Platform has been transformative to our development efforts. While proprietary platforms lead users down roads littered with caveats, Cloud Big Data Platform gives us an intuitive platform to build our new suite of services,” said Shanti Subramanyam, CEO of Orzota. “With the great support of the Rackspace team, Orzota’s ETL/data science platform atop Rackspace’s Cloud Big Data Platform can greatly simplify deployments and help drive the use of Apache Hadoop.”

Orzota is one of the first partners of Rackspace to build a product on the Rackspace Big Data Cloud. The Orzota Big Data Management platform is the starting point of our suite of services to help our clients rapidly develop and deploy Big Data applications.

Please see Orzota Big Data Management for more information.

 

 

As Big Data makes inroads into enterprises, the hype and assumption was always that that there was massive value to be gained by discovering all the hidden nuggets in a company’s data that weren’t being processed.  This was especially true for online retailers who had extensive information about customer behavior . And there was data to back this up.

Nucleus Research conducted a study on ROI from big data in 2012 and found substantial returns reported by companies:

“Examples of those returns included: a 942 percent ROI for a manufacturer that was able to scour large, disparate data sets from vendors for purchasing and cost information; 1,822 percent ROI from reduced labor costs by a resort that integrated shift scheduling processes with data from the National Weather Service; and an 863 percent ROI by a metropolitan police force that was able to combine various crime databases alongside predictive analytics and its department assets.”

Big Data ROI in Retail

CNBC published an article titled “Retail Goes Shopping Through Big Data” in April 2013 in which they report: “Macys.com analyzes tens of millions of terabytes of information every day, including social media, store transactions and even Twitter feeds with compelling results. Big data analytics has helped Macy’s boost store sales by 10 percent.”

The article goes on to give other examples:

“Kroger CEO David Dillon has called big data analytics his “secret weapon” in fending off other grocery competitors. And Sterling Jewelers has harnessed customer analytics in its digital channels, increasing sales 49 percent in the past holiday season.”

Realistic ROI

A legitimate question is whether these cases were singled out. Is the Big Data ROI real and something that any company can expect to attain?

Wikibon tried to answer this question by conducting a survey. Their results showed that nearly 46% of companies that had Big Data deployments in production were not seeing value from it. They reported 50 cents return on each dollar of investment when their expectation was $3 – $4 over a 3-5 year period.

The blogosphere took to this survey with gusto – everyone quoting it and theorizing on why this was the case.

I must admit that I was tempted to join the fray, but several questions came to mind and it would be good to have answers first. After all, we are data engineers – it behooves us to use data and metrics to evaluate this information.

So here are some questions about the survey:

  1. How did companies come up with $3 – $4 ROI in the first place?
  2. Reading the article carefully, it states that the expected return was a 3-5 year period. Where were the companies in this cycle when surveyed? Maybe the 50 cents makes sense if their deployment was still in a state of flux.
  3. Shouldn’t the chart below actually reflect the year over year ROI rather than one big number? After all, the surveyed companies were presumably at different points in the lifecyle of their big data deployments. It would also be good to see a scatter plot or histogram of all of the individual results.  Averages can be meaningless if the data varies widely.
  4. How many companies were surveyed? How many of these had full deployments vs partial?  How many projects?

wikibon-ROI

With so much ambiguity, it is difficult to analyze this claim of poor ROI. To its credit, Wikibon does mention that the cause of the poor returns is probably due to the lack of skilled staff and/or failure to address performance issues as the big data deployments are scaled to handle more and more data. Companies that had solved these problems got better returns.

Conclusions

There is lots of anecdotal evidence about the great insights derived from big data directly leading to increased revenue, reduced customer churn, etc. Kudos to Wikibon for trying to systematically quantify that.

It is a well known fact that big data technologies are hard. Scaling large amounts of data onto a large horizontally scaled infrastructure and then analyzing it is a non-trivial exercise. Most companies do not have the skills to tackle their first big data project on their own. Taking Wikibon’s advice, it is better to use expert help to get started on big data.

Orzota is here to help. We have done work for several customers, helping them ramp up or scale their big data deployments. Contact us.

While Data Science and especially Big Data Science are still huge buzz words, there has been this myth that companies should be hiring their own data scientists and pay them whatever they command. Stories such as Microsoft paying $650,000 for a data scientist quickly spread through the blogosphere. There were wide spread reports that the U.S has close to 200,000 jobs for data scientists that are not filled.

Let’s take a step back and re-think this notion that every company must have their own team of data scientists. Data Science is part science, part art, part experience. Although there is a rigorous need for math and statistics, there is never one “right” solution to a problem. Different data scientists may come up with completely different models for the same problem. Only the rigors of time and testing with different sets of data can tell what works and what doesn’t.

The Competition Model

It may be advantageous to have multiple scientists work on a predictive model instead of relying on the one in-house data scientist. This is precisely the reasoning behind the success of Kaggle. The Kaggle competitions spur many people to try and solve a problem. The sponsoring company is the winner as they have many different solutions to pick from.

The Outsourced Model

In many cases, especially for small to medium sized companies, hiring their own full time data scientists may not make sense. They are trying to solve a specific problem and with no in-house expertise in managing big data and/or predictive analytics, it may be better to use outside consultants instead. The Consultants can not only analyze business goals and craft the right solution, but also train in-house staff in the deployment and management of that solution.

So it’s not surprising that the analytics outsourcing market is growing at a rapid pace. Information Management reports that the analytic services market is estimated to reach $46.9 billion this year, IDC projects it will grow at 14.3% and reach $70.8 billion by 2016.

 

At Orzota, we know big data analytics can be hard. We have experience converting business goals to solutions and experts who can help architect the correct solution to meet your organization’s needs.

For more information, contact us at info@orzota.com