Orzota

Author: Shanti Subramanyam

(ARTICLE 1 in the Series)

The Rise of Big Data

Big Data technologies made it possible for enterprises to capture and integrate diverse sets of data. They were no longer constrained to the data warehouses for analytics and reporting. Big Data allowed the integration of third-party syndicated data sets and social media data such as tweets and blogs. In addition, it helped break down silos between the various divisions within the enterprise, democratizing data access and help gain new insights from data.

The enriched big data sets can be used not just to understand the past, but make predictions about the future – which customers are likely to churn, which customers/equipment are most likely to generate new claims, which products are the most likely to succeed, etc.

AI-powered

We are now in the next wave of deriving value from data using AI-powered applications. The big breakthrough for this wave is the ability to use AI-powered neural networks to solve a wide variety of problems including autonomous driving vehicles, natural language understanding, image recognition, etc. Translating these technological advancements to real business use cases will result in significant operational benefits – reducing cost, providing faster customer service while creating new business models and sources of revenue.

Let’s look at some of the use cases for AI in insurance.

Underwriting

Underwriting or new application processing is the first pillar in any type of insurance – namely, processing applications for new insurance policies. The process can be complicated depending on the type, size, prior history and other components of the application to evaluate the risk and enroll the client. It involves communication among multiple parties – the client, agent, and underwriter. This is traditionally a manual process as it involves a review of many different types of documents from diverse carriers with no standardization that allows easy automation. Further, many carriers still receive paper documents that are faxed or scanned (or worse – sent via snail mail !)

AI-powered systems can help this step in multiple ways:

  1. Natural Language Processing (NLP) systems and chatbots and streamline communication between the parties
  2. AI-driven document extraction systems (Docu-AI) can automate the processing of the various documents using AI and Big Data
  3. Data from documents can then be analyzed by AI-powered analytics to help the underwriter assess risk

Claims Processing

Claims processing forms the core of the business for insurance carriers. When a claim is processed in a timely manner, it improves customer satisfaction and retention. Simultaneously, the processing has to minimize financial loss due to fraud or other factors to maximize profitability. Most companies have focused their energies on improving the claims process using technology.

Related image

Many software applications already automate workflows ensuring timely processing and smooth communication with all parties involved. Mobile apps allow users to easily submit claims along with documentation such as photos of the incident, claim form, invoices, etc.

Yet, main parts of the process are heavily manual. Claims adjusters have to frequently go out in the field to make assessments. Even in the case of smaller claims, the adjuster may manually review documents and photos.

 

How can AI-powered systems help claims processing?

  1. Image recognition algorithms can help identify and automatically categorize various pieces of information in claim evidence photos such as license plates of vehicles, insurance cards, various types of damages, etc.
  2. AI-driven document extraction systems (DocuAI) can automate analysis and categorization of line items in medical records, body shop estimates, contractor reports, etc. Using NLP and Deep learning techniques allows these systems to recognize a wide variety of content.
  3. Robotic Process Automation (RPA) can automate many parts of the processing workflow along with 1) and 2) above

Fraud Detection

Fraud detection is usually a part of claim processing to ensure that no opportunistic fraud has taken place. The biggest loss for insurance companies is due to fraud. Many larger carriers already use predictive analytics to help identify potential fraud in claims. These Machine Learning models use not just  a carrier’s own data but also shared databases across companies to flag potential fraud.

AI-powered systems can take this a step further. They can use the vast amounts of accumulated data and images to detect more subtle instances of fraud as well as previously intractable ones. With the cost of running these models dropping dramatically, even small claims can be analyzed to detect patterns.

Customer Service

Image result for customer service iconImproving customer service is the goal of every organization. With Big Data and AI, it is possible to automate the analysis of customer service calls and emails allowing customer service agents to proactively address complaints and issues.

AI-driven chatbots are now pervasive on websites and web portals. They provide an easy way of answering customers’ questions while reserving human interaction to handle more complex issues. Mobile apps with the ability to answer spoken natural language queries are now possible using technologies like Siri, Alexa and the same knowledge base used by chatbots and customer service agents.

New Business Models

With IoT enabling the gathering of fine-grained data (how many do I drive every day, what is the average trip, how many hours is the property unoccupied), insurance companies are seizing the opportunity to come up with new ways of underwriting policies. AI-powered systems can provide better risk analysis for determining premiums resulting in new personalized products. These new products can be provided at attractive premiums, driving new business.

 

I will be giving a talk titled “Anomaly Detection for Predictive Maintenance” at the Global Artificial Intelligence Conference in Seattle on April 27th 2018. If you are going to the conference, please do reach out.

Abstract

Detecting anomalies in sensor events is a requirement for a wide variety of use cases in the industrial IoT. Examples include predicting failures of HVAC systems and elevators for property management to identifying potential signals of malfunction in aircraft engines to schedule preventive maintenance.  When the number of sensors runs into the tens of thousands or more, as is common in large IoT installations, a scalable model for preventive maintenance is needed.

Unlike prediction models for customer churn, inventory forecasts, etc. that rely on multiple sources of data and a wide range of domain-specific parameters, it is possible to detect anomalies for many types of time-series data using statistical techniques alone.
In this session, we will discuss a step by step process for anomaly detection with examples that aid in quick insights for building models for preventive maintenance.

 

I will be giving a talk titled “Time-series analysis in minutes” at the Global Data Science Conference in Santa Clara on April 2nd at 3:30 PM.

Screen Shot 2018-03-19 at 8.22.33 AM

The focus of the talk will be on understand why and how to analyze time-series data quickly and efficiently. You can read the full abstract here.

An interview given as part of this conference is also available at the conference website.

If you are going to the conference and would like to connect, I would be happy to meet with you.

There have been many articles written and talks given over the last several years on abandoning the Enterprise Data Warehouse (EDW) in favor of an Enterprise Data Lake with some passionately promoting the idea and others just as passionately denying that this is achievable. In this article, I would like to take a more pragmatic approach to the case and try and lay down a process that enterprises should consider for a data management architecture.

The focus is on data lakes for enterprises, referred to as Enterprise Data Lake to distinguish it from data lakes created by internet, ad-tech or other technology companies that have different types of data and access requirements.

The Enterprise Data Warehouse

The much reviled and beleaguered Data Warehouse has been the mainstay of enterprises for over 20 years supporting business reports, dashboards and allowing analysts to understand how the business is functioning. Data Warehouses when built right provide robust security, audit and governance which is critical – especially with the increasing cyber-hacks today.

Alas – many data warehouse projects are so complex, they are never finished! Further, the strict, hierarchical governance that many IT departments created around the warehouse caused lots of frustration as business analysts and researchers cannot explore the data freely.

The Hadoop Phenomenon

When Hadoop entered the mainstream, the big attraction for business analysts and data scientists was the ability to store and access data outside the restrictive bounds of IT! This raised the exciting possibility of finding new insights into business operations, optimizing spend and finding new revenue streams.

3 Requirements for the Enterprise Data Lake

Data LakeJames Dixon coined the term Data Lake in 2010 to mean data flowing from a single source with the data being stored in its natural state. We have come some ways from that definition and the most common definition of a Data Lake today is a data repository for many different types and sources of data, be they structured or unstructured, internal or external, to facilitate different ways of accessing and analyzing the data. The Data Lake is built on Hadoop with the data stored in HDFS across a cluster of systems.

The 3 requirements for the Enterprise Data Lake are:

  1. It must collect and store data from one or more sources in its original, raw form and optionally, its various processed forms.
  2. It must allow flexible access to the data from different applications; for example, structured access to tables and columns as well as unstructured access to files.
  3. Entity and transaction data must have strong governance defined to prevent the lake from becoming a swamp.
Enterprise Data Lake Architecture

The diagram below shows an Enterprise Data Lake that ingests data from many typical systems such as CRM, ERP and other transactional systems. In addition, it is fed unstructured data from web logs, social media, IoT devices, third-party sites (such as DMP, D&B) creating a data repository. This rich data eco-system can now support combining multiple sources of data for more accurate analytics and never-before possible insights into business operations.

enterprise data lakeWith technologies such as BigForce SNAP, it is possible to run existing enterprise Business Intelligence (BI) tools as well as perform exploratory analysis with visualization tools such as Tableau.

Enterprise Data Lake Governance

More importantly, the Hadoop eco-system now supports data governance through technologies like Ranger, Knox and Sentry. In combination with Kerberos, and enterprise identity management systems such as Active Directory (AD) or other LDAP frameworks, it is possible to implement strong security and governance rules. See “Implementing Hadoop Security” for details.

The Modern Enterprise Data Architecture

But what if you already have an existing EDW with hundreds of applications, some of which use complex analytics functions? How best can you leverage the EDW while also moving to a modern data architecture that allows new data sources to be integrated and empower your data scientists to integrate, enrich and analyze lots of data without the restrictions of the EDW?

A happy compromise between the data lake and data warehouse does exist and data architects and businesses have realized that it IS possible to build on the strengths of each system.

modern data architecture with enterprise data lakeIn this architecture, the data lake serves as the repository for all raw data, ingested from all the relevant data sources of an organization. Optionally, the data lake can also store cleansed and integrated data which is then also fed into the data warehouse. This way, newer BI applications can be built directly on the enterprise data lake while existing applications can continue to run on the EDW.

Data Governance in the Enterprise Data Lake

Image result for data governance cartoon

Data Governance policies for enterprise data in the EDW should also apply to the same data within the Enterprise Data Lake in most cases. Otherwise, this may lead to security holes and data inconsistencies between the two systems. If careful consideration is not given to governance, the data lake will turn into a data swamp !

However, since the data lake consists of all the raw data from operational systems as well as new data sources, it is possible to now provide data scientists and other analysts access to these data sets for new exploratory analytics.

Conclusion

Architecting a modern data architecture requires a thorough understanding of the requirements, existing applications and future needs and goals of the enterprise. Especially important to consider are Master data and Metadata management, governance and security as well as the right technologies.

At Orzota, we have built data lakes for a variety of businesses and have a methodology in place to ensure success. Contact us for more information.

Move to the Cloud – Benefits Of Cloud Computing

NOTE: This is a guest post by Danish Wadhwa

Cloud Computing has changed the way we use software – whether for personal or business use. The process of downloading, installing, configuring and maintaining different types of software, is eliminated with a move to the Cloud, saving businesses time and resources. Cloud has brought us to an era of increased responsiveness and efficiency.

cloud-computingOver the past few years, Cloud Computing has taken over and has become an essential part of our everyday life. Whether it is for updating a status on Facebook or for checking account balances on a smartphone, we use the Cloud. The Cloud is best at handling various processes in an organization. With it, individuals and businesses can plan, strategize and organize tasks within minutes. The Cloud can also keep information safe, while providing access from anywhere at any time.

Here are some of the top reasons for businesses to move to the Cloud:

Flexibility

Cloud based services are the best for businesses with fluctuating demands. The Cloud’s capacity can be increased or decreased according to specific requirements. Such flexibility gives businesses a real advantage over competitors. This operational ability of Cloud computing is one of the main reasons for moving to the Cloud .

Fast Data Recovery

Cloud keeps data protected, while offering data backup and recovery options in case of an emergency. A Cloud-based backup and recovery solution saves time and avoids large upfront investments as well. Further, by backing up to Cloud Servers in different geographical regions, a robust backup strategy can provide insurance against natural disasters, power outages, etc.

Automatic Software Updates

Cloud Computing servers are usually off-premises and suppliers of cloud computing make sure that all issues are resolved without impacting the end user, who can utilize the services of the Cloud without interruptions. Systems are maintained and kept up-to-date, with regular software updates that are done automatically. This leaves organizations to focus on matters more pertinent to their business, rather than their software and hardware infrastructure.

Increased Collaboration

Cloud Computing simplifies various everyday operations and makes work easier: it provides access to data and the option to edit and share documents with different team members anytime, anywhere. One example is Asana, it is a cloud version of a project management tool that helps assign tasks to different team members, edit lists and keep track of progress, thus improving collaboration and coordination.

Security

Safety becomes an important issue when you decide to store your entire data on the Cloud and this is where Cloud Computing’s high-end safety measures come into play. Although many enterprises pointed to security concerns as their number one reason for not moving to the cloud, that myth has been debunked. Today, the Cloud can be more secure than a private data center. Your data is encrypted to protect it for any kind of outage or disaster of any sort, from the process where your data is in transit to while it rests on the cloud servers. Just not that, customers can also choose to control their encryption keys if they wish to.

Scalability

The “Pay as you Go” service allows you to pay according to your usage, thus helping small startups figure out what they need and expand as they grow. It also provides opportunities to various businesses to commence their ventures, regardless of available capital. Thus, the initial investment may be considerably low, allowing a company to gradually increase usage as it grows. The Cloud gives organizations access to enterprise-class technology, along with an opportunity to learn and understand the market and plan how to beat competition.

For mid-to-large enterprises, this one point can be a use time and money saver as on-premise infrastructure can take a lot of time to provision and needs to be planned for well in advance of the need for scale.

Eco-friendly

Apart from all the benefits we have discussed above, the Cloud is eco-friendly too. With the ability to change the server size according to usage, organizations only use the energy required at the moment, without leaving giant carbon footprints.

Why NOT move to the Cloud?

Why move to the cloudAs technology leaders, we are challenged to make decisions that impact the organization’s growth. Our primary goals are to deliver on time and resolve problems efficiently, while staying within budget. The Cloud makes it possible to achieve these goals, with a proper plan and process in place. Moving to the Cloud can be one of those changes that an organization can make. Get the Devops Certification to benchmark your skills in Cloud Computing and understand its benefits.

 

Save

During the summer of 2016, we had a high school student intern with us. He knew some Java from the Computer Science AP course but was very interested in using machine learning  to predict health outcomes. We were skeptical at first – the prospect of teaching a teenager (even a very smart one) the fundamentals of ML, along with a new programming language and then have him apply it to a real data set … and all in the span of a summer internship seemed like an Herculean task. But seeing how keen he was, we decided to take him on.

Sushant Thyagaraj (that was his name) proved us wrong! He learned R within the first week, following that quickly with various ML algorithms through tutorials and sample exercises. He researched various publicly available data sets that might be suitable for his work, went through several iterations with a couple of the data sets before finally settling on predicting survival for lung cancer patients after thoracic surgery.

Machine Learnign to predict healthHe continued fine tuning his results and wrote a full paper detailing his work (I should add that this last was done after school began). We are pleased to present his paper: Using Machine Learning to Predict the Post-Operative Life Expectancy of Lung Cancer Patients

 

We are pleased to announce a free trial to the Orzota BigForce Social solution. This solution built on top of the Orzota BigForce platform provides the capability to analyze text streams from twitter, media sites, blog sites, etc. With search capability, the solution provides data scientists a means of exploring the social media data, with a focus on sentiment analysis. The sentiment analysis free trial will let you quickly determine whether such a solution can meet your needs.

Unlike many other sentiment analysis solutions, the  Orzota BigForce Social solution can be customized to meet your needs. So sign up for the trial and reach out to us to understand how we can make this work for you!

 

We recently worked with a leading Hi-Tech manufacturing company to design and implement a brand new scalable and efficient workforce analytics solution targeted the mobile workforce.

The solution is designed to raise the workers’ confidence bar, and to minimize the effort required to train the workers. The solution also improved the manpower utilization by optimizing inventory adjustments with higher accuracy while fulfilling orders. It also reduces the learning curve for workers resulting in substantial reduction in training hours.

Workforce Analytics Solution Overview

The Workforce Analytics solution was built on a Common Data Analytics Platform leveraging Hortonworks HDP 2.4 and used the following technologies: Kafka, Storm, HBase, HDFS, Hive, Knox, Ranger, Spark and Oozie.

The platform collects real time data from the application on mobile devices, stores it, and runs analytics with better performance and lower latency compared to their prior legacy system.

The HDP components at a glance:
Workforce Analytics Solution HDP Components

Workforce Analytics Architecture

The operational real-time data is collected using Kafka and ingested into HDFS and HBase in parallel using Storm (see diagram below). HBase acts as the primary data store for the analytics application. The data in HDFS is encrypted and reserved for other applications. Based on the business logic, the data stored in HBase is processed using Spark on a daily, weekly, monthly and yearly basis, and stored back into HBase as a feed for Spark Analytics (Spark SQL). Spark Analytics is used to run jobs to generate specific insights. The output from Spark Analytics in Hive as a temporary table. Hive Thrift Server is used to execute queries against Hive and retrieve the results for visualization and exploration using Tableau. Custom dashboards were also built for business users to help them track higher-level metrics.

Workforce Analytics - Architecture

To address security requirements, Apache Knox and Apache Ranger were used for perimeter security and access control, respectively. Both are included as a part of HDP 2.4 and are configured in the Access Node.

Workforce Analytics Physical Architecture

The figure below shows the physical layout of the services on the various servers used. The architecture comprises of Edge Nodes, Master Nodes and Slave Nodes. Each set of nodes run a variety of services.

Workforce Analytics Physical Architecture

Issues and Solutions

While implementing this solution, we ran into a variety of issues. We outline some of them here in the hope that it may help others who are designing similar architectures with the Apache Hadoop  or Hortonworks HDP eco-system of components. Table creation, user permission and workflows were the common focus areas.

HBase Table Creation

We ran into permission issues with HBase table creation.

Solution: In Apache Ranger, update HBase policy by giving appropriate read, write and create permission for the defined user.

Connection to hive thrift server

Another issue we ran into involved connections to Hive Thrift Server for a particular user “ABC”.

Solution: Ensure that the below properties are added to $HADOOP_CONF/core-site.xml

hadoop.proxyuser.ABC.groups=*

hadoop.proxyuser.ABC.hosts=*

Oozie workflow jobs submission

Permission errors continued to plague the project while creating workflows in oozie.

Solution: The following needs to exist in the section of the corresponding job definition in workflow.xml:

<env-var>

HADOOP_USER_NAME=ABC

</env-var>

within the

<shell xmlns="uri:oozie:shell-action:0.2">

oozie workflow job stuck in prep state

When re-running an Oozie workflow job after a period of time, it went to PREP state and did not execute. While trying to kill the job via CLI, the Oozie log shows the job was successfully killed.

USER [test] GROUP[-] TOKEN[-] APP[-] JOB[ABCDEF] ACTION[] User test killed the WF job ABCEDEF-oozie-oozi-W

However, in the Oozie UI, the job is still shown to be in PREP state.

Solution: Further research showed that the Oozie database at the backend (Derby by default) was corrupted, and was not representing the correct state of the jobs.

We decided, for longer term stability, to migrate from Derby to MySQL as the backend database for Oozie. After this migration, we did not run into this issue again.

Conclusion

Big data projects can grow and evolve rapidly. It’s important to realize that the solution chosen must offer the flexibility to scale up or down to meet business needs. Today, in addition to commercial platform distributions such as Hortonworks and Cloudera, higher level tools and applications simplify the process of developing big data applications. However, as seen by some of the issues we describe above, expertise in the underlying technologies is still crucial for timely completion of projects. Orzota can help. Please contact us.

Surveys, research and retailers’ own measurements show that personalization works. According to Internet Retailer, 80% of consumers say they like it when brands’ e-mails recommend products based on their previous purchases.

personalizing the customer experience
Personalization Statistics

Personalization for Retail is Hard

The incessant media attention on the personalization and customer intelligence of companies such as Amazon and Google hides the fact that 80% of marketers fail at personalization. Knowing who each consumer is, what his or her likes and dislikes are is a major challenge. There are two major problems: data collection and data integration.

For retailers, the challenge is even more significant. It is not just about omnichannel integration across devices, but integrating online and retail store traffic and purchases as well. For instance, how can you know if a user’s purchase in the store is the result of an email marketing campaign?  How can you even tell if a user was driven by the campaign to the store if she does not make a purchase?

Additionally, understanding what metrics are important, how to define and measure them, let alone how to make them actionable are all difficult problems that many retailers grapple with. Traditional analytics fail to provide the necessary data, metrics and insights needed.

A Plethora of Solutions

There are dozens of tools and solutions that solve certain parts of the problem. Some are focused on conversion metrics, others at marketing campaigns, still others at building a 360-degree view of a customer (or so they claim). Some address e-commerce only, others tackle omnichannel traffic. Some target beacon solutions in retail stores, others claim to provide predictions on which users will buy and how much they will spend.

It is easy to get lost in the hype and marketing statements and lose track on what is the real problem to be solved. At Orzota, our focus is always the customer and their pain points. Our hybrid approach to customized solutions expertly managed aims to provide exactly what the customer wants – no more, no less.

Contact us for more information on our retail solutions.

 

Last week, we announced the acquisition of big data solutions provider, the Houston based company, Avarida. It has been a lot of work (and a lot of fun) doing the integration and figuring out our combined strategy as we move into 2016.

Many people have asked what the synergies between Orzota and Avarida are and how will our strategy and offerings change. As mentioned in the Press Release, Avarida’s focus on building big data solutions to solve customer problems is a great complement to Orzota’s technology-focused expertise. Integrating some of these solutions wit the Orzota Big Data Management Platform will provide our customers with end-to-end big data solutions.

Secondly, the Avarida team brings great enterprise-class experience on all fronts. Founder and CEO, Naren Gokul has tons of experience working in and with large enterprises. His sales and solutions engineering team nicely complement our own.

Finally, Avarida’s locations in Houston and Chicago eases our access to customers in the Center and East Coast. We are already seeing the benefits of this expanded team’s reach with increased interest from prospective customers.

Over the coming weeks and months, we will be launching new solutions and offerings, so please watch this space!