(ARTICLE 2 in the Series)

Automation in Pet Insurance

If you think about it, Pet Insurance should not be different from human medical insurance. Pets need vaccinations, wellness checks and can get sick or hurt just as humans do. They can also suffer from serious illnesses like cancer, diabetes, etc. whose treatments start to get expensive. Pet InsuranceYet, the market is still small as most people just pay out of pocket. Consequently, standardization and automation of processes have not yet fully taken place.

Sometimes, being behind in the technology curve can be a good thing. One can skip a cycle and leapfrog into the latest wave (as happened with cellular phones in Africa). Larger companies with some level of legacy automation will take longer to adopt the latest technologies. The pet insurance business can leapfrog this cycle instead of playing catch up with the adoption of Machine Learning and AI.

Claims Processing

In the world of pets and vets, a pet owner directly submits a claim against incurred medical expenses. The claims may be submitted via mobile apps, web apps, email, fax, etc. Most insurance companies are trying to encourage digital submissions using their online tools so as to avoid paper handling. The claim will include a standard insurance claim form and various receipts and medical reports sent by the doctor and hospital.

Image result for claims processing

The claim is then processed by a claims adjuster who often has to manually re-enter information from the various documents received into the claims application. The adjuster then translates the line items to a category, sub-category codes and validates against the customer’s policy to determine what is covered, amount of co-pays, etc. Larger companies may split the task by using lower-paid contractors for data entry while using the adjuster’s knowledge and experience to focus on the categorization and validation of the claim. Different mechanisms may be used for fraud detection (is only one dog insured while multiple are claimed?).

Businesses also need access to analytics – these are typically derived from the claims database in terms of count of claims processed, amounts claimed vs paid, etc.

How can AI Help?

Artificial Intelligence is the ability of a system to learn and automatically improve itself over time. This goes beyond building Machine Learning models periodically to automate the entire process of learning and tuning the algorithms.

With the use of Machine Learning and Deep Learning technologies (and AI), a significant portion of the claims process can be automated while providing additional new insights. Let’s look at some of them.

1.   Invoices Automation

One complexity in automation of the documents ingestion is that the invoices and medical records can vary widely between doctors and various service providers. Handwritten comments and diagnosis complicate the situation. Pure OCR alone is insufficient to extract the necessary information from these documents. Using Machine Learning can help improve the extraction process and recognize the various fields in claim and invoice forms. For example, extracting Provider name and address.

For embedded handwritten text, images, etc. deep learning techniques using neural networks are required. With the use of AI, the system can improve itself over time requiring less manual validation.

2.   Medical Records Analysis

It is a well-known fact that not only is a doctor’s handwriting illegible but that no two doctors will describe an ailment in the same way!

Related image

With a lack of standardization, it is left to humans to translate the description of the services performed into codes that the claims application and database can understand and process.

This task can be handled by Machine Learning as well. Learning from existing medical records, it can predict and categorize new medical records correctly – and do so, better than a human in many cases! This is especially useful when dealing with complex, rare diagnosis or medicines.

Think about it this way – you buy an exotic vegetable (say Chayote Squash) at a grocery store and bring it to the checkout counter. The clerk doesn’t know what it is or even if he does, has to manually look up the code in the book since it’s not an often-purchased item and so is not retained in memory. However, an image recognition application would instantly recognize it and pull up the relevant code.

3.   Claims Analytics

When all the data flows through an AI application, it can provide advanced analytics resulting in new insights.

4. Fraud Detection

Fraud is a huge issue in many forms of insurance and can significantly reduce profits. Today, financial companies have built sophisticated fraud detection algorithms using big data and AI techniques. The insurance industry can do the same. As more claims get processed, big data and machine learning can help predict potential fraud.

5. Marketing Analytics

Individual customer, patient and provider behavior can be analyzed as well as an aggregated understanding of regions, states, demographics, etc. Enriching the data with external data sets can provide new insights for marketing and sales organizations. Targeted ad campaigns can reduce expenses while improving customer acquisition.

6.   Customer Service

Improving customer service is a goal of every organization. The pet insurance companies are generally ahead of the game by providing easy to use web portals and mobile apps. Filing claims using mobile apps exclusively with minimal documentation enhances the customer experience.

With Big Data and AI, it is possible to automate the analysis of customer service calls and emails allowing customer service agents to proactively address complaints and issues.

 

Trying to find a database that can both store and retrieves your time-series data? There are many different databases today to handle time-series data. We started out using OpenTSDB but then switched to TimescaleDB. In this article, we examine the characteristics of time-series data and requirements for handling them.

We initially tried with OpenTSDB and met with a clumsy work on combining the metadata(stored in PostgreSQL) with Time Series data. To overcome the burden, we switched to TimeScaleDB. We will also compare Timescale DB against OpenTSDB on key technical features and architecture.

Time-series Data :

Let’s have a look at the basic definition of time-series data :
A discrete time-data sequence taken at successive equally spaced points in time-indexed in time order.
Data that collectively represents how a system, process, or behavior changes over time.

Types of Time-series Data :

  • Seasonal effect (Seasonal Variation or Seasonal Fluctuations)
  • Other Cyclic Changes (Cyclical Variation or Cyclic Fluctuations)
  • Trend (Secular Trend or Long Term Variation)
  • Other Irregular Variation (Irregular Fluctuations)

Some of the examples are Monitoring applications data, weather analysis data, Stock Market Data etc.,

In this digital environment, a great deal of data is gathered by various devices and applications. For example, current location, browsing data, personal fitness/metrics trackers etc. In this kind of scenario, it is really important to store the data for the overall population in an effective time-series database for future predictions/forecast.


fig1: Example of Time series data points

TimescaleDB Overview :

TimescaleDB is the first time-series database specifically designed for scale, ease of use, and complex queries. While TimescaleDB is an extension of PostgreSQL, it provides the following:

  • Automatic partitioning across time and space (partitioning key)
  • Full SQL support
  • Easy to use; like a relational database
  • Fast and parallel ingestion


Source:https://blog.timescale.com/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/

fig 2: PostgreSQL and TimescaleDB – A Comparison of Insert Rates

As we can see in the above figure (fig 2), the insert rates go down as the Dataset size increases in PostgreSQL. While in TimeScaleDB, steady insertion rate is maintained irrespective of the size of the Dataset. Thus, the performance of the application that sits on top of TimescaleDB improves greatly.

TimescaleDB executes the query on Hypertable comprising of many Chunks partitioned by time and space which really look like regular tables.

“ Time-series data is largely immutable. Writes primarily occur as new appends to recent time intervals, not as updates to existing rows. Both read and writes have a natural partitioning across both time and space.”
-TimeScale DB Developers

Data Handling in TimescaleDB :

HyperTable Outlook: Abstracts the table as the Hypertable composed of many right-sized chunks partitioned as per Time and Space.
Optimized Query Execution: During Query Execution it checks whether only the necessary chunks are used for retrieval of data. This can be done by aggressive constraint exclusion.
Data Model: TimeScale DB follows the Wide table Model which helps in the estimate, measure, or note the similarity between data.

Benchmarking TimeScaleDB :

Our use case required running complex aggregation queries while also supporting simultaneous ingestion of incoming time-series data. To ensure that the chosen platform can handle this load, we did some benchmarking.

An ingestion application kept pumping in data into the database. We ran three types of queries that accesses a varying number of rows. Every query was run several times to ensure stable results. Execution time over a table varied only slightly. This is because the query hits only the selected number of chunks satisfying the filtering conditions. The results are shown in the table below. We will describe the individual queries in a future blog post.

Results of TimescaleDB benchmark:

Records Retrieved

Records Retrieved Query 1 Execution time(seconds) Query 2 Execution Time(seconds) Query 3 Execution time(seconds)
1098212 19.35 20.60 18.29
1431323 27.21 25.55 27.42
1790467 34.49 35.19 34.32
2152595 41.94 43.10 41.52
2498503 46.94 48.06 46.54

 

TimeScaleDB(v0.4.2) Vs OpenTSDB(v2.3.0)

Conclusion :

Based on our use case for handling and storing time-series data for IoT implementation in a Big Data environment, TimescaleDB compares better on features such as partitioning, data retention, access methods, and compatibility to scripting for automation. Moreover, ease of access and simple retrieval of data for further application makes it more convenient when compared to the other time-series databases that we have used.

(ARTICLE 1 in the Series)

The Rise of Big Data

Big Data technologies made it possible for enterprises to capture and integrate diverse sets of data. They were no longer constrained to the data warehouses for analytics and reporting. Big Data allowed the integration of third-party syndicated data sets and social media data such as tweets and blogs. In addition, it helped break down silos between the various divisions within the enterprise, democratizing data access and help gain new insights from data.

The enriched big data sets can be used not just to understand the past, but make predictions about the future – which customers are likely to churn, which customers/equipment are most likely to generate new claims, which products are the most likely to succeed, etc.

AI-powered

We are now in the next wave of deriving value from data using AI-powered applications. The big breakthrough for this wave is the ability to use AI-powered neural networks to solve a wide variety of problems including autonomous driving vehicles, natural language understanding, image recognition, etc. Translating these technological advancements to real business use cases will result in significant operational benefits – reducing cost, providing faster customer service while creating new business models and sources of revenue.

Let’s look at some of the use cases for AI in insurance.

Underwriting

Underwriting or new application processing is the first pillar in any type of insurance – namely, processing applications for new insurance policies. The process can be complicated depending on the type, size, prior history and other components of the application to evaluate the risk and enroll the client. It involves communication among multiple parties – the client, agent, and underwriter. This is traditionally a manual process as it involves a review of many different types of documents from diverse carriers with no standardization that allows easy automation. Further, many carriers still receive paper documents that are faxed or scanned (or worse – sent via snail mail !)

AI-powered systems can help this step in multiple ways:

  1. Natural Language Processing (NLP) systems and chatbots and streamline communication between the parties
  2. AI-driven document extraction systems (Docu-AI) can automate the processing of the various documents using AI and Big Data
  3. Data from documents can then be analyzed by AI-powered analytics to help the underwriter assess risk

Claims Processing

Claims processing forms the core of the business for insurance carriers. When a claim is processed in a timely manner, it improves customer satisfaction and retention. Simultaneously, the processing has to minimize financial loss due to fraud or other factors to maximize profitability. Most companies have focused their energies on improving the claims process using technology.

Related image

Many software applications already automate workflows ensuring timely processing and smooth communication with all parties involved. Mobile apps allow users to easily submit claims along with documentation such as photos of the incident, claim form, invoices, etc.

Yet, main parts of the process are heavily manual. Claims adjusters have to frequently go out in the field to make assessments. Even in the case of smaller claims, the adjuster may manually review documents and photos.

 

How can AI-powered systems help claims processing?

  1. Image recognition algorithms can help identify and automatically categorize various pieces of information in claim evidence photos such as license plates of vehicles, insurance cards, various types of damages, etc.
  2. AI-driven document extraction systems (DocuAI) can automate analysis and categorization of line items in medical records, body shop estimates, contractor reports, etc. Using NLP and Deep learning techniques allows these systems to recognize a wide variety of content.
  3. Robotic Process Automation (RPA) can automate many parts of the processing workflow along with 1) and 2) above

Fraud Detection

Fraud detection is usually a part of claim processing to ensure that no opportunistic fraud has taken place. The biggest loss for insurance companies is due to fraud. Many larger carriers already use predictive analytics to help identify potential fraud in claims. These Machine Learning models use not just  a carrier’s own data but also shared databases across companies to flag potential fraud.

AI-powered systems can take this a step further. They can use the vast amounts of accumulated data and images to detect more subtle instances of fraud as well as previously intractable ones. With the cost of running these models dropping dramatically, even small claims can be analyzed to detect patterns.

Customer Service

Image result for customer service iconImproving customer service is the goal of every organization. With Big Data and AI, it is possible to automate the analysis of customer service calls and emails allowing customer service agents to proactively address complaints and issues.

AI-driven chatbots are now pervasive on websites and web portals. They provide an easy way of answering customers’ questions while reserving human interaction to handle more complex issues. Mobile apps with the ability to answer spoken natural language queries are now possible using technologies like Siri, Alexa and the same knowledge base used by chatbots and customer service agents.

New Business Models

With IoT enabling the gathering of fine-grained data (how many do I drive every day, what is the average trip, how many hours is the property unoccupied), insurance companies are seizing the opportunity to come up with new ways of underwriting policies. AI-powered systems can provide better risk analysis for determining premiums resulting in new personalized products. These new products can be provided at attractive premiums, driving new business.

 

There have been many articles written and talks given over the last several years on abandoning the Enterprise Data Warehouse (EDW) in favor of an Enterprise Data Lake with some passionately promoting the idea and others just as passionately denying that this is achievable. In this article, I would like to take a more pragmatic approach to the case and try and lay down a process that enterprises should consider for a data management architecture.

The focus is on data lakes for enterprises, referred to as Enterprise Data Lake to distinguish it from data lakes created by internet, ad-tech or other technology companies that have different types of data and access requirements.

The Enterprise Data Warehouse

The much reviled and beleaguered Data Warehouse has been the mainstay of enterprises for over 20 years supporting business reports, dashboards and allowing analysts to understand how the business is functioning. Data Warehouses when built right provide robust security, audit and governance which is critical – especially with the increasing cyber-hacks today.

Alas – many data warehouse projects are so complex, they are never finished! Further, the strict, hierarchical governance that many IT departments created around the warehouse caused lots of frustration as business analysts and researchers cannot explore the data freely.

The Hadoop Phenomenon

When Hadoop entered the mainstream, the big attraction for business analysts and data scientists was the ability to store and access data outside the restrictive bounds of IT! This raised the exciting possibility of finding new insights into business operations, optimizing spend and finding new revenue streams.

3 Requirements for the Enterprise Data Lake

Data LakeJames Dixon coined the term Data Lake in 2010 to mean data flowing from a single source with the data being stored in its natural state. We have come some ways from that definition and the most common definition of a Data Lake today is a data repository for many different types and sources of data, be they structured or unstructured, internal or external, to facilitate different ways of accessing and analyzing the data. The Data Lake is built on Hadoop with the data stored in HDFS across a cluster of systems.

The 3 requirements for the Enterprise Data Lake are:

  1. It must collect and store data from one or more sources in its original, raw form and optionally, its various processed forms.
  2. It must allow flexible access to the data from different applications; for example, structured access to tables and columns as well as unstructured access to files.
  3. Entity and transaction data must have strong governance defined to prevent the lake from becoming a swamp.
Enterprise Data Lake Architecture

The diagram below shows an Enterprise Data Lake that ingests data from many typical systems such as CRM, ERP and other transactional systems. In addition, it is fed unstructured data from web logs, social media, IoT devices, third-party sites (such as DMP, D&B) creating a data repository. This rich data eco-system can now support combining multiple sources of data for more accurate analytics and never-before possible insights into business operations.

enterprise data lakeWith technologies such as BigForce SNAP, it is possible to run existing enterprise Business Intelligence (BI) tools as well as perform exploratory analysis with visualization tools such as Tableau.

Enterprise Data Lake Governance

More importantly, the Hadoop eco-system now supports data governance through technologies like Ranger, Knox and Sentry. In combination with Kerberos, and enterprise identity management systems such as Active Directory (AD) or other LDAP frameworks, it is possible to implement strong security and governance rules. See “Implementing Hadoop Security” for details.

The Modern Enterprise Data Architecture

But what if you already have an existing EDW with hundreds of applications, some of which use complex analytics functions? How best can you leverage the EDW while also moving to a modern data architecture that allows new data sources to be integrated and empower your data scientists to integrate, enrich and analyze lots of data without the restrictions of the EDW?

A happy compromise between the data lake and data warehouse does exist and data architects and businesses have realized that it IS possible to build on the strengths of each system.

modern data architecture with enterprise data lakeIn this architecture, the data lake serves as the repository for all raw data, ingested from all the relevant data sources of an organization. Optionally, the data lake can also store cleansed and integrated data which is then also fed into the data warehouse. This way, newer BI applications can be built directly on the enterprise data lake while existing applications can continue to run on the EDW.

Data Governance in the Enterprise Data Lake

Image result for data governance cartoon

Data Governance policies for enterprise data in the EDW should also apply to the same data within the Enterprise Data Lake in most cases. Otherwise, this may lead to security holes and data inconsistencies between the two systems. If careful consideration is not given to governance, the data lake will turn into a data swamp !

However, since the data lake consists of all the raw data from operational systems as well as new data sources, it is possible to now provide data scientists and other analysts access to these data sets for new exploratory analytics.

Conclusion

Architecting a modern data architecture requires a thorough understanding of the requirements, existing applications and future needs and goals of the enterprise. Especially important to consider are Master data and Metadata management, governance and security as well as the right technologies.

At Orzota, we have built data lakes for a variety of businesses and have a methodology in place to ensure success. Contact us for more information.

What exactly is Artificial Intelligence?

Artificial intelligence is really starting to shape the world as we know it. The field of AI includes everything that has anything to do with the “intelligence” of a machine; and more specifically, that machine’s ability to imitate a human’s thought process and reasoning abilities.

Artificial IntelligenceWhile artificial intelligence develops programs to help solve problems, the patterns needed for solving a problem via AI is a lot different from the way a human would solve it. In a general sense, these programs that are developed are often designed to interpret, sort through, and provide insight from a vast amount of data. We want these AI programs to handle this data because it can process far more than a human brain ever could.

Four AI abilities

There are four abilities that contribute to artificial intelligence; and without them AI would not be what we expect.

Ability to sense

object-recognitionThe first, the ability to sense, correlates directly with object recognition. In this case, object recognition is the picking out and identification of objects from different inputs such as videos and digital images. Natural Language Processing (NLP) also contributes to the ability to sense, meaning the ability to read text and make sense of it.

Ability to converse

Thinking robotThe second, which is the ability to have a conversation, is the foundation to develop the ability to think. Predictive Analytics sums this up by identifying the likelihood of future outcomes based on historical data and algorithms (machine learning).

Ability to act

Working robotThe third, the ability to act, refers to taking action based on thinking. This is also known as “Prescriptive Analytics,” and determines the best solutions/outcomes among various choices, with known parameters.

Ability to learn

Learning robotThe fourth and final ability, the ability to learn, includes automatically occurring self-improvements. Not only do these improvements need to happen, but we also need to understand how these improvements were made as they occur.

More than sci-fi robots

SophiaThroughout the advancement process of AI, the technology industry has made AI an essential part of its work. The advancement of this field has caused debate over whether AI is a threat to humanity or not. Artificial intelligence is NOT something to fear; and it IS more than just sci-fi robots taking over.

Of course, it’s easy to understand why some may think AI and robots are one in the same, getting some things mixed up. Pop culture can be blamed for this, because robots are often portrayed in such a way that may cause humans to worry about what exactly they may become. In reality, robots are physical machines created to carry out a specific task and artificial intelligence is used to develop programs to solve problems. When AI and robots are integrated, autonomous robots are born.

Practical uses today

Believe it or not, but artificial intelligence systems are seen every day. Interesting Engineering came up with a list of everyday applications of AI, which can be separated into two categories: consumer-focused and enterprise-focused.

Some consumer-focused applications include smart cars, video games, smart homes, and preventing heart attacks.

Examples of enterprise-focused applications are customer service, workflow automation, cybersecurity, and maintenance predictions.

With the increasing advancements in the field of artificial intelligence, we are destined to see more and more practical uses.

Orzota can help!

The Orzota BigForce Docu-AI Solution helps automate document workflows for insurance and finance use cases. It uses sophisticated AI techniques to parse documents (image files, PDFs, etc.), extracting information and key insights while providing instant search and analysis capabilities.

To find out more, please contact info@orzota.com.

Security is an essential part of Hadoop infrastructure in any organization. Let’s look at some of the key or need-to-have components in ensuring that infrastructure is secured from external compromises.

The key aspects of security are authentication, authorization and encryption. We will look at ways to implement all three in the content of Hadoop clusters – be it on the cloud or on-prem.

hadoopServices

Network Level Security (Apache Knox):

Apache Knox is used to secure the perimeter of hadoop clusters to access data and to execute jobs. Knox can be deployed as clusters of Knox nodes, which acts as single access point and routes requests to the Hadoop rest and HTTP APIs and Provides SSO (single sign on) for multiple UIs. Knox supports LDAP, Active Directory as well as kerberos authentication.

The most prevalent and popular way to provide secure authentication to Hadoop clusters is by the use of Kerberos, which requires client-side configuration and packages. Apache Knox eliminates the requirement for such client-side library and complex configurations.

We can create different topologies, where we can provide for actual hosts and ports to run service components by integrating LDAP/Kerberos authentication.

Example to access HDFS data:

Make directory:
curl -ik -u knox_username -X PUT ‘https://knoxhost:8443/gateway/topology_name/webhdfs/v1/user/hdfs/input?op=MKDIRS’

HTTP/1.1 200 OK

Date: Fri, 01 Sep 2017 09:10:41 GMT

Set-Cookie: JSESSIONID=k9klsdy2yyeg1engj31y5djh8;Path=/gateway/test;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT

Set-Cookie: rememberMe=deleteMe; Path=/gateway/test; Max-Age=0; Expires=Thu, 31-Aug-2017 09:10:41 GMT

Cache-Control: no-cache

Expires: Fri, 01 Sep 2017 09:10:41 GMT

Date: Fri, 01 Sep 2017 09:10:41 GMT

Pragma: no-cache

Expires: Fri, 01 Sep 2017 09:10:41 GMT

Date: Fri, 01 Sep 2017 09:10:41 GMT

Pragma: no-cache

Content-Type: application/json; charset=UTF-8

X-FRAME-OPTIONS: SAMEORIGIN

Server: Jetty(6.1.26.hwx)

Content-Length: 16

Above command will create input directory in /user/hdfs location

Example to access hive table using beeline
beeline> !connect
jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive

Connecting to jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive

Enter username for jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive: knox_user

Enter password for jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive: **********

log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See https://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Connected to: Apache Hive (version 1.2.1000.2.6.2.0-205)

Driver: Hive JDBC (version 1.2.1.spark2)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://knoxhost:8443/> show databases;

+———————–+–+
| database_name |
+———————–+–+
| default |
+———————–+–+
1 row selected (4.169 seconds)

0: jdbc:hive2://knoxhost:8443/> create database test;

No rows affected (1.271 seconds)

0: jdbc:hive2://knoxhost:8443/> show databases;

+———————–+–+
| database_name |
+———————–+–+
| default |
| test |
+———————–+–+
2 rows selected (2.222 seconds)

Hortonworks knox tutorial can be accessed here.

Authentication (Kerberos):

The primary purpose of an Hadoop cluster is to store and process large amount of data, which requires secure handling to prevent unauthorised access. Kerberos network authentication protocol provides for strong authentication of client/server applications. For each operation, the client is required to provide its identity(principal) to the Kerberos server. There are two types of principals – user and service principals.

Another import term in Kerberos is Realm. Realm is the authentication and administrative domain and all principals are assigned to a specific Kerberos realm.

Key Distribution Centre is used to store and controls all Kerberos principals and Realm.

kerbos

    KDC (Key Distribution Centre) has three components

  • Kerberos Databases,
  • Authentication Server(AS),
  • Ticket Granting Service(TGS)

Kerberos Database stores and controls all principals and realms. Kerberos principals in the database are identities with following naming convention.

User@EXAMPLE.COM (User Principal)

Hdfs/node23.example.com@EXAMPLE.COM (Service Principal)

AS is responsible for issuing TGT (Ticket Granting Ticket) service tickets when client initiate request to AS.

TGS is responsible for validating TGT service tickets. Service tickets allows an authenticated principal to use services provided by the application server, which is identified by service principal.

To create Principal

as root user,

kadmin.local -q “addprinc -pw orzota hdfs-user”

the above command will add new hdfs-user with orzota as password.

To access hdfs data in kerberized client machine,

$ kinit

Password for hdfs-user@ORZOTAADMIN.COM:

$ klist

Ticket cache: FILE:/tmp/krb5cc_1013

Default principal: hdfs-user@ORZOTAADMINS.COM

Valid starting Expires Service principal
09/14/2016 14:54:32 09/15/2016 14:54:32 krbtgt/ORZOTAADMIN.COM:

Authorization (Apache Sentry / Ranger)

In Hadoop infrastructure, Apache Sentry or Ranger can be used to perform the centralized way to manage security across various components in a Hadoop cluster.. In this blog, we will consider Ranger for authorization.

Ranger is used authorize users/group (as well as authenticated user by Kerberos) to access resources inside Hadoop ecosystem.

Currently Ranger provides audits and plugins for each of the Hadoop services which include HDFS, Hive, HBase, YARN, Kafka, Storm,Knox and Solr. Ranger uses Solr to audit the user actions on all supported services.

By using these plugins, Hadoop Administrator can create policies to authorize users to access Hadoop services.

For Example, Hive-Ranger-Plugin provides authorization at database, table and column level. By using this we can create specific / role-based policies for each user/group, thereby controlling the kind of queries that can be run on the database / table.

Hortonworks Ranger Tutorial can be access here.

Encryption (Ranger KMS):

Ranger Key Management Server (KMS) is built on the Hadoop KMS developed by the Apache community. It extends the native Hadoop KMS functions by letting the Ranger Admins store keys in a secure database.

Ranger provides centralized administration of Key management using Ranger admin UI. Ranger admin provides ability to create,delete and update keys using its dashboard or rest APIs. Ranger admin also provides the ability to manage access control policies within Ranger KMS. The access policies control permissions to generate or manage keys, adding another layer of security for data encrypted in Hadoop.

HDFS Encryption Example:

In Ranger KMS UI, Create key in the name of hdfs-encryption.

Add new policy in the name of key-test and give decrypt permission only for bob user.
in hdfs,

1. create test dir and give owner permission to the bob user.

hdfs dfs -mkdir /test

hdfs dfs -chown -R bob:hdfs /test

2. create encryption zone:

[hdfs@ip-172-31-4-145 ~]$ hdfs crypto -createZone -keyName hdfs-encryption -path /test

Added encryption zone /test

[hdfs@ip-172-31-4-145 ~]$ hdfs crypto -listZones

/test hdfs-encryption

3. Verify read write permission for user bob. only bob user can access the data from /test

4. if you try to access data from some other user. it will through following error.

[hdfs@ip-172-31-4-145 ~]$ hdfs dfs -put test1.txt /test/
put: User:alice not allowed to do ‘DECRYPT_EEK’ on ‘hdfs-encryption’
17/08/17 10:51:02 ERROR hdfs.DFSClient: Failed to close inode 17051

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /test/test1.txt._COPYING_ (inode 17051): File does not exist. Holder DFSClient_NONMAPREDUCE_1683412138_1 does not have any open files.

Move to the Cloud – Benefits Of Cloud Computing

NOTE: This is a guest post by Danish Wadhwa

Cloud Computing has changed the way we use software – whether for personal or business use. The process of downloading, installing, configuring and maintaining different types of software, is eliminated with a move to the Cloud, saving businesses time and resources. Cloud has brought us to an era of increased responsiveness and efficiency.

cloud-computingOver the past few years, Cloud Computing has taken over and has become an essential part of our everyday life. Whether it is for updating a status on Facebook or for checking account balances on a smartphone, we use the Cloud. The Cloud is best at handling various processes in an organization. With it, individuals and businesses can plan, strategize and organize tasks within minutes. The Cloud can also keep information safe, while providing access from anywhere at any time.

Here are some of the top reasons for businesses to move to the Cloud:

Flexibility

Cloud based services are the best for businesses with fluctuating demands. The Cloud’s capacity can be increased or decreased according to specific requirements. Such flexibility gives businesses a real advantage over competitors. This operational ability of Cloud computing is one of the main reasons for moving to the Cloud .

Fast Data Recovery

Cloud keeps data protected, while offering data backup and recovery options in case of an emergency. A Cloud-based backup and recovery solution saves time and avoids large upfront investments as well. Further, by backing up to Cloud Servers in different geographical regions, a robust backup strategy can provide insurance against natural disasters, power outages, etc.

Automatic Software Updates

Cloud Computing servers are usually off-premises and suppliers of cloud computing make sure that all issues are resolved without impacting the end user, who can utilize the services of the Cloud without interruptions. Systems are maintained and kept up-to-date, with regular software updates that are done automatically. This leaves organizations to focus on matters more pertinent to their business, rather than their software and hardware infrastructure.

Increased Collaboration

Cloud Computing simplifies various everyday operations and makes work easier: it provides access to data and the option to edit and share documents with different team members anytime, anywhere. One example is Asana, it is a cloud version of a project management tool that helps assign tasks to different team members, edit lists and keep track of progress, thus improving collaboration and coordination.

Security

Safety becomes an important issue when you decide to store your entire data on the Cloud and this is where Cloud Computing’s high-end safety measures come into play. Although many enterprises pointed to security concerns as their number one reason for not moving to the cloud, that myth has been debunked. Today, the Cloud can be more secure than a private data center. Your data is encrypted to protect it for any kind of outage or disaster of any sort, from the process where your data is in transit to while it rests on the cloud servers. Just not that, customers can also choose to control their encryption keys if they wish to.

Scalability

The “Pay as you Go” service allows you to pay according to your usage, thus helping small startups figure out what they need and expand as they grow. It also provides opportunities to various businesses to commence their ventures, regardless of available capital. Thus, the initial investment may be considerably low, allowing a company to gradually increase usage as it grows. The Cloud gives organizations access to enterprise-class technology, along with an opportunity to learn and understand the market and plan how to beat competition.

For mid-to-large enterprises, this one point can be a use time and money saver as on-premise infrastructure can take a lot of time to provision and needs to be planned for well in advance of the need for scale.

Eco-friendly

Apart from all the benefits we have discussed above, the Cloud is eco-friendly too. With the ability to change the server size according to usage, organizations only use the energy required at the moment, without leaving giant carbon footprints.

Why NOT move to the Cloud?

Why move to the cloudAs technology leaders, we are challenged to make decisions that impact the organization’s growth. Our primary goals are to deliver on time and resolve problems efficiently, while staying within budget. The Cloud makes it possible to achieve these goals, with a proper plan and process in place. Moving to the Cloud can be one of those changes that an organization can make. Get the Devops Certification to benchmark your skills in Cloud Computing and understand its benefits.

 

Save

During the summer of 2016, we had a high school student intern with us. He knew some Java from the Computer Science AP course but was very interested in using machine learning  to predict health outcomes. We were skeptical at first – the prospect of teaching a teenager (even a very smart one) the fundamentals of ML, along with a new programming language and then have him apply it to a real data set … and all in the span of a summer internship seemed like an Herculean task. But seeing how keen he was, we decided to take him on.

Sushant Thyagaraj (that was his name) proved us wrong! He learned R within the first week, following that quickly with various ML algorithms through tutorials and sample exercises. He researched various publicly available data sets that might be suitable for his work, went through several iterations with a couple of the data sets before finally settling on predicting survival for lung cancer patients after thoracic surgery.

Machine Learnign to predict healthHe continued fine tuning his results and wrote a full paper detailing his work (I should add that this last was done after school began). We are pleased to present his paper: Using Machine Learning to Predict the Post-Operative Life Expectancy of Lung Cancer Patients

 

The topic of data science has been on the rise within the tech industry. Often, we see techies conversing and sharing articles about data science on social media and we hear professionals discussing it as part of their business plan. By now, most of us are aware that it exists and have an inkling about what it does. But can you answer the following questions?

Data Science

Do You Need a Data Scientist?

In the past, it has been known that larger, technologically advanced companies used data scientists (Facebook, LinkedIn, Google, etc.). However, we are seeing non-technology type businesses hire data scientists. For example, retailers are using data science for everything from understanding customers to managing inventory. Data science allows companies to gain insights from data in many fields and ultimately improve forecasting and decision making.

What Does a Data Scientist Do?

What Does a Data Scientist DoAccording to Dr. Steve Hanks, there are three major capabilities that data scientists must have: (1) They must understand that data has meaning, (2) They must understand the problem that you need to solve and how the data relates to that, and (3) They must understand the engineering.

A data scientist, in very general terms, looks at and investigates a set of data and then comes up with different ways to answer previously posed questions. Along the way, the data scientist may consider historical data analysis to develop analytical models and dashboards that provide insights and improve decision making.For example, a data scientist for a large retailer like Macy’s may look at not just past seasons’ data, but current economic and weather conditions to make predictions for their upcoming season. Retail executives make use of this to improve things such as sales, revenue, marketing strategies, advertising efforts, etc.

How Do You Build a Strong Data Science Team?

Choosing people that are aware and skilled in areas that fit your company’s need is essential. An article from Datafloq says, “The team needs to take the data and understand how it can affect different areas of the company and help those areas implement positive changes.” Not all the skills of a data scientist can be taught; it is important to have a natural affinity for data analytics, and the drive to produce beneficial insights to answer your company’s needs.Data scientists are not only computer scientists and statisticians, but must have a solid understanding of the business as well.

Should You Outsource Your Data?

Because this field of work is both complex and intimidating, there is a shortage of skilled professionals in the industry. Advanced analytics require a certain skill-set to develop and run machine learning models. Instead of spending the money and putting in the efforts to develop a team with the necessary skills internally, you can speed up your path to data science and outsource. For small-to-medium businesses, it can be cost-prohibitive to have their own data science team. There is work in the field of data engineering that must be done before a data scientist can develop models. This may not be an efficient use of resources for a small-to-medium business to hire both data engineers and data scientists.

Shanti Subramanyam, CEO at Orzota says, “Deciding to outsource reflects the core competency of your business. If you don’t have the financial resources or the capacity to focus on it, outsourcing is a faster and more efficient way to stand up a capability.”

Conclusion

If you’re overwhelmed by these questions, don’t be. Although the idea of data science and big data may seem complex, it is important to understand at least the basics. If you can articulate your business pain-points, it will be easier to answer these questions and find the best solutions to fit your company’s needs. Orzota is here to explain further, answer your questions, and offer services to help you feel comfortable with understanding and fulfilling your data science needs.

orzotapartner

IoT and Big Data are becoming essential to market growth and customer success. Enter Orzota’s call for partners. There are initiatives in all major verticals like Manufacturing, Oil & Gas, Transportation, Retail, Financial, Insurance, Life Science/Healthcare, etc. Additionally, there are many pieces to the puzzle between, data architecture, technology and the necessary resources to deliver a successful Big Data & IoT program that will benefit business users.

At Orzota we seat at the intersection of IoT and Big Data. As a Silicon Valley based company we aim to provide solutions that can transform the way companies collaborate and derive value from Data whether being it from sensors, machines, ERPs, websites, industry boards, social media and beyond.  To do so we bring platforms for Big Data and IoT that are flexible and quickly accelerate the delivery of solutions while supporting it through our Managed Services models. We’re harnessing Open stacks and cloud technologies that provide the elasticity and economics to quickly generate ROI. Lastly, we augment projects with verified resources for Data Architects, Data Engineers and Data Science.

As a partner, you’re a technology or a consulting provider that serves the mid-market and is looking to augment their service portfolio while adopting the latest in Open stack technologies for Big Data and IoT.  Apart from an excellent solution, what you can expect is the support of an experienced team that has been in the technology side of this domain at Yahoo while harnessing project experience from companies like Netflix, Boeing, and Bank of America to name a few.

We’re all about making it easy so email us at partners@orzota.com