Security is an essential part of Hadoop infrastructure in any organization. Let’s look at some of the key or need-to-have components in ensuring that infrastructure is secured from external compromises.

The key aspects of security are authentication, authorization and encryption. We will look at ways to implement all three in the content of Hadoop clusters – be it on the cloud or on-prem.

hadoopServices

Network Level Security (Apache Knox):

Apache Knox is used to secure the perimeter of hadoop clusters to access data and to execute jobs. Knox can be deployed as clusters of Knox nodes, which acts as single access point and routes requests to the Hadoop rest and HTTP APIs and Provides SSO (single sign on) for multiple UIs. Knox supports LDAP, Active Directory as well as kerberos authentication.

The most prevalent and popular way to provide secure authentication to Hadoop clusters is by the use of Kerberos, which requires client-side configuration and packages. Apache Knox eliminates the requirement for such client-side library and complex configurations.

We can create different topologies, where we can provide for actual hosts and ports to run service components by integrating LDAP/Kerberos authentication.

Example to access HDFS data:

Make directory:
curl -ik -u knox_username -X PUT ‘https://knoxhost:8443/gateway/topology_name/webhdfs/v1/user/hdfs/input?op=MKDIRS’

HTTP/1.1 200 OK

Date: Fri, 01 Sep 2017 09:10:41 GMT

Set-Cookie: JSESSIONID=k9klsdy2yyeg1engj31y5djh8;Path=/gateway/test;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT

Set-Cookie: rememberMe=deleteMe; Path=/gateway/test; Max-Age=0; Expires=Thu, 31-Aug-2017 09:10:41 GMT

Cache-Control: no-cache

Expires: Fri, 01 Sep 2017 09:10:41 GMT

Date: Fri, 01 Sep 2017 09:10:41 GMT

Pragma: no-cache

Expires: Fri, 01 Sep 2017 09:10:41 GMT

Date: Fri, 01 Sep 2017 09:10:41 GMT

Pragma: no-cache

Content-Type: application/json; charset=UTF-8

X-FRAME-OPTIONS: SAMEORIGIN

Server: Jetty(6.1.26.hwx)

Content-Length: 16

Above command will create input directory in /user/hdfs location

Example to access hive table using beeline
beeline> !connect
jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive

Connecting to jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive

Enter username for jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive: knox_user

Enter password for jdbc:hive2://knoxhost:8443/;ssl=true;sslTrustStore=/opt/jdk1.8.0_144/jre/lib/security/cacerts;trustStorePassword=changeit?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/test/hive: **********

log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See https://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Connected to: Apache Hive (version 1.2.1000.2.6.2.0-205)

Driver: Hive JDBC (version 1.2.1.spark2)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://knoxhost:8443/> show databases;

+———————–+–+
| database_name |
+———————–+–+
| default |
+———————–+–+
1 row selected (4.169 seconds)

0: jdbc:hive2://knoxhost:8443/> create database test;

No rows affected (1.271 seconds)

0: jdbc:hive2://knoxhost:8443/> show databases;

+———————–+–+
| database_name |
+———————–+–+
| default |
| test |
+———————–+–+
2 rows selected (2.222 seconds)

Hortonworks knox tutorial can be accessed here.

Authentication (Kerberos):

The primary purpose of an Hadoop cluster is to store and process large amount of data, which requires secure handling to prevent unauthorised access. Kerberos network authentication protocol provides for strong authentication of client/server applications. For each operation, the client is required to provide its identity(principal) to the Kerberos server. There are two types of principals – user and service principals.

Another import term in Kerberos is Realm. Realm is the authentication and administrative domain and all principals are assigned to a specific Kerberos realm.

Key Distribution Centre is used to store and controls all Kerberos principals and Realm.

kerbos

    KDC (Key Distribution Centre) has three components

  • Kerberos Databases,
  • Authentication Server(AS),
  • Ticket Granting Service(TGS)

Kerberos Database stores and controls all principals and realms. Kerberos principals in the database are identities with following naming convention.

User@EXAMPLE.COM (User Principal)

Hdfs/node23.example.com@EXAMPLE.COM (Service Principal)

AS is responsible for issuing TGT (Ticket Granting Ticket) service tickets when client initiate request to AS.

TGS is responsible for validating TGT service tickets. Service tickets allows an authenticated principal to use services provided by the application server, which is identified by service principal.

To create Principal

as root user,

kadmin.local -q “addprinc -pw orzota hdfs-user”

the above command will add new hdfs-user with orzota as password.

To access hdfs data in kerberized client machine,

$ kinit

Password for hdfs-user@ORZOTAADMIN.COM:

$ klist

Ticket cache: FILE:/tmp/krb5cc_1013

Default principal: hdfs-user@ORZOTAADMINS.COM

Valid starting Expires Service principal
09/14/2016 14:54:32 09/15/2016 14:54:32 krbtgt/ORZOTAADMIN.COM:

Authorization (Apache Sentry / Ranger)

In Hadoop infrastructure, Apache Sentry or Ranger can be used to perform the centralized way to manage security across various components in a Hadoop cluster.. In this blog, we will consider Ranger for authorization.

Ranger is used authorize users/group (as well as authenticated user by Kerberos) to access resources inside Hadoop ecosystem.

Currently Ranger provides audits and plugins for each of the Hadoop services which include HDFS, Hive, HBase, YARN, Kafka, Storm,Knox and Solr. Ranger uses Solr to audit the user actions on all supported services.

By using these plugins, Hadoop Administrator can create policies to authorize users to access Hadoop services.

For Example, Hive-Ranger-Plugin provides authorization at database, table and column level. By using this we can create specific / role-based policies for each user/group, thereby controlling the kind of queries that can be run on the database / table.

Hortonworks Ranger Tutorial can be access here.

Encryption (Ranger KMS):

Ranger Key Management Server (KMS) is built on the Hadoop KMS developed by the Apache community. It extends the native Hadoop KMS functions by letting the Ranger Admins store keys in a secure database.

Ranger provides centralized administration of Key management using Ranger admin UI. Ranger admin provides ability to create,delete and update keys using its dashboard or rest APIs. Ranger admin also provides the ability to manage access control policies within Ranger KMS. The access policies control permissions to generate or manage keys, adding another layer of security for data encrypted in Hadoop.

HDFS Encryption Example:

In Ranger KMS UI, Create key in the name of hdfs-encryption.

Add new policy in the name of key-test and give decrypt permission only for bob user.
in hdfs,

1. create test dir and give owner permission to the bob user.

hdfs dfs -mkdir /test

hdfs dfs -chown -R bob:hdfs /test

2. create encryption zone:

[hdfs@ip-172-31-4-145 ~]$ hdfs crypto -createZone -keyName hdfs-encryption -path /test

Added encryption zone /test

[hdfs@ip-172-31-4-145 ~]$ hdfs crypto -listZones

/test hdfs-encryption

3. Verify read write permission for user bob. only bob user can access the data from /test

4. if you try to access data from some other user. it will through following error.

[hdfs@ip-172-31-4-145 ~]$ hdfs dfs -put test1.txt /test/
put: User:alice not allowed to do ‘DECRYPT_EEK’ on ‘hdfs-encryption’
17/08/17 10:51:02 ERROR hdfs.DFSClient: Failed to close inode 17051

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /test/test1.txt._COPYING_ (inode 17051): File does not exist. Holder DFSClient_NONMAPREDUCE_1683412138_1 does not have any open files.

We recently worked with a leading Hi-Tech manufacturing company to design and implement a brand new scalable and efficient workforce analytics solution targeted the mobile workforce.

The solution is designed to raise the workers’ confidence bar, and to minimize the effort required to train the workers. The solution also improved the manpower utilization by optimizing inventory adjustments with higher accuracy while fulfilling orders. It also reduces the learning curve for workers resulting in substantial reduction in training hours.

Workforce Analytics Solution Overview

The Workforce Analytics solution was built on a Common Data Analytics Platform leveraging Hortonworks HDP 2.4 and used the following technologies: Kafka, Storm, HBase, HDFS, Hive, Knox, Ranger, Spark and Oozie.

The platform collects real time data from the application on mobile devices, stores it, and runs analytics with better performance and lower latency compared to their prior legacy system.

The HDP components at a glance:
Workforce Analytics Solution HDP Components

Workforce Analytics Architecture

The operational real-time data is collected using Kafka and ingested into HDFS and HBase in parallel using Storm (see diagram below). HBase acts as the primary data store for the analytics application. The data in HDFS is encrypted and reserved for other applications. Based on the business logic, the data stored in HBase is processed using Spark on a daily, weekly, monthly and yearly basis, and stored back into HBase as a feed for Spark Analytics (Spark SQL). Spark Analytics is used to run jobs to generate specific insights. The output from Spark Analytics in Hive as a temporary table. Hive Thrift Server is used to execute queries against Hive and retrieve the results for visualization and exploration using Tableau. Custom dashboards were also built for business users to help them track higher-level metrics.

Workforce Analytics - Architecture

To address security requirements, Apache Knox and Apache Ranger were used for perimeter security and access control, respectively. Both are included as a part of HDP 2.4 and are configured in the Access Node.

Workforce Analytics Physical Architecture

The figure below shows the physical layout of the services on the various servers used. The architecture comprises of Edge Nodes, Master Nodes and Slave Nodes. Each set of nodes run a variety of services.

Workforce Analytics Physical Architecture

Issues and Solutions

While implementing this solution, we ran into a variety of issues. We outline some of them here in the hope that it may help others who are designing similar architectures with the Apache Hadoop  or Hortonworks HDP eco-system of components. Table creation, user permission and workflows were the common focus areas.

HBase Table Creation

We ran into permission issues with HBase table creation.

Solution: In Apache Ranger, update HBase policy by giving appropriate read, write and create permission for the defined user.

Connection to hive thrift server

Another issue we ran into involved connections to Hive Thrift Server for a particular user “ABC”.

Solution: Ensure that the below properties are added to $HADOOP_CONF/core-site.xml

hadoop.proxyuser.ABC.groups=*

hadoop.proxyuser.ABC.hosts=*

Oozie workflow jobs submission

Permission errors continued to plague the project while creating workflows in oozie.

Solution: The following needs to exist in the section of the corresponding job definition in workflow.xml:

<env-var>

HADOOP_USER_NAME=ABC

</env-var>

within the

<shell xmlns="uri:oozie:shell-action:0.2">

oozie workflow job stuck in prep state

When re-running an Oozie workflow job after a period of time, it went to PREP state and did not execute. While trying to kill the job via CLI, the Oozie log shows the job was successfully killed.

USER [test] GROUP[-] TOKEN[-] APP[-] JOB[ABCDEF] ACTION[] User test killed the WF job ABCEDEF-oozie-oozi-W

However, in the Oozie UI, the job is still shown to be in PREP state.

Solution: Further research showed that the Oozie database at the backend (Derby by default) was corrupted, and was not representing the correct state of the jobs.

We decided, for longer term stability, to migrate from Derby to MySQL as the backend database for Oozie. After this migration, we did not run into this issue again.

Conclusion

Big data projects can grow and evolve rapidly. It’s important to realize that the solution chosen must offer the flexibility to scale up or down to meet business needs. Today, in addition to commercial platform distributions such as Hortonworks and Cloudera, higher level tools and applications simplify the process of developing big data applications. However, as seen by some of the issues we describe above, expertise in the underlying technologies is still crucial for timely completion of projects. Orzota can help. Please contact us.

It’s an exciting time for the Orzota team. During February we will hold some interesting conversations with the local business community in Chicago. chicago-skyline

Orzota’s Founder and CTO, Bharath Mundlapudi and our Director of Sales Ilias Farfouris will be presenting and hosting 2 events. The topics are Big Data in IoT and Modern Data architecture using Flink and Hadoop.

Chicago and the midwest are communities that are taking huge strides in Technology within the Big Data domain. Orzota’s mission and past use case expertise makes a dynamic contributor for these groups to share best practices.

During February 15th we will be a part of the IoT business community in Chicago. Please find the info and join us for a great evening of technological conversation that fuse Big Data and IoT. Here’s the link https://www.meetup.com/Internet-of-Things-IoT-Chicago/events/228330908/

During February 16th we will be a part of the Chicago Apache Flink community. This is a technology that is growing and the group is starting efforts in New York and internationally. Our mission as a big data company is to contribute with new and exciting upcoming technologies. We’re excited to illustrate our domain and fuse it with the new kid on the block Flink. Here’s the link https://www.meetup.com/Chicago-Apache-Flink-Meetup/events/228225429/

Come join us during these fun sessions and interact with peers in Big Data and IoT community. Special thanks to the organizers of the IoT Chicago and Flink Chicago community and thank you for having us!

See you there!