The topic of data science has been on the rise within the tech industry. Often, we see techies conversing and sharing articles about data science on social media and we hear professionals discussing it as part of their business plan. By now, most of us are aware that it exists and have an inkling about what it does. But can you answer the following questions?

Data Science

Do You Need a Data Scientist?

In the past, it has been known that larger, technologically advanced companies used data scientists (Facebook, LinkedIn, Google, etc.). However, we are seeing non-technology type businesses hire data scientists. For example, retailers are using data science for everything from understanding customers to managing inventory. Data science allows companies to gain insights from data in many fields and ultimately improve forecasting and decision making.

What Does a Data Scientist Do?

What Does a Data Scientist DoAccording to Dr. Steve Hanks, there are three major capabilities that data scientists must have: (1) They must understand that data has meaning, (2) They must understand the problem that you need to solve and how the data relates to that, and (3) They must understand the engineering.

A data scientist, in very general terms, looks at and investigates a set of data and then comes up with different ways to answer previously posed questions. Along the way, the data scientist may consider historical data analysis to develop analytical models and dashboards that provide insights and improve decision making.For example, a data scientist for a large retailer like Macy’s may look at not just past seasons’ data, but current economic and weather conditions to make predictions for their upcoming season. Retail executives make use of this to improve things such as sales, revenue, marketing strategies, advertising efforts, etc.

How Do You Build a Strong Data Science Team?

Choosing people that are aware and skilled in areas that fit your company’s need is essential. An article from Datafloq says, “The team needs to take the data and understand how it can affect different areas of the company and help those areas implement positive changes.” Not all the skills of a data scientist can be taught; it is important to have a natural affinity for data analytics, and the drive to produce beneficial insights to answer your company’s needs.Data scientists are not only computer scientists and statisticians, but must have a solid understanding of the business as well.

Should You Outsource Your Data?

Because this field of work is both complex and intimidating, there is a shortage of skilled professionals in the industry. Advanced analytics require a certain skill-set to develop and run machine learning models. Instead of spending the money and putting in the efforts to develop a team with the necessary skills internally, you can speed up your path to data science and outsource. For small-to-medium businesses, it can be cost-prohibitive to have their own data science team. There is work in the field of data engineering that must be done before a data scientist can develop models. This may not be an efficient use of resources for a small-to-medium business to hire both data engineers and data scientists.

Shanti Subramanyam, CEO at Orzota says, “Deciding to outsource reflects the core competency of your business. If you don’t have the financial resources or the capacity to focus on it, outsourcing is a faster and more efficient way to stand up a capability.”

Conclusion

If you’re overwhelmed by these questions, don’t be. Although the idea of data science and big data may seem complex, it is important to understand at least the basics. If you can articulate your business pain-points, it will be easier to answer these questions and find the best solutions to fit your company’s needs. Orzota is here to explain further, answer your questions, and offer services to help you feel comfortable with understanding and fulfilling your data science needs.

implementing big dataThe conversation has shifted from trying to define what Big Data is but we’re stuck in the layers. Is Big Data fun like eating an orange full of vitamin c? Or is it more like an onion? where you cry when you try to peel it off but contrary it’s so sweet when you cook it properly? I tend to lean more towards an orange, and I will tell you why.

One important area is the learning curve. We are seeing efforts taking six months+ without even having addressed any of the use cases let alone having to figure them out. There is a lot of innovation at the moment but with that comes a lot of new terminologies that also require some effort to be informed. More important is the use case. How are we going to evaluate what we need without knowing what the use cases are? A real-time data streaming scenario is much different than scenarios where the speed is not of importance as we can afford to see the results the next day.

Three Steps to get started with Big Data

  1. Start by identifying your vertical and finding what the demands are and who you serve as your customer. This will guide you on how to decide the Proof-of-Concept PoC(s) and therefore the use case(s). As a follow-up step, we then look at the technological assets that we will need. From there the next steps is how to launch the Proof-of-Concept (PoC) quickly. For companies that think their data is sensitive follow this rule:  decide what data you can live with by having it in the cloud as you’re only testing the hypothesis at this stage and is the fastest route.
  1. Seamless integration. At this stage, you need a solution that is responsive. Why because multiple business units will come with requirements. You need to be able to accommodate such demands. I can’t stress enough the managed services approach until you can support it internally. Why because it’s a huge learning curve and the return on investment (ROI) is far better. Many fall into the loop that because they see what a super enterprise is doing, then it’s feasible to have the same approach. This approach will only yield frustration as it will take longer. At Orzota BigForce we’re working hard to accelerate this stage.
  1. Insights. At the end of the day, you are doing all this for the insights but more specifically for the predictive and the hidden potential. Most likely you have a ton of reporting going on. Reports it’s not the issue here. If you cannot get the proper reports right now, then you need some serious help. Insights can be discovered internally but also externally, so kindly remind your hardest critics that is not about how much data you have but how much is out there that you can derive these very crucial insights for your business. A good example to put this argument to bed is Social listening and therefore Sentiment Analysis.

In conclusion, if you think about it, there isn’t much overlap if you start from the use case and Proof-of-Concept (PoC) approach while following the above steps. Starting small will also allow you to get buy-in and then expand. Partners may seem in the beginning that they offer similar services but at the pole position, you need to get more with less. Lastly, always keep in mind that for the majority the dynamics are different as there isn’t much Data Analysis talent out there thus with a managed services approach or a hybrid approach you can accelerate your environment and team. Finally, the orange correlation in 3 easy steps: pick, peel and consume. You can enjoy it getting all the vitamin C and its benefits.

big data analysis

We recently worked with a leading Hi-Tech manufacturing company to design and implement a brand new scalable and efficient workforce analytics solution targeted the mobile workforce.

The solution is designed to raise the workers’ confidence bar, and to minimize the effort required to train the workers. The solution also improved the manpower utilization by optimizing inventory adjustments with higher accuracy while fulfilling orders. It also reduces the learning curve for workers resulting in substantial reduction in training hours.

Workforce Analytics Solution Overview

The Workforce Analytics solution was built on a Common Data Analytics Platform leveraging Hortonworks HDP 2.4 and used the following technologies: Kafka, Storm, HBase, HDFS, Hive, Knox, Ranger, Spark and Oozie.

The platform collects real time data from the application on mobile devices, stores it, and runs analytics with better performance and lower latency compared to their prior legacy system.

The HDP components at a glance:
Workforce Analytics Solution HDP Components

Workforce Analytics Architecture

The operational real-time data is collected using Kafka and ingested into HDFS and HBase in parallel using Storm (see diagram below). HBase acts as the primary data store for the analytics application. The data in HDFS is encrypted and reserved for other applications. Based on the business logic, the data stored in HBase is processed using Spark on a daily, weekly, monthly and yearly basis, and stored back into HBase as a feed for Spark Analytics (Spark SQL). Spark Analytics is used to run jobs to generate specific insights. The output from Spark Analytics in Hive as a temporary table. Hive Thrift Server is used to execute queries against Hive and retrieve the results for visualization and exploration using Tableau. Custom dashboards were also built for business users to help them track higher-level metrics.

Workforce Analytics - Architecture

To address security requirements, Apache Knox and Apache Ranger were used for perimeter security and access control, respectively. Both are included as a part of HDP 2.4 and are configured in the Access Node.

Workforce Analytics Physical Architecture

The figure below shows the physical layout of the services on the various servers used. The architecture comprises of Edge Nodes, Master Nodes and Slave Nodes. Each set of nodes run a variety of services.

Workforce Analytics Physical Architecture

Issues and Solutions

While implementing this solution, we ran into a variety of issues. We outline some of them here in the hope that it may help others who are designing similar architectures with the Apache Hadoop  or Hortonworks HDP eco-system of components. Table creation, user permission and workflows were the common focus areas.

HBase Table Creation

We ran into permission issues with HBase table creation.

Solution: In Apache Ranger, update HBase policy by giving appropriate read, write and create permission for the defined user.

Connection to hive thrift server

Another issue we ran into involved connections to Hive Thrift Server for a particular user “ABC”.

Solution: Ensure that the below properties are added to $HADOOP_CONF/core-site.xml

hadoop.proxyuser.ABC.groups=*

hadoop.proxyuser.ABC.hosts=*

Oozie workflow jobs submission

Permission errors continued to plague the project while creating workflows in oozie.

Solution: The following needs to exist in the section of the corresponding job definition in workflow.xml:

<env-var>

HADOOP_USER_NAME=ABC

</env-var>

within the

<shell xmlns="uri:oozie:shell-action:0.2">

oozie workflow job stuck in prep state

When re-running an Oozie workflow job after a period of time, it went to PREP state and did not execute. While trying to kill the job via CLI, the Oozie log shows the job was successfully killed.

USER [test] GROUP[-] TOKEN[-] APP[-] JOB[ABCDEF] ACTION[] User test killed the WF job ABCEDEF-oozie-oozi-W

However, in the Oozie UI, the job is still shown to be in PREP state.

Solution: Further research showed that the Oozie database at the backend (Derby by default) was corrupted, and was not representing the correct state of the jobs.

We decided, for longer term stability, to migrate from Derby to MySQL as the backend database for Oozie. After this migration, we did not run into this issue again.

Conclusion

Big data projects can grow and evolve rapidly. It’s important to realize that the solution chosen must offer the flexibility to scale up or down to meet business needs. Today, in addition to commercial platform distributions such as Hortonworks and Cloudera, higher level tools and applications simplify the process of developing big data applications. However, as seen by some of the issues we describe above, expertise in the underlying technologies is still crucial for timely completion of projects. Orzota can help. Please contact us.

managing big data

 

It’s no secret that analyzing data is a key initiative for all businesses. Some say we have been doing it for years. Yes, that’s true but new Big Data methodologies give us the opportunity to be more effective and discover new sources of revenue. Today, we have the capability to use technology at a very low cost that allows us to do more, understand more, reach out to more consumers while having a unified idea of what they are looking for.

 big data managementWe can shape and mold everything that we do base on data from research to marketing and everything in between. Think about it – if you know you’re shooting only 18% from 3 point range why would you keep shooting 3 pointers?

Big Data comes with two major idiosyncrasies: complexity because of the open source frameworks and not enough talent available. Don’t let open source scare you – it actually means savings! Lack of talent is also easy to overcome using big data management.

Here are some benefits of managing big data:

  • Taking out the cost  If you’re not a large, tech-savvy enterprise like Facebook or other giants, chances are you’re struggling with this decision. First is the technology part – including data engineering, hardware, and technology solutions. Then is the part of going and finding the right people (not the one overseeing everything, but the actual crew). Let’s take a closer look. Here are some of the roles required to build your first big data solution: software engineer, data engineer, data scientist, QA & release engineers, data architect, devops,… have you had enough?
  • Taking out the complexity  Big Data gives us the opportunity to select the tools and tailor them to the exact use cases of what we’re looking to implement, all in an economic fashion. Open source frameworks come with a learning curve but also provide great value. How can one exploit the advantages while minimizing the complexity?  Big data management is the answer. Coupled with experienced services offerings that can help build out your solution, you can enjoy the benefits of deep data insights in a matter of weeks at a fraction of the cost.
  • Show me the money To be competitive, every enterprise today must exploit their data with advanced analytics. There is also a need for a sidekick. Lastly, speed and agility are important. Big Data Management will give you that edge to get the scale of a super enterprise while not breaking the bank.

In conclusion, I was inspired by articles on the complexity of Big Data and also from articles on talent for Big Data. These are major issues that can simply be avoided by looking for a Big Data Management option.  This approach allows for analytics on large amounts of data, with the possibility of providing new sources of revenue. It’s not about the size of data; what matters is to solve your problems fast and economically.

Does your Analytics journey look like this?

analytic solutionsBig Data and Analytics is now moving at a faster pace than before to midsized businesses (MSBs). The potential “Value” of Analytics while once thought useful to larger global enterprises, has quickly moved downstream. As a consequence, the spend in this space is also expected to grow faster than the large-scale enterprises.

Midsize businesses across domains can now gain a significant competitive advantage, get valuable insights to identify potential markets, and form the basis to improve customer experience and operational efficiency.

However, there’s more money spent on efforts to cope with massive influx of available data than the applicability of “Analytical Value” that technology offers. To influence business results meaningfully, MBs must take a multi-dimensional view of the Analytics Value. They must consider the Value across the following three dimensions:

Intent, Commitment, and Clarity of Business Impact

It is imperative to understand the purpose and focus areas where Big data and Analytics have the most potential. Here are few key questions:

  • What do we want to use big data for? Strategic? Tactical?
  • How can we monetize the data streams in terms of customer loyalty, revenue growth and/or cost reduction?
  • What are the areas for Business Impact (Customer, Product/Service, Operations, Supplier/Partner, Finance, Risk)?
  • Where within these areas does Big Data and Analytics provide the most sustainable value?
  • How can we use data-driven customer intelligence to understand customer behavior?
  • What specific customer-centric and operational-centric KPIs or metrics provide insights into a particular component of our business? e.g.: Propensity to Buy, Customer Lifetime Value (CLV)
  • How will better insights and information help overcome the most pressing challenges in our business?

Solution Options Built on a Foundation of Analytics

Most medium sized businesses lack understanding of the various Solution Options and Tools available, and hence are not confident and hesitate to employ it. It is imperative to select diligently from a plethora of Analytics Solutions and Tools for cost efficiencies, process improvements, data governance, and technology.

From a process perspective, they must be able to collect enough internal data, normalize and combine this data with external data sources to identify patterns and behaviors.

A study by IDC revealed that organizations that use diverse data sources, analytical tools (e.g., predictive analytics) and the right set of metrics are five times more likely to succeed and exceed expectations for their projects than those who don’t use these big data strategies.

Skills Training, Gap Analysis, and Lessons Learned

Having a successful initial implementation or Proof Of Concept (more is better, timing is of essence) within 6-8 months is critical. Shorten decision paths and leveraging domain centric Big data and Analytics partners and solution providers is essential.

However, identifying and realizing Analytics Value for midsize businesses is also more than just working with external partners. They must know the critical questions to answer based on the data.The business must understand the Analytics terminology and technology, as well as possess some internal statistical knowledge. We see a few midsized businesses include a new role – a Data Strategist, to help with their growth strategies, streamline business operations and integrate technology to help the business operate more efficiently.

Please feel free to reach out to learn how Orzota has helped organizations across the above dimensions to:

  • Build models targeted to specific use-cases that can be implemented swiftly, with clear business focus
  • Select, deploy of targeted data-analytic solutions
  • Adopt the Analytical solutions and tools
  • Realize Analytics Value Faster

5V of big data

Since the explosion of Big Data we have early adopters and also a fan-base that is still in the evaluation phase. Everyone is familiar with the 4 V’s in Big Data: Volume, Variety, Veracity and Velocity. These are fantastic fundamentals but we need a 5v of big data.

Introducing the 5V of Big Data

vision  It seems that many are stuck in the definition stage and endless discussions that don’t lead anywhere. There is a plethora of innovation and all kinds of companies that are providing a solution to the many moving parts of Big Data. In an earlier post, I wrote about the three pillars of Big Data: Modernization of Data Warehouse, Data Plenum and Data Juncture, Insights.  All good to this point but…

Why Do We Need the 5th V (vision) in Big Data?

  • Vision help us to identify where to start.

We have to envision the complexity and ultimately where we need to take this program. And yes folks, it is a program. By spending a minute to look into the next few months, we can identify at least internally at this stage an initial framework of action items. You have to remember that depending on the size of your company, you have to be a team player on this.

  • We can get a better understanding of the playing field.

Big Data has complexity, and this is why it’s so much fun to work in this domain. Your Vision should be centric to the goals of your company as a whole! This means that it should serve the needs of your internal and external business efforts. Let’s take an example for a B2C company. If you provide products/offerings to consumers, you will have different needs that are around how to understand your customers better, how they shop online vs. store, marketing, and predictive insights. When you are envisioning, take a minute to understand how you will get there by servicing the needs of the business requirements.

  • Building the program

One of the final elements of the Vision is to initiate the program. You have to remember that it is going to be an effort that will require external assistance and internal coordination. A major accomplishment in this area is to quickly be able to accommodate the business requirements thus providing big data driven solutions to the company users. I can’t stress enough the importance of a managed services model and here is why: first, you eliminate the complexity and second, you can deliver faster to the business users. Also, you have to remember that there is a need for nurturing and continuous modeling. Finally the team structure. In team sports, we often hear the word franchise player the person(s) that the team is built around. Whether you like the NFL or the NBA you can’t go anywhere without a good quarterback and in basketball point guard. Then you build from there wide receivers, center you get the point.

In conclusion, Vision, the 5V of Big Data, would be a catalyst to create initiate the steps that get you to successfully build the process for Volume, Velocity, Variety, and Veracity. Remember that we are all Big Data analysts and that analytics in one way or another are ingrained in our human system. From a young age, we played sports, so we looked at stats and knew what to do or we knew how much time to spend to improve our game, therefore same for business. One last point though it takes a team to win the championship keep that in mind! No need to feel that we’re drowning or intimidated by data and likely there is plenty of innovation out there at the moment. So take a minute to use the 5th V (Vision) and keep moving forward to derive value from Big Data.

It’s an exciting time for the Orzota team. During February we will hold some interesting conversations with the local business community in Chicago. chicago-skyline

Orzota’s Founder and CTO, Bharath Mundlapudi and our Director of Sales Ilias Farfouris will be presenting and hosting 2 events. The topics are Big Data in IoT and Modern Data architecture using Flink and Hadoop.

Chicago and the midwest are communities that are taking huge strides in Technology within the Big Data domain. Orzota’s mission and past use case expertise makes a dynamic contributor for these groups to share best practices.

During February 15th we will be a part of the IoT business community in Chicago. Please find the info and join us for a great evening of technological conversation that fuse Big Data and IoT. Here’s the link https://www.meetup.com/Internet-of-Things-IoT-Chicago/events/228330908/

During February 16th we will be a part of the Chicago Apache Flink community. This is a technology that is growing and the group is starting efforts in New York and internationally. Our mission as a big data company is to contribute with new and exciting upcoming technologies. We’re excited to illustrate our domain and fuse it with the new kid on the block Flink. Here’s the link https://www.meetup.com/Chicago-Apache-Flink-Meetup/events/228225429/

Come join us during these fun sessions and interact with peers in Big Data and IoT community. Special thanks to the organizers of the IoT Chicago and Flink Chicago community and thank you for having us!

See you there!

awesome-elephant-and-little-girl-with-violin

It’s an exciting time for business. Companies have either amassed a tremendous amount of data or are increasing their efforts to ingest data available from this digital ecosystem that we’re part of to find value and new profit streams. No matter how we look at it there is a need to make sense out of all this.

Time to move from spending precious time for defining Big Data and actually taking steps of doing something with the data available. Before we do this though we need vision. On an earlier post I added the 5th V for Big Data Vision. This is an integral component to be on the right path of profitability. Enter the 5 ways to getting to insights fast.

  1. Modernize data access

More and more as we engage on this strategy there is going to be a need for data engineering. Perhaps one of the most important layers. Unfortunately there are many moving parts in Big Data and being an open source technological framework there is a high curve of learning. A good practice is to do an assessment so you have the roadmap blueprinted.

  1. Data Management & data processing

This step depends on such factors as organizational structure, capabilities to ingest and process a variety of data with velocity. Chances are that in order to be efficient and to accommodate the business demands you have to move swiftly. As such you’ll need solutions. A good model is to look for managed services. It is important that you alleviate all stakeholders as much as possible.

  1. Predictive insights

A managed services model per above will allow you to focus on how to improve the business. A lot of companies today are focusing on the reporting. This is all good but elaborated spreadsheets don’t allow you to have time to explore where the hidden revenue streams are. Time to put the thinking cap on. The paradox here is the human capability.  Predictive modeling is where the action should be focused. A good rule here is that this is a continuous process so don’t treat this steps as one and done.

  1. Problem solving selection

Big Data methodology can help you solve many problems. Improve product recommendations to clients, adjust content creation based on listening ingestion from social feeds, prevention in maintenance or in fighting diseases or customize customer experience, etc. If you’re starting you need to test the waters. At this point you have heard the terminology POC (Proof of Concept) and Use Case (the larger version of the problem you have to solve).  A good rule start with a modest problem that can lead to the full use case solution.

  1. Anticipate Scalability

Revert back to the vision. You will grow and not all companies have the same capabilities. Your historical data will grow, the demand from the business units will grow, and your team will grow. At this stage what you have to keep in mind is that Big Data should be tailored to the exact needs of your organization. Stay away from one size fits all models. Choose nimble technology partners because there is a learning curve and more importantly a high degree of customization.

In conclusion there is a dependency to external partners and also a high degree for teamwork. You have to remember and constantly evaluate if at the end of the day are my problems Big Data problems and that you shouldn’t do Big Data because it’s fashionable but because you have a need to extract value for your business. On my next post I would be writing about the 5th V in Big Data stay tuned.