Who is a Data Scientist?
In case you missed it, one of the hottest new job titles in tech is that of the Data Scientist.In traditional IT, the data was managed by engineers ensuring that the data warehouse had relevant data which was then used by business analysts to answer business questions.With the rise of big data as a result of the ability to process large amounts of unstructured (e.g. data from devices), semi-structured (e.g. all sorts of machine logs) and traditional structured data, the notion of the data warehouse as a central RDBMS for analytics is being disrupted.Why not do analysis on all this raw data in the big data platform instead? But analysts who use BI tools do not typically have the ability to deal with raw data, let alone “big data”.
Thus, was born the role of data scientist – someone who has a strong computer science background (read PhD), along with math, statistics and the business acumen of the analyst. The data scientist explores the data set on his own, creates analytical models and tries to find patterns in the data to answer business questions. Anjul Bhambhri, vice president of big data products at IBM goes a step further, describing the data scientist role as “part analyst, part artist”. She says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”
Do you need one?
Companies like Facebook, LinkedIn and Google have derived great benefits by using a team of data scientists. Many non-technology businesses have also started hiring them. Macys has boosted store sales by 10% using big data analytics done by their data scientists. (https://www.cnbc.com/id/100638141). Organizations of all sizes can gain superior insights using data analytics leading to improved decision making.
Starting a data science project
Before you decide that your company needs a data scientist, it is important to take a step back and have a clear understanding of what problems need to be solved.
It is best to start small in the data science space by picking a particular problem and trying to solve this. Note that this is counter to the general view that data scientists explore data and find interesting patterns. For smaller companies and/or those new to big data, the better option is to stay focused on getting results for one initial question. For example, a fairly universal problem for many businesses is to understand the reasons why customers are lost (known as customer churn). This can be a very complex analysis involving many different dimensions from pricing, support, product quality, feedback, social media, etc. and can be a daunting task to undertake.
It is possible to break this down, solve a subset of the problem and continue to build on it.
All of this sounds great. Find a data scientist and you can now have great new insights into your data and the ability to solve business problems. Except for a small problem – they are hard to find. McKinsey Global Institiute estimates that there will be a shortage of 190,000 data scientist jobs by 2018.
If you can’t find a data scientist, it is possible to build this skill set using a team consisting of a data architect, data engineer, data analytics engineer and business analyst working co-operatively together. This may even be advantageous as a larger team can handle multiple projects in the long term. Finding people with a more narrow skill set may also be easier.
If your organization does not have the expertise to start a data science project, consider partnering with a consulting firm that can help you get started. Bootstrapping from this project may be quicker than trying to start from scratch and learn as you go. Enterprise IT is littered with tales of failed ambitious projects. Don’t be part of that statistic!
A lot of discussions is now underway about the relationship between Big Data Management (BDM) and Master Data Management (MDM). There is general agreement that there should be a symbiotic relationship between BDM and MDM and the relationship is more likely going to be bi-directional. Big data technology can benefit from master data as a starting point for analysis, and it can also help augment or feed new insights and facts into an MDM system.
Big Data and MDM
Even though MDM does not have a Big Data problem per se, the concept of trusted master data entities and data attributes has an important role to play in Big Data analytics. MDM ensures data integrity while playing a key role in the integration of structured and poly-structured data for advanced analysis. MDM initiatives were undertaken to ensure that there is a structured repository of trusted key enterprise data entities – customers, households, products, locations, and many others. For MDM to work with Big Data, the concept of a “single” master data entity and its relationship with other entities must evolve into a multi-dimensional entity that has complex relationships with other Master Data entities. In other words, Master Data entities should evolve from a 360o view to a 720 view and start to be the “glue” that can bring together associated transaction, interaction, and observational data. A more or less complete view of core business entities will enable the rapid integration with existing business processes.
Data Science and MDM
So why should Data Scientists care about Master Data and MDM? Master Data refers to the core entities that organizations of all sizes work with such as data about customer, supplier, partner, product, materials and employee data. Master Data is generally perceived to be “high value” data and is critical for addressing business problems and is at the heart of every business transaction, application and decision. More importantly, MDM enables business agility and sets the foundation for information-rich views of data enabling linkages to structured and unstructured data and how they relate to “core” business data. An MDM solution becomes a repository that delivers a single version of truth of an organization’s critical data entities – customer, product, supplier and more – helping them make better decisions and achieve better business outcomes.
The data scientist typically analyzes all data available in a Big Data platform with the goal of discovering a previously hidden insight, which in turn can provide a competitive advantage or address a pressing business problem. A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data. Data Scientists can now leverage the data ecosystem as shown below.
Master Data Management implementation styles are generally classified as Collaborative MDM (integration with business processes and workflows), Operational MDM (consumption by applications and operational systems during transactions) and Analytical MDM (integration with analytical environments). Data Scientists should take particular note of Analytical MDM as the goal such an initiative is to provide “trusted” data for analytics and augment operational systems with integrated analytics.
Analytical MDM has typically provided accurate dimensional data to Data Warehouses/Data Marts and has enabled better Business Intelligence. Analytical MDM thus allows for accurate master objects and hierarchical structures to be consumed by and automatically synchronized with Big Data analytical applications thus providing the potential for better analytical outcomes. Conversely, outcomes derived from Big Data analytical solutions (for example, recommendations, cross-sell, and up-sell suggestions) could be important data to persist in the MDM system for use by consumer-facing applications. Hence there can is very little doubt that MDM has a key role to play Big Data Analytics.
Orzota’s consulting services offer a comprehensive view of how Big Data Processing, Big Data Analytics, and Enterprise Information Management come together using proven methods, architectures, and leading practices. We can help you get on the journey to implement a Big Data Information Management solution. For more information, please reach out to us at https://www.orzota.com/contact/.
The last five years have seen a revolution in the data management arena resulting in a paradigm shift of how we view the role and impact of data in our daily lives and societal interactions. There has been a radical change in the approach, design and implementation of products to collect, store and manage data to handle what we have come to call as the “Big data phenomenon.” The focus (and some say hype) on Big Data has led to exciting innovations and (re)engineering of database solutions, so much so that there is a glut of database solutions hitting the markets recently. Even though the relational model to databases have been around for nearly thirty years and powers the majority of IT applications today –database engineers have boldly gone onto question the status quo and come up with newer (and many times improved) database products. These alternates to relational products have come to be called as NoSQL, Not-only-SQL, NewSQL, Cloud databases and such.
Wikipedia lists around 30+ different DBMS types (https://en.wikipedia.org/wiki/Database) based on the type of content and data models supported. In addition, we have newer database types mentioned above. Such a rapid growth and glut of database solutions can result in information overload and lead to confusion and indecision in choosing the right database solution tailored to a company’s needs.
The classic approach
Traditionally, when IT folks wanted to get an idea of the database landscape for knowledge, evaluation, and selection, they talked to data experts and more often than not turned to industry analysts for their reports. An illustrative report from a leading industry analyst is shown below:
The drawback of such a report becomes readily apparent – they don’t show all the players especially the upcoming DBMS offerings and you really have to understand the methodology behind the selection and ranking. So if we need to get the bigger complete picture, we need to have a fairly exhaustive list and one that shows how they can be categorized logically and how they fit together. A recent database landscape from 451Group (https://blogs.the451group.com/information_management ) is a better attempt to show this in detail as shown below.
A brave new world
Instead of using quadrants, “waves”, hype cycles and such, the analysts at 451Group have drawn inspiration from the map of the London Tube (https://www.tfl.gov.uk/assets/downloads/standard-tube-map.pdf) and the Real Story Group (https://www.realstorygroup.com/images/RSG-Subway-H1-2013-Large.jpg). I find this approach both comprehensive, path oriented and incorporating additional information dimensions. The broad categories are Relational, Non-Relational and Grid/Cache – in addition if you are seeking additional categories such as cloud, NewSQL etc. you can locate them quite easily using the supplied key. Interestingly, databases that don’t fit readily into the available database categories are referred to as multi-model databases (they take on the characteristics of multiple databases).
As comprehensive as this aims to be, let’s look at how the landscape has changed rapidly in recent times.
What we have is a tremendous choice and flexibility in choosing the right database solution(s) or a literal Chinese Menu, depending on how it’s viewed.
With this rich information, it behooves Data Architects, Data Engineers, and Data Scientists to get into the nitty- gritties of these available solutions, what products should be considered/short-listed for their application needs and the risk-reward equation in going with a particular implementation. Such insight is critical not only to embark on the appropriate solution blueprint but also handle demands (and handle situations shown below, however unlikely!) from non-technical stakeholders.
Twenty Questions (or something close)
Anyone considering and/or involved in an Extreme Information Management initiative will benefit by thinking about the questions below:
– Is RDBMS really going the way of the dinosaur?
– What have been the drivers and motivations for startups, academia and established vendors to invest in and develop new database products?
– Open source, solutions from emerging companies or established big vendors?
– What has been the experience of other companies with implementing non-relational database solutions?
– Why do companies such as Netflix, Facebook, LinkedIn, Twitter, Yahoo and other data engineering powerhouses choose to go the non-relational route?
– How do I do a fit-gap analysis and zoom into the appropriate solutions for my business problem?
– Are some database products more suited to an industry problem or need?
– What are the adoption patterns of database products by business needs, use cases, industry verticals and customer segments?
– How do the newer database solutions fit into existing data landscapes?
– Are there skilled resources available for the desired database product?
– What is the viability and reputation of a company offering a particular product?
– Do I understand or need to understand the technology behind the newer database solutions (sharding, CAP, scaling up, key-value pairs, columnar storage, distributed processing, etc.) for me to consider a database product?
– What are the Operations, Administration, Maintenance and Scaling aspects for a database product?
– Are there are any reliable case studies and benchmarks available for me to consider a database product?
How can we help?
Orzota’s consulting services can help you get a grip on the brave new world of database systems and help you navigate it. We understand your business, functional and technical needs and use our consulting arsenal of SME’s, Point-of-Views, practical implementation experience, fit-gap analysis, vendor assessment tools to help our clients choose the right solution for todays and future needs. For more information, please reach out to us at https://www.orzota.com/contact/.
Does your Data Warehouse use Teradata?
Are you concerned about the growing size of the data?
Download our white paper
to find out how Hadoop can help
augment your data warehouse,
improve efficiency and reduce cost.
The NameNode in Hadoop keeps the directory tree of all files in the Hadoop File System. In other words, it keeps some metadata for every file, not the actual file itself.
This particular customer required a very large data ingestion rate and was concerned that the single NameNode would not be able to handle this. In reality, the situation is usually the reverse i.e. the clients and/or network bottlenecks arise way before the NameNode becomes a bottleneck.
But what if there are tens or hundreds of clients sending the data over a fat pipe and there is no client bottleneck? Wouldn’t the data from each client have to go through the NameNode and cause it to choke?
The answer is no. The client never sends the actual data through the NameNode. Rather, the clients make connections directly to the datanodes. So this is unlikely to be an issue even when large number of clients connect.
However, metadata (i.e. the directory info) is written to the NameNode. This data is in fact stored in main memory and is the main bottleneck today (in addition to the fact that the single NameNode is a single point of failure – but that’s a different topic and not directly related to scaling).
There are ways to minimize the impact by using very large RAM (128GB is not unheard of) and tuning garbage collection.
These techniques allow companies such as Facebook to store 100s of millions of files in a single Hadoop cluster.
So to summarize, it is unlikely that the single NameNode will give rise to vertical scalability concerns for the vast majority of Hadoop deployments. Companies such as Yahoo! and Facebook have successfully stored tens to hundreds of millions of files. Of course, such large systems need adequate tuning and monitoring to ensure performance does not suffer.