Data science and data analytics - false trails

I recently spotted Data analytics vs data science: what better suits your needs? being shared on Linkedin. Despite my best efforts, I’m still a sucker for these type of articles.

The shared article has significant flaws. I could criticize the grammar, but that would be cheap. I will say that if you’re going to use zettabytes and exabytes you’d better get it right.

This post references articles I believe are more credible, as well as setting out my own views.

Let’s get down to the serious points covered by the shared article (and more):

Big data
Data analytics and data science defined
A phoney war
Non-phoney challenges (goes beyond the article)
Business use of data analytics and data science (goes beyond the article)
Analytics and data science: careers and recruitment

Big data

My clients tend to be life insurance and reinsurance companies, particularly in the protection sector i.e. their products include life insurance, critical illness and income protection. No life insurer I know of uses genuinely big data. In principle their data can easily be stored in traditional professional grade databases such as SQL Server and Oracle - though see below.

There’s nothing wrong with that, if you’re exploiting your “small” data. What’s more worrying is that some companies still aren’t. Here are two examples, based on supposed database constraints.

One company makes no use of its the c30m quotes a year it issues electronically. Its IT department says it doesn’t have space, so it throws it away. Hopefully that’s changed by now.

Another company also makes no use of its quotes data. It claims to be able to predict its competitors’ prices to within 1%, using data science techniques. But its databases can’t store the predictions - or produce them in real time. What a missed opportunity.

The bottom line: it’s better to ignore the big data hype and focus on problem solving.

Data analytics and data science defined

The article belatedly defines data analytics and data science. I’m not really in favour of the definitions used, but its the supposed implications that are wrong and potentially damaging. As an aside I don’t buy the article’s suggestion that data science is about necessarily structured data; natural language processing normally starts with unstructured data.

For the record I use the term (data) analytics in two ways, both of which are quite loose:

Analytics. I prefer to use the encompassing analytics word to cover all forms of data analysis, including data science. This use of analytics covers descriptive, predictive and prescriptive analytics as well as non-descriptive but non-predictive analyses.
Data analytics. I use this phrase to encompass two type of analyses:

Descriptive and exploratory analyses which might focus on concepts as simple as percentages and ratios. In my experience these can yield golden insights.
Statistical modelling which may be used for predictions. It has concepts such as information criteria to prevent over-fitting, but doesn’t use the data science toolkit.

With the first definition I include all analysis of data, while the second has data science complementing and building on data analytics to (e.g.) improve (any) predictive power.

A phoney war

Before defining the relevant terms, the article excitedly declares that “… the war between data analytics vs data science is still on.” As UK politicians have got out of the habit of saying: poppycock.

There is simply no such war. Or at least I’ve not heard a data-literate person say there is. However you define the terms, (data) analytics and data science are going to be closely related, so any “war” is likely contrived - perhaps by a person with an agenda.

The article Data Analytics vs. Data Science: a Breakdown again gives no hint of a war and makes little mention of big data. Similarly statistical modelling and machine learning complement one another.

Non-phoney challenges

The main challenges which prevent realization of data science benefits may not be technical.

It could be about culture. Harvard Business Review asked why is it so hard to become a data driven company? while some think data science is already dead. Both articles agree that culture rather than data science provides the answer to why data science is not delivering all it “promised”.

It could be about decisions. Becoming decision driven not data-driven might help with prioritization and tying data science to value, especially where data scientists are light on domain expertise.

Business use of data analytics and data science

What’s the marginal value of data science in business?

Data science can take you beyond (e.g.) statistical modelling by:

helping you make predictions more robust
automating model building and selection using e.g. Featuretools and DataRobot
providing more complex models e.g. beyond parametric modelling, though take care
learning and making decisions without traditional coding: impressive or scary?

Specific uses of data analytics and data science

Let’s take a classic example from protection: distributor analytics a.k.a. distributor quality management (DQM), which targets lapse, expense and mortality management.

Lapse experience can be managed - with the first order benefit going to the insurer. Depending on the reinsurance structure the insurer wants the policy to continue for as long as possible.

Some distributors may have such high early lapse rates that the business is hardly profitable. A informed insurer can make a judgement on the value of retaining the agent.

Beyond lapses, similar techniques apply to business cancelling before going on risk (“NPWs”) or which cancels in the cooling off period first month (“CFIs”). These results, together with lapse experience, may be a leading indicator of poor mortality experience.

Mortality can be managed - for the mutual benefit of insurer and reinsurer. As well as the point on leading indicators above we may/should also monitor:

Socio-economic group (SEG) of applicants. Even with entirely truthful disclosures and consistent underwriting (with no adjustment for SEG) the mortality experience of different SEGs will differ. It is difficult to target this via standard pricing.
Non-disclosure. For a given SEG, age and smoker status we can measure the actual versus expected nature and level of disclosures. Similarly for given SEG and age we can measure the actual and expected smoker declarations. Action can be taken where actual is much lower than expected. This really needs only basic statistics and is better than random testing.
Underwriting decisions. A reinsurer will make a prospective judgement of the impact on mortality of an underwriting philosophy, including underwriting decisions e.g. rated and declines. This is in addition to rather than dependent on DQM.

DQM can be implemented without data science, using more rudimentary analytics and models, as implied by L&G’s Craig Brown in his What’s being done to clean up UK protection distribution? article.

Finally DQM can also add - and be sold as adding - value to agents: this was the emphasis when L&G developed its Early Warning System way back in 2003.

Analytics and data science: careers and recruitment

Careers

There are arguably good reasons not to become a data scientist. First, think beyond the maths:

For those considering a career in data science or commencing their studies, it may serve you well to constantly refer back to the Venn diagram (*) that you will undoubtedly come across. It describes data science as an confluence of statistics, programming and domain knowledge. Despite each occupying an equal share of the intersecting area, some may warrant a higher weighting than others.

Source: Data scientists will be extinct in 10 years

(*) The article is referencing Drew Conway’s data science venn diagram

Drew Conway

There are many alternatives to and critiques of this diagram. The suggested danger zone is interesting (and historic?) Many say you can democratize data science and do without the mathematical understanding. Certainly domain expertise results in invaluable sense checks.

While not intended to be part of career planning, potential employers and employees might give it thought: what does it take to get sufficient competence in all three areas?

Good career progress can clearly be made based on having only domain expertise (though this is perhaps getting harder) and there is still demand for programmers.

So you could give up on the supposed “demands” of being a data scientist and go for industry expertise. Or you could retain a greater data flavour.

A major alternative is software engineer. For me don’t become a data scientist - become a software engineer instead presents compelling arguments. I don’t buy the idea that mere mortals can’t deal with the maths and the argument becomes even stronger if you put your software engineer together with some from “the business” - horrible phrase.

A data engineer has a more natural data focus. Datacamp provides training in data science and has written the article data scientist versus data engineer. For me the data engineer role seems mundane versus the data analyst and data science alternatives.

Many data scientists still have to spend a lot of time cleaning data, though it’s probably less than the oft-reported 80% (for the avoidance of doubt we are talking internal data and dealing with “dodgy” data rather than transforming it into a format which fits the algorithm). Get a good data engineer!

Perhaps you should become an analytics translator? Harvard Business Review suggests analytics translator will become a must-have role. Initially I suspected that the role would seem artificial, but the requirements are significant:

Knowledge of company and industry - domain knowledge is most important
Ability to determine key business challenges and metrics
Working knowledge of AI and analytics to convey business goals to technicians
Must bridge the technical expertise (data engineers and data scientists) and operational expertise (marketing, supply chain, manufacturing etc) and high level management
Sufficient technical experience to understand what models are available (e.g. logistic regression) and their applications (e.g. customer retention) and risks (e.g. overfitting)

Data analytics: the perfect career and role

Coming full circle, data analytics can be regarded as an alternative and complement to data science.

Let’s suppose you can’t get your perfect data scientist (if they exist). You probably have software engineers and data engineers. Why not get or develop an analytics professional who has:

understanding of your industry
deep experience of data analysis and management
coding ability
enough understanding of data science to understand its marginal contribution
the appetite and ability stretch to implement the data science
communication skills to liaise with management and software and data engineers

But where would you find one of those?

In truth it may be quite similar to HBR’s analytics translator

How not to recruit

Or perhaps the perfect data scientist exists and can work magic for you.

A leading consultancy wants an “advanced Data Scientist” with the following experience:

Hands on experience in developing models (end to end)
Logistic regression
Clustering
Decision Tree
Random Forest
Support Vector Machine
Naïve Bayes
Gradient Boosting Machine
Deep learning
Natural Language Processing

This is not an unusual approach and at least it’s not a complete alphabet soup. But the list seems a bit haphazard; we start with an excellent pipeline requirement then go through some algorithms (SVM, Naïve Bayes etc) mixed with some other areas to which algorithms are applied (e.g. NLP).

But you’re unlikely to get an advanced data scientist based on the requirements.

Aside from the pipeline requirement the areas and techniques can be picked up from Udemy courses. The interviewer might pick up insufficient experience, but this list risks not getting the “advanced” applicants while also alienating those who might be outstanding mathematicians and coders with appropriate domain experience who may simply never have learned the algorithms.

Going forward, the skill set collectively known as data science will be borne by a new generation of data savvy business specialists and subject matter experts able to imbue analysis with their deep domain knowledge, irrespective of whether they can code or not.

Source: Data scientists will be extinct in 10 years

Some of the most successful life insurers have been (relatively!) late to the data science party. Based on decades of analytics experience they know what they want and the marginal value data science can bring versus (data) analytics. They were beating their peers without data science.

In conclusion

The perfect data scientist is hard to define, let alone find.

Don’t buy the hype. Read credible articles. Do your own thinking. Recruit wisely.

Final disclosures

I have worked for L&G (over 2001-2004, including on their pricing, analytics and distribution) and have a substantial (though not life-changing) position in their shares. I have deep experience of pricing, reinsurance, data and analytics and can get by (but am not expert in) Python and data science. All of this must have influenced the post, for better or worse.