Insights

Data science: the pragmatic approach to AI

Mon 21 Jun 2021

One of the main challenges for companies now is which AI technology to go for.

Mohamed El Mrabti

Putting data at the heart of a company's decisions by adopting a "data-centric" approach is not trivial. Companies will have to face several challenges: skills, technological choices, organization... If Data Science has been on the rise in recent years, it is not by chance. This new discipline allows to structure projects thanks to a pragmatic approach.

The term AI encompasses data-driven challenges, opportunities and technologies. While the core technologies of AI have been around for several decades, it is only recently that interest in AI and its potential applications has accelerated among large enterprises. The advent of Big Data has profoundly contributed to this evolution, with an exponential increase in the ability to collect and process huge volumes of data.

One of the main challenges for companies now is which AI technology to go for. There are two main characteristics that help guide this choice:

  • The research of reproduction, by artificial systems, of the cognitive capacities of the human being, not definable by rules in a simple way (example: facial recognition, vocal recognition, understanding of images...),
  • The notion of learning (supervised or unsupervised) of the machine, i.e. its capacity to improve its performance from the autonomous and iterative analysis of its results (e.g.: prediction, fraud, anomaly detection, recommendation...).

The different stages of Data Science

In recent years, the science of data analysis or Data Science has developed strongly. This new discipline allows the exploration and analysis of raw data to transform it into valuable information for companies. In other words, Data Science aims to put data at the heart of decisions!

Therefore, each step of this approach is key to making the best decision. We have identified 6 main steps that we will illustrate with an example for a better understanding: customer churn (loss of customers or subscribers).

Understanding the business

This involves defining the business perimeter and expectations in order to reformulate the request and establish the precise framework of the study.

In our example, the challenge is to identify the customers with the highest risk of churn in order to better target marketing campaigns.

In the telecom domain, the data related to churn are the reasons (termination, suspension, portability...), the period and the type of churn (total or relative).

Knowledge and preparation of the data

It is important to master the information that will solve the problem. This step, which is often time-consuming, allows us to identify different categories of data: irrelevant data (duplicates, incomplete data, outliers, etc.), missing data, relevant data and data to be transformed for analysis.

In our example on customer churn, it is important to define the sources of the data to be collected, such as type of usage; revenues detailed by type of usage; calls to the service center, complaints and contract information (end date, duration, segment).

Data analysis

The objective of this step is to cross-reference the different types of data and establish correlations between them. It can be interesting at this stage to explore the data using graphs and descriptive statistics in order to identify:

  • Fields with atypical distributions (difficult to model);
  • Highly correlated fields (keep the most relevant according to the business);
  • Fields that need to be transformed for the analysis (number of modalities too high to be grouped, dates, time stamps...).

In our churn case, we need:

  • Collect the identified data with a significant history (6 months for example);
  • Define a certain number of indicators to look for possible correlations, such as the "churn rate" which determines the ratio expressed as a percentage of the number of customers who discard an offer.
  • Draw a number of analyses such as: the churn rate is high for accounts with more lines and X% of customers cancel at least Y% of their lines; Z% of churn is related to line suspensions and high churns are more likely to come from customers who pay in cash, have no commitment contracts and very low data and roaming usage.

Knowledge modeling

This step corresponds to the machine learning phase where we choose the type of statistical model (supervised or unsupervised) to use.

There are many grids that allow to classify the learning use cases and the algorithms to be used to solve the associated problems.

Here are some examples:

https://www.rankred.com/machine-learning-cheat-sheets/

https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet

https://www.hebergementwebs.com/news/beat-the-heat-with-machine-learning-cheat-sheet

 

In this step, consider separating the data randomly into three subsets:

  • The first set of data will be used to build the training model (training data);
  • The second will be used to test the relevance of the model and choose the best one (test data);
  • The third will be used to validate the model and evaluate the performance (validation data).

This allows us to build the models on the training data (optimized by the test data) and to keep the best performing one on the validation data.

In our example, after the cross-analyses of the data, we could conclude that the churn model should be used to:

  • Prioritize at-risk customers in proactive retention campaigns;
  • Assign a probability between zero and one to each customer (the closer the probability is to one, the more likely the customer is to leave).

For these use cases, the recommended algorithms are based on supervised learning. To choose, we could rely on one of the following algorithm selection help trees: Logistic Regression, Decision Trees, Random Forest, GBoosted Trees and Naive Bayes.

Performance evaluation of the different models before industrialization

At this stage, the aim is to select the best model, as the model selected at the previous stage is not necessarily the best in terms of performance.

Performance evaluation is a discipline in its own right. It requires strong skills in statistics (a possible method is AUC - Area Under the [ROC] Curve).

Industrialization

The ability of the data scientist to present the results in a clear and pedagogical way is at least as important as his ability to collect and analyze the data.

After selecting the best model, the company will have to decide to deploy it (performance in line with the objective), to improve it (by injecting more data) or to abandon it (the model has not proven itself (bad predictions).

Each of these steps is crucial in the implementation of a "data-centric" project to give it every chance of success.

The implementation of a data science approach is not neutral and can disrupt the way businesses operate, sometimes for several years. Beyond the necessary data science skills, the success of a "data-centric" project depends on the company's ability to adapt its way of doing things, to change certain aspects of its organization by taking into account the results of the modeling.

Extract from our white paper : Challenges and advancements in the era of data and artificial intelligence

Mohamed El Mrabti

IT & Networks Maintenance Director