In Europe and around the world, the volume of data transmitted over mobile networks is skyrocketing. With the advent of cloud, virtualization, and the introduction of 5G SA, network architecture is becoming more heterogeneous and complex. The use of current maintenance and equipment supervision techniques has its limitations in terms of responsiveness and proactivity. In this context, the implementation of new techniques based on cloud and artificial intelligence for network incident anticipation and detection becomes a top priority for global operators, including Orange.
Network incidents are often detected minutes or even hours after they occur, causing disruptions for customers, who sometimes trigger incident reports by calling customer support, resulting in overall customer dissatisfaction. Supervision using static thresholds requires periodic maintenance and review of these thresholds and is limited to detecting spikes, i.e., anomalies following major outages, making the detection of less significant anomalies difficult or impossible. Furthermore, alarm analysis is a complex task due to the volume, repetitiveness, and varying interpretations of alarms from one equipment manufacturer to another one.
At Orange, the strategic program places data and AI at the heart of the innovation model. This is why the AI Empowered Networks program was launched to develop responsible and sustainable AI for smarter networks and improved operational efficiency. Another objective of the program is to provide support to Orange subsidiaries in developing and implementing their network use cases. The predictive maintenance project (core network and end-to-end) is part of this program.
To strengthen the teams of data scientists and engineers in the development of network use cases, automation, and cloudification, as well as the management of trials with external suppliers and partners, Orange Innovation has engaged Sofrecom to provide a project manager with strong network expertise to lead the development, define use cases with business owners, and coordinate with program directors.
Development of use Cases in an agile mode with operational teams from subsidiaries
The subsidiaries of the group have shown a keen interest in AI to assist operational teams in detecting even predicting incidents that can impact the network, services, and customers. Work was organized in an agile mode with appropriate collaboration tools to develop Minimum Viable Products (MVPs) with flexibility, deliver product versions regularly to the client, and receive their feedback. This collaboration allowed to communicate regularly with the client, adapt to changes, and efficiently prioritize the work. Depending on the use case's needs, we used the Scrum framework (with ceremonies like daily standup, sprint planning, sprint review, and retrospective) and Kanban, as well as collaboration tools like Confluence (for documentation management), Jira (for tracking tickets, user stories, etc.), Microsoft Teams, among others.
Development and deployment steps on the cloud
An innovative methodology to optimize deployment costs. For each use case, the iterative process consists of several steps:
- The first step is to define the business case with the client to properly frame the problem to be solved, the data to be utilized, and the type of degradation to be detected.
- In a test environment, the "Data Exploration" phase can begin, which involves exploring the large volume of data and analyzing trends, statistical characteristics, and correlations.
- After data exploration, data scientists select the best algorithms for anomaly detection or prediction when possible.
- The next step is to set up the data pipeline and feature engineering on the cloud to process and clean raw data, making it usable for Machine Learning (ML) algorithms.
- Then, the ML models are implemented on the cloud platform.
- In the evaluation stage, the results are studied and analyzed. Iterations to previous stages may occur to improve detection performance.
- The deployment phase follows, during which a deployment environment is created and set up after defining an operating model that outlines roles and responsibilities for each party.
- To monitor the model's performance, a supervision system is established.
This iterative methodology allows for revisiting and making changes and improvements with minimal impact for better performance.
Generalization of a replicable framework for use cases
Experimentation work on use cases has enabled the development of a common and replicable approach for network anomaly detection use cases. The generalized framework allows end-to-end automation and easy integration of new use cases. Different modules make up this framework, starting with data ingestion (sending data to the platform in near-real-time), followed by the data pipeline (clean and transform raw data to features) and ML pipeline (to apply ML models on the data features). A customizable data visualization module with dashboards enables graphical representation of data and results. Finally, a feedback loop module allows business experts to annotate anomalies (by confirmation or rejection) after receiving a notification or consulting dashboards.
Balancing motivation and resistance to change among some clients
During the pandemic, this mission lasted two years within a multidisciplinary team mobilized from various countries, with the clients being technical teams from European subsidiaries. Thanks to the collaboration tools used, regular agile ceremonies, and various workshops held in person, collaboration and exchanges were smooth and fruitful.
The topic of artificial intelligence is of great interest at the core of the group's priorities and has allowed me to develop skills in cloud, big data, and ML. During this project, the client-side stakeholders varied from network experts to team leaders and operations team managers. Some were highly motivated by AI-related topics and saw great potential for automating daily tasks and detecting anomalies that are difficult or impossible to detect with current methods. Their motivation and involvement led to concrete results on some use cases. Other collaborators greeted these projects with great skepticism, fearing that their work would be disrupted and changed. Change, therefore, requires more training and acclimatization to AI, as the ultimate goal is to assist teams in their work and move towards incident resolution for maximum customer satisfaction.