In today’s fast-paced digital environment, telecom operators face numerous challenges in maintaining network performance and ensuring customer satisfaction. Traditional methods of incident management are often cumbersome, reactive, and inefficient, struggling to keep pace with the complexities of modern telecom networks. To address these issues, the telecom industry is increasingly turning to advanced artificial intelligence (AI) solutions. This article explores a cutting-edge Catalyst project that leverages Generative AI (GenAI) technologies to transform incident management, highlighting its architecture, use cases, and the profound benefits it offers.
INTRODUCTION TO INCIDENT CO-PILOT
Traditional telecom incident management processes often involve a significant amount of manual intervention, from diagnosing issues to coordinating with field maintenance engineers. This approach not only increases the time required to resolve incidents but also amplifies the risk of human error. Incident Co-Pilot addresses key questions posed by Network Operations Center’s (NOC) operators and engineers by leveraging historical data and real-time alarms. It performs semantic analysis of trouble tickets to facilitate a comprehensive understanding of network incidents and provides actionable insights for faster incident resolution.
The Copilot aims to revolutionize telecom incident management by integrating GenAI-driven capabilities into operational workflows. As a multi-vendor, multi-operator team, Incident Copilot integrates agents for incident management, providing data correlation and actionable insights to NOCs and incident resolution teams by harnessing multiple techniques and Large Language Models (LLMs). It retrieves relevant information from various data stores and presents it in a tailored manner, and not only provides a concise summary to a NOC engineer but also generates a plot diagram to represent the output visually.
The Co-Pilot eliminates the need for external assistance from other team members to interact with their data, offering NOC Directors and CXO-level employees the ability to perform similar analyses in a streamlined, efficient, and insightful manner.
THE TEAM
- AIS, China Mobile, China Unicom, du, Orange, MTN, Telkomsel, Telefonica, Telecom Italia, Safaricom, and Vodafone.
Project Champions that shared their knowledge, defined scenarios for solving real-life telecom incident and ticket management problems and challenges, and guided the Incident Co-Pilot development overall. - Huawei. Project and technical lead. Implemented the incident management flow for all telecommunications incidents by using Retrieval-Augmented Generation (RAG) technologies (learn more), Chain of Thought prompting, agent-based workflows, and provided a custom GPT version of Incident Co-pilot.
- MEF.DEV. Project and technical co-lead. Provided the integration layer platform. Also provided all the necessary integration guidance and support to ensure seamless operations and multi-vendor orchestration based on incident diagnosis. Also created tickets for proactive ticket management and performed incident root-cause analysis using different foundation LLMs.
- Infosys. Leveraged its AI-based network assurance solution for observability, diagnostics, and automation of various network domains. Also offered the next best action recommendations for “Device” incident types.
- Qvantel. Was responsible for ticket and case management implementation. Qvantel received TMF621 Trouble Tickets from MEF.DEV enterprise integration platform.
- Universita degli Studi di Milano. Provided synthetic data for Telecom incidents test dataset and shared research on federated learning and LLMs.
THE CHALLENGE
In the rapidly evolving world of telecommunications, managing network incidents effectively has become increasingly complex. Our Incident Co-Pilot project team performed an in-depth analysis and discovered the most harmful ones:
- Manual intervention and cumbersomeness. Traditional incident management involves extensive manual intervention, from diagnosing issues to coordinating with field maintenance engineers. This process is often labor-intensive, slow, and prone to human error.
- Complexity of modern telecom networks. The increasing complexity of telecom networks makes it challenging for traditional methods to keep pace. Managing incidents efficiently requires handling large volumes of data and correlating information from various sources.
- Slow response times. Traditional methods can result in delayed incident resolution due to the time required for manual analysis, ticket creation, and coordination with maintenance teams.
- High risk of human error. With significant reliance on human input for diagnosing and resolving incidents, there is a higher risk of errors that can lead to prolonged downtime and increased operational costs.
- Inefficient use of resources. Manual processes can lead to inefficient use of resources and hinder the ability of NOC engineers to focus on more complex issues.
- Lack of proactive insights. Traditional systems cannot often provide proactive insights and recommendations, leading to reactive rather than proactive incident management.
THE SOLUTION
The Incident Co-Pilot leverages a sophisticated multi-agent architecture designed to enhance the efficiency and effectiveness of NOC engineers. This architecture seamlessly integrates advanced AI technologies, including LLMs and real-time data processing, to deliver comprehensive incident management and resolution capabilities.
The process begins with the NOC Engineer, who initiates the workflow by providing prompts to the Incident Co-Pilot. These prompts, ranging from simple queries to complex commands, are the starting point for the incident management process. The Incident Co-Pilot serves as the central orchestrator. It interprets the NOC Engineer’s prompts, sets goals, and provides context for the tasks at hand. Acting as the hub of the system, the Incident Co-Pilot determines which specialized agents to instantiate from the Agent Repository based on the nature of the incident.
Once the appropriate agent is selected, the Network Incident Agent is instantiated. This agent refines the engineer’s prompt by adding necessary context and querying a comprehensive Vectorized Knowledge Base and Real-Time Data sources. This allows the agent to gather all relevant information needed to address the incident thoroughly. The system utilizes various Foundation Models such as LLaMa3, Google Gemma, OpenAI GPT-4, and Mistral. These models are integrated to enhance the response capabilities of the Network Incident Agent, ensuring that the information and solutions provided are both sophisticated and relevant.
The Agent Knowledge component is crucial to the operation. It comprises two main parts:
- Vectorized Knowledge Base. This includes unstructured text, Q&A pairs, rules, constraints, and knowledge graphs. It provides a deep well of contextual information that the agents draw upon to perform their tasks.
- Real-Time Data. This component includes current network data such as alarms and topology, ensuring that the agents’ actions are based on the latest available information.
In short, the overall workflow follows a clear and systematic process:
- The NOC Engineer issues a prompt.
- The Incident Co-Pilot receives the prompt, sets the goal, and adds context.
- The Incident Co-Pilot instantiates the Network Incident Agent.
- The Network Incident Agent queries the Vectorized Knowledge Base and Real-Time Data sources.
- The Agent refines the prompt and sends it to the LLM for a response.
- The LLM processes the refined prompt and generates a response.
- The Network Incident Agent uses this response to address the incident.
- The results are returned back to the Incident Co-Pilot and presented to the NOC Engineer.
THE WORKFLOW
The BPMN workflow above demonstrates what is happening underneath the hood of Incident Co-Pilot Catalyst on the multi-vendor incident diagnosis orchestration level.
Orchestration begins with receiving an API request from the NOC engineer’s software to diagnose the incident. At stage 1, MEF.DEV Platform validates the request, checks if Qvantel’s trouble ticketing API is available, and creates the corresponding trouble ticket using TM Forum’s standardized TMF621 Trouble Ticket API.
Next, the platform has to check which vendor will perform incident diagnosis, selecting between 3 options depending on the incident category:
- For device-based incidents, the gateway forwards further incident diagnosis to Infosys (stage 2.2);
- For network-based incidents, MEF.DEV Platform performs Root Cause Analysis (RCA) to diagnose the incident (stage 2.3);
- For all the remaining incident categories, the gateway forwards further incident diagnosis to Huawei (stage 2.1).
If the incident is network-based, the platform calls another BPMN flow (presented below) that generates a suggested resolution according to the TMF724 Incident Management API model (GitHub).
The RCA begins with checking the list of alarms. If the list of alarms is present, the platform prepares a request to a corresponding LLM model by building embeddings, referring to a database of manuals and a history of alarms from other vendors (stages 1 & 2). The goal is to identify the alarm that initially caused the incident and define how to resolve it using vendors’ documentation. At this step, MEF.DEV Platform also adds information about network topology (stage 2).
Next, the platform performs prompt engineering based on either “One Shot” or “Chain of Thoughts” technologies. It uses the unified SuggestResolution Prompting to access RAG, forming the suggested incident resolution (stage 3, SuggestResolution prompt example).
Next, the suggested resolution is packed into the incident diagnosis object following the TMF724 API and returned back to the main incident diagnosis BPMN flow presented in the first workflow picture (stage 4). Finally, depending on the incident diagnosis feasibility, the platform returns a successful response in a TMF724 Incident Management API model or an error.
Without having strict predefined technical requirements, the MEF.DEV platform became a collaboration hub for all the involved teams of developers and provided them with a clear, logical, and understandable way of collaboration to efficiently interconnect system components and ensure they work together smoothly by using the BPMN + Low-Code approach. It helped to develop, test, and deploy all the required functionality and business logic processes and adapt them to the Incident Co-Pilot project requirements.
BUSINESS VALUE
Enhanced incident resolution efficiency. The LLM-backed Co-Pilot analyzes large amounts of data from various sources, including alarms, tickets, and customer feedback, to identify patterns and anomalies associated with incidents. It enables quicker and more accurate incident RCA, thus leading to faster incident resolution, reducing downtime, and improving business continuity and operational efficiency.
Next-best action guidance. Incident Co-Pilot provides actionable insights and recommendations based on historical incident data and real-time network conditions. This proactive approach minimizes human error, allowing incident management teams to rely on the Co-Pilot’s guidance, improving decision-making and resolution times.
Time-saving and increased efficiency. Business efficiency is optimized by streamlining workflows and minimizing downtime. Experiments show that Co-Pilot helps NOC engineers resolve incidents 40% faster than when using traditional incident resolution methods.
Improved accuracy and rapid error detection. AI-driven data correlation provides accurate Root Cause identification and actionable insights, identifies errors quicker, diagnoses their causes, and recommends solutions, reducing downtime and improving system reliability. This reduces human errors and improves decision-making processes, ensuring precise and effective incident resolution.
Enhanced user experience and trust. Co-Pilot provides tailored responses based on the user’s role and experience level, ensuring relevant and useful information for each user. It offers clear explanations for its recommendations, thus NOC engineers receive detailed insights, while CXO-level employees get high-level summaries, fostering trust and acceptance.
Continuous learning. LLM improvement. Incident Co-Pilot continuously learns and adapts, improving performance over time. This approach ensures NOC operations can maintain a competitive edge and adapt to evolving network complexities.
Reduced manual workload for NOC engineers. By automating tasks, Co-Pilot significantly reduces the manual workload of NOC engineers, allowing them to focus on strategic tasks, leading to improved job satisfaction and productivity.
Vendor solution integration. Incident Co-Pilot integrates multiple vendor solutions through the TMF724 Incident Management API and associated APIs, combining the best features from different tools and systems, and enhancing incident management capabilities. The project has been completed promptly by using TM Foum’s standardized APIs and MEF.DEV Platform’s BPMN, Low-Code, automation, and orchestration capabilities.
For additional information and materials, please visit the Incident Co-Pilot Catalyst webpage.