NOC Services

Troubleshooting and Resolution

Processes and procedures guide the NOC to address each task and work to resolve issues. Beyond these basic functions, a NOC team’s key responsibilities encompass a wider series of specific tasks presented in the ITIL* framework. While ITIL includes other processes, in our experience these five are the most important to address. Let us briefly look at them through a practical lens.

Event Monitoring and Management

It allows the NOC to monitor, detect, and process events and faults related to the organization’s infrastructure and systems. Events can consist of alarms from systems, calls from internal staff (or customers), and email or chat. The NOC team uses a single or (as is common), multiple tools, including Network Management Systems (NMSs), Element Management System (EMSs), Application Performance Management (APM) tools, and others. These platforms receive and filter messages from devices, servers, cloud instances, applications, and other infrastructure using protocols such as SNMP, TL1, WMI, and, more recently, gRPC and gNMI, among others. Once an event is detected, it’s evaluated, correlated, and acknowledged, and if further management is needed, it’s logged into an incident or ticket.

Incident Management

Incident Management is, in one respect, the core process of a successful NOC. Using the NOC’s IT service management platform or ticketing system, this process provides support when a network, system, or application event requires action. The event is recorded in a ticket with information in different fields. Tickets are handled by NOC engineers and also sent to other personnel as needed in the form of an email, a call, or a message requesting action to address an issue. These communications also include periodic updates and notifications until the incident has been closed. Incident tickets collectively act as a record of all work efforts in the NOC and allow for reporting that can help manage NOC workflow and resources.

Problem Management

Problem Management encompasses all the activities needed to diagnose the root cause of incidents and request changes to resolve those problems. Problem Management differs from Incident Management as the focus is to investigate and identify the root cause of an incident rather than its effect. Typically, Problem Management requires greater engineering skills to review the trends leading up to an incident, scour logs for indications that point to possible causes of the failure, and formulate plans to prevent future incidents. The Problem Management service also maintains information about problems and workarounds for use by Incident Management personnel.

Capacity Management

Capacity Management oversees the performance, utilization, and capacity of infrastructure components to ensure that the client’s service level targets are achieved. Capacity Management should ensure that business capacity, service capacity, and component capacity needs all continue to be met. Senior engineers’ regular reviews of reports and alarm thresholds, taking into account the desired business outcomes and the impact of utilization on business operations, will ensure that evolving capacity needs are addressed in a timely manner.

Change Management

Change Management reduces risk when changes are made to the supported infrastructure environment. This function includes identifying the types of changes the organization anticipates and establishing how each change should be handled to reduce the impact on the organization. Processes and controls are generally oriented around three types of change:

Standard changes

Standard changes are routine and low-impact, like resetting passwords.

Emergency changes

Emergency changes must be addressed promptly, such as by rerouting network traffic when the primary WAN uplink at a regional office is unstable.

Normal changes

Normal changes are planned in advance and might include upgrading the operating system on a server cluster, for example. This type of change would be managed through a review process to ensure proper planning.

A Change Advisory Board should review and set policies for all these changes. This group helps mitigate risk by ensuring all possible impacts of a change have been taken into consideration and a proper plan with a recovery process is in place.