Cisco has most recently addressed this in their Service Provider Operations certification track, however there has always been a certain degree of an "ops" perspective seeded throughout most Professional certifications. My approach is more about the tools, methodologies, and tasks that one utilizes on a daily basis to successfully maintain an enterprise network.
Before we embark down the path of being truly successful in managing our enterprise, let us examine why they call it "Operations":
Quoted from http://dictionary.reference.com/browse/operation
op·er·a·tion [op-uh-rey-shuhn] noun
1.an act or instance, process, or manner of functioning or operating.
2.the state of being operative (usually preceded by in or into ): a rule no longer in operation.
3.the power to act; efficacy, influence, or force.
4.the exertion of force, power, or influence; agency: the operation of alcohol on the mind.
5.a process of a practical or mechanical nature in some form of work or production: a delicate operation in watchmaking.
For us, bullet points 1 and 5 are most relevant. As a fellow VIP - Scott Morris - has mentioned more than once, you can often break down any given task into a subset of smaller, simpler tasks. Network Management is the epitomy of this if you really dive into the details; this is why "operations" is the key word given to most groups that execute the Network Management responsibility. It is a series of processes and acts that collectively comprise a full suite of capabilities to help you maintain your IT infrastructure, in this case specifically the network. As most folks that have been in the industry for a bit know by now, networks grow - whether organically or by design. With that growth comes the need to scale your operations to maintain efficiency and reign in costs. Several tools and processes come to mind which allow us to do just that, which we'll discuss in this blog series. Several of these topics will be expanded on in successive blog posts independantly. However, the holistic goal here for now is to show people what tools are out there from a conceptual perspective, why they are important, and what they can do for you individually.
With a capable configuration management tool you can automate many tasks that may otherwise tie up valuable man-hours. Suppose you need to update an on-call number within SNMP for EVERY DEVICE in the network. That could be 30k devices! If you have 30, maybe logging in and changing that one variable is feasible. However, for 30k, that could literally take weeks. With configuration management, you are looking at writing a script to update the configuration, and then selecting the scope of devices to run the script against, and viola - done! Just be sure your script works prior to blasting it out to 30k devices......
The Incident Management System(IMS) is typically seen as the chronological life of the network from an operations perspective. You can track chronic issues at sites, you can track trends, you can track man-hours spent on projects, you can track the utilization of your personnel, etc. Often times you can use these metrics to justify a project/expenditure : "we currently work 3000 unique tickets a week, with this upgrade we could cut that to 500, freeing up X man hours". On the flip side, the IMS can also serve as the record for changes made on the network for break-fix situations.
Common things that IMS tickets are used to track are as follows:
REF:
Network Monitoring Tools
Before we embark down the path of being truly successful in managing our enterprise, let us examine why they call it "Operations":
Quoted from http://dictionary.reference.com/browse/operation
op·er·a·tion [op-uh-rey-shuhn] noun
1.an act or instance, process, or manner of functioning or operating.
2.the state of being operative (usually preceded by in or into ): a rule no longer in operation.
3.the power to act; efficacy, influence, or force.
4.the exertion of force, power, or influence; agency: the operation of alcohol on the mind.
5.a process of a practical or mechanical nature in some form of work or production: a delicate operation in watchmaking.
For us, bullet points 1 and 5 are most relevant. As a fellow VIP - Scott Morris - has mentioned more than once, you can often break down any given task into a subset of smaller, simpler tasks. Network Management is the epitomy of this if you really dive into the details; this is why "operations" is the key word given to most groups that execute the Network Management responsibility. It is a series of processes and acts that collectively comprise a full suite of capabilities to help you maintain your IT infrastructure, in this case specifically the network. As most folks that have been in the industry for a bit know by now, networks grow - whether organically or by design. With that growth comes the need to scale your operations to maintain efficiency and reign in costs. Several tools and processes come to mind which allow us to do just that, which we'll discuss in this blog series. Several of these topics will be expanded on in successive blog posts independantly. However, the holistic goal here for now is to show people what tools are out there from a conceptual perspective, why they are important, and what they can do for you individually.
Network Monitoring
One of the primary tools that will enable us to run our networks is a capable network monitoring system. This facilitates near real-time visibility into the status and health of our network. Tools such as SolarWinds Orion, HP OpenView, NetCool, SMARTS, all give the network team the ability to see what is happening based on SNMP and Up/Down tracking of devices. Often called "alerts", when the notification comes through that any given metric has surpassed a threashold, it allows the Network Team to react to it. Most times this involves a ticket being created to track this event. I'll circle back around to incident management down the road, but that is the system you would ideally have in place to facilitate these "tickets".
When you first roll out a monitoring tool, especially if this is the initial introduction of a tool like this in that environment, you may choose to start only with up/down monitoring enabled. This allows the IT staff to really come to terms with dealing with alerts, having the network tell them what is going on, how to use the software, etc. Up/Down alerts are a good way to break staff into this kind of growth of responsibilities/capabilities. Over time you can introduce link status, errors, utilization, et al.
Here are a few favorite alarms that I've seen companies track:
- Up/Down Status
- CPU Utilization
- WAN Link utilization
- WAN Link health (errors, drops, etc)
- Critical LAN link status/health
There are myriad more alerts that most systems employ, but those are ones you typically see at any given shop, earning them a spot on the list of what I call universal favorites.
Configuration Management
With a capable configuration management tool you can automate many tasks that may otherwise tie up valuable man-hours. Suppose you need to update an on-call number within SNMP for EVERY DEVICE in the network. That could be 30k devices! If you have 30, maybe logging in and changing that one variable is feasible. However, for 30k, that could literally take weeks. With configuration management, you are looking at writing a script to update the configuration, and then selecting the scope of devices to run the script against, and viola - done! Just be sure your script works prior to blasting it out to 30k devices......
Another example is deploying devices - you can have your staff deploy a switch with the meat of the config, VLANs, VTP, uplinks, etc. They get it up and running - and then you pull it into the management domain and deploy your management template. This can include SNMP, AAA, security ACLs, etc. All centrally managed - which reduces the chance of error.
These tools can often be useful as well to execute custom poll scripts to devices. This can help you tool reports to specifically target a special case that exists on your network, or target specific information you need without having to poll through an entire "show run" or "show tech". This is especially useful the larger your network gets.
Biggest benefits you often gain out of configuration management systems:
- Historical configuration backup
- Easy way to find last known good config during outage
- Mass change function
- Intelligent scripting can cut time on large-scale simple changes
- Config reporting
- Ability to quickly poll a stored data set for patters/configs w/o impacting production network
Incident Management
This is the fabled "ticketing system", which tracks incidents via records, also known as trouble tickets, event tickets, work orders, task orders, etc. There are as many names for it as there are versions out there. Remedy is one prevalent platform, as is Heat. I've worked on several internally developed platforms that usually outperform both, but that is because they were built from scratch for exactly those environments. Tough to do from a template.
The Incident Management System(IMS) is typically seen as the chronological life of the network from an operations perspective. You can track chronic issues at sites, you can track trends, you can track man-hours spent on projects, you can track the utilization of your personnel, etc. Often times you can use these metrics to justify a project/expenditure : "we currently work 3000 unique tickets a week, with this upgrade we could cut that to 500, freeing up X man hours". On the flip side, the IMS can also serve as the record for changes made on the network for break-fix situations.
Common things that IMS tickets are used to track are as follows:
- Timeline for incident
- What troubleshooting was done
- What was found to be the exact problem
- What actions were taken to resolve
- What was root cause of problem
Change Management Controls
This is many people's worst enemy - change management! The idea is to keep a historical record of all the changes that go on in the lifecycle of the network. The benefit of having this kind of looking glass into the past is multifaceted; metric tracking, root cause analysis, accountability to stakeholders(more on that later....), and perhaps above all - providing visibility into the stability of the network.
Part of the difficulty many organizations face with change management is fully integrating the business facet into the IT world. Not only does this require the IT group accepting the fact that the business has the power to approve/decline changes, it also requires the business to understand the strategic and tactical nature of how their IT systems support and/or drive their business vertical. Without going into specifics, if a business is in the process of making you money, you want the network to ASSIST in that process, not be the cause for financial loss. Robust, well developed, and fully integrated change management policies paired with an easy to use tool to track this is critical for companies to develop stringent control over the lifecycle of the network.
When a business unit fully realizes the control and peace of mind that can result from this kind of framework, they often buy into it and get involved. Balancing business versus IT needs can often be precarious at best, a well forumalted decision matrix can help ease those tensions. When the change control process is followed dogmatically by all of the parties involved, two huge benefits are realized. The vertical can now hold IT accountable for outages they cause - which makes for a more calculated approach to dealing with network changes. On the other hand, the IT group can then say "we made no changes", and the business vertical should have a reasonable level of trust that this is true by looking into the change management system. Checks and balances should always exist, and I've seen large scale shops build in scripting tools to track EVERY keystroke of an engineer and log it to a third party within the company for reconciliation purposes. While this is an extreme case - it goes to show you how far this kind of concept can be taken to balance the need for action and the requirement to follow policy.
More to Follow.....
With that background, in future blogs I will go on to show you how you can tie these tools together using policies and processes. Each of these tools alone provide great value in and of themselves, but they truly shine and provide an exponential ROI when your internal practices leverage them properly. Before I can show you that, though, they would need to be up and running in your environment, no? So, I will give you a few walk-throughs on basic deployment of these tools within your environment. Considerations that need to be addressed, how to pick the best product, the pros and cons of buying Commercial Off The Shelf products versus developing some of them in-house, and so on.
Once we can get them up and running, we are going to discuss integrating your business model as an IT shop around them, how to work with your customers - whether internal or external - and re-tool your relationship with them based on these new capabilities. In addition I'll try and show you can leverage them to provide SLA agreements with internal customers, what you can do to use these tools to bring truth in advertising to other groups within your organization, and a few other neat features that you'll find.
I hope you've enjoyed this blog, if you have any questions or requests, please leave a comment! I will try to respond as best I can, if you don't get a timely response PM me and point me towards the thread.
Thanks everyone for reading!
REF:
Network Monitoring Tools