Jun 07

PagerDuty Review

I’m always looking for ways to improve both internal and external services and tools because I believe that to stagnate is bad. Just because what we have works, doesn’t mean we can’t make it better. If we can make it better in a cost-effective way, why aren’t we doing that?

It is in that vein that when I saw an ad for PagerDuty‘s services (with a free T-shirt) that I looked a little more at what they actually did, and then signed up for a trial.

Disclaimer: I have no affiliation with PagerDuty beyond my use as a trial user. Any links provided to PagerDuty are free of any referral codes and I pay no cost nor do I reap any benefit by anyone clicking through to their site and signing up for their services. PagerDuty offers a free t-shirt on the first alert, but since my address wasn’t requested either explicitly or given an option to enter on the form, I can only assume this is only for paid users – it certainly wasn’t made clear on signup but it does make sense that you need to give them money for them to give things to you.

The Basics

PagerDuty is a cloud-based service that provides alert escalations to duty technicians or administrators based on predefined rules and schedules.

PagerDuty can, based on user preferences, escalate alerts via email, SMS, phone call or by push notifications to their Android and iPhone applications.

PagerDuty accepts alerts from a wide range of tools including (but certainly not limited to) Nagios, Icinga, Zabbix, Pingdom, UptimeRobot and NodePing, harnessing both email queues and API-based tools.

When an alert is received it is assigned to an escalation path based on where it came from. The escalation path then notifies individuals or schedules and can be escalated if desired when the assigned on-call fails to acknowledge the alert. If the acknowledged alert goes unresolved for a period of time it is possible to have the alert fall back into a triggered state whereby it starts the escalation process again.

Who Is PagerDuty For?

PagerDuty is for anyone who needs to escalate alerts. PagerDuty is excellent for organizations where there are one or more monitoring systems that need to be consolidated into a single escalation system (e.g. Pingdom for system availability and Nagios for specific sub-checks). PagerDuty also excels at scheduling on-call staff on a daily, weekly or other basis. Override tools are also provided so that if Frank is out for a few days, Keith can be scheduled without breaking the rotation.

In my personal environment, PagerDuty would escalate a set of monitoring systems in a uniform manner and notify me of issues. In my work environment it would be an excellent tool to handle alerts for our level 3 support teams where our Systems, Network and possibly Management teams would each have an escalation path to which some alerts would go directly and others would be initiated after investigation by our Level 1/2 technical staff.

Review

Enough about what PagerDuty is and what it does, let’s get into the meat of it. I signed up for PagerDuty and the first thing I did was poke around. The interface is pretty intuitive, I never really got lost. There were no glaring bugs in the system, everything tied together nicely. A few things do have dependencies, you can’t delete an Escalation Path if it is tied to a Service, for example, but the warnings and errors were more than sufficient to tell me what the problem was and what I needed to do to fix it.

The first thing I did was add my Nagios system, because that is my primary source of alerts. Many of them require tuning to reduce false alerts, but that’s another story. Nagios is an interesting one, they can receive alerts by email but for systems like Nagios and Zabbix that can integrate with their notification system it works a little more directly. There is a queue that alerts are posted to, and a cron-job that runs every minute to flush the queue. If you prefer to trigger via e-mail, you can do so. With Nagios there is no reason you can’t have both – an alert that checks the queue is being cleared that sends via email, or an alert that checks mail is being processed that alerts via the queue tool. Once I got the Nagios tools working, I moved onto some of my other external monitoring tools that were only supported by email. Both tools were on the supported list and integrated perfectly.

Alerts in PagerDuty have three states, Triggered, Acknowledged, and Resolved. A new alert will land in the Triggered state and will trigger the Escalation Path associated with the service it came in from. This can be as simple or as convoluted as you want it to be. It might just notify you individually and stop, or it might start by notifying the Level 1 Schedule, wait 15 minutes, then notify the Level 1 Schedule along with the Level 2 schedule, then wait another 30 minutes before notifying the Management Schedule. As soon as someone clicks the “Acknowledge” button for an alert, the Escalation Path is stopped. There is an option (per-service) to time-out an acknowledged an alert, that is – if the alert remains “Acknowledged” for that time (default: 30 minutes) it falls back into the Triggered state and the Escalation Path starts again from the beginning.

For alerts from systems like Nagios that can also send the “OK” state to contacts, PagerDuty will automatically resolve the open alerts (triggered or acknowledged) and any escalations will stop for those also.

Now, for those who expect your alert tools to send everything and then be able to filter at the PD level will be disappointed. There are zero options in PagerDuty to determine whether an alert should be escalated differently or not – you need to have predefined this all. For example, if you have Nagios alerts that should escalate at a high-priority level and others that should escalate at a much lower priority, you need to set those up to escalate to two different contacts so that they can have two different services and two different escalation paths in PagerDuty. That means that it doesn’t really simplify the problem very much, it just moves it.

The next problem is one that will irk managers who want to define how escalations work at the individual level, and that is that each individual user can define how they are contacted. While there may be some API-based tools to update all users (I doubt it, but I haven’t looked), it is completely possible that your engineers have configured their profile to not be notified for 30 minutes by any method. Personally, I would like to see a way for a manager to enforce an individual escalation procedure (SMS and Email at 0 minutes, and a phone call at 15 minutes, for example). One of the handy tools is that you can configure a notification when going on or off call, and this can come by email or by SMS.

This problem is closely related to the user permissions levels. I’d be interested to see a table that shows the various tasks that can be performed, and which level is required to perform them. It seemed that most of the critical tasks performed by PagerDuty could be performed by a “User” where it would probably be beneficial to have these restricted to Admin. It may also be that PagerDuty have separated these where Team Leads who manage schedules are “Users”, their subordinates who just handle alerts are Limited Users and anyone dealing with anything Billing related is an Admin, but I was personally surprised by what a “User” could do compared with an “Admin.”

The final major issue I noted is that adding users is not the simplest task I’ve seen. An Admin or User must “invite” the user via email, and it’s possible to add users who haven’t accepted their invitations to the schedules. There didn’t seem to be an easy way to determine if a user has accepted their invitation and configured themselves.

PagerDuty Staff and Support

I haven’t contacted their support team, though I noted they don’t seem to have an easy way to contact them from any of the logged in pages and after hours their live-support was unavailable from the main site.

That said, every trial user is (understandably) a sales opportunity and their sales team are very aggressive in wanting to show you the ins and outs of their tool. After a week the email started, and I got three messages within three days from their staff wanting to further demonstrate the product. By that point I was well acquainted and failed to respond (sorry guys) but for my use cases there is nothing more I can do with the product than I have already.

The Numbers

PagerDuty is expensive, there is no question there. At $19.95/user/month (paid yearly, $24.95/user/month if paid monthly) it gets very expensive, very quickly, even for a small team. I estimate that if I pushed this to my employer, that would be the first question and it would be turned down very quickly as a result. Even if we only entered our Network, Systems and core managers (who weren’t already on either team) into the tool we would be looking at at least 9, maybe 10 users which means we’re looking at $199.50/mo just for a monitoring service, assuming we paid yearly – a bill of nearly $2400. At that cost, we’re better off paying one of our technicians to build the same system with PHP and in just a couple of months we’d be profiting.

Keep in mind, the $19.95 number is also for fairly basic service. It provides only 25 international alerts per month, so if you have a lot of overseas staff (outside the USA) then you’d best not need to escalate to them very often (it is possible to purchase more at $0.35 each). More worrying is that the $19.95 level doesn’t grant any SLA and only Email support. If you want more international alerts (100/mo then $0.35 each), phone support, or the OPTION (i.e. costs extra) for 24/7 phone support or an SLA, you’ll be paying $39.95/user/month (paid yearly, $49.95/user/month if paid monthly). As an added bonus, the higher level also allows for Single Sign On.

The Verdict

Unfortunately because of the cost alone, I can’t recommend PagerDuty and I think I’ll be terminating at the end of the trial. If we put cost aside, it’s a great tool with great potential that I would be more than happy to push at work or even keep for myself, but I just can’t get past that cost factor – it’s not quite that valuable.

Apr 16

Surviving a Provider Outage

Last year, EIG (or the Endurance International Group) suffered a major outage in one of their facilities in Utah, impacting a number of customers on their Bluehost, Hostgator and Hostmonster brands, possibly among others. Today they have been down for close to 6 hours and counting leaving customers with services ranging from a small shared hosting site for a family all the way up to dedicated server customers running business sites and services all offline with no resolution in sight.

So how do you, as a website owner or a service provider relying on other providers such as Bluehost being online, go about keeping your business-critical server online and functioning? The key is forward planning.

Finagle’s Law of Dynamic Negatives states that Anything that can go wrong, will—at the worst possible moment. Like your computer blue-screening right before you hit save (or because you hit save) and losing all of your work since the last time you saved. Or the power going out as the concert starts. You can imagine many, many more. Quite simply, if you are relying on absolutely anyone else to help provide your critical systems, chances are one of them will fail at some point and leave you stranded for a period of time.

So, forward planning. You know that something is going to fail so much like financial investments you need to diversify your business service portfolio, as it were.

Start with Reliable. Choosing your provider is important, it’s a crucial balance that you should probably reconsider as your business needs and abilities change. The saying goes – Good, Cheap, Reliable: Pick Two. You might be able to get a good host for cheap, but it won’t be reliable. If you want a Good, Reliable host, you’re going to have to pay a little more. In any case, do your research and ask around – don’t just pick one with flashy ads on Youtube.

Consider a Disaster Recovery (or DR) environment. We know that no matter who you choose as your primary provider, they’re going to have downtime. It might be for maintenance (in which case they should let you know ahead of time) or it might be due to an unexpected failure of some kind. Some are relatively minor and only impact one customer (like a part in your dedicated server fails) or a handful of customers (a switch or power distribution unit fails). If might be something massive like the core routers losing connectivity. Your business is critical, so it’s worth investing in an environment that your services fail over to when the primary is unavailable. It can be as complex as a full hardware and software replication of your production environment, or it may even share some of the load during regular hours. Or it might be as simple as a cheap virtual server, everything might run a little slower but it’s enough to help you ride out the storm and gives you somewhere to migrate your critical functions.

Backup, Backup, Backup. Maybe you can’t afford a DR environment, keep backups of everything. If your service provider went bankrupt and simply shuts down, or as we saw with Volume Drive last year just up and leaves their colocation provider and some servers just “go missing” – how will you move on? You need a backup of your system so that when you can select a new provider it becomes a relatively painless process of deploying your service again.

Service credits don’t cover the cost of your lost revenue, and just because they offer a 99.999% guarantee doesn’t mean they’ll spread that 0.001% across the calendar year. It’s a critical item that needs to be considered when planning your IT strategy. After all, anything that can fail, will. And probably at the worst possible time.

Apr 12

Foreman on Debian – Install issues

A few weeks ago I noticed that as a result of an upgrade, puppet-dashboard had removed itself from the system it ran on. No big deal, I wasn’t really using it and it put a number of more important packages on the removal list if I made the required changes to keep it. Since then I’ve had an error in my log every time a system polls puppet – a lot of errors when it’s multiple machines twice an hour.

Quick research suggests that Puppet Dashboard is no longer under development, and it should be replaced. Work is looking to install Foreman as a frontend for our Puppet install, so I figured what better tool to replace Puppet Dashboard with than Foreman, and give myself a jump start on the training. It also installs on Debian from their own repos.

The install itself was actually a little painful, and was surprisingly easy to fix once I got to the bottom of it. The problem was this, when I ran “apt-get install foreman”:

dpkg: error processing package foreman (--configure):
subprocess installed post-installation script returned error exit status 7

The process was failing during the postinst script while dpkg was configuring the package. This script is found in /var/lib/dpkg/info/<packagename>.postinst – in this case the package name was “foreman.” I set the package to debug mode and ran a dpkg –configure foreman. The last commands before error and terminate were thus:

+ cd /usr/share/foreman
+ [ -f Gemfile.lock ]
+ CMD=bundle install --path ./vendor/ --local --no-prune
+ [ ! -z ]
+ bundle install --path ./vendor/ --local --no-prune

So let’s run them manually:

root@kiwi:/usr/share/foreman# bundle install --path ./vendor/ --local --no-prune
Resolving dependencies...
Some gems seem to be missing from your vendor/cache directory.
Could not find gem 'safemode (~> 1.2) ruby' in the gems available on this machine.
root@kiwi:/usr/share/foreman#

Well then. Apparently we need to install the safemode gem!

gem install safemode
bundle update

A quick apt-get -f install, and foreman configured itself correctly. Now I’ll spend a few days trying to make it work on my install, and we’ll likely bring an update or two as time goes on with what I learn and so on.

Mar 17

Employee Satisfaction

I have this theory which roughly states that an employees loyalty to a company is directly proportional to their satisfaction level. This seems obvious, but there are many things which apply to this which aren’t often taken into account, because the number one thing we consider is remuneration – I think this is wrong.

Sure, how much money an employee receives for their work performed plays a large role in their satisfaction, but it is a far cry from being the only thing making an employee happy. In fact, my theory goes on to suggest that even if you are paying bottom dollar for an employee’s services, you can still retain that employees loyalty if you can keep him (or her) happy in other areas.

What do I mean, exactly?

Well, it’s simple. If an employee consistently feels like they are being discriminated against, that is a negative strike against their satisfaction level. If an employee feels like efforts are being made to include them in activities despite it being inconvenient, that is a positive mark against their satisfaction level. Paying them more is a positive mark, giving them a free lunch for being an employee is a positive mark, consistently failing to recognize when they pick up the pieces of their co-workers failed tasks is a negative mark.

As a relative newcomer to the IT industry, I’ve only worked for a handful of companies, but already I can see that the way a company treats it’s employees directly impacts the employees satisfaction and ultimately reflects in the turnover rate in a given department or position. More importantly, I’ve noticed that it isn’t always about how much money an employee is paid – I’ve seen instances where someone will leave and when asked if increasing their pay could be an incentive to stay, they’ll say it straight that no amount of money could convince them that staying was a good idea.

If you’re in a position of management, find out how your employees feel about being a member of the team and the company. If they’re not comfortable telling you that they are unhappy, you may need to start looking inward at yourself because it’s likely they don’t trust you’ll protect them if they are truly honest. And if they are unhappy, try to get to the bottom of what’s bothering them – maybe it’s a personal problem at home and they need a couple of days off, or it could be that they are overloaded with work and need a hand with things.

To sum this up:

1) Not all employees are happy, and just because they say they’re fine doesn’t mean they are.

2) Not every employee happiness problem can be solved by throwing money at it.

When was the last time you had a team building exercise? I don’t mean going out into the woods to swing on ropes, or taking them to paintball or laser tag, but just sat down and had dinner? Maybe it’s time to take the company on a picnic – not to talk about work, but to just exist with coworkers, to get to know each other over conversation that doesn’t relate to last weeks sales call or next weeks investor meeting. Who knows, you may make people just as happy with a couple of hundred dollars total as you would giving each of them a 10% pay increase. Now that’s a smart business decision.

Mar 08

RAID vs. Backup

Occasionally you hear the words “I don’t need backup, I have RAID!” or similar phrases. You may have even used them yourself once or twice. They are not the same. Though both are intended to maintain uptime, they perform two different functions which serve two different purposes. Both of them are defense mechanisms against disaster, but to view them as the same is wrong and will inevitably backfire.

RAID – Redundant Array of Inexpensive Disks

RAID is all about defending disaster in the Here and Now. RAID defends against disk failure and allows your system to continue running (albeit at reduced performance) until the disk is replaced and the array is rebuilt.

In the event that you suffer from data corruption, data loss or become the victim or a virus or malware, RAID does absolutely nothing.

Backup/Restore

Backups exist to provide historical record of what your system looked like at the time the backup was run. There are, at the core of the concept, two backup types: A full backup and an incremental backup. Most organizations will run a full backup regularly during quiet times (e.g. every week over the weekend) as they take a while and can tax resources, and incremental backups which can be less intensive to fill in the gaps (e.g. nightly). There are also other backup systems such as CDP which allow a full backup to be taken and then keeping track of all changes as they happen – they are out of scope for this text.

The purpose of a backup is to grant access to data that has been lost or changed and needs to be retrieved, whether it be a file, a directory, or an entire system. Backup is wonderful for restoring a system that has been compromised or had hardware failure necessitating a reinstall, but it does absolutely nothing to protect a system during the incident, it only helps to recover from disaster.

—-

So you see, RAID and Backup are not the same. Having one is good, but knowing why you have it is better, and having both is better still. And remember, a backup system is only as good as the last backup you tested. If you never test the backup, you’ll never know if it works. And a broken backup system is worse than having no backup at all.

Mar 01

Centralized Logging

We all know how important logs are, I use them regularly to find why the automated firewall blocked an IP address or to figure out an error in the apache config or the problem with the Asterisk phone tree. I also know that our network engineers use them to identify problems with switches or routers and even the guys using Windows use Event Viewer to audit system logins or investigate system crashes.

There are two key issues that arise when trying to use local logs (logs on the system, in Event Viewer or in /var/log, etc), and they’re such a critical part of the investigation process.

  1. When using logs to investigate a compromised machine, the logs themselves are almost entirely untrustworthy as there is very little stopping an attacker from modifying the logs and removing traces of themselves.
  2. If the logs are stored in volatile memory or the system is designed to erase/overwrite logs on reboot (common on network equipment) then the logs are forever lost in the event that the system restarts either as part of the problem, resolution, or afterwards but before the logs can be retrieved for use in investigation.

There are only a small number of solutions to these issues, and the most common is to use a Centralized Logging System. If you have more than 2-3 systems then immediately you can begin reaping the benefits of central logging. Event Viewer for Windows has, since Vista/2008 had the ability to forward events included in its core functionality. Prior to that, in XP and 2003 there was an Add-in available that granted this functionality. From there it should be a fairly simple configuration to set up a single server and point the clients at it with their events. On Linux there are a number of tools, one of the more common is rsyslog which can both accept and transmit log entries. There are also services available such as Loggly or Splunk which offer both free and paid tiers of service to store logs on your behalf on their remote side, and downloadable tools such as Graylog which will act as a receiver for syslog connections and allow searching and statistics to be run, or alert emails to be sent when specific log entries are encountered. There are countless other tools or you can even write your own.

The key is to ensure your logging server is as secure as possible, as well as reliable — if the log server fails it can’t receive logs, so in the event that your log server is down and an event occurs, you’ll need to be able to grab the logs from the device without relying on the central system. At the same time, security is a concern, if an intruder can reach the logging server then suddenly none of your logs are safe.

They have already been mentioned above, but some other useful tasks that can be made easier with a centralized system can include:

  • Triggering alerts based on log entries. Do you have a Known Issue that happens occasionally but isn’t often enough or critical enough to get fixed, but still warrants immediate attention? Set up an alert so that if it occurs on any of your systems, you can be alerted to it as soon as possible.
  • Statistics and other data. Do you suspect that your trend of bounced emails is going up? Maybe you want to be able to gather data across all of your web servers for page hits for the last month. With all of your logs being gathered and forwarded to a central location, you have greater power to run various analysis on them to mine data or gather stats on your environments. They should all still be properly tagged with the machine they came from, so extracting the relevant data remains easy enough to do while still keeping them together. I also use a system called ‘logcheck’ which pulls all of my logs for the previous hour, drops lines which match regex expressions (for the lines I know are there but really don’t care about) and then sends the results in an email – by centralizing my logs I can get a single email for all of the systems and I also only need to maintain one set of ignore files.
  • Verifying data integrity on the hosts themselves. If you’ve had a break-in and you’re concerned the logs have been tampered with then you can not only use the central log to find what is missing, you can run a diff between the log on the central server and the log on the host itself to see exactly what is different between them.

From here, it’s on you to figure out how to set up remote logging for your environment. I recommend rsyslogd for Linux and similar systems, or you can read up on centralizing the Event Viewer for Windows. Either way, I can hardly recommend centralized logging enough.

Feb 22

Diagnosing Internet Issues, Part Three

This is the third and final installment (for now!) in my brief series on Internet issues. This time we’re addressing throughput, because it’s another one that comes up occasionally.

So here is the scenario, you’ve had a rack of equipment in a datacenter in New York for the last few months, and everything is going well. But you had a brief outage back in 2013 when the hurricane hit, and you’d like to avoid having an issue like that again in the future, so you order some equipment and lease a rack with a datacenter in Los Angeles and set to copying your data across.

Only, you hit a snag. You bought 1Gbps ports at each end, you get pretty close to 1Gbps from speed test servers when you test systems on each end, but when you set about transferring your critical data files you realize you’re only seeing a few Mbps, what gives?!

There are a number of things that can cause low speeds between two systems, it could be the system itself is unable to transmit or receive the data fast enough for some reason, which may indicate a poor network setup (half-duplex, anyone?) or a bad crimp on a cable. It could be that there is congestion on one or more networks between the locations, or in this case, it could simply be due to latency.

Latency, you ask, what difference does it make if the servers are right beside each other or half way around the world?! Chances are you are using a transfer protocol that uses TCP. HTTP does, FTP does, along with others. TCP has many pros compared with UDP, it’s alternative. The most common is that it’s very hard to actually lose data in a TCP transfer, because it’s constantly checking itself. In the event it finds a packet hasn’t been received it will resend it until either the connection times out, or it receives an acknowledgement from the other side.

[notice]A TCP connection is made to a bar, and it says to the barman “I’d like a beer!”

The barman responds “You would like a beer?”

To which the TCP connection says “Yes, I’d like a beer”

The barman pours the beer, gives it to the TCP connection and says “OK, this is your beer.”

The TCP connection responds “This is my beer?”

The barman says “Yes, this is your beer”

and finally the TCP connection, having drunk the beer and enjoyed it, thanks the barman and disconnects.[/notice]

UDP, on the other hand, will send and forget. It doesn’t care whether the other side got the packet it sent, any error checking for services using UDP will need to be built into the system using UDP. It’s entirely possible that UDP packets will arrive out of order, so your protocol will need to take that into account too.

[notice]Knock knock.

Who’s there?

A UDP packet.

A UDP packet who?[/notice]

If you’re worried about losing data, TCP is the way to go. If you want to just send the stream and not worry about whether it gets there in time or in order, UDP is probably the better alternative. File transfers tend to use TCP, voice and video conversations tend to prefer UDP.

But that’s where TCP has it’s problem with latency: the constant checking. When a TCP stream sends a packet, it waits for an acknowledgement before sending the next one. If your server in New York is sending data to another server in Los Angeles, remember our calculation from last week? The absolute best ideal world latency you can hope for is around 40ms, but because we know that fiber doesn’t run in a straight line, and routers and switches on the path are going to slow it down, that’s probably going to be closer to 45 or 50ms. That is, every time you send a window of traffic, your server waits at least 50ms for the acknowledgement before it sends the next one.

The default window size in Debian is 208KB. The default in CentOS is 122KB. To calculate the max throughput, we need the window size in bits, and we divide that by the latency in seconds, so for Debian our max throughput from NY to LA is 212992*8[the window in bytes *8, 1,623,936] divided by 0.045 = 36087466.67bps, that’s 36mbps as a max throughput, not including protocol overhead. For CentOS we get 22209422.22bps, or 22mbps.

So for each stream, you get 22mbps between CentOS servers, when you’re paying for 1Gbps at each end, how can we fix this? There are three ways to resolve the issue:

1) Reduce the latency between locations, that isn’t going to happen because they can’t be moved any closer together (at least, not without the two cities being really unhappy) and we’re limited by physics with regard to how quickly we can transmit data.

2) We can change the default window size in the operating system. That is, we can tell the OS to send bigger windows, that way instead of sending 208KB and waiting for an acknowledgement, we could send 1024KB and wait, or send 4096KB and wait. This has pros and cons. On the plus side, you spend less time per KB waiting for a response, meaning that if the window of data is successfully sent you don’t have to wait so long for the confirmation. The big negative side is that if for some reason any part of the packet is lost or corrupted the entire window will need to be resent, and it has to sit in memory on the sending side until it has been acknowledged as received and complete.

3) We can tell the OS to send more windows of data before it needs to receive an acknowledgement. That is to say that instead of sending just one window and waiting for the ack, we can send 5 windows and wait for those to be acknowledged. We have the same con that the windows need to sit in memory before they are acked, but we are still sending more data before we get the ack, and if one of those windows is lost then it’s a smaller amount of data to be resent.

All in all, you need to decide what is best for your needs and what you are prepared to deal with. Option 1 isn’t really an option, but there are a number of settings you can tweak to make options 2 and 3 balance out for you, and increase that performance.

On the other hand, you could also just perform your data transfer in a way that sends multiple streams of data over the connection and avoid the TCP tweaking issue altogether.

Feb 15

Diagnosing Internet Issues, Part Two

Last week we covered traceroutes, and why you should gather data on both the forward path and the reverse path. This week we are looking at MTR, why you should use it, and how to interpret the results.

‘traceroute’ is a very handy tool, and it exists on most operating systems. If it isn’t there by default, there is almost undoubtedly a package you can install, or at worst, source code available to download and compile for your particular OS. Its one downfall is that it does one task, providing the path. Sometimes that data isn’t enough on its own, you need to see the path over time and observe the situation as it stands with an average view.

Enter “MTR”, initially “Matt’s Trace Route” and since renamed “My Trace Route” is a tool that has existed for Unix systems for over 17 years. It has several advantages over the traditional traceroute, and is preferred by many because of them. It will run multiple traces sequentially, and provide the results as it goes, telling you what percentage of packets have been lost at any hop, some latency statistics per hop (average response time, worst response time, etc), and in the event the route changes while the trace is in progress it will expand each hop with the list of routers it has seen. MTR is available for most Unix systems via their package managers or by a source download. An MTR clone is available for Windows called WinMTR.

Let’s give a quick overview of how traceroute works. First it sends out an ICMP Echo request, with a TTL of 1. Whenever a request passes through a router it will decrement the TTL by 1 and if/when the TTL on a packet reaches 0, the expectation is that the router will generate an ICMP Type 11 packet in return, or a Time Exceeded. Therefore when traceroute sends the first packet, with a TTL of 1, it expires at the first router it encounters, and the packet that is returned contains enough information that traceroute knows the IP address of the first hop, and it can use a reverse DNS lookup to get a hostname for it. Then it will send out another ICMP Echo request with a TTL of 2, this packet will pass through the first hop, where the TTL is decreased to 1, and then expires at the second hop. This carries on until either the maximum TTL is reached (typically 30 by default) or the destination is reached.

mtrex1

In this example, the red line is TTL=1, the green line is TTL=2, the orange line TTL=3 and the blue line TTL=4, where we reach our destination and the trace is complete.

This is where we start to encounter issues, because in the land of hardware routers there is a distinct difference in how the router handles traffic for which the router itself is the destination (e.g. traffic TO the router) and how it handles traffic for other destinations (traffic THROUGH the router). In our example above, the router at the first hop needs to receive, process, and return the packet to the origin. However, the other three packets it doesn’t need to look at, it only needs pass them on. From a hardware perspective they are handled by two entirely different parts of the router.

Cisco routers, and the other hardware vendors, typically have grossly underpowered processors for performing compute tasks, most under the 2GHz mark and many older routers still in production environments are under 1GHz CPUs. Despite the low clock speed, they remain fast by using hardware. Unfiltered traffic which only passes through the router is handled by the forwarding or data planes. These are updated as required by the control plane, but beyond that they are able to just sit back and transfer packets like it was their purpose (hint: it was their purpose).

That significantly reduces the load on the CPU, but it will still be busy with any number of tasks from updating the routing tables (any time a BGP session refreshes, the routing table needs to be rebuilt) to processing things like SNMP requests to allowing administrators to log in via Telnet, SSH or the serial console and run commands. Included in that list is processing ICMP requests directed to the router itself (remember, ICMP packets passing through the router aren’t counted).

mtrex2

To prevent abuse, most routers have a Control Plane Policy in place to limit different kinds of traffic. BGP updates from known peers are typically accepted without filter, while BGP packets from unknown neighbors are rejected without question. SNMP requests may need to be rate limited or filtered, but if they’re from a known safe source, such as your monitoring server, they should be allowed through. ICMP packets may be dropped by this policy, or they may just be ratelimited. In any case, they CPU tends to consider them a low priority, so if it has more important tasks to do then they will just sit in the queue until the CPU has time to process them, or they expire.

Why is this important? Because they play a large role in interpreting an MTR result, such as the one below:

mtrex3

The first item is the green line. For some reason, we saw a 6% loss of the packets that were sent to our first hop. Remember the difference between traffic TO a router and traffic THROUGH a router. We are seeing 6% of packets being dropped at the first hop, but there is not a “packet loss issue” at this router. All that we are seeing is that the router is, either by policy or by its current load, not responding or not responding in time to our trace requests. If the router itself were dropping packets, we’d see that 6% propagating through the rest of the trace.

The second item is the yellow line, notice how the average response time for this hop is unusually higher compared to those before it? More importantly, notice how the next hop is lower again? This is a further indication that issues you could misinterpret from MTR are not really issues at all. Again, like the green line, all that we see here is that the router at hop 3 is either, by policy or by current process load, too busy to respond as quickly as other hops, so we see a delay in its response. Again, traffic through the router is being passed quickly, but ICMP traffic to the router is being responded to much more slowly.

The pink box is, to me, the most interesting one, and this is getting a little off track. Here we see the traffic go from New York (jfk, the airport code for one of New York’s airports, LGA is another common one for New York devices) to London (lon). There are several fiber links between New York and London, along with some other east coast US cities, but it still takes time for the light to travel between those two cities. The speed of light through a fiber optic cable is somewhere around c/1.46 (where c is the speed of light in a vacuum, or 300,000km/s and 1.46 is the refractive index of fiber optic cable). The distance from New York to London is around 5575 km. So even if the fiber were in a straight line, the best latency we could expect between those two locations is 5575*(1.46/300,000)*1000*2 is about 54ms.

[notice]This is a simple calculation for guessing ideal scenarios. It is generally invalid to use it as an argument because a) fiber optic cables are rarely in a straight line, b) routers, switches and repeaters often get in the way, and c) most of these calculations are on estimates which err on the side of a smaller round trip time than a longer one.

The calculation is as follows:

$distancebetweentwocities_km * (1.46 / 300,000) * 1000 * 2

(or for you folk not upgraded to metric, $distancebetweentwocities_miles * (1.46 / 186,000) * 1000 * 2)

You take the brackets and calculate the estimated speed of light through the fiber, then multiply that by the distance between your two cities to calculate the estimated time in seconds, then multiply that by 1000 to get milliseconds, then multiply that by 2 to get your round trip time in ms.[/notice]

We can see from the trace that it takes closer to 70ms for the packet to go from New York to London, so we’re actually looking pretty good.

Back to the real results, we see in hops 13 and 15 that there is a 2% loss. Over 50 packets this is only 1, so it’s difficult to know for sure, but remembering what we said before about through vs. to, in this case it is possible that the packet lost at hop 13 was actually lost and not just dropped by the router. We also have a 24% drop at hop 14, so it’s possible that the packet lost at hop 13 was coincidental and the packet lost by 15 was actually lost at hop 14.

So there it is, interpreting MTR results. The core notes you should take away are these:

  • If you see packet loss at a hop along the route, but the loss is not carried through remaining hops, it’s more likely a result of ICMP deprioritization at that hop, it’s almost certainly not an issue of packet loss along the path.
  • The same applies to latency, if you are seeing increased latency at a single hop but the latency is not carried through the remaining hops, it’s more likely a result of ICMP deprioritization at that hop.
  • If you’re submitting an MTR report to your ISP to report an issue, make sure you get one for the return path as well (see Part One, last week’s post).

 

Feb 08

Diagnosing Internet Issues, Part One

Having worked in the support team for a Network Services Provider, it’s fairly common to see customer tickets come in complaining about packet loss or latency through/to our network. Many of these are the result of them running an MTR test to their IP and not fully understanding the results, and with a little education on how to correctly interpret an MTR report they are a little happier and generally more satisfied with the service.

More recently, however, I’ve noticed more and more people giving incorrect advice on the internet via some social communities which perpetuates the problem. There is already a wealth of knowledge on the internet about how to interpret things like ping results or MTR reports, but I’m going to present this anyway as another reference.

This post however deals with the basics of how the internet fits together from a networking standpoint, and we’ll look at some well known things, and some lesser known things.

There is an old term that just about everyone has heard: The Internet is a Series of Tubes. It’s not far from the truth, really, they’re just tubes of copper and fiber which carry electrons and light which through the magic of physics and the progression of technology have allowed us to transmit hundreds, thousands, millions of 0s and 1s across great distances in fractions of a second and send each other cat pictures and rock climbing videos.

Take the following as an example. The two squares represent two ends of a connection, say your computer and my web server. In between and all around are any number of routers at your house, your ISP, their uplinks, peers, and providers, and in turn the uplinks, peers, and providers of my server’s host, their routers, and finally the server itself:

seriesoftubes

The yellow lines represent links between the different routers, and I haven’t included their potential links outside the image to other providers. This, essentially, is what the internet looks like. Via protocols like BGP, each router is aware of what traffic it is responsible for routing (e.g. my router may be announcing 198.51.100.0/24 to the internet, through BGP my providers will also let the rest of the internet know that in order for traffic to reach 198.51.100.48 they well need to come to my router) and they also keep track of what their neighbors are announcing. This allows the internet to be fluid and dynamic in terms of IP addresses moving around between providers and so on.

So let’s say you wanted to reach my server, as you did when you opened this web page. The simplest example is the one we gravitate to: it simply uses the shortest possible path:

seriesoftubes2

The purple line represents the common “hops” between devices, and in this case the traffic passes through 6 routers on it’s way from your computer to my server, and then the same 6 hops when my server sends back the page data. In the “old days” of the internet, this was actually a pretty accurate representation of traffic flow, as there were (compared to today’s internet) only a handful of providers and only a couple of links when it came to crossing large distances, such as Washington DC to Los Angeles.

Today there are significantly more providers, and millions of links between various parts of the world. Each provider has peering agreements with each other that determine things like how much traffic can be sent across any given link, or what it costs to transfer data. As a result, we may have two providers, so if it would cost $0.10/mbps to send traffic through provider A, but cost $0.25/mbps to send it through provider B, that is an incentive for an ISP to prefer receiving traffic over either link, but avoid sending it via provider B if there are cheaper peers available.

What this means is that it’s entirely possible (and in fact, more common than not) for traffic to go out through one path and come back through a separate path:

seriesoftubes3

 

In this example, we still see purple for the common links, but the red shows traffic going from the left to the right, while the blue shows traffic from the right to the left. See how it took a different path? There are any number of variables that play into this, and it usually comes down to the providers preferring traffic due to capacity concerns or, more likely, cost to transmit data.

Let’s take a practical example with two traceroutes. I used a VPS in Las Vegas, NV, and a free account at sdf.org and from each one, traced the other. Here’s the trace from Vegas to SDF:

reversepathex2

 

And the return path:

reversepathex1

 

Now, it’s cut off in the screen, but I happen to know that “atlas.c …” is Cogent, so from a simple analysis we see that traffic went to SDF via Cogent, and came back via Hurricane Electric, or HE.net:

reversepath

 

For this reason, whenever you submit traceroutes to your ISP to report an issue, you should always include, whenever possible, traces in both directions. Having trouble reaching a friend’s FTP server? Ask them to give you the traceroute back to your network. If the issue is in transit, there is a 50/50 chance it’s on the return path, and that won’t show up in a forward trace.

The network engineers investigating your problem will thank you, because they didn’t have to ask.

 

Dec 22

Read Only Friday

It was about 2 years ago I first heard the concept of Read Only Friday. I thought it was great then, but having worked in a customer-facing role for the last year, especially in an organization that doesn’t practice ROF (and part of my customer-service role includes weekend support), the more I see the shining benefits of having a Read Only Friday Policy.

For those of you who don’t know, Read Only Friday is a simple concept. It states that if your regular staff are not going to be on duty the following day (e.g. because it’s Friday and tomorrow is Saturday, or today is December 23rd and the next two days are company holidays for Christmas), you do not perform any planned changes to your production environment.

That is to say you shouldn’t be planning network maintenance or application roll-outs for a Friday simply because if something goes wrong then, at best, someone is staying late to fix it (and that’s never fun, less so on a Friday) or at worst, someone is going to get a call and potentially spend a significant part of  their weekend resolving whatever happened.

I see the logic behind it – especially for organizations where staff availability for upgrades is low but the requirement has been specified that it won’t occur within generally accepted business hours. Personally, I still (naively) think that Sunday night or any other weeknight would work better while achieving the same goal. If anything, it may improve the quality of work being done because the one performing the maintenance is more likely to be the one getting the call the next day. Of course, you could also institute a rule that if anything breaks that could be related to any work done on Friday, the individual who did it gets the call.

Now, it doesn’t restrict you from changing production at all, because sometimes things break on a Friday that necessitate work, but these are generally unplanned and the change is in order to provide a return to operation.

I am all in for making life easier, not just on the plebs who have to talk to angry customers, but on the higher level people who inadvertently get the call to fix something that’s broken. Moreso on Christmas day, when they should be at home with their families and/or friends, celebrating the fact that they can be at home with their families and friends.

The development environment, on the other hand, is fair game.