Diagnosing Internet Issues, Part One

Having worked in the support team for a Network Services Provider, it’s fairly common to see customer tickets come in complaining about packet loss or latency through/to our network. Many of these are the result of them running an MTR test to their IP and not fully understanding the results, and with a little education on how to correctly interpret an MTR report they are a little happier and generally more satisfied with the service.

More recently, however, I’ve noticed more and more people giving incorrect advice on the internet via some social communities which perpetuates the problem. There is already a wealth of knowledge on the internet about how to interpret things like ping results or MTR reports, but I’m going to present this anyway as another reference.

This post however deals with the basics of how the internet fits together from a networking standpoint, and we’ll look at some well known things, and some lesser known things.

There is an old term that just about everyone has heard: The Internet is a Series of Tubes. It’s not far from the truth, really, they’re just tubes of copper and fiber which carry electrons and light which through the magic of physics and the progression of technology have allowed us to transmit hundreds, thousands, millions of 0s and 1s across great distances in fractions of a second and send each other cat pictures and rock climbing videos.

Take the following as an example. The two squares represent two ends of a connection, say your computer and my web server. In between and all around are any number of routers at your house, your ISP, their uplinks, peers, and providers, and in turn the uplinks, peers, and providers of my server’s host, their routers, and finally the server itself:

seriesoftubes

The yellow lines represent links between the different routers, and I haven’t included their potential links outside the image to other providers. This, essentially, is what the internet looks like. Via protocols like BGP, each router is aware of what traffic it is responsible for routing (e.g. my router may be announcing 198.51.100.0/24 to the internet, through BGP my providers will also let the rest of the internet know that in order for traffic to reach 198.51.100.48 they well need to come to my router) and they also keep track of what their neighbors are announcing. This allows the internet to be fluid and dynamic in terms of IP addresses moving around between providers and so on.

So let’s say you wanted to reach my server, as you did when you opened this web page. The simplest example is the one we gravitate to: it simply uses the shortest possible path:

seriesoftubes2

The purple line represents the common “hops” between devices, and in this case the traffic passes through 6 routers on it’s way from your computer to my server, and then the same 6 hops when my server sends back the page data. In the “old days” of the internet, this was actually a pretty accurate representation of traffic flow, as there were (compared to today’s internet) only a handful of providers and only a couple of links when it came to crossing large distances, such as Washington DC to Los Angeles.

Today there are significantly more providers, and millions of links between various parts of the world. Each provider has peering agreements with each other that determine things like how much traffic can be sent across any given link, or what it costs to transfer data. As a result, we may have two providers, so if it would cost $0.10/mbps to send traffic through provider A, but cost $0.25/mbps to send it through provider B, that is an incentive for an ISP to prefer receiving traffic over either link, but avoid sending it via provider B if there are cheaper peers available.

What this means is that it’s entirely possible (and in fact, more common than not) for traffic to go out through one path and come back through a separate path:

seriesoftubes3

 

In this example, we still see purple for the common links, but the red shows traffic going from the left to the right, while the blue shows traffic from the right to the left. See how it took a different path? There are any number of variables that play into this, and it usually comes down to the providers preferring traffic due to capacity concerns or, more likely, cost to transmit data.

Let’s take a practical example with two traceroutes. I used a VPS in Las Vegas, NV, and a free account at sdf.org and from each one, traced the other. Here’s the trace from Vegas to SDF:

reversepathex2

 

And the return path:

reversepathex1

 

Now, it’s cut off in the screen, but I happen to know that “atlas.c …” is Cogent, so from a simple analysis we see that traffic went to SDF via Cogent, and came back via Hurricane Electric, or HE.net:

reversepath

 

For this reason, whenever you submit traceroutes to your ISP to report an issue, you should always include, whenever possible, traces in both directions. Having trouble reaching a friend’s FTP server? Ask them to give you the traceroute back to your network. If the issue is in transit, there is a 50/50 chance it’s on the return path, and that won’t show up in a forward trace.

The network engineers investigating your problem will thank you, because they didn’t have to ask.

 

Read Only Friday

It was about 2 years ago I first heard the concept of Read Only Friday. I thought it was great then, but having worked in a customer-facing role for the last year, especially in an organization that doesn’t practice ROF (and part of my customer-service role includes weekend support), the more I see the shining benefits of having a Read Only Friday Policy.

For those of you who don’t know, Read Only Friday is a simple concept. It states that if your regular staff are not going to be on duty the following day (e.g. because it’s Friday and tomorrow is Saturday, or today is December 23rd and the next two days are company holidays for Christmas), you do not perform any planned changes to your production environment.

That is to say you shouldn’t be planning network maintenance or application roll-outs for a Friday simply because if something goes wrong then, at best, someone is staying late to fix it (and that’s never fun, less so on a Friday) or at worst, someone is going to get a call and potentially spend a significant part of  their weekend resolving whatever happened.

I see the logic behind it – especially for organizations where staff availability for upgrades is low but the requirement has been specified that it won’t occur within generally accepted business hours. Personally, I still (naively) think that Sunday night or any other weeknight would work better while achieving the same goal. If anything, it may improve the quality of work being done because the one performing the maintenance is more likely to be the one getting the call the next day. Of course, you could also institute a rule that if anything breaks that could be related to any work done on Friday, the individual who did it gets the call.

Now, it doesn’t restrict you from changing production at all, because sometimes things break on a Friday that necessitate work, but these are generally unplanned and the change is in order to provide a return to operation.

I am all in for making life easier, not just on the plebs who have to talk to angry customers, but on the higher level people who inadvertently get the call to fix something that’s broken. Moreso on Christmas day, when they should be at home with their families and/or friends, celebrating the fact that they can be at home with their families and friends.

The development environment, on the other hand, is fair game.

Nagios 4 on Ubuntu 13.10

Last night I installed Nagios 4 on an Ubuntu 13.10 VM running on my VMWare ESXi machine.

I followed through the steps listed here, which worked overall:

https://raymii.org/s/tutorials/Nagios_Core_4_Installation_on_Ubuntu_12.04.html

I did run into a small number of issues which should be documented for anyone trying this on a later install of Ubuntu, but raymii isn’t allowing comments, so I can’t post them there.

1) The Apache configuration setup changed, (as it did in Debian, breaking a few things) so instead of a conf.d directory, it uses conf-available and conf-enabled, just like mods and sites. Further, the configure script doesn’t pick this up automatically, so you need to tell it where the config directory is (I told it the conf-available directory and then linked from conf-enabled, per the standard as it typically is)

2) The ‘default’ site which they tell you to disable has been renamed, you can “a2dissite 000-default” to remove the site.

3) Most critically, apache doesn’t have CGI enabled automatically. Most importantly was enabling the CGI module and reloading Apache.

Overall Nagios 4 seems to have installed correctly and is running nicely. I'll try to get my configs over from my production system and see how they run and start playing with the new features in version 4.

I’ve also installed Zabbix on another VM, I’ll be doing some more reading and seeing if I can’t duplicate at least some of the checks I have in Nagios into Zabbix and then write up a review comparing them.

Term Pronunciation

Something I’ve noticed in my time working with systems is that the vast majority of our terms are written down 90% of the time, it’s very rare that they are verbalized. This leads to interesting divisions within our community where we develop different ways to pronounce things.

One of those that seems more common is SQL. Some people term it “sequel” while others use each letter, “S, Q, L” Which one is right seems to depend on which specific SQL implementation is being referred to. In Microsoft circles, it’s Sequel Server. The popular open-source database is considered My Ess Queue Ell.

The goal of this page is list a few of the written terms, and how I tend to pronounce them. By no means are they necessarily accurate, and it really doesn’t matter so long as whoever is listening to you understands what you are saying when you speak them out load. If you say one of them differently, drop me a comment, I’d be interested to know what it is and how you say it, and possibly why.

IRC – Eye Are See. When I was in high school, I had a friend who would say it as “urk”, and his client of choice was “murk.” It took me a while to realize what he was saying, but for me it’s always been an acronym, I, R, C.

SQL – Depends how I feel, but usually again, S, Q, L. If I’m having a conversation with someone, it may vary based on what they call it just to reduce confusion. This one I’m flexible on.

/var – The name stems from the word “variable” as the contents of this directory are subject to change regularly, things like /var/log and /var/spool, but how I say it usually rhymes with “bar,” “far” or “car.” I guess it should possibly be said closer to “there” as in “there-ee-ah-bull” but I don’t care. (If we’re going by that rule, “lib” should be said as “lyb”, since it’s short for “libraries” – most people call it “lib” to rhyme with “bib” or “fib”)

/etc – This gets referred to by different names also. Some people around here call it “etsy” others call it “E, T, C”, I usually skip the debate entirely. When I do have to pronounce it, it tends to be as “etcetera” as it’s how you usually pronounce things that are written as “etc”

eth0 – The rarely spoken name of the primary network interface on common Linux, and some other Unix-based systems. My verbalization of this comes from my pronunciation of what it’s short for, ethernet or “ee-ther-net.” That said, I tend to say it as “eeeth”, to sound similar to “beef” (without the b). Others I’ve heard recently say it as “eth” to rhyme with “Beth.” I guess I just contradicted my rule on saying things based on how they look, rather than what they’re short for!

Linux – This is pretty well standard now, “linnix” is the common. When first introduced, however, it tended to be said more in line with the name of it’s creator, Linus (Ly-niss), we would say it “lie-nicks” or “lie-nucks”

Those are the ones I can think of, there may be an update to this post, or perhaps a SQL in the future.

I do want to see your comments, however, what are some tech/IT terms you see written down all the time but hear pronounced differently from time to time?

Coincidence

I’m typically not one to believe in co-incidence, but I’m also a fan of correlation not necessarily proving causation, while also not disproving it.

We run a large number of Tripplite PDU (power distribution unit) devices in customer cabinets in our facility, they are largely how we determine power usage and billing for overages in power (using SNMP cards, a private VLAN and a set of monitoring systems). It’s rare that we have issues with them, but just the other day we suddenly started getting multiple communication alerts for one PDU in particular.

I saw it and noted it, verified everything attached was online, and left it alone until I had a few minutes to go and reseat the web-card. I didn’t think about it for another hour until it tripped again, and then again, and then again. Several times it would come back online on it’s own without us going to reseat the card. There seemed to be no rhyme or reason as to why it was dropping or when. This was late on a Friday, so we decided to ride it out through the weekend and let the provisioning team know so it could be handled on Monday.

Today (Sunday) I came in and saw a few alerts for the same PDU, and then noticed one for another PDU in a different cab with the same issue. These two have very little in common, they’re the same type of PDU but they’re in different cabinets in different rows attached to two different switches. I check the alert history and notice it has done the same thing 4-5 times in the previous 72 hours. It seems like both have failing web cards but it seems odd that they would fail together, especially separated as they are. At this point, our provisioning engineer who works Sunday’s was already investigating the first one, so I added it to his notes to take a look.

To cut an even longer story short, it was determined that at some point in the past, one of the two cards had crapped itself; it was no longer showing the serial number in the web interface and it had reset it’s MAC address to “40:66:00:00:00:00.” This wasn’t a major issue, it was still responsive and everything stayed happy. Until one day (earlier this week, presumably) the other card in the other PDU did the same thing – no serial and MAC address 40:66:00:00:00:00. Now we have a MAC address conflict on the VLAN and suddenly they begin interfering with each other. Once this was determined we pulled one of the cards – the alerts have been quiet for several hours, pinging the remaining card shows no packet loss over more than 6000 pings.

The good news is that to resolve the MAC address conflict, we really only need to replace one of the two cards. For now at least.

Communication and It’s Relation to Customer Satisfaction

As systems administrators we work in an environment where everything we do provides a service of some kind. Whether it’s providing a shared hosting server to multiple users at a low cost, managing racks of servers for a client for thousands of dollars a month, or maintaining an Active Directory infrastructure for a small business to go about their day to day business selling lumber or making pipes. We have been given a set of expectations and our job is to meet those expectations.

Sometimes, however, these expectations aren’t well communicated, and this causes all kinds of problems. We had a customer on our shared platform just this week who had a problem and had done a small amount of research into what he wanted, and started addons to his existing service with us to fix it. Little did either of us realize, his solutions to the problem were being poorly communicated, so while we worked with him to provide what he has requested, it didn’t actually resolve the problem he had to begin with.

The customer was using our cPanel environment, and had been seeing SSL errors while accessing his webmail account at hisdomain.com/webmail. This is typically not an issue, and we have a valid SSL certificate if the site is accessed as server.ourdomain.com:2096 – but the customer was concerned that his encrypted connection was not being handled as securely as it should.

However, what he initially communicated to us, was that he planned to conduct business via his website and that the tax regulations of his country demanded this be done via an encrypted website, necessitating an SSL certificate. He bought a dedicated IP addon for his account and then opened a ticket to request the SSL certificate, explaining his reasoning as above.

And so we provided just that: we assigned the IP to his account, and issued an SSL certificate for his domain and installed it. After 25 pages in the ticket (many of which were a result of an error on his side which caused us to see every response he made come through 4 times), we had a long back and forth and eventually we realized that what he had asked for wasn’t even close to what he wanted.

This will inevitably lead to a position where despite our best efforts and the involvement of a large number of people scrambling over themselves to help meet the customers needs, the customer will leave the experience deeply unsatisfied and feeling that we have in some way cheated him.

Communication is the key to ensuring that our customers are satisfied, ensuring that they understand the problem, and what the solution they are buying can do to resolve that problem.

“If It Isn’t Documented, Then It Isn’t Done”

Wise words from a Senior Sysadmin that I heard today, fortunately not directed at me.

We hear this often: comment your code, document your processes. How many of us actually do it, and do it in a way that someone else can follow?

Documentation is important for many many reasons. Primarily ensuring that someone else can take over for you if required,

Friday’s are a great day for this, especially if you work in an environment that supports the idea of “read-only Friday” where any change to a production system is banned, and changes to non-production systems are not recommended. Use your time to write documentation, so that if you are sick, or take a vacation, or move on to new opportunities, you can ensure and rest easily knowing that those filling your position either temporarily or permanently are doing so without cursing your name or screwing things up unnecessarily.

And if you’re working on a new project, whether it be building out a system or developing a new tool: If it isn’t documented, then it isn’t done.

Handling Outages – A Lesson from Hostgator

Yesterday Hostgator had a major outage in one of it’s Utah datacenters which caused a number of customers to be offline for several hours. The outage actually impacted Hostgator, Hostmonster, Justhost, Bluehost, and possibly more, but this post regards the Hostgator response specifically.

These companies all provide shared webhosting services. I am neither a client nor an employee of any of the businesses, though I do know people who are.

The company I do work for has had it’s share of outages, and I am doing what internally to help improve our own practices when outages happen, and I will consider following up on this with my manager next week to see if we can learn anything from it. What I saw during the outage, as an outsider, is interesting. There were three outlets of information provided, which we’ll analyze.

The first is the Hostgator Support Boards, their public forums where users can ask each other for help and the staff can jump in and provide assistance also. There was a thread about the outage, I’ve taken an excerpt (original):

2013-08-03_1300

 

The thing that stands out most is that it is really the same update over and over again, no new information is being provided to the customer. This might work just fine for brief outages, but when the initial outage notification is at 10:30am, to be providing the same details until 4pm with nothing of substance in between is unacceptable. For six hours forum users were told by this thread that “the issue is ongoing, and our staff are working to resolve it” in several forms and variations.

Another outlet of information was the HostGator Twitter account (here), which had the following to say (note it is in reverse, captured 1pm EDT today, Saturday):

2013-08-03_1301

Times are based on EDT:

Again, an initial report just after 9am, followed shortly by an (incorrect) report at 9:40am that things are returning to normal. At 10:45am the outage is announced and at midday users are then directed to the above forum post, which has no details worth anything to someone wondering why their site has been down for hours. Still no useful news via Twitter, until at just before 4pm they announce a new site to provide updates.

And so we reach the third source of information, found here, which had updates every half hour from 3:30pm to 6pm, when the issues were finally resolved for the day. This is the only source where useful data for the technically minded could be found.

 

2013-08-03_1302

 

It turns out there were issues with both core switches at the facility which brought the entire network down. Not only did it take 8-9 hours to fix, it also took 6 hours for the company to provide any useful information as to what the problem was and what was being done to fix it.

Providers should look at this stream of communication and consider whether they would find it acceptable, and review how they handle their own outages. I have been in this situation as a customer, albeit with a different provider. If there is an outage for 10 minutes, I can be quickly placated with a “there is an issue, we’re working on it.” If the outage extends anywhere beyond about an hour, I want to know what is wrong and what is being done to fix it. Not because I want inside information, I want you to demonstrate that you are competent to find and fix the problem – this is what gives me confidence to continue using your service after the issue is fixed. And if your service is down beyond 2 or 3 hours, I am going to expect new useful updates at least hourly, ideally more often, so that I can follow the progression of solving the problem.

For me as a customer, it isn’t that your service went down, I understand things break. It is more important that you provide honest and useful details on why it is down and when you expect to have it fixed, even if these things are subject to change.

Changing Puppet Masters

As users of puppet, occasionally we need to migrate nodes from one master to another.

In my case I’m decommissioning my old puppet server having stood up a new one, as a part of my “migrate home” project.

I ran into a couple of minor issues, but this is essentially the process for moving a node from one master to another.

First, stop puppet (this isn’t necessary, but good practice):

# /etc/init.d/puppet stop

Next, edit your puppet.conf to reflect the new change.

Now, if you start puppet again you’ll likely get errors and it won’t work.

# rm -r /var/lib/puppet/ssl/*

[warning]This is for Debian package-installed systems; if this is not your system, check your puppet.conf to determine where the SSL directory is.[/warning]

# /etc/init.d/puppet restart

Now switch back to your NEW master and look for the new certificate, and if it checks out, sign it:

# puppet cert list

“swedishchef.i-al.net” (SHA256) 05:5E:23:7E:03:A9:58:B6:F2:FE:F6:D4:A1:C3:CE:FD:8B:64:4D:F2:D5:87:02:22:7A:C1:44:8D:D8:44:8E:E8

# puppet cert sign swedishchef.i-al.net

Notice: Signed certificate request for swedishchef.i-al.net

Notice: Removing file Puppet::SSL::CertificateRequest swedishchef.i-al.net at ‘/var/lib/puppet/ssl/ca/requests/swedishchef.i-al.net.pem’

Check everything is running, and you should observe everything is in order. If not, debug as normal. As always, try this in a test environment first – I take no responsibility for broken production environments based on the above.

Migrating Home

My “Personal Project” for a long time has been small web/mail hosting, primarily for myself. For the last year or more I’ve achieved this with a group of small VPS services.

It started out back in 2008, I think, with a 256MB Xen machine, hosted by ezVPS (no longer in business). Eventually I picked up a second one from the same provider, and balanced the load with the different sites across the two servers.

As time went on and I grew my aspirations, I rented a 512MB KVM server from BuyVM/Frantech. When ezVPS shut down, I was in the process of moving one of my servers to a 256 KVM with BuyVM already, and I was able to snag another 256MB and move the other one. Right now I’m paying ~$20/mo for three servers ($10 for the 512 and $5 each for the 256s). Money has been a little tight, however, and now that I’m paying for and controlling the internet connection where I live I felt it was time to start moving things home.

I started by creating some new VMs on my VMWare server. I have one each for Administrative purposes (mostly just Puppet), the Panel (ISPConfig), the Web server, and the Mail server (though it will be shut down and I’ll use one of the 256s). With everything appearing to be running nicely, I started by moving one site to the new server. All appears in order, so it’s time to start moving the rest and slowly getting everything off the 512.

Once everything moved off, I can shut it down, cancel and start saving $10 a month. So far, so good.