Read Only Friday

It was about 2 years ago I first heard the concept of Read Only Friday. I thought it was great then, but having worked in a customer-facing role for the last year, especially in an organization that doesn’t practice ROF (and part of my customer-service role includes weekend support), the more I see the shining benefits of having a Read Only Friday Policy.

For those of you who don’t know, Read Only Friday is a simple concept. It states that if your regular staff are not going to be on duty the following day (e.g. because it’s Friday and tomorrow is Saturday, or today is December 23rd and the next two days are company holidays for Christmas), you do not perform any planned changes to your production environment.

That is to say you shouldn’t be planning network maintenance or application roll-outs for a Friday simply because if something goes wrong then, at best, someone is staying late to fix it (and that’s never fun, less so on a Friday) or at worst, someone is going to get a call and potentially spend a significant part of  their weekend resolving whatever happened.

I see the logic behind it – especially for organizations where staff availability for upgrades is low but the requirement has been specified that it won’t occur within generally accepted business hours. Personally, I still (naively) think that Sunday night or any other weeknight would work better while achieving the same goal. If anything, it may improve the quality of work being done because the one performing the maintenance is more likely to be the one getting the call the next day. Of course, you could also institute a rule that if anything breaks that could be related to any work done on Friday, the individual who did it gets the call.

Now, it doesn’t restrict you from changing production at all, because sometimes things break on a Friday that necessitate work, but these are generally unplanned and the change is in order to provide a return to operation.

I am all in for making life easier, not just on the plebs who have to talk to angry customers, but on the higher level people who inadvertently get the call to fix something that’s broken. Moreso on Christmas day, when they should be at home with their families and/or friends, celebrating the fact that they can be at home with their families and friends.

The development environment, on the other hand, is fair game.

Nagios 4 on Ubuntu 13.10

Last night I installed Nagios 4 on an Ubuntu 13.10 VM running on my VMWare ESXi machine.

I followed through the steps listed here, which worked overall:

https://raymii.org/s/tutorials/Nagios_Core_4_Installation_on_Ubuntu_12.04.html

I did run into a small number of issues which should be documented for anyone trying this on a later install of Ubuntu, but raymii isn’t allowing comments, so I can’t post them there.

1) The Apache configuration setup changed, (as it did in Debian, breaking a few things) so instead of a conf.d directory, it uses conf-available and conf-enabled, just like mods and sites. Further, the configure script doesn’t pick this up automatically, so you need to tell it where the config directory is (I told it the conf-available directory and then linked from conf-enabled, per the standard as it typically is)

2) The ‘default’ site which they tell you to disable has been renamed, you can “a2dissite 000-default” to remove the site.

3) Most critically, apache doesn’t have CGI enabled automatically. Most importantly was enabling the CGI module and reloading Apache.

Overall Nagios 4 seems to have installed correctly and is running nicely. I'll try to get my configs over from my production system and see how they run and start playing with the new features in version 4.

I’ve also installed Zabbix on another VM, I’ll be doing some more reading and seeing if I can’t duplicate at least some of the checks I have in Nagios into Zabbix and then write up a review comparing them.

Term Pronunciation

Something I’ve noticed in my time working with systems is that the vast majority of our terms are written down 90% of the time, it’s very rare that they are verbalized. This leads to interesting divisions within our community where we develop different ways to pronounce things.

One of those that seems more common is SQL. Some people term it “sequel” while others use each letter, “S, Q, L” Which one is right seems to depend on which specific SQL implementation is being referred to. In Microsoft circles, it’s Sequel Server. The popular open-source database is considered My Ess Queue Ell.

The goal of this page is list a few of the written terms, and how I tend to pronounce them. By no means are they necessarily accurate, and it really doesn’t matter so long as whoever is listening to you understands what you are saying when you speak them out load. If you say one of them differently, drop me a comment, I’d be interested to know what it is and how you say it, and possibly why.

IRC – Eye Are See. When I was in high school, I had a friend who would say it as “urk”, and his client of choice was “murk.” It took me a while to realize what he was saying, but for me it’s always been an acronym, I, R, C.

SQL – Depends how I feel, but usually again, S, Q, L. If I’m having a conversation with someone, it may vary based on what they call it just to reduce confusion. This one I’m flexible on.

/var – The name stems from the word “variable” as the contents of this directory are subject to change regularly, things like /var/log and /var/spool, but how I say it usually rhymes with “bar,” “far” or “car.” I guess it should possibly be said closer to “there” as in “there-ee-ah-bull” but I don’t care. (If we’re going by that rule, “lib” should be said as “lyb”, since it’s short for “libraries” – most people call it “lib” to rhyme with “bib” or “fib”)

/etc – This gets referred to by different names also. Some people around here call it “etsy” others call it “E, T, C”, I usually skip the debate entirely. When I do have to pronounce it, it tends to be as “etcetera” as it’s how you usually pronounce things that are written as “etc”

eth0 – The rarely spoken name of the primary network interface on common Linux, and some other Unix-based systems. My verbalization of this comes from my pronunciation of what it’s short for, ethernet or “ee-ther-net.” That said, I tend to say it as “eeeth”, to sound similar to “beef” (without the b). Others I’ve heard recently say it as “eth” to rhyme with “Beth.” I guess I just contradicted my rule on saying things based on how they look, rather than what they’re short for!

Linux – This is pretty well standard now, “linnix” is the common. When first introduced, however, it tended to be said more in line with the name of it’s creator, Linus (Ly-niss), we would say it “lie-nicks” or “lie-nucks”

Those are the ones I can think of, there may be an update to this post, or perhaps a SQL in the future.

I do want to see your comments, however, what are some tech/IT terms you see written down all the time but hear pronounced differently from time to time?

Coincidence

I’m typically not one to believe in co-incidence, but I’m also a fan of correlation not necessarily proving causation, while also not disproving it.

We run a large number of Tripplite PDU (power distribution unit) devices in customer cabinets in our facility, they are largely how we determine power usage and billing for overages in power (using SNMP cards, a private VLAN and a set of monitoring systems). It’s rare that we have issues with them, but just the other day we suddenly started getting multiple communication alerts for one PDU in particular.

I saw it and noted it, verified everything attached was online, and left it alone until I had a few minutes to go and reseat the web-card. I didn’t think about it for another hour until it tripped again, and then again, and then again. Several times it would come back online on it’s own without us going to reseat the card. There seemed to be no rhyme or reason as to why it was dropping or when. This was late on a Friday, so we decided to ride it out through the weekend and let the provisioning team know so it could be handled on Monday.

Today (Sunday) I came in and saw a few alerts for the same PDU, and then noticed one for another PDU in a different cab with the same issue. These two have very little in common, they’re the same type of PDU but they’re in different cabinets in different rows attached to two different switches. I check the alert history and notice it has done the same thing 4-5 times in the previous 72 hours. It seems like both have failing web cards but it seems odd that they would fail together, especially separated as they are. At this point, our provisioning engineer who works Sunday’s was already investigating the first one, so I added it to his notes to take a look.

To cut an even longer story short, it was determined that at some point in the past, one of the two cards had crapped itself; it was no longer showing the serial number in the web interface and it had reset it’s MAC address to “40:66:00:00:00:00.” This wasn’t a major issue, it was still responsive and everything stayed happy. Until one day (earlier this week, presumably) the other card in the other PDU did the same thing – no serial and MAC address 40:66:00:00:00:00. Now we have a MAC address conflict on the VLAN and suddenly they begin interfering with each other. Once this was determined we pulled one of the cards – the alerts have been quiet for several hours, pinging the remaining card shows no packet loss over more than 6000 pings.

The good news is that to resolve the MAC address conflict, we really only need to replace one of the two cards. For now at least.

Communication and It’s Relation to Customer Satisfaction

As systems administrators we work in an environment where everything we do provides a service of some kind. Whether it’s providing a shared hosting server to multiple users at a low cost, managing racks of servers for a client for thousands of dollars a month, or maintaining an Active Directory infrastructure for a small business to go about their day to day business selling lumber or making pipes. We have been given a set of expectations and our job is to meet those expectations.

Sometimes, however, these expectations aren’t well communicated, and this causes all kinds of problems. We had a customer on our shared platform just this week who had a problem and had done a small amount of research into what he wanted, and started addons to his existing service with us to fix it. Little did either of us realize, his solutions to the problem were being poorly communicated, so while we worked with him to provide what he has requested, it didn’t actually resolve the problem he had to begin with.

The customer was using our cPanel environment, and had been seeing SSL errors while accessing his webmail account at hisdomain.com/webmail. This is typically not an issue, and we have a valid SSL certificate if the site is accessed as server.ourdomain.com:2096 – but the customer was concerned that his encrypted connection was not being handled as securely as it should.

However, what he initially communicated to us, was that he planned to conduct business via his website and that the tax regulations of his country demanded this be done via an encrypted website, necessitating an SSL certificate. He bought a dedicated IP addon for his account and then opened a ticket to request the SSL certificate, explaining his reasoning as above.

And so we provided just that: we assigned the IP to his account, and issued an SSL certificate for his domain and installed it. After 25 pages in the ticket (many of which were a result of an error on his side which caused us to see every response he made come through 4 times), we had a long back and forth and eventually we realized that what he had asked for wasn’t even close to what he wanted.

This will inevitably lead to a position where despite our best efforts and the involvement of a large number of people scrambling over themselves to help meet the customers needs, the customer will leave the experience deeply unsatisfied and feeling that we have in some way cheated him.

Communication is the key to ensuring that our customers are satisfied, ensuring that they understand the problem, and what the solution they are buying can do to resolve that problem.

“If It Isn’t Documented, Then It Isn’t Done”

Wise words from a Senior Sysadmin that I heard today, fortunately not directed at me.

We hear this often: comment your code, document your processes. How many of us actually do it, and do it in a way that someone else can follow?

Documentation is important for many many reasons. Primarily ensuring that someone else can take over for you if required,

Friday’s are a great day for this, especially if you work in an environment that supports the idea of “read-only Friday” where any change to a production system is banned, and changes to non-production systems are not recommended. Use your time to write documentation, so that if you are sick, or take a vacation, or move on to new opportunities, you can ensure and rest easily knowing that those filling your position either temporarily or permanently are doing so without cursing your name or screwing things up unnecessarily.

And if you’re working on a new project, whether it be building out a system or developing a new tool: If it isn’t documented, then it isn’t done.

Handling Outages – A Lesson from Hostgator

Yesterday Hostgator had a major outage in one of it’s Utah datacenters which caused a number of customers to be offline for several hours. The outage actually impacted Hostgator, Hostmonster, Justhost, Bluehost, and possibly more, but this post regards the Hostgator response specifically.

These companies all provide shared webhosting services. I am neither a client nor an employee of any of the businesses, though I do know people who are.

The company I do work for has had it’s share of outages, and I am doing what internally to help improve our own practices when outages happen, and I will consider following up on this with my manager next week to see if we can learn anything from it. What I saw during the outage, as an outsider, is interesting. There were three outlets of information provided, which we’ll analyze.

The first is the Hostgator Support Boards, their public forums where users can ask each other for help and the staff can jump in and provide assistance also. There was a thread about the outage, I’ve taken an excerpt (original):

2013-08-03_1300

 

The thing that stands out most is that it is really the same update over and over again, no new information is being provided to the customer. This might work just fine for brief outages, but when the initial outage notification is at 10:30am, to be providing the same details until 4pm with nothing of substance in between is unacceptable. For six hours forum users were told by this thread that “the issue is ongoing, and our staff are working to resolve it” in several forms and variations.

Another outlet of information was the HostGator Twitter account (here), which had the following to say (note it is in reverse, captured 1pm EDT today, Saturday):

2013-08-03_1301

Times are based on EDT:

Again, an initial report just after 9am, followed shortly by an (incorrect) report at 9:40am that things are returning to normal. At 10:45am the outage is announced and at midday users are then directed to the above forum post, which has no details worth anything to someone wondering why their site has been down for hours. Still no useful news via Twitter, until at just before 4pm they announce a new site to provide updates.

And so we reach the third source of information, found here, which had updates every half hour from 3:30pm to 6pm, when the issues were finally resolved for the day. This is the only source where useful data for the technically minded could be found.

 

2013-08-03_1302

 

It turns out there were issues with both core switches at the facility which brought the entire network down. Not only did it take 8-9 hours to fix, it also took 6 hours for the company to provide any useful information as to what the problem was and what was being done to fix it.

Providers should look at this stream of communication and consider whether they would find it acceptable, and review how they handle their own outages. I have been in this situation as a customer, albeit with a different provider. If there is an outage for 10 minutes, I can be quickly placated with a “there is an issue, we’re working on it.” If the outage extends anywhere beyond about an hour, I want to know what is wrong and what is being done to fix it. Not because I want inside information, I want you to demonstrate that you are competent to find and fix the problem – this is what gives me confidence to continue using your service after the issue is fixed. And if your service is down beyond 2 or 3 hours, I am going to expect new useful updates at least hourly, ideally more often, so that I can follow the progression of solving the problem.

For me as a customer, it isn’t that your service went down, I understand things break. It is more important that you provide honest and useful details on why it is down and when you expect to have it fixed, even if these things are subject to change.

Changing Puppet Masters

As users of puppet, occasionally we need to migrate nodes from one master to another.

In my case I’m decommissioning my old puppet server having stood up a new one, as a part of my “migrate home” project.

I ran into a couple of minor issues, but this is essentially the process for moving a node from one master to another.

First, stop puppet (this isn’t necessary, but good practice):

# /etc/init.d/puppet stop

Next, edit your puppet.conf to reflect the new change.

Now, if you start puppet again you’ll likely get errors and it won’t work.

# rm -r /var/lib/puppet/ssl/*

[warning]This is for Debian package-installed systems; if this is not your system, check your puppet.conf to determine where the SSL directory is.[/warning]

# /etc/init.d/puppet restart

Now switch back to your NEW master and look for the new certificate, and if it checks out, sign it:

# puppet cert list

“swedishchef.i-al.net” (SHA256) 05:5E:23:7E:03:A9:58:B6:F2:FE:F6:D4:A1:C3:CE:FD:8B:64:4D:F2:D5:87:02:22:7A:C1:44:8D:D8:44:8E:E8

# puppet cert sign swedishchef.i-al.net

Notice: Signed certificate request for swedishchef.i-al.net

Notice: Removing file Puppet::SSL::CertificateRequest swedishchef.i-al.net at ‘/var/lib/puppet/ssl/ca/requests/swedishchef.i-al.net.pem’

Check everything is running, and you should observe everything is in order. If not, debug as normal. As always, try this in a test environment first – I take no responsibility for broken production environments based on the above.

Migrating Home

My “Personal Project” for a long time has been small web/mail hosting, primarily for myself. For the last year or more I’ve achieved this with a group of small VPS services.

It started out back in 2008, I think, with a 256MB Xen machine, hosted by ezVPS (no longer in business). Eventually I picked up a second one from the same provider, and balanced the load with the different sites across the two servers.

As time went on and I grew my aspirations, I rented a 512MB KVM server from BuyVM/Frantech. When ezVPS shut down, I was in the process of moving one of my servers to a 256 KVM with BuyVM already, and I was able to snag another 256MB and move the other one. Right now I’m paying ~$20/mo for three servers ($10 for the 512 and $5 each for the 256s). Money has been a little tight, however, and now that I’m paying for and controlling the internet connection where I live I felt it was time to start moving things home.

I started by creating some new VMs on my VMWare server. I have one each for Administrative purposes (mostly just Puppet), the Panel (ISPConfig), the Web server, and the Mail server (though it will be shut down and I’ll use one of the 256s). With everything appearing to be running nicely, I started by moving one site to the new server. All appears in order, so it’s time to start moving the rest and slowly getting everything off the 512.

Once everything moved off, I can shut it down, cancel and start saving $10 a month. So far, so good.

Test Case Web

[notice]FAIR WARNING: At time of writing, this software hadn’t been fully tested. During the tests I have found a large number of SQL injection issues with this code that I have patched on my system, and will continue to patch as I check over the package. In the next few days I’ll make a useful diff/patch and submit to the maintainer, because this is simply unacceptable – especially for a tool designed to help with Software QA.[/notice]

Part of my specific duties involve some software testing as part of our Quality Assurance efforts on tools we have developed in house, both for internal use and for our customers to use. Things like our customer portal which, among many other things, gives our customers the ability to manage what is in their racks, and if they have a PDU which allows it, remotely power up or power down hardware.

We’ve been managing this effort using a shared spreadsheet which works well enough, but can easily be improved. So I started looking for tools that would allow us to manage our testing efforts in a much more efficient manner. It might take a little more administration, but it should improve our workflow and hopefully balance out, especially once the initial start-up is out of the way.

Here is what I found: an old application called “TCW” or “Test Case Web” which has been in development for some time. According to SourceForge, it is still in fairly active development, the most recent release being just a couple of weeks ago, on April 24th.

It’s written for PHP4, so there are a couple of deprecated functions and variables which I’ve adjusted for, and I had to fight my development server just a little to make it work right, but it’s running.

Here are a couple of tips:

The default login is (case sensitive, apparently):
Username: Admin
Password: admin

Line 4 of “adminaction.php” reads “$args=$HTTP_POST_VARS;”, change it to “$args=$_POST;” under PHP 5.

None of the system has an install script. You’ll need to create a database and user in MySQL, then edit the incluido.fil file to have the credentials. You’ll also need to import the schema to MySQL, easily achieved with the mysql command line tool or phpMyAdmin.

There are also a handful of places in the code that trigger PHP warnings, mostly because they check the contents of a variable without checking that the variable is set.

For the “home page” (which is severely out of date, but the docs mostly apply), see here.

For the SourceForge project page with current releases see here.

Also, if you’re a PHP dev, it might not be a bad idea to take a look and maybe consider helping out, even just briefly, to review the code for security issues and offering a helping hand to bring it up to PHP5 standards.