It’s Always DNS..

Also, Rule 39 — there’s no such thing as a coincidence (there is, but shush!)

A few weeks ago we had a power outage in one of our larger colocation facilities hosting customer racks. Because (legacy, cost) we have a number of servers providing some core services that don’t have redundant power supplies, and so a couple of these went down with everything else. Ever since, we’d been having serious performance issues with one of those nodes.

Some quick background: At every facility we provide service in, we have at least a couple of servers which exist as hypervisors. Back in the day we were using OpenVZ to operate separate containers, then we were using KVM-based VMs, and now we’re in the process of moving over to VMWare. In each iteration, the principal is the same: Provide services within the datacenter that are either not ideal, or not recommended, to traverse the public internet.

For example, we host a speed test server so that we can have our clients test their connectivity within the datacenter, to the next closest facility, or across the country. We have a syslog proxy server which takes plaintext input and sends it back to a central logging server via an encrypted link. We have a couple of DNS servers which we offer to our clients for local DNS resolvers. We have a monitoring system that reaches out to local devices via SNMP (often unencrypted) and reports back to the cloud-based monitoring tool.

Ever since the power outage, we noticed that the monitoring server was a lot more sluggish than usual, and reporting itself as down multiple times per day. I did some digging, noticed a failure on one of the drives, had it replaced — that didn’t go well. Ended up taking the entire hypervisor down and rebuilding it and its VMs (these hosts don’t have any kind of centralized storage). No big, I thought. This won’t take long, I thought.

The process was going a lot slower than usual, and I didn’t put nearly enough thought into why that might be the case. I left the hypervisor building overnight and came back in with a fresh head. I was just settling in when I noticed a flood of alerts arriving from hosts that they weren’t able to reach Puppet. Odd, I thought, the box is fine! Memory is good, CPU is good, I/O Wait time (which is notoriously poor on this hardware) was fine. I set about monitoring — puppet runs were taking forever. Our puppet master typically compiles the manifest in 3-4 seconds, tops. It was taking 20-30s per host — that’s not sustainable with nearly 200 servers checking in.

What’s embarrassing for me is that it wasn’t until I had a customer ticket come in that they were getting slow DNS lookups that I realized exactly what was happening — the hypervisor I was rebuilding, the one I had consciously deleted configs for “dns1” on, was the culprit.

A quick modification to Puppet Master’s resolv.conf, and also a temporary update to the Puppet-configured resolv.conf that we push out, reversing those internal nameservers, and everything started to clear out. Puppet runs dropped back down to single digits, customers reported everything was fine.

Of course it was DNS. It’s always DNS.

Web of Trust

They say that the internet is just a series of tubes. They’re not entirely wrong, but that’s not all it’s made up of.

It’s also made of a complex series of trust relationships. It is when these trust relationships fall apart for one reason or another that we run into problems.

BGP (Border Gateway Protocol) is a tool used by every internet provider on the planet. Its primary purpose is to allow a router to peer with its neighbors and advertise to them what addresses can be reached through it. Say I’m an internet router for Comcast, I might have peering connections to routers owned by Verizon and by Cox, and I would advertise to those routers that I own Comcast’s IP addresses, and they advertise to me that they own Verizon’s and Cox’s IP addresses, respectively. Now, I might also have a transit agreement in place with Weird Kiwi (we don’t, just using ourselves as an example) that means Comcast will also advertise Weird Kiwi IP addresses (we don’t own any, for what it’s worth). That would mean that Verizon knows that by passing traffic to Comcast, it will reach Weird Kiwi.

There are two types of BGP peering that you should be aware of. They’re often referred to as “Peering” and “Transit.” Transit is the big one, it’s where you advertise your own routes and the provider advertises back most/all of the internet. Peering is much smaller, and only advertises local IP space for the organization you’re peering with and possibly their customers.

Through this protocol, everyone knows how to reach everyone else. How it works in detail isn’t important, but what is important to know is that there is very little in the way of safeguarding built in. Many responsible ISPs will protect their BGP sessions with customers in order to limit what advertisements they will accept. They require LOAs (Letters of Authorization) and utilize router policies to prevent their customers or peers from advertising too many routes, or from advertising routes to IP addresses that they are not authorized to route. Not all ISPs are this responsible. Some ISPs can, have, and will continue to allow anyone to advertise anything, which results in interesting network issues from time to time. Like when Telekom Malaysia advertised a significant portion of the internet to Level 3, who accepted and propagated it. Or when Indosat in Indonesia did much the same thing. Usually these issues are much smaller, like when someone accidentally advertises IP space they don’t own — it just gets tricky when it’s IP space owned by Amazon.

The point is, BGP is one of the very base layers of the internet, and it is entirely based on mutual trust between internet service providers to properly advertise themselves and to properly vet and limit their customers.

This isn’t unique. Just about the entire internet is based on trust — trust that when you send some message you will get a response, and trust that the response you receive is appropriately accurate. Think of e-mail. When you send a message to a mail server, you expect it to send back appropriate codes. The most important is “250 OK” — this indicates that the mail server has validated that it can deliver to the recipient, that it has vetted your server is not malicious by most standard tests (PTR, DNSBL, IP- or Domain-based reputation tests), and that the message is not malicious or containing spam. This is all possible to test prior to sending back the “250 OK” message, so it is trusted on the sending side that if “250 OK” is returned, that the message was accepted and will be properly delivered. Some mail providers however, choose not to do this. They prefer, for some reason, to accept the message as normal and then classify Spam or Junk mail later. It’s entirely possible for mail to be “properly accepted” at the gateway, and then silently dropped at some point before it would reach the end user’s inbox.

Being a user of the internet involves bestowing, and being bestowed with, a significant amount of trust in the remainder of the internet. Trust that you won’t advertise via BGP any IP addresses you don’t own. Trust that you keep your devices and networks appropriately secured. Trust that you won’t attempt to violate the security of other people’s devices or networks. Trust that you won’t, knowingly or unknowingly, participate in a Denial of Service attack of any kind. But what happens when you do? It’s still a web of trust. Another one of the standing conventions of the internet is that ISPs will make public their preferred method for receiving complaints of abuse from their networks. Typically this is an email address, and even more commonly that email address is something like abuse@(domain). Trust comes to play when sending an abuse complaint that the receiving party will receive it, review it, and respond to it. Either respond by taking some kind of action (notifying their customer, suspending their service, fixing their own security problems) or by responding to the complainant requesting clarification or additional detail.

There is no obligation on the part of the receiving party to actually do anything, you just trust that they will do the Right Thing.

Qualities of an Employee

Since we just did a post on qualities of a good manager, it only seems appropriate to look at that relationship the other way around and consider the qualities of a good employee.

They’re remarkably similar to the manager. I think my top three are: Communication, Transparency, and Self-motivating.

Communication

If you can’t tell by now, I personally believe that good communication is the key to just about everything positive. Very rarely does something good come from a failure to effectively communicate, and when it does it’s usually a lucky mistake.

We discussed it in the managerial post, but communication is a two-way street. Sometimes as a manager I can’t be everywhere at once, so I rely on my employees to reach out when they need something. I need to know what is going on, and especially need to know what’s going on that will impact your ability to do work. Can I help? Maybe, maybe not. That doesn’t mean I don’t want to know. I can’t help, or try to help, if I don’t know. Even if the limit of my helping is to offload some work and make life a little easier for a few days.

Transparency

It relates heavily to communication, but if I’m going to be an effective leader, I need to know what you’re working on. I don’t need deep dark details, or heavy technical intel, just enough to get a feeling of what your workload looks like. I can’t fix problems that I don’t know are there, so if other people are giving you work it should be documented in a way that I can see it and re-assign to balance the team.

Self-Motivating

Good employees are capable of walking the fine line between asking for guidance, or being told what to do, and being able to self-motivate and self-initiate new tasks and projects. This also lends back to transparency — I need to know what you’re working on.

I expect my employees to know their role and their responsibilities. I expect my employees to do the work they find hard or they don’t like, in addition to the work they enjoy or find easy. I also expect this to be done without constant prodding or reminding.

If you’ve come up with a new project or idea to develop, that’s fine. It needs to be in your scheduled tasks list, and it needs to be worked on as a priority behind other work I, as your manager, have deemed higher priority — most likely work that has a customer- or management-imposed deadline. I also expect that you will respect a managerial decision to shelve that project.

I will also put in here, a good employee is aware of his surroundings, and of his impact on them. A good employee will recognize when he is assisting others in getting their work done, or when he is detrimental to their productivity. He will also take steps to correct his own behavior as much as possible on that basis.

Review

A good employee understands and respects the goals set by upper and middle management for the direction of the company, of the team, and of the employee. The good employee will work their best towards meeting those goals.

A good employee will take steps to ensure their manager and co-workers are aware of what they are working on, and the progress that is being made. They will reach out for help or guidance when needed, and they will listen to input from their manager and their peers.

A good employee will also reach out to management to discuss personal issues, and keep them “in the loop” as much as possible and as much as relevant to their work life.

Most of all, a good employee will recognize the efforts of a good manager or employer, and respond in kind. They’ll recognize a poor employer, and still take effort to show respect and, if the relationship cannot be turned around, they will move on and find a new location that will give them greater respect.

Qualities of a Manager

The trend on the Sysadmin Subreddit this week has been talking about the struggles of management, what the lowly team members might not realize their team lead or manager is having to deal with, and resenting them accordingly.

Coming off this, combined with some recent personal experiences, I was thinking about what I value the most in a manager or a lead, of “Management” in general. And I came up with three things: Communication, Decisiveness, and Respect.

Communication

Communication is the key to just about everything. If you want to clap, you need co-ordination between the left hand, the right hand, and when appropriate, the rhythm section of the brain. If you want to onboard an employee, you need to be able to notify the appropriate people of the critical details. When are they starting? What are they going to be doing? Where will they be working? Do they need a laptop or workstation set up? Software? An Email address? And it’s not just questions for the IT department, do they need a desk? Is there one there, or do we need to order/build one? Do they need a parking permit? Does HR have someone available to do their basic onboarding class?

A good manager needs to be able to communicate in multiple directions. Passing messages from upper management down to the employees. Passing messages from the employees to upper management. Sending and receiving messages from other teams. Putting team members in contact with the right people in other departments in order to streamline processes.

A manager with good communication skills will have employees who feel that their voices are heard, and feel like they’re up to speed on team goals and company goals. A manager with poor communication skills will have employees who feel they don’t know what is going on, feel they don’t know who to talk to or who to ask for help, and who will ultimately find other places to be.

Decisiveness

The worst thing you can do to me is constantly change your mind, back and forth, without reason. I say that because I fully understand that sometimes changes must be made, decisions are found to be wrong, data is found to be erroneous, etc. But to constantly back-and-forth forever on who we should choose to replace our broken phone system is ridiculous, time-consuming, frustrating, demoralizing. and expensive. It’s difficult to quantify that cost, but that doesn’t mean it isn’t there.

My job is generally not to make decisions, not when it comes to spending money. It is to provide valid data and opinion, your job as manager is to either make the choice yourself or pass it up the chain. I reasonably expect that process to consume some time, but when you come back with a decision, I expect we should be able to stick with that decision unless new data comes to light.

This isn’t just imagined, either. It may be a little misperception (see the last post), but it isn’t imagined. The company I work for has been trying to replace its poorly implemented Asterisk PBX solution. So far it has taken about 9 months. We trialed five companies, recommended a decision to management who sat on it for three months, then told us the company was merging, we should try this other vendor that the other company uses. No problem, trial, recommend, move on. The decision sits around for another six months while the merger completes, we learn more information about their preferred vendor, decide it won’t fit. Start reaching out to other vendors, oh, wait, they will fit, never mind. All the while we were instructed to spend as little time as absolutely necessary to maintain or resolve issues with the existing system, so the number of bugs and issues keeps rising.

I’m fairly certain this also drives other departments to believe that we’re just sitting on our hands.

A good manager will take the data at hand, and make an informed decision as quickly as possible. Progress will be made, and poor choices will have opportunity for correction. A poor manager will sit on a difficult choice for hours, days, weeks or even months or years. Issues will languish and decay. The biggest risk of procrastination is that when decisions finally are made, the data that backed them has changed and what would have been a good choice yesterday is now a terrible one. And of course, when you take so long to make decsisions in the first place, the time it takes to correct your poor choices is lengthened as well.

Respect

This seems to be the hardest to come by, at least, if social media is to be believed. “They don’t care about you, they care about your ability to make them profitable.” It may be true at a corporate level, but a good manager recognizes the talent of his employees and treats them appropriately. For one, firing staff and hiring new people is time consuming, expensive, and results in the rest of the team being unproductive while they pick up the slack and help in training the new person.

What is often struggled with is that you must show respect not just to the employee, but also to their family and loved ones. Most specifically for me, I have a wife and young daughter. This fact is not new to any of my recent managers, and it doesn’t impact my day-to-day. It does, however, impact my ability to travel. My wife has a job with slightly odd hours, and my daughter attends a daycare. With our regular schedules we have no issues dropping off and picking up, and providing adequate care. But I also can’t just decide I’m going on a business trip for three days next week — that needs to be coordinated and planned at least a couple of weeks in advance. And then I can’t just stay an extra day because you felt like it. Two years ago that would be showing disrespect to me, and I would suck it up and deal with it. Today, that’s showing disrespect to me and to my family, and holds the potential to impact our overall livelihood if my wife gets fired as a result of it.

A good manager will know and understand his employee’s talents and utilize them to the best of his ability. He will recognize that employees have home lives, and work to serve the needs of the business as the primary goal, while working around the unique personal needs of his staff. We may realize it’s not always possible, but we should at least be able to see that you tried. A bad manager will consider employees as a commodity, to be hired and fired as needed, and to be manipulated in order to avoid said firing. A bad manager will leave employees entirely out of the loop on changes to their schedules until the minute they’re implemented, and will point to contractual terms and company policies when questioned about it.

Review

A good manager will listen to his staff, make decisions as quickly as possible, and communicate those decisions back out to the team as required. A good manager will have respect for employees and their families, at least up until the employee gives reason to negate that respect.

A poor manager will consistently fail to communicate, will take forever to make decisions or flip-flop on those choices. A poor manager will show contempt or otherwise disrespect his staff or their families.

Ultimately bad managers find the door one way or another. But even then, a good manager of managers will recognize the good and the bad beneath them. After all, good managers tend to hold on to employees much longer, and get much better reviews when surveyed. Bad managers drive people away quickly.

Personal Experience is Personal Reality

When everyone in a room shares a similar perception, people tend to get along and understand each other even if they disagree with that perception being right or wrong. As soon as someone in that rooms has a personal perception that is at odds with the remainder, they become a shard of the group and each side begins to perceive the other as being crazy to some degree.

Remember the 6-image pictures that were all the rage a few years ago, each one showing “What my friends think I do,” etc?

nwmltgc

All of those functions are based on the various perceptions of different people. Granted, few of them actually compare with reality, but until a persons experience is changed, their perception will remain, and for them that is reality.

Consider the people you work with, how do they perceive you and the work you do? In the image above, if they’re anything like what your boss thinks, you may have a real problem — one that involves updating the resume and applying for new work somewhere else really soon. If you’re a manager, how do your employees perceive your leadership skills? Chances are it’s not how you perceive them yourself.

Here’s the thing. Personal Experience drives Personal Perception, and those two things combined are an individual’s Personal Reality. As a manager, or just as a representative of a group (i.e., a member of the IT support team), it is your responsibility to ensure perceptions of you and of your team are positive and accurate. When interacting with someone, that should be foremost in your mind, ensuring their experience is right such as to drive that perception. Because if your employees or your customers perceive you as not caring about them, it doesn’t matter what you do or how hard you try. If their experience working with you can lend them to believe you don’t care for them in the slightest, that’s what they’re likely to believe. That will be their reality.

Upgrading OwnCloud (or, keep up to date to avoid issues)

It turns out that OwnCloud is a finicky piece of software. After being defeated in my attempts to update, I grew determined to win.

Premise:

We have a CentOS 6 x64 server running OC, v7.0.2. Current release is 9.0.1, but OC doesn’t support direct upgrades between major versions, and it wasn’t even as simple as 7.0 to 8.0 to 9.0 — there was an 8.1 and an 8.2 in there as well.

OC was originally installed via an OpenSUSE package. That package has since gone out of date and no upgrade path via that method existed.

(note, backup your system here)

Step 1: Update 7.0.2 to 7.0.11

Step 1 was probably the easiest.

Download the 7.0.11 tar.bz2 from the change page (https://owncloud.org/changelog/), drop in the DocRoot.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Hit the webpage, follow the upgrade steps. Owncloud handles the DB changes and makes sure everything is happy.

Once the upgrade is done, log in and make sure all is well.

(note, backup your system here)

Step 2: Update 7.0.11 to 8.0.9

This is where the fun starts. I’m working on a CentOS 6 box, and the packaged version of PHP for Cent6 is 5.3 (5.3.3, to be exact). As it happens, 8.0 requires PHP 5.4, so life just got interesting.

A few commands, and we’re away:

yum install centos-release-SCL
yum install php54 php54-php php54-php-gd php54-php-mbstring php54-php-mysqlnd
mv /etc/httpd/conf.d/php.conf /etc/httpd/conf.d/php.off
service httpd restart

Actually, at this point, you’re best off stopping HTTPd, unless you have a very small OwnCloud instance. The OC forums are riddled with comments and notes about how long this next step took, and by default it will time out after 3600s, or one hour in a browser window.

Download the 8.0.9 tar.bz from the change page (above), drop in the DocRoot.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Move into your new OwnCloud directory and run:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

This runs the upgrade script that would be run from your browser, but instead does it from command line. It also uses the PHP5.4 binary, and runs as the Apache user (in CentOS). No worries about timeouts!

This step takes a long time. Apparently it has to do with rewriting all of the encryption keys for every file uploaded to your cloud. ‘top’ showed 100% CPU usage for the duration, I left it overnight to run and with our 50GB of data it took about 3 hours.

Once that’s done, migrate the encryption keys:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ encryption:migrate

Restart httpd (if you stopped it) and log in. Make sure all is well.

(note, backup your system here)

Step 3: Update 8.0.9 to 8.1.4

This is essentially the same as Step 2, without a PHP upgrade.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Move into your new OwnCloud directory and run:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

For me, this step was much faster, taking only two minutes.

Restart httpd (if you stopped it) and log in. Make sure all is well.

(note, backup your system here)

Step 4: Update 8.1.4 to 8.2.1

Again, essentially the same.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Move into your new OwnCloud directory and run:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

This took a little longer, mostly hanging on the database schema update check. And by a little longer, I mean about 90 seconds.

Log in once more and check that all is well.

(note, backup your system here)

Step 5: Update 8.2.1 to 9.0.1

Not so simple this time, although you could do it the same way. We wanted to get back onto the point releases, so we moved back to using packages.

Move the owncloud directory out of the way, and remove the owncloud package (7.0.2) from the system.

Remove the old repos, and install the updated ones:

rpm --import https://download.owncloud.org/download/repositories/stable/CentOS_6_SCL_PHP54/repodata/repomd.xml.key
wget http://download.owncloud.org/download/repositories/stable/CentOS_6_SCL_PHP54/ce:stable.repo -O /etc/yum.repos.d/ce:stable.repo
yum clean expire-cache
yum install owncloud-files

Copy the config into the appropriate directory, switch to the OwnCloud directory and run, once more:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

Log in and check all is well.

For kicks, back up your system one last time.

You’re now running the latest available OwnCloud, installed from package. Now you just need to keep that up to date, and life will be significantly easier!

Now I can run a yum update again, and not have to see cascading dependency failures, or risk breaking the system because the package available is too far ahead of what was installed.

Why do I need my own backups?

“My provider already has them,” “It’s not that important, anyway,” “backup tools are (too hard|too expensive|too complicated|not possible in my environment).”

“Everyone wants restore, no-one wants backup.” AdminZen, Part 6, Backups, point 1.

Too many times recently I’ve seen excuses like the above for not having backups, followed shortly by long tirades about how it’s someone else’s fault that their data is lost.

Somewhere I saw someone say it well: If it’s not important enough for you to put your own backup system in place, it’s not important enough for you to complain about it being gone when it’s gone.

I work for a service provider and we take regular backups of our systems, our customer facing systems. Systems our customers pay to be hosted on. Not too long ago we found out the hard way that we had a corrupted backup when the system failed and took all the data with it. We were able to restore the system from a very dated backup, but it had been nearly 2 months since the last successful backup, that’s all we were able to restore. Some of our customers were diligently taking their own backups, however, so they were able to get back up with other providers before we brought our system back up, or they were able to restore their accounts with us at that time, since we brought back very old data.

You’re a careless provider, it’s your fault we have no data! Actually, it’s not. Our terms of service are not at all unique in the hosting community, and they show that while we may have backups, there are limits to what we are obligated to provide. They also recommend that you take your own backups independently.

Now, I don’t work for Bluehost, this is used as a generic example (section 17):

For its own operational efficiencies and purposes, Bluehost from time to time backs up data on its servers, but is under no obligation or duty to Subscriber to do so under these Terms. IT IS SOLELY SUBSCRIBER’S DUTY AND RESPONSIBILITY TO BACKUP SUBSCRIBER’S FILES AND DATA ON BLUEHOST SERVERS, AND under no circumstance will Bluehost be liable to anyone FOR DAMAGES OF ANY KIND under any legal theory for loss of Subscriber FILES AND/or data on any Bluehost server. Bluehost will not attempt to back up accounts that exceed 50,000 files or 30 Gigs of space for any reason.

To confirm this as generic, here is InMotion Hosting’s section (see “Data Backup”; I don’t work for them either):

InMotion Hosting maintains, as a convenience to its clients, regular automated data backups on accounts equal to or less than 10 gigabytes in total size. This service is included only with Business Class or Virtual Private Server hosting accounts and is provided at no additional charge. Hosting accounts greater than 10 gigabytes in size will not be included in regular data backups; this service is, however, available for an additional service charge for accounts exceeding the 10 gigabyte size limit.

While InMotion Hosting maintains the previously stated backups, this service is provided as a convenience only and InMotion Hosting assumes no liability as to the availability or completeness of client data backups. Each client is expected and encouraged to maintain backup copies of their own data. InMotion Hosting will provide, upon request, one (1) data restore per four (4) calendar months free of charge. Additional data restores may be provided but are subject to additional service fees.

Note how there are limits on account size for backup, and a specific note is included that there is no liability on the part of the service provider to take a backup for you.

So how do I do backup? There are a number of tools depending on what you are doing.

For Dedicated Servers or VPSs, a simple rsync or rsnapshot is often sufficient. If you have more than one, and your usage on each is well below 50%, consider backing them up to each other. If not, creating a ZIP or Tar file of your data and regularly downloading it is often sufficient. Remember, the purpose of the backup is to ensure that if your provider fails or entirely loses your system, you can find a new provider and get back online as soon as possible.

For Shared Hosting users, most of you will have access to cPanel or a similar control panel, which provides an option to generate and download an account backup. Most providers will also frown upon keeping account backups on the server, so make sure you download and then delete it. A cPanel backup would be preferred, however, as it is in the perfect format to upload to another provider and restore your content if you need to move.

If the panel doesn’t have the option, or the provider has locked the option out for some reason, you should still have FTP access. This will let you download your files anywhere with an FTP client. Now, that won’t include any MySQL or other databases, you’ll need a solution for that as well.

For very small backups, something I’ve done in the past is to have a scheduled task (usually crontab task or cronjob on my Linux systems) which dumps the database, packages it up with the files, and sends it via email to one of my GMail addresses. Then there is a filter which immediately deletes the email. It seems odd, but GMail will keep emails in the trash for 30 days by default. Therefore, this setup creates a rolling 30 day backup! Keep in mind, it only works if the email size is less than 10MB, any larger and Google will reject the message at receipt.

Building for High Availability

There are several principles to consider when designing and building a system for high availability, so called 100% Uptime.

Firstly, what is the budget? Cost constraints will invariably determine how high that availability is. For this exercise, we’re going to focus on a virtual environment in which the VMs will have 100% Uptime.

Constraining factors:

Power: Reliable power is critical. If grid power goes out, what happens? You need to know that a UPS system is in place, and is well monitoring and tested. You also need to know that at least one generator is in place, ideally more, to pick up the load for prolonged power grid outages.

Network: Your VMs may be 100% available internally, but if you can’t reach them from the outside, there is no point. Having a network provider with 100% uptime will be critical, or you accept that 99.999% is your maximum based on your provider’s maximum. That, or you need to select multiple providers to blend bandwidth with yourself – that gets expensive though, moreso when you consider the equipment you’ll need and the expertise to configure and maintain it.

Level of redundancy: How much of your stack are you able to lose before you can’t provide full service? For example, do you have N+1 hypervisors, N+2, or 2N? In each case, “N” is the minimum number of things you need to provide service. Knowing that you have more than that (and how much more) available is critical.

Single Points of Failure

To be 100% reliable, you must not have any single point of failure. If you draw your infrastructure out on a piece of paper and throw darts at it, no dart should be able to strike something that takes down the entire stack. Two darts, hopefully not. Three darts, maybe. Four well-placed darts, quite possibly.

In fact, this is an excellent exercise in finding weak points. Draw out your infrastructure and make copies. Mark out one item on each copy, and determine whether the stack will continue to function. The hardest one of these is often power or core network. If a member of your storage switch failed, would everything continue to function? If a power circuit failed, either due to a breaker trip or a PDU problem, would you stay up and online?

At least two of everything

The minimum number of anything in a High Availability deployment is 2. Two power strips, two storage controllers, two switches (per switch stack), two provider uplinks, etc, etc, etc. They may be active/active, they may be active/slave, but there are two and they are hot, ready to pick up the full load on no notice.

For example, if your hypervisors only have one power supply, you should have two hypervisors anyway, and you will feed power one to each of your power circuits. Your network connectivity should lead one port to each member of your switch stack, and configured for some type of link aggregation or failover in the event that a switch member fails or a switchport on your system fails.

Recalculating as needed

Often we’ll deploy these systems with a set of design goals or a set upper capacity. Time passes and we realize that current used capacity is above designed upper capacity, and we should have processes in place to notice that. When that happens, you’ll need to recalculate and re-evaluate. Maybe you now need another hypervisor or two to retain the advertised redundancy. Maybe you need more storage devices, or more switches.

Make someone else do it

Of course, the “easy” way out is to pay someone else to design and build your system for you. Plenty of service providers will be quite happy to do so, from Dell, to VCE, to many smaller MSPs. You still need to understand the above, and be able to question their designs based on it.

Datacenters 101

A lot of people have specific ideas when you say “Datacenter” — dark ambient light, lots of bright lights from servers (hopefully lots of green and blue, and not so much yellow or red). Some big datacenter providers put a lot of effort into that “cool” factor, going out of their way to keep overhead lights dim, brushed steel with LED strings in behind, etc.  The fact is that there is a lot to datacenters that isn’t as much talked about because it isn’t so cool. In fact, most datacenters and datacenter operations are pretty boring.

What is a datacenter?

A datacenter is any place you keep and store servers and/or data and/or core networking equipment. For many businesses it’s a closet, or even just a section of a closet. That’s because that is all the business needs. As the business grows, the IT needs grow and this gets outsourced somewhere, possibly to a Datacenter Provider.

What does a datacenter provide?

A professional datacenter provider is supposed to have a range of things.

  • Reliable network connectivity, either through their own network (carrier controlled) or by making available direct connections to providers (carrier-neutral) — some providers do both.
  • Reliable power. Datacenters should have a UPS system to keep all of its customers online for a given period of time, usually long enough for the generator(s) to turn over and start. Depending on the age and size of the facility, most of these will be battery backed. Smaller facilities may use other technologies such as flywheels.
  • Reliable climate. Computers don’t like it when the humidity is too low, and they don’t like swings in temperature. Many facilities will target specific temperatures and tell you why their choice is optimal, but anywhere between 60 and 80 degrees F is probably going to be fine, so long as it isn’t swinging more than a couple of degrees over a period of time.
  • Reliable security. You don’t want other customers being able to steal your data, bandwidth or power. If they do, you want your provider to be able to know and prove it within a reasonable doubt so that it can be properly prosecuted. This isn’t limited to your cabinet or cage, either. When you get packages, you want to be able to trust that they are accepted and held for you, and only you.
  • Some kind of remote hands ability. If you need work done on your equipment such as a reboot or a drive swap, your provider should be able to give you means to do that using their staff. You almost invariably pay extra in some way for this service, be it a bundled number of hours or just an hourly rate with minimum time commits per-incident.

What does this mean for you?

Everything is variable, and everything has a cost. Datacenter providers will be more than happy to provide 100% Power SLAs, 100% Uptime SLAs, tight security based on blood samples and facial recognition along with a 10-digit PIN, and the best remote hands and managed services technicians in the world. They’re also going to expect you to pay high prices for that package.

On the other hand, you might get a physical key to a door, told which cabinet is yours, and to have a nice day. There might be a camera, there might not. There might be a lock on the cabinet, there might not. The cooling might work, it might not. And power/network will come with only a 99.0% uptime guarantee. And it’s dirt cheap.

The key is to find the right mix for your business and your needs. Recognizing that migrating is expensive, both in terms of man hours and downtime, picking right the first time is important. What’s important, and if you can’t afford the best, where are you able to compromise? Maybe you need maximum power and network, but can afford to lax a little on the security side if it’s “good enough.”

Other notes

Most IT people have heard of SOX or PCI compliance, but have you heard of a SAS 70, or an SSAE-16? Any datacenter worth their salt will have been SSAE-16 SOC 1 and/or SOC 2 (which is the updated SAS 70) certified. This isn’t simple, though. Really, all it means is that they’ve defined a series of controls for different things, mostly related to reliability or security, and have demonstrated either to an auditor at one point in time that they were documented (type 1) or demonstrated to an auditor that they were documented and then carried out over a period of time (usually about 12 months, type 2). Each report is unique to the facility or facility’s owner (covering all of their facilities), and you’ll need to read theirs specifically and carefully to learn. Facilities have an obligation to provide the report on request (after all, it’s why they spent the time and money to get it written), they may demand signing of an NDA or similar first.

Don’t Panic

The noise your phone makes on an alert-based SMS.

The ringtone when the office is calling.

“Waiting for controller to initialize” hangs on startup.

Don’t panic. The key to resolving any issue quickly is to slow down and think it through.

I had this this morning. Two new servers were installed on my desk and bench tested without issue. Then they were put in a car and transported to another datacenter, racked, patched and powered on. The first was fine, no issues. Second hung initializing the 3ware RAID controller.

The issue was really quick to resolve. Pulled the card, pulled the riser, reseated the riser, reseated the card. System booted.

Don’t panic.