Qualities of a Manager

The trend on the Sysadmin Subreddit this week has been talking about the struggles of management, what the lowly team members might not realize their team lead or manager is having to deal with, and resenting them accordingly.

Coming off this, combined with some recent personal experiences, I was thinking about what I value the most in a manager or a lead, of “Management” in general. And I came up with three things: Communication, Decisiveness, and Respect.

Communication

Communication is the key to just about everything. If you want to clap, you need co-ordination between the left hand, the right hand, and when appropriate, the rhythm section of the brain. If you want to onboard an employee, you need to be able to notify the appropriate people of the critical details. When are they starting? What are they going to be doing? Where will they be working? Do they need a laptop or workstation set up? Software? An Email address? And it’s not just questions for the IT department, do they need a desk? Is there one there, or do we need to order/build one? Do they need a parking permit? Does HR have someone available to do their basic onboarding class?

A good manager needs to be able to communicate in multiple directions. Passing messages from upper management down to the employees. Passing messages from the employees to upper management. Sending and receiving messages from other teams. Putting team members in contact with the right people in other departments in order to streamline processes.

A manager with good communication skills will have employees who feel that their voices are heard, and feel like they’re up to speed on team goals and company goals. A manager with poor communication skills will have employees who feel they don’t know what is going on, feel they don’t know who to talk to or who to ask for help, and who will ultimately find other places to be.

Decisiveness

The worst thing you can do to me is constantly change your mind, back and forth, without reason. I say that because I fully understand that sometimes changes must be made, decisions are found to be wrong, data is found to be erroneous, etc. But to constantly back-and-forth forever on who we should choose to replace our broken phone system is ridiculous, time-consuming, frustrating, demoralizing. and expensive. It’s difficult to quantify that cost, but that doesn’t mean it isn’t there.

My job is generally not to make decisions, not when it comes to spending money. It is to provide valid data and opinion, your job as manager is to either make the choice yourself or pass it up the chain. I reasonably expect that process to consume some time, but when you come back with a decision, I expect we should be able to stick with that decision unless new data comes to light.

This isn’t just imagined, either. It may be a little misperception (see the last post), but it isn’t imagined. The company I work for has been trying to replace its poorly implemented Asterisk PBX solution. So far it has taken about 9 months. We trialed five companies, recommended a decision to management who sat on it for three months, then told us the company was merging, we should try this other vendor that the other company uses. No problem, trial, recommend, move on. The decision sits around for another six months while the merger completes, we learn more information about their preferred vendor, decide it won’t fit. Start reaching out to other vendors, oh, wait, they will fit, never mind. All the while we were instructed to spend as little time as absolutely necessary to maintain or resolve issues with the existing system, so the number of bugs and issues keeps rising.

I’m fairly certain this also drives other departments to believe that we’re just sitting on our hands.

A good manager will take the data at hand, and make an informed decision as quickly as possible. Progress will be made, and poor choices will have opportunity for correction. A poor manager will sit on a difficult choice for hours, days, weeks or even months or years. Issues will languish and decay. The biggest risk of procrastination is that when decisions finally are made, the data that backed them has changed and what would have been a good choice yesterday is now a terrible one. And of course, when you take so long to make decsisions in the first place, the time it takes to correct your poor choices is lengthened as well.

Respect

This seems to be the hardest to come by, at least, if social media is to be believed. “They don’t care about you, they care about your ability to make them profitable.” It may be true at a corporate level, but a good manager recognizes the talent of his employees and treats them appropriately. For one, firing staff and hiring new people is time consuming, expensive, and results in the rest of the team being unproductive while they pick up the slack and help in training the new person.

What is often struggled with is that you must show respect not just to the employee, but also to their family and loved ones. Most specifically for me, I have a wife and young daughter. This fact is not new to any of my recent managers, and it doesn’t impact my day-to-day. It does, however, impact my ability to travel. My wife has a job with slightly odd hours, and my daughter attends a daycare. With our regular schedules we have no issues dropping off and picking up, and providing adequate care. But I also can’t just decide I’m going on a business trip for three days next week — that needs to be coordinated and planned at least a couple of weeks in advance. And then I can’t just stay an extra day because you felt like it. Two years ago that would be showing disrespect to me, and I would suck it up and deal with it. Today, that’s showing disrespect to me and to my family, and holds the potential to impact our overall livelihood if my wife gets fired as a result of it.

A good manager will know and understand his employee’s talents and utilize them to the best of his ability. He will recognize that employees have home lives, and work to serve the needs of the business as the primary goal, while working around the unique personal needs of his staff. We may realize it’s not always possible, but we should at least be able to see that you tried. A bad manager will consider employees as a commodity, to be hired and fired as needed, and to be manipulated in order to avoid said firing. A bad manager will leave employees entirely out of the loop on changes to their schedules until the minute they’re implemented, and will point to contractual terms and company policies when questioned about it.

Review

A good manager will listen to his staff, make decisions as quickly as possible, and communicate those decisions back out to the team as required. A good manager will have respect for employees and their families, at least up until the employee gives reason to negate that respect.

A poor manager will consistently fail to communicate, will take forever to make decisions or flip-flop on those choices. A poor manager will show contempt or otherwise disrespect his staff or their families.

Ultimately bad managers find the door one way or another. But even then, a good manager of managers will recognize the good and the bad beneath them. After all, good managers tend to hold on to employees much longer, and get much better reviews when surveyed. Bad managers drive people away quickly.

Personal Experience is Personal Reality

When everyone in a room shares a similar perception, people tend to get along and understand each other even if they disagree with that perception being right or wrong. As soon as someone in that rooms has a personal perception that is at odds with the remainder, they become a shard of the group and each side begins to perceive the other as being crazy to some degree.

Remember the 6-image pictures that were all the rage a few years ago, each one showing “What my friends think I do,” etc?

nwmltgc

All of those functions are based on the various perceptions of different people. Granted, few of them actually compare with reality, but until a persons experience is changed, their perception will remain, and for them that is reality.

Consider the people you work with, how do they perceive you and the work you do? In the image above, if they’re anything like what your boss thinks, you may have a real problem — one that involves updating the resume and applying for new work somewhere else really soon. If you’re a manager, how do your employees perceive your leadership skills? Chances are it’s not how you perceive them yourself.

Here’s the thing. Personal Experience drives Personal Perception, and those two things combined are an individual’s Personal Reality. As a manager, or just as a representative of a group (i.e., a member of the IT support team), it is your responsibility to ensure perceptions of you and of your team are positive and accurate. When interacting with someone, that should be foremost in your mind, ensuring their experience is right such as to drive that perception. Because if your employees or your customers perceive you as not caring about them, it doesn’t matter what you do or how hard you try. If their experience working with you can lend them to believe you don’t care for them in the slightest, that’s what they’re likely to believe. That will be their reality.

Upgrading OwnCloud (or, keep up to date to avoid issues)

It turns out that OwnCloud is a finicky piece of software. After being defeated in my attempts to update, I grew determined to win.

Premise:

We have a CentOS 6 x64 server running OC, v7.0.2. Current release is 9.0.1, but OC doesn’t support direct upgrades between major versions, and it wasn’t even as simple as 7.0 to 8.0 to 9.0 — there was an 8.1 and an 8.2 in there as well.

OC was originally installed via an OpenSUSE package. That package has since gone out of date and no upgrade path via that method existed.

(note, backup your system here)

Step 1: Update 7.0.2 to 7.0.11

Step 1 was probably the easiest.

Download the 7.0.11 tar.bz2 from the change page (https://owncloud.org/changelog/), drop in the DocRoot.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Hit the webpage, follow the upgrade steps. Owncloud handles the DB changes and makes sure everything is happy.

Once the upgrade is done, log in and make sure all is well.

(note, backup your system here)

Step 2: Update 7.0.11 to 8.0.9

This is where the fun starts. I’m working on a CentOS 6 box, and the packaged version of PHP for Cent6 is 5.3 (5.3.3, to be exact). As it happens, 8.0 requires PHP 5.4, so life just got interesting.

A few commands, and we’re away:

yum install centos-release-SCL
yum install php54 php54-php php54-php-gd php54-php-mbstring php54-php-mysqlnd
mv /etc/httpd/conf.d/php.conf /etc/httpd/conf.d/php.off
service httpd restart

Actually, at this point, you’re best off stopping HTTPd, unless you have a very small OwnCloud instance. The OC forums are riddled with comments and notes about how long this next step took, and by default it will time out after 3600s, or one hour in a browser window.

Download the 8.0.9 tar.bz from the change page (above), drop in the DocRoot.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Move into your new OwnCloud directory and run:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

This runs the upgrade script that would be run from your browser, but instead does it from command line. It also uses the PHP5.4 binary, and runs as the Apache user (in CentOS). No worries about timeouts!

This step takes a long time. Apparently it has to do with rewriting all of the encryption keys for every file uploaded to your cloud. ‘top’ showed 100% CPU usage for the duration, I left it overnight to run and with our 50GB of data it took about 3 hours.

Once that’s done, migrate the encryption keys:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ encryption:migrate

Restart httpd (if you stopped it) and log in. Make sure all is well.

(note, backup your system here)

Step 3: Update 8.0.9 to 8.1.4

This is essentially the same as Step 2, without a PHP upgrade.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Move into your new OwnCloud directory and run:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

For me, this step was much faster, taking only two minutes.

Restart httpd (if you stopped it) and log in. Make sure all is well.

(note, backup your system here)

Step 4: Update 8.1.4 to 8.2.1

Again, essentially the same.

Move the old owncloud directory out of the way, unpackage the new one, and copy the config.php (ownclloud-old/config/config.php) into the new directory (owncloud/config/).

Move into your new OwnCloud directory and run:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

This took a little longer, mostly hanging on the database schema update check. And by a little longer, I mean about 90 seconds.

Log in once more and check that all is well.

(note, backup your system here)

Step 5: Update 8.2.1 to 9.0.1

Not so simple this time, although you could do it the same way. We wanted to get back onto the point releases, so we moved back to using packages.

Move the owncloud directory out of the way, and remove the owncloud package (7.0.2) from the system.

Remove the old repos, and install the updated ones:

rpm --import https://download.owncloud.org/download/repositories/stable/CentOS_6_SCL_PHP54/repodata/repomd.xml.key
wget http://download.owncloud.org/download/repositories/stable/CentOS_6_SCL_PHP54/ce:stable.repo -O /etc/yum.repos.d/ce:stable.repo
yum clean expire-cache
yum install owncloud-files

Copy the config into the appropriate directory, switch to the OwnCloud directory and run, once more:

sudo -u apache /opt/rh/php54/root/usr/bin/php occ upgrade

Log in and check all is well.

For kicks, back up your system one last time.

You’re now running the latest available OwnCloud, installed from package. Now you just need to keep that up to date, and life will be significantly easier!

Now I can run a yum update again, and not have to see cascading dependency failures, or risk breaking the system because the package available is too far ahead of what was installed.

Why do I need my own backups?

“My provider already has them,” “It’s not that important, anyway,” “backup tools are (too hard|too expensive|too complicated|not possible in my environment).”

“Everyone wants restore, no-one wants backup.” AdminZen, Part 6, Backups, point 1.

Too many times recently I’ve seen excuses like the above for not having backups, followed shortly by long tirades about how it’s someone else’s fault that their data is lost.

Somewhere I saw someone say it well: If it’s not important enough for you to put your own backup system in place, it’s not important enough for you to complain about it being gone when it’s gone.

I work for a service provider and we take regular backups of our systems, our customer facing systems. Systems our customers pay to be hosted on. Not too long ago we found out the hard way that we had a corrupted backup when the system failed and took all the data with it. We were able to restore the system from a very dated backup, but it had been nearly 2 months since the last successful backup, that’s all we were able to restore. Some of our customers were diligently taking their own backups, however, so they were able to get back up with other providers before we brought our system back up, or they were able to restore their accounts with us at that time, since we brought back very old data.

You’re a careless provider, it’s your fault we have no data! Actually, it’s not. Our terms of service are not at all unique in the hosting community, and they show that while we may have backups, there are limits to what we are obligated to provide. They also recommend that you take your own backups independently.

Now, I don’t work for Bluehost, this is used as a generic example (section 17):

For its own operational efficiencies and purposes, Bluehost from time to time backs up data on its servers, but is under no obligation or duty to Subscriber to do so under these Terms. IT IS SOLELY SUBSCRIBER’S DUTY AND RESPONSIBILITY TO BACKUP SUBSCRIBER’S FILES AND DATA ON BLUEHOST SERVERS, AND under no circumstance will Bluehost be liable to anyone FOR DAMAGES OF ANY KIND under any legal theory for loss of Subscriber FILES AND/or data on any Bluehost server. Bluehost will not attempt to back up accounts that exceed 50,000 files or 30 Gigs of space for any reason.

To confirm this as generic, here is InMotion Hosting’s section (see “Data Backup”; I don’t work for them either):

InMotion Hosting maintains, as a convenience to its clients, regular automated data backups on accounts equal to or less than 10 gigabytes in total size. This service is included only with Business Class or Virtual Private Server hosting accounts and is provided at no additional charge. Hosting accounts greater than 10 gigabytes in size will not be included in regular data backups; this service is, however, available for an additional service charge for accounts exceeding the 10 gigabyte size limit.

While InMotion Hosting maintains the previously stated backups, this service is provided as a convenience only and InMotion Hosting assumes no liability as to the availability or completeness of client data backups. Each client is expected and encouraged to maintain backup copies of their own data. InMotion Hosting will provide, upon request, one (1) data restore per four (4) calendar months free of charge. Additional data restores may be provided but are subject to additional service fees.

Note how there are limits on account size for backup, and a specific note is included that there is no liability on the part of the service provider to take a backup for you.

So how do I do backup? There are a number of tools depending on what you are doing.

For Dedicated Servers or VPSs, a simple rsync or rsnapshot is often sufficient. If you have more than one, and your usage on each is well below 50%, consider backing them up to each other. If not, creating a ZIP or Tar file of your data and regularly downloading it is often sufficient. Remember, the purpose of the backup is to ensure that if your provider fails or entirely loses your system, you can find a new provider and get back online as soon as possible.

For Shared Hosting users, most of you will have access to cPanel or a similar control panel, which provides an option to generate and download an account backup. Most providers will also frown upon keeping account backups on the server, so make sure you download and then delete it. A cPanel backup would be preferred, however, as it is in the perfect format to upload to another provider and restore your content if you need to move.

If the panel doesn’t have the option, or the provider has locked the option out for some reason, you should still have FTP access. This will let you download your files anywhere with an FTP client. Now, that won’t include any MySQL or other databases, you’ll need a solution for that as well.

For very small backups, something I’ve done in the past is to have a scheduled task (usually crontab task or cronjob on my Linux systems) which dumps the database, packages it up with the files, and sends it via email to one of my GMail addresses. Then there is a filter which immediately deletes the email. It seems odd, but GMail will keep emails in the trash for 30 days by default. Therefore, this setup creates a rolling 30 day backup! Keep in mind, it only works if the email size is less than 10MB, any larger and Google will reject the message at receipt.

Building for High Availability

There are several principles to consider when designing and building a system for high availability, so called 100% Uptime.

Firstly, what is the budget? Cost constraints will invariably determine how high that availability is. For this exercise, we’re going to focus on a virtual environment in which the VMs will have 100% Uptime.

Constraining factors:

Power: Reliable power is critical. If grid power goes out, what happens? You need to know that a UPS system is in place, and is well monitoring and tested. You also need to know that at least one generator is in place, ideally more, to pick up the load for prolonged power grid outages.

Network: Your VMs may be 100% available internally, but if you can’t reach them from the outside, there is no point. Having a network provider with 100% uptime will be critical, or you accept that 99.999% is your maximum based on your provider’s maximum. That, or you need to select multiple providers to blend bandwidth with yourself – that gets expensive though, moreso when you consider the equipment you’ll need and the expertise to configure and maintain it.

Level of redundancy: How much of your stack are you able to lose before you can’t provide full service? For example, do you have N+1 hypervisors, N+2, or 2N? In each case, “N” is the minimum number of things you need to provide service. Knowing that you have more than that (and how much more) available is critical.

Single Points of Failure

To be 100% reliable, you must not have any single point of failure. If you draw your infrastructure out on a piece of paper and throw darts at it, no dart should be able to strike something that takes down the entire stack. Two darts, hopefully not. Three darts, maybe. Four well-placed darts, quite possibly.

In fact, this is an excellent exercise in finding weak points. Draw out your infrastructure and make copies. Mark out one item on each copy, and determine whether the stack will continue to function. The hardest one of these is often power or core network. If a member of your storage switch failed, would everything continue to function? If a power circuit failed, either due to a breaker trip or a PDU problem, would you stay up and online?

At least two of everything

The minimum number of anything in a High Availability deployment is 2. Two power strips, two storage controllers, two switches (per switch stack), two provider uplinks, etc, etc, etc. They may be active/active, they may be active/slave, but there are two and they are hot, ready to pick up the full load on no notice.

For example, if your hypervisors only have one power supply, you should have two hypervisors anyway, and you will feed power one to each of your power circuits. Your network connectivity should lead one port to each member of your switch stack, and configured for some type of link aggregation or failover in the event that a switch member fails or a switchport on your system fails.

Recalculating as needed

Often we’ll deploy these systems with a set of design goals or a set upper capacity. Time passes and we realize that current used capacity is above designed upper capacity, and we should have processes in place to notice that. When that happens, you’ll need to recalculate and re-evaluate. Maybe you now need another hypervisor or two to retain the advertised redundancy. Maybe you need more storage devices, or more switches.

Make someone else do it

Of course, the “easy” way out is to pay someone else to design and build your system for you. Plenty of service providers will be quite happy to do so, from Dell, to VCE, to many smaller MSPs. You still need to understand the above, and be able to question their designs based on it.

Datacenters 101

A lot of people have specific ideas when you say “Datacenter” — dark ambient light, lots of bright lights from servers (hopefully lots of green and blue, and not so much yellow or red). Some big datacenter providers put a lot of effort into that “cool” factor, going out of their way to keep overhead lights dim, brushed steel with LED strings in behind, etc.  The fact is that there is a lot to datacenters that isn’t as much talked about because it isn’t so cool. In fact, most datacenters and datacenter operations are pretty boring.

What is a datacenter?

A datacenter is any place you keep and store servers and/or data and/or core networking equipment. For many businesses it’s a closet, or even just a section of a closet. That’s because that is all the business needs. As the business grows, the IT needs grow and this gets outsourced somewhere, possibly to a Datacenter Provider.

What does a datacenter provide?

A professional datacenter provider is supposed to have a range of things.

  • Reliable network connectivity, either through their own network (carrier controlled) or by making available direct connections to providers (carrier-neutral) — some providers do both.
  • Reliable power. Datacenters should have a UPS system to keep all of its customers online for a given period of time, usually long enough for the generator(s) to turn over and start. Depending on the age and size of the facility, most of these will be battery backed. Smaller facilities may use other technologies such as flywheels.
  • Reliable climate. Computers don’t like it when the humidity is too low, and they don’t like swings in temperature. Many facilities will target specific temperatures and tell you why their choice is optimal, but anywhere between 60 and 80 degrees F is probably going to be fine, so long as it isn’t swinging more than a couple of degrees over a period of time.
  • Reliable security. You don’t want other customers being able to steal your data, bandwidth or power. If they do, you want your provider to be able to know and prove it within a reasonable doubt so that it can be properly prosecuted. This isn’t limited to your cabinet or cage, either. When you get packages, you want to be able to trust that they are accepted and held for you, and only you.
  • Some kind of remote hands ability. If you need work done on your equipment such as a reboot or a drive swap, your provider should be able to give you means to do that using their staff. You almost invariably pay extra in some way for this service, be it a bundled number of hours or just an hourly rate with minimum time commits per-incident.

What does this mean for you?

Everything is variable, and everything has a cost. Datacenter providers will be more than happy to provide 100% Power SLAs, 100% Uptime SLAs, tight security based on blood samples and facial recognition along with a 10-digit PIN, and the best remote hands and managed services technicians in the world. They’re also going to expect you to pay high prices for that package.

On the other hand, you might get a physical key to a door, told which cabinet is yours, and to have a nice day. There might be a camera, there might not. There might be a lock on the cabinet, there might not. The cooling might work, it might not. And power/network will come with only a 99.0% uptime guarantee. And it’s dirt cheap.

The key is to find the right mix for your business and your needs. Recognizing that migrating is expensive, both in terms of man hours and downtime, picking right the first time is important. What’s important, and if you can’t afford the best, where are you able to compromise? Maybe you need maximum power and network, but can afford to lax a little on the security side if it’s “good enough.”

Other notes

Most IT people have heard of SOX or PCI compliance, but have you heard of a SAS 70, or an SSAE-16? Any datacenter worth their salt will have been SSAE-16 SOC 1 and/or SOC 2 (which is the updated SAS 70) certified. This isn’t simple, though. Really, all it means is that they’ve defined a series of controls for different things, mostly related to reliability or security, and have demonstrated either to an auditor at one point in time that they were documented (type 1) or demonstrated to an auditor that they were documented and then carried out over a period of time (usually about 12 months, type 2). Each report is unique to the facility or facility’s owner (covering all of their facilities), and you’ll need to read theirs specifically and carefully to learn. Facilities have an obligation to provide the report on request (after all, it’s why they spent the time and money to get it written), they may demand signing of an NDA or similar first.

Don’t Panic

The noise your phone makes on an alert-based SMS.

The ringtone when the office is calling.

“Waiting for controller to initialize” hangs on startup.

Don’t panic. The key to resolving any issue quickly is to slow down and think it through.

I had this this morning. Two new servers were installed on my desk and bench tested without issue. Then they were put in a car and transported to another datacenter, racked, patched and powered on. The first was fine, no issues. Second hung initializing the 3ware RAID controller.

The issue was really quick to resolve. Pulled the card, pulled the riser, reseated the riser, reseated the card. System booted.

Don’t panic.

Oversight and Documentation

When you have an environment as big as ours is, there absolutely needs to be processes in place to manage things coming in, existing things, and things going out. That is why I haven’t been posting much recently – I’ve been stuck trying to fix this problem because those processes are either not written down, don’t exist, or no-one follows them.

Consider this: When I took on this role in July of 2014, we were getting our feet wet in Puppet, we had dabbled at some point in Spacewalk, and no further. We had somewhere in the vicinity of 200 servers, be they physical devices or virtual containers of one kind or another, most of them running CentOS 5, and no central processes or tools to manage them. I don’t know how the team managed Shellshock or Heartbleed, I’m assuming they patched the systems they could think of that would be most likely to get hit or would hurt the most, and ignored the rest.

My highest priority coming in was to fix the Puppet implementation, re-deploy Spacewalk, set CentOS 6.x as the standard and get moving on pushing systems into that environment. So far we’ve made good progress – over 150 systems are now in that environment, I don’t have a good count on what is left but we’re well over 50%. Still there are systems I don’t know about, or don’t know enough about.

Our cloud solution is one of them. I worked on this project as a Junior after we’d pushed it and had problems. I was astounded, here we were trying to put together a product to sell to our customers at an extremely high premium, and we were throwing a few hours a week at building it in between panicked days of supporting needy customers. It was no wonder to me that when we rolled it out it was broken, it wasn’t properly documented or monitored, and no-one knew how it all worked. Part of me wonders if we intended for it to fail.

And so we come back to oversight and documentation. As my team is in the midst of conceptual design for our next virtualization platform, the thing fails. By now it has a few shared webhosting servers running on it, and that’s about it, but our support team is still getting slammed and we need to fix it.

  • The control node for the environment had filled its disk. Apparently some time back in June or July, I don’t know for sure — we weren’t monitoring it.
  • The backup server, which stores backups of VMs generated via the control node, had a corrupted disk and had gone into read-only mode. Possibly as far back as February — we weren’t monitoring it.
  • Two hypervisors failed simultaneously, one of them came back up but the VM it hosted was still broken. Still, we learned this when customers called in and reported issues, and when the VMs themselves generated alerts by being unreachable. We weren’t properly monitoring the hypervisors.

All of these issues should have been handled long before the service became available for sale. Parts of them were documented as needing to be fixed, but no-one seemed too worried about making it happen.

My predecessor once said “if it isn’t documented, it isn’t finished” — I agree. But expanding on that, if it isn’t monitored, it isn’t fully documented. If it isn’t documented, it isn’t finished, and if it isn’t finished, it isn’t ready for full-price-paying customers in production.

Bugs in limits

I came across a new one yesterday; well, not really a new one, but a not particularly well documented one.

It seems to be a problem that was found in CentOS 5, and has at least on some level persisted into Cent 6 as well.

The problem as noted stems from attempting to raise the number of processes per user, which may be why not many notice it — after all, how many systems need to permit more than 1024 processes per user (the system default)?

Well, I did. We have some network maintenance going on which involves replacing the entire network infrastructure. In doing so, we have a number of maintenance windows going on and our Network Engineers have some monitoring systems that ping every host they can find on our network repeatedly throughout the window. Once the window is over, the pinging stops but they are able to verify throughout the window that anything they take down comes back up. At last count they were using about 4,000 processes, which is where the problem was noticed.

How does one raise the soft/hard limit on maximum processes? How does it get done for a single user, a group of users, or all users? Well, for me at least, it’s usually an edit to /etc/security/limits.conf or one/more files in /etc/security/limits.d/, something along the lines of:

thekiwi       soft     nproc   4096
@kiwis        soft     nproc   4096
*             soft     nproc   4096

And that’s what I did, until I noticed during verification that it wasn’t working:

# cat /etc/centos-release
 CentOS release 6.7 (Final)
# ulimit -i
 3864

With some configurations I also got 3865. Now, there is a known workaround for this, and it involves specifying the first parameter using a UID. E.g.:

1013       soft     nproc   4096
@1091      soft     nproc   4096
500:65535  soft     nproc   4096

 

Long story short, if you’re having issues with limits.conf or ulimit not specifying the right values for max processes (nproc) or number of pending signals (sigpending) at login, try specifying using UID or GID instead of the common name.

Amplification Attacks and Your Response

Amplification attacks are frustrating, whether you are the target of the flood or you find your system has been taking part in one.

The concept is simple — there are two core items:

  1. You send a small string to a UDP-based service and you get a large response back.
  2. You spoof your IP address so that the response goes somewhere else.

By utilizing both items, you can send a very small amount of traffic to a location and have it send a very large amount of traffic to your target. If you find enough services that are “vulnerable”, you can send a comparatively small amount of data and have those services send a lot of data back out to your target in an effort to flood their connection.

Common vectors for this attack are DNS, NTP, SNMP and others. See the below section of a tcpdump, we sent a small packet to a DNS server (as is common) and we got back 163 bytes. Most queries are around 64 bytes, so by sending 64 bytes we got a response of 163 bytes, that’s a response about 2.5 times larger than the request.

20:34:31.960523 IP (tos 0x0, ttl 51, id 10115, offset 0, flags [none], proto UDP (17), length 163)
 google-public-dns-a.google.com.domain > xxxxxxxxxxxxxxxxx.35760: [udp sum ok] 42237 q: A? google.com. 6/0/1 google.com. [4m59s] A 74.125.136.138, google.com. [4m59s] A 74.125.136.113, google.com. [4m59s] A 74.125.136.100, google.com. [4m59s] A 74.125.136.101, google.com. [4m59s] A 74.125.136.102, google.com. [4m59s] A 74.125.136.139 ar: . OPT UDPsize=512 (135)

That’s a small DNS response, if the right record is found you could easily get a 500% increase in the response compared with the request. Now, let’s be clear – this was a request to a recursive nameserver, but the results are exactly the same if you use an authoritative nameserver.

NTP servers are most prone to attack when they aren’t protected against the monlist command — in most cases they’ll respond with packets about the same size as the request, but with the monlist command they can return a very large response, many times the size of the request.

SNMP is probably one of the highest potential returns with the lowest risk — unless the community string is set to a default like “public” or “rocommunity.”


So we know the problem, what is the solution? Depending on the service there are many ways to tackle the problem. The first solution is to recognize which of your services have what potential to be a vector for attack. Running NTP? DNS? Other UDP-based services? Make sure you know what requests can be made and what the response to those might be. If you’re running NTP, you can disable the monlist command, with SNMP you can keep the community string complex.

There is also a more generic way to handle this, and that’s by using the firewall. On Linux iptables will allow you to limit the number of packets per second using the limit module:

iptables -A INPUT -p udp --dport 53 -m limit --limit 10/s -j ACCEPT
iptables -A INPUT -p udp --dport 53 -j DROP

This will allow ten requests per second to the DNS port (UDP 53), anything beyond that will be dropped. This is a set of rules that will need to be tweaked for production on your server!

Another option to complement this would be to look at other iptables modules that will allow to limit per-IP, so maybe you want to allow 100 requests per second overall, but any given IP can only make 10 requests per second.

The third firewall-related solution is a tool such as fail2ban which can read logs from your daemon and block users who you consider to be abusive for a given period. An IP makes more than 3600 requests in an hour? Blocked for an hour. This is a little more dangerous as it means an attacker could use your server to spoof-attack one of the major DNS resolvers like Google,which you then block, and then Google’s public nameservers are unable to resolve any domains on your servers.

As I told someone earlier today, fixing the security holes in your services is important, to be sure. But it shouldn’t be the only solution.

 


Sources:

https://www.us-cert.gov/ncas/alerts/TA14-017A

http://www.watchguard.com/infocenter/editorial/41649.asp

http://blog.cloudflare.com/technical-details-behind-a-400gbps-ntp-amplification-ddos-attack