Oversight and Documentation

When you have an environment as big as ours is, there absolutely needs to be processes in place to manage things coming in, existing things, and things going out. That is why I haven’t been posting much recently – I’ve been stuck trying to fix this problem because those processes are either not written down, don’t exist, or no-one follows them.

Consider this: When I took on this role in July of 2014, we were getting our feet wet in Puppet, we had dabbled at some point in Spacewalk, and no further. We had somewhere in the vicinity of 200 servers, be they physical devices or virtual containers of one kind or another, most of them running CentOS 5, and no central processes or tools to manage them. I don’t know how the team managed Shellshock or Heartbleed, I’m assuming they patched the systems they could think of that would be most likely to get hit or would hurt the most, and ignored the rest.

My highest priority coming in was to fix the Puppet implementation, re-deploy Spacewalk, set CentOS 6.x as the standard and get moving on pushing systems into that environment. So far we’ve made good progress – over 150 systems are now in that environment, I don’t have a good count on what is left but we’re well over 50%. Still there are systems I don’t know about, or don’t know enough about.

Our cloud solution is one of them. I worked on this project as a Junior after we’d pushed it and had problems. I was astounded, here we were trying to put together a product to sell to our customers at an extremely high premium, and we were throwing a few hours a week at building it in between panicked days of supporting needy customers. It was no wonder to me that when we rolled it out it was broken, it wasn’t properly documented or monitored, and no-one knew how it all worked. Part of me wonders if we intended for it to fail.

And so we come back to oversight and documentation. As my team is in the midst of conceptual design for our next virtualization platform, the thing fails. By now it has a few shared webhosting servers running on it, and that’s about it, but our support team is still getting slammed and we need to fix it.

  • The control node for the environment had filled its disk. Apparently some time back in June or July, I don’t know for sure — we weren’t monitoring it.
  • The backup server, which stores backups of VMs generated via the control node, had a corrupted disk and had gone into read-only mode. Possibly as far back as February — we weren’t monitoring it.
  • Two hypervisors failed simultaneously, one of them came back up but the VM it hosted was still broken. Still, we learned this when customers called in and reported issues, and when the VMs themselves generated alerts by being unreachable. We weren’t properly monitoring the hypervisors.

All of these issues should have been handled long before the service became available for sale. Parts of them were documented as needing to be fixed, but no-one seemed too worried about making it happen.

My predecessor once said “if it isn’t documented, it isn’t finished” — I agree. But expanding on that, if it isn’t monitored, it isn’t fully documented. If it isn’t documented, it isn’t finished, and if it isn’t finished, it isn’t ready for full-price-paying customers in production.

Bugs in limits

I came across a new one yesterday; well, not really a new one, but a not particularly well documented one.

It seems to be a problem that was found in CentOS 5, and has at least on some level persisted into Cent 6 as well.

The problem as noted stems from attempting to raise the number of processes per user, which may be why not many notice it — after all, how many systems need to permit more than 1024 processes per user (the system default)?

Well, I did. We have some network maintenance going on which involves replacing the entire network infrastructure. In doing so, we have a number of maintenance windows going on and our Network Engineers have some monitoring systems that ping every host they can find on our network repeatedly throughout the window. Once the window is over, the pinging stops but they are able to verify throughout the window that anything they take down comes back up. At last count they were using about 4,000 processes, which is where the problem was noticed.

How does one raise the soft/hard limit on maximum processes? How does it get done for a single user, a group of users, or all users? Well, for me at least, it’s usually an edit to /etc/security/limits.conf or one/more files in /etc/security/limits.d/, something along the lines of:

thekiwi       soft     nproc   4096
@kiwis        soft     nproc   4096
*             soft     nproc   4096

And that’s what I did, until I noticed during verification that it wasn’t working:

# cat /etc/centos-release
 CentOS release 6.7 (Final)
# ulimit -i
 3864

With some configurations I also got 3865. Now, there is a known workaround for this, and it involves specifying the first parameter using a UID. E.g.:

1013       soft     nproc   4096
@1091      soft     nproc   4096
500:65535  soft     nproc   4096

 

Long story short, if you’re having issues with limits.conf or ulimit not specifying the right values for max processes (nproc) or number of pending signals (sigpending) at login, try specifying using UID or GID instead of the common name.

Amplification Attacks and Your Response

Amplification attacks are frustrating, whether you are the target of the flood or you find your system has been taking part in one.

The concept is simple — there are two core items:

  1. You send a small string to a UDP-based service and you get a large response back.
  2. You spoof your IP address so that the response goes somewhere else.

By utilizing both items, you can send a very small amount of traffic to a location and have it send a very large amount of traffic to your target. If you find enough services that are “vulnerable”, you can send a comparatively small amount of data and have those services send a lot of data back out to your target in an effort to flood their connection.

Common vectors for this attack are DNS, NTP, SNMP and others. See the below section of a tcpdump, we sent a small packet to a DNS server (as is common) and we got back 163 bytes. Most queries are around 64 bytes, so by sending 64 bytes we got a response of 163 bytes, that’s a response about 2.5 times larger than the request.

20:34:31.960523 IP (tos 0x0, ttl 51, id 10115, offset 0, flags [none], proto UDP (17), length 163)
 google-public-dns-a.google.com.domain > xxxxxxxxxxxxxxxxx.35760: [udp sum ok] 42237 q: A? google.com. 6/0/1 google.com. [4m59s] A 74.125.136.138, google.com. [4m59s] A 74.125.136.113, google.com. [4m59s] A 74.125.136.100, google.com. [4m59s] A 74.125.136.101, google.com. [4m59s] A 74.125.136.102, google.com. [4m59s] A 74.125.136.139 ar: . OPT UDPsize=512 (135)

That’s a small DNS response, if the right record is found you could easily get a 500% increase in the response compared with the request. Now, let’s be clear – this was a request to a recursive nameserver, but the results are exactly the same if you use an authoritative nameserver.

NTP servers are most prone to attack when they aren’t protected against the monlist command — in most cases they’ll respond with packets about the same size as the request, but with the monlist command they can return a very large response, many times the size of the request.

SNMP is probably one of the highest potential returns with the lowest risk — unless the community string is set to a default like “public” or “rocommunity.”


So we know the problem, what is the solution? Depending on the service there are many ways to tackle the problem. The first solution is to recognize which of your services have what potential to be a vector for attack. Running NTP? DNS? Other UDP-based services? Make sure you know what requests can be made and what the response to those might be. If you’re running NTP, you can disable the monlist command, with SNMP you can keep the community string complex.

There is also a more generic way to handle this, and that’s by using the firewall. On Linux iptables will allow you to limit the number of packets per second using the limit module:

iptables -A INPUT -p udp --dport 53 -m limit --limit 10/s -j ACCEPT
iptables -A INPUT -p udp --dport 53 -j DROP

This will allow ten requests per second to the DNS port (UDP 53), anything beyond that will be dropped. This is a set of rules that will need to be tweaked for production on your server!

Another option to complement this would be to look at other iptables modules that will allow to limit per-IP, so maybe you want to allow 100 requests per second overall, but any given IP can only make 10 requests per second.

The third firewall-related solution is a tool such as fail2ban which can read logs from your daemon and block users who you consider to be abusive for a given period. An IP makes more than 3600 requests in an hour? Blocked for an hour. This is a little more dangerous as it means an attacker could use your server to spoof-attack one of the major DNS resolvers like Google,which you then block, and then Google’s public nameservers are unable to resolve any domains on your servers.

As I told someone earlier today, fixing the security holes in your services is important, to be sure. But it shouldn’t be the only solution.

 


Sources:

https://www.us-cert.gov/ncas/alerts/TA14-017A

http://www.watchguard.com/infocenter/editorial/41649.asp

http://blog.cloudflare.com/technical-details-behind-a-400gbps-ntp-amplification-ddos-attack

 

Cleaning Up from the Last Guy

It’s inevitable, really, as we move through our careers we won’t just be joining teams but replacing people and taking over their previous duties. We also have a tendency to blame many things about the infrastructure on the last guy who ran it. In my duties as a sysadmin and technical account manager I’ve had the very recent opportunity to “pick up the pieces” from two very different situations and administrators in a very short period.

Our client has had several admins in a short period of time (a year or so) and as a result of some poor decisions by those guys along with some terrible documentation we’ve found their environment was essentially crumbling from the inside. When I came on I found three systems with already failed drives, all of our documentation was missing several systems (physical and virtual), access to each system was non-standard (at least it was for the Linux systems) and at least a few devices we have no detail or access information for at all.

This situation was hardly a surprise, the company had fired their previous sysadmin several months prior and hired a new guy just a few days before I was assigned to the account. The physical infrastructure was a mess, to say the least. Once we’ve cleaned up the remaining cabling I’ll be posting another article showing the difference, despite their racks having no space for cable management. Their power distribution units were overloaded, some key parts of the infrastructure weren’t redundant at all.

While we’re in the middle of working on this, our own company’s Senior Sysadmin decided it was time to move on in his career and many of his responsibilities were transferred to me. This appears to have been an almost completely different experience — the vast majority of our environment is documented, we don’t have standard passwords but there are standardized access groups between different systems which were formally handed off, and current/planned projects were also handed off in a formally documented manner. I have little doubt that if anything fails either I or my manager has the access details needed to log in, and that our documentation has enough detail that I can begin working on the issue.

Again, this was barely a surprise. This was an admin who kept detailed notes on just about everything. He grew frustrated if any of the techs (or customers, for that matter) broke the cabling convention, much less broke the cable management system he had designed for them. The physical infrastructure is certainly aging, but there are projects in place to replace it and either a good backup that we can restore from if necessary or notes on how the systems are configured that we can rebuild from scratch if it comes to that.

As a relatively new admin gaining much-needed experience, this entire situation has been an eye-opener. The idea of clean-as-you-go (or rather, document as you go) is a grand one. Maintaining a well-documented environment is a key component of being invaluable. Many admins have this notion that if they keep everything a secret then they’ll ensure their own job security — this is true, but only in a limited capacity. If that is your only grasp on job security, that tells me you’re not that good at your job and should probably be replaced anyway. On the other hand, if you have a properly documented environment then 1) you can take vacations and trust you won’t get called because something trivial broke, and 2) the guy who follows you, whether you left voluntarily or by force (or joins your team as it expands) won’t be bad-mouthing you to all of your co-workers and other parts of the local administrative community.

PagerDuty Review

I’m always looking for ways to improve both internal and external services and tools because I believe that to stagnate is bad. Just because what we have works, doesn’t mean we can’t make it better. If we can make it better in a cost-effective way, why aren’t we doing that?

It is in that vein that when I saw an ad for PagerDuty‘s services (with a free T-shirt) that I looked a little more at what they actually did, and then signed up for a trial.

Disclaimer: I have no affiliation with PagerDuty beyond my use as a trial user. Any links provided to PagerDuty are free of any referral codes and I pay no cost nor do I reap any benefit by anyone clicking through to their site and signing up for their services. PagerDuty offers a free t-shirt on the first alert, but since my address wasn’t requested either explicitly or given an option to enter on the form, I can only assume this is only for paid users – it certainly wasn’t made clear on signup but it does make sense that you need to give them money for them to give things to you.

The Basics

PagerDuty is a cloud-based service that provides alert escalations to duty technicians or administrators based on predefined rules and schedules.

PagerDuty can, based on user preferences, escalate alerts via email, SMS, phone call or by push notifications to their Android and iPhone applications.

PagerDuty accepts alerts from a wide range of tools including (but certainly not limited to) Nagios, Icinga, Zabbix, Pingdom, UptimeRobot and NodePing, harnessing both email queues and API-based tools.

When an alert is received it is assigned to an escalation path based on where it came from. The escalation path then notifies individuals or schedules and can be escalated if desired when the assigned on-call fails to acknowledge the alert. If the acknowledged alert goes unresolved for a period of time it is possible to have the alert fall back into a triggered state whereby it starts the escalation process again.

Who Is PagerDuty For?

PagerDuty is for anyone who needs to escalate alerts. PagerDuty is excellent for organizations where there are one or more monitoring systems that need to be consolidated into a single escalation system (e.g. Pingdom for system availability and Nagios for specific sub-checks). PagerDuty also excels at scheduling on-call staff on a daily, weekly or other basis. Override tools are also provided so that if Frank is out for a few days, Keith can be scheduled without breaking the rotation.

In my personal environment, PagerDuty would escalate a set of monitoring systems in a uniform manner and notify me of issues. In my work environment it would be an excellent tool to handle alerts for our level 3 support teams where our Systems, Network and possibly Management teams would each have an escalation path to which some alerts would go directly and others would be initiated after investigation by our Level 1/2 technical staff.

Review

Enough about what PagerDuty is and what it does, let’s get into the meat of it. I signed up for PagerDuty and the first thing I did was poke around. The interface is pretty intuitive, I never really got lost. There were no glaring bugs in the system, everything tied together nicely. A few things do have dependencies, you can’t delete an Escalation Path if it is tied to a Service, for example, but the warnings and errors were more than sufficient to tell me what the problem was and what I needed to do to fix it.

The first thing I did was add my Nagios system, because that is my primary source of alerts. Many of them require tuning to reduce false alerts, but that’s another story. Nagios is an interesting one, they can receive alerts by email but for systems like Nagios and Zabbix that can integrate with their notification system it works a little more directly. There is a queue that alerts are posted to, and a cron-job that runs every minute to flush the queue. If you prefer to trigger via e-mail, you can do so. With Nagios there is no reason you can’t have both – an alert that checks the queue is being cleared that sends via email, or an alert that checks mail is being processed that alerts via the queue tool. Once I got the Nagios tools working, I moved onto some of my other external monitoring tools that were only supported by email. Both tools were on the supported list and integrated perfectly.

Alerts in PagerDuty have three states, Triggered, Acknowledged, and Resolved. A new alert will land in the Triggered state and will trigger the Escalation Path associated with the service it came in from. This can be as simple or as convoluted as you want it to be. It might just notify you individually and stop, or it might start by notifying the Level 1 Schedule, wait 15 minutes, then notify the Level 1 Schedule along with the Level 2 schedule, then wait another 30 minutes before notifying the Management Schedule. As soon as someone clicks the “Acknowledge” button for an alert, the Escalation Path is stopped. There is an option (per-service) to time-out an acknowledged an alert, that is – if the alert remains “Acknowledged” for that time (default: 30 minutes) it falls back into the Triggered state and the Escalation Path starts again from the beginning.

For alerts from systems like Nagios that can also send the “OK” state to contacts, PagerDuty will automatically resolve the open alerts (triggered or acknowledged) and any escalations will stop for those also.

Now, for those who expect your alert tools to send everything and then be able to filter at the PD level will be disappointed. There are zero options in PagerDuty to determine whether an alert should be escalated differently or not – you need to have predefined this all. For example, if you have Nagios alerts that should escalate at a high-priority level and others that should escalate at a much lower priority, you need to set those up to escalate to two different contacts so that they can have two different services and two different escalation paths in PagerDuty. That means that it doesn’t really simplify the problem very much, it just moves it.

The next problem is one that will irk managers who want to define how escalations work at the individual level, and that is that each individual user can define how they are contacted. While there may be some API-based tools to update all users (I doubt it, but I haven’t looked), it is completely possible that your engineers have configured their profile to not be notified for 30 minutes by any method. Personally, I would like to see a way for a manager to enforce an individual escalation procedure (SMS and Email at 0 minutes, and a phone call at 15 minutes, for example). One of the handy tools is that you can configure a notification when going on or off call, and this can come by email or by SMS.

This problem is closely related to the user permissions levels. I’d be interested to see a table that shows the various tasks that can be performed, and which level is required to perform them. It seemed that most of the critical tasks performed by PagerDuty could be performed by a “User” where it would probably be beneficial to have these restricted to Admin. It may also be that PagerDuty have separated these where Team Leads who manage schedules are “Users”, their subordinates who just handle alerts are Limited Users and anyone dealing with anything Billing related is an Admin, but I was personally surprised by what a “User” could do compared with an “Admin.”

The final major issue I noted is that adding users is not the simplest task I’ve seen. An Admin or User must “invite” the user via email, and it’s possible to add users who haven’t accepted their invitations to the schedules. There didn’t seem to be an easy way to determine if a user has accepted their invitation and configured themselves.

PagerDuty Staff and Support

I haven’t contacted their support team, though I noted they don’t seem to have an easy way to contact them from any of the logged in pages and after hours their live-support was unavailable from the main site.

That said, every trial user is (understandably) a sales opportunity and their sales team are very aggressive in wanting to show you the ins and outs of their tool. After a week the email started, and I got three messages within three days from their staff wanting to further demonstrate the product. By that point I was well acquainted and failed to respond (sorry guys) but for my use cases there is nothing more I can do with the product than I have already.

The Numbers

PagerDuty is expensive, there is no question there. At $19.95/user/month (paid yearly, $24.95/user/month if paid monthly) it gets very expensive, very quickly, even for a small team. I estimate that if I pushed this to my employer, that would be the first question and it would be turned down very quickly as a result. Even if we only entered our Network, Systems and core managers (who weren’t already on either team) into the tool we would be looking at at least 9, maybe 10 users which means we’re looking at $199.50/mo just for a monitoring service, assuming we paid yearly – a bill of nearly $2400. At that cost, we’re better off paying one of our technicians to build the same system with PHP and in just a couple of months we’d be profiting.

Keep in mind, the $19.95 number is also for fairly basic service. It provides only 25 international alerts per month, so if you have a lot of overseas staff (outside the USA) then you’d best not need to escalate to them very often (it is possible to purchase more at $0.35 each). More worrying is that the $19.95 level doesn’t grant any SLA and only Email support. If you want more international alerts (100/mo then $0.35 each), phone support, or the OPTION (i.e. costs extra) for 24/7 phone support or an SLA, you’ll be paying $39.95/user/month (paid yearly, $49.95/user/month if paid monthly). As an added bonus, the higher level also allows for Single Sign On.

The Verdict

Unfortunately because of the cost alone, I can’t recommend PagerDuty and I think I’ll be terminating at the end of the trial. If we put cost aside, it’s a great tool with great potential that I would be more than happy to push at work or even keep for myself, but I just can’t get past that cost factor – it’s not quite that valuable.

Surviving a Provider Outage

Last year, EIG (or the Endurance International Group) suffered a major outage in one of their facilities in Utah, impacting a number of customers on their Bluehost, Hostgator and Hostmonster brands, possibly among others. Today they have been down for close to 6 hours and counting leaving customers with services ranging from a small shared hosting site for a family all the way up to dedicated server customers running business sites and services all offline with no resolution in sight.

So how do you, as a website owner or a service provider relying on other providers such as Bluehost being online, go about keeping your business-critical server online and functioning? The key is forward planning.

Finagle’s Law of Dynamic Negatives states that Anything that can go wrong, will—at the worst possible moment. Like your computer blue-screening right before you hit save (or because you hit save) and losing all of your work since the last time you saved. Or the power going out as the concert starts. You can imagine many, many more. Quite simply, if you are relying on absolutely anyone else to help provide your critical systems, chances are one of them will fail at some point and leave you stranded for a period of time.

So, forward planning. You know that something is going to fail so much like financial investments you need to diversify your business service portfolio, as it were.

Start with Reliable. Choosing your provider is important, it’s a crucial balance that you should probably reconsider as your business needs and abilities change. The saying goes – Good, Cheap, Reliable: Pick Two. You might be able to get a good host for cheap, but it won’t be reliable. If you want a Good, Reliable host, you’re going to have to pay a little more. In any case, do your research and ask around – don’t just pick one with flashy ads on Youtube.

Consider a Disaster Recovery (or DR) environment. We know that no matter who you choose as your primary provider, they’re going to have downtime. It might be for maintenance (in which case they should let you know ahead of time) or it might be due to an unexpected failure of some kind. Some are relatively minor and only impact one customer (like a part in your dedicated server fails) or a handful of customers (a switch or power distribution unit fails). If might be something massive like the core routers losing connectivity. Your business is critical, so it’s worth investing in an environment that your services fail over to when the primary is unavailable. It can be as complex as a full hardware and software replication of your production environment, or it may even share some of the load during regular hours. Or it might be as simple as a cheap virtual server, everything might run a little slower but it’s enough to help you ride out the storm and gives you somewhere to migrate your critical functions.

Backup, Backup, Backup. Maybe you can’t afford a DR environment, keep backups of everything. If your service provider went bankrupt and simply shuts down, or as we saw with Volume Drive last year just up and leaves their colocation provider and some servers just “go missing” – how will you move on? You need a backup of your system so that when you can select a new provider it becomes a relatively painless process of deploying your service again.

Service credits don’t cover the cost of your lost revenue, and just because they offer a 99.999% guarantee doesn’t mean they’ll spread that 0.001% across the calendar year. It’s a critical item that needs to be considered when planning your IT strategy. After all, anything that can fail, will. And probably at the worst possible time.

Foreman on Debian – Install issues

A few weeks ago I noticed that as a result of an upgrade, puppet-dashboard had removed itself from the system it ran on. No big deal, I wasn’t really using it and it put a number of more important packages on the removal list if I made the required changes to keep it. Since then I’ve had an error in my log every time a system polls puppet – a lot of errors when it’s multiple machines twice an hour.

Quick research suggests that Puppet Dashboard is no longer under development, and it should be replaced. Work is looking to install Foreman as a frontend for our Puppet install, so I figured what better tool to replace Puppet Dashboard with than Foreman, and give myself a jump start on the training. It also installs on Debian from their own repos.

The install itself was actually a little painful, and was surprisingly easy to fix once I got to the bottom of it. The problem was this, when I ran “apt-get install foreman”:

dpkg: error processing package foreman (--configure):
subprocess installed post-installation script returned error exit status 7

The process was failing during the postinst script while dpkg was configuring the package. This script is found in /var/lib/dpkg/info/<packagename>.postinst – in this case the package name was “foreman.” I set the package to debug mode and ran a dpkg –configure foreman. The last commands before error and terminate were thus:

+ cd /usr/share/foreman
+ [ -f Gemfile.lock ]
+ CMD=bundle install --path ./vendor/ --local --no-prune
+ [ ! -z ]
+ bundle install --path ./vendor/ --local --no-prune

So let’s run them manually:

root@kiwi:/usr/share/foreman# bundle install --path ./vendor/ --local --no-prune
Resolving dependencies...
Some gems seem to be missing from your vendor/cache directory.
Could not find gem 'safemode (~> 1.2) ruby' in the gems available on this machine.
root@kiwi:/usr/share/foreman#

Well then. Apparently we need to install the safemode gem!

gem install safemode
bundle update

A quick apt-get -f install, and foreman configured itself correctly. Now I’ll spend a few days trying to make it work on my install, and we’ll likely bring an update or two as time goes on with what I learn and so on.

Employee Satisfaction

I have this theory which roughly states that an employees loyalty to a company is directly proportional to their satisfaction level. This seems obvious, but there are many things which apply to this which aren’t often taken into account, because the number one thing we consider is remuneration – I think this is wrong.

Sure, how much money an employee receives for their work performed plays a large role in their satisfaction, but it is a far cry from being the only thing making an employee happy. In fact, my theory goes on to suggest that even if you are paying bottom dollar for an employee’s services, you can still retain that employees loyalty if you can keep him (or her) happy in other areas.

What do I mean, exactly?

Well, it’s simple. If an employee consistently feels like they are being discriminated against, that is a negative strike against their satisfaction level. If an employee feels like efforts are being made to include them in activities despite it being inconvenient, that is a positive mark against their satisfaction level. Paying them more is a positive mark, giving them a free lunch for being an employee is a positive mark, consistently failing to recognize when they pick up the pieces of their co-workers failed tasks is a negative mark.

As a relative newcomer to the IT industry, I’ve only worked for a handful of companies, but already I can see that the way a company treats it’s employees directly impacts the employees satisfaction and ultimately reflects in the turnover rate in a given department or position. More importantly, I’ve noticed that it isn’t always about how much money an employee is paid – I’ve seen instances where someone will leave and when asked if increasing their pay could be an incentive to stay, they’ll say it straight that no amount of money could convince them that staying was a good idea.

If you’re in a position of management, find out how your employees feel about being a member of the team and the company. If they’re not comfortable telling you that they are unhappy, you may need to start looking inward at yourself because it’s likely they don’t trust you’ll protect them if they are truly honest. And if they are unhappy, try to get to the bottom of what’s bothering them – maybe it’s a personal problem at home and they need a couple of days off, or it could be that they are overloaded with work and need a hand with things.

To sum this up:

1) Not all employees are happy, and just because they say they’re fine doesn’t mean they are.

2) Not every employee happiness problem can be solved by throwing money at it.

When was the last time you had a team building exercise? I don’t mean going out into the woods to swing on ropes, or taking them to paintball or laser tag, but just sat down and had dinner? Maybe it’s time to take the company on a picnic – not to talk about work, but to just exist with coworkers, to get to know each other over conversation that doesn’t relate to last weeks sales call or next weeks investor meeting. Who knows, you may make people just as happy with a couple of hundred dollars total as you would giving each of them a 10% pay increase. Now that’s a smart business decision.

RAID vs. Backup

Occasionally you hear the words “I don’t need backup, I have RAID!” or similar phrases. You may have even used them yourself once or twice. They are not the same. Though both are intended to maintain uptime, they perform two different functions which serve two different purposes. Both of them are defense mechanisms against disaster, but to view them as the same is wrong and will inevitably backfire.

RAID – Redundant Array of Inexpensive Disks

RAID is all about defending disaster in the Here and Now. RAID defends against disk failure and allows your system to continue running (albeit at reduced performance) until the disk is replaced and the array is rebuilt.

In the event that you suffer from data corruption, data loss or become the victim or a virus or malware, RAID does absolutely nothing.

Backup/Restore

Backups exist to provide historical record of what your system looked like at the time the backup was run. There are, at the core of the concept, two backup types: A full backup and an incremental backup. Most organizations will run a full backup regularly during quiet times (e.g. every week over the weekend) as they take a while and can tax resources, and incremental backups which can be less intensive to fill in the gaps (e.g. nightly). There are also other backup systems such as CDP which allow a full backup to be taken and then keeping track of all changes as they happen – they are out of scope for this text.

The purpose of a backup is to grant access to data that has been lost or changed and needs to be retrieved, whether it be a file, a directory, or an entire system. Backup is wonderful for restoring a system that has been compromised or had hardware failure necessitating a reinstall, but it does absolutely nothing to protect a system during the incident, it only helps to recover from disaster.

—-

So you see, RAID and Backup are not the same. Having one is good, but knowing why you have it is better, and having both is better still. And remember, a backup system is only as good as the last backup you tested. If you never test the backup, you’ll never know if it works. And a broken backup system is worse than having no backup at all.

Centralized Logging

We all know how important logs are, I use them regularly to find why the automated firewall blocked an IP address or to figure out an error in the apache config or the problem with the Asterisk phone tree. I also know that our network engineers use them to identify problems with switches or routers and even the guys using Windows use Event Viewer to audit system logins or investigate system crashes.

There are two key issues that arise when trying to use local logs (logs on the system, in Event Viewer or in /var/log, etc), and they’re such a critical part of the investigation process.

  1. When using logs to investigate a compromised machine, the logs themselves are almost entirely untrustworthy as there is very little stopping an attacker from modifying the logs and removing traces of themselves.
  2. If the logs are stored in volatile memory or the system is designed to erase/overwrite logs on reboot (common on network equipment) then the logs are forever lost in the event that the system restarts either as part of the problem, resolution, or afterwards but before the logs can be retrieved for use in investigation.

There are only a small number of solutions to these issues, and the most common is to use a Centralized Logging System. If you have more than 2-3 systems then immediately you can begin reaping the benefits of central logging. Event Viewer for Windows has, since Vista/2008 had the ability to forward events included in its core functionality. Prior to that, in XP and 2003 there was an Add-in available that granted this functionality. From there it should be a fairly simple configuration to set up a single server and point the clients at it with their events. On Linux there are a number of tools, one of the more common is rsyslog which can both accept and transmit log entries. There are also services available such as Loggly or Splunk which offer both free and paid tiers of service to store logs on your behalf on their remote side, and downloadable tools such as Graylog which will act as a receiver for syslog connections and allow searching and statistics to be run, or alert emails to be sent when specific log entries are encountered. There are countless other tools or you can even write your own.

The key is to ensure your logging server is as secure as possible, as well as reliable — if the log server fails it can’t receive logs, so in the event that your log server is down and an event occurs, you’ll need to be able to grab the logs from the device without relying on the central system. At the same time, security is a concern, if an intruder can reach the logging server then suddenly none of your logs are safe.

They have already been mentioned above, but some other useful tasks that can be made easier with a centralized system can include:

  • Triggering alerts based on log entries. Do you have a Known Issue that happens occasionally but isn’t often enough or critical enough to get fixed, but still warrants immediate attention? Set up an alert so that if it occurs on any of your systems, you can be alerted to it as soon as possible.
  • Statistics and other data. Do you suspect that your trend of bounced emails is going up? Maybe you want to be able to gather data across all of your web servers for page hits for the last month. With all of your logs being gathered and forwarded to a central location, you have greater power to run various analysis on them to mine data or gather stats on your environments. They should all still be properly tagged with the machine they came from, so extracting the relevant data remains easy enough to do while still keeping them together. I also use a system called ‘logcheck’ which pulls all of my logs for the previous hour, drops lines which match regex expressions (for the lines I know are there but really don’t care about) and then sends the results in an email – by centralizing my logs I can get a single email for all of the systems and I also only need to maintain one set of ignore files.
  • Verifying data integrity on the hosts themselves. If you’ve had a break-in and you’re concerned the logs have been tampered with then you can not only use the central log to find what is missing, you can run a diff between the log on the central server and the log on the host itself to see exactly what is different between them.

From here, it’s on you to figure out how to set up remote logging for your environment. I recommend rsyslogd for Linux and similar systems, or you can read up on centralizing the Event Viewer for Windows. Either way, I can hardly recommend centralized logging enough.