Linus being Linus

Linus Torvalds is a name that is well known within many tech communities. As the creator and maintainer of the Linux Kernel, he is revered by many users oblivious to his behavior. Behavior that is certainly effective, but also frequently highlighted as unprofessional and unnecessarily abusive towards volunteers whose only interest is to help continue making Linux better.

Which is why I rolled my eyes somewhat when I saw a new mailing response from The Man Himself calling out one of his developers for only considering a subset of cases, the ones escalated, without regard for all of the other cases that are not escalated, not even reported as problems, because the software or the process did exactly what it was supposed to do.

As an admin, I saw one section of this response and wanted to highlight it, because whether it’s useful in context or not, it is very relevant to work that those of us in Service positions (be it sysadmins or any other position — technical or non-technical). It is this:

So you’re making that statement without taking into account all the cases that you don’t see, and that you don’t care about, because the [insert something] has already handled them for you

Linus Torvalds [ https://lkml.org/lkml/2019/6/13/1892 ]

It was something I realized far too late in my Front-Line Support career — the people who are calling for help, or looking for your assistance, have either been inadequately informed of your repositories for self-help, or couldn’t understand or follow it for some reason. They are a poor representation of your entire support base which also includes 1) the people who didn’t have any problems at all, and 2) the people who did have problems but were able to solve them without calling for help. And in the case of the people who couldn’t find your documentation, the easiest solution is to show them how to find it!

Next time you’re on the phone with that one annoying customer who just can’t figure out how to do whatever simple task (that is part of his regular job duties) and you’re helping him for what feels like the fiftieth time this week, consider two things:

  1. What improvements you might be able to make in terms of providing documentation to the user so that they don’t need to be walked through it yet again (really, if it truly is the fiftieth time, you should have been writing the steps in a document as you went, and firing off copies to the user and their manager).
  2. How many people have to follow this process, and how often, and they don’t need to call you every time because they were able to do it without the additional help?

If your sample size is idiots, it’s easy to see why you might think everyone in the world is an idiot. It’s a valid assumption given the data, but it’s also an unfair assumption against the rest of the population.

False Alert Frustration

For those of us who endure the regular on-call schedule, one thing is more frustrating than anything else: the so-called “False Alert.” It comes at the most inconvenient of hours, it requires getting things out of bags and connecting and the whole rigmarole of engaging, and then you find that the server isn’t really down, or that the router memory usage has dropped back below threshold, or some other thing that wasn’t worth the disruption.

This is almost immediately followed by making the ticket as resolved and leaving a note that it was a FALSE ALERT before closing the laptop and returning to your day. Or night. Or whatever.

I know – I’ve been there. I’ve been that person ignoring alarms because I know better than the monitoring system. I was wrong, I have seen the light.

It undoubtedly relates to being the dedicated administrator for a monitoring tool in a large organization, but I’ve seen the effects of such ignorant responses, and my thinking has evolved. The duty of a monitoring tool is to tell you that there is a problem, and there are three accepted paths away from an alarm that it generates:

1: Fix the fault condition

This one is obvious. Something broke, it generates an alert, you do the needful and fix it. Alarm clears, you close the ticket.

2: Fix the monitor

Sometimes our monitors are overzealous. Maybe you have an application that manages its own memory, and so it grabs 90% and does its thing. If the monitoring tool is set for a 90% threshold, you’re going to have a bad time – just adjust the threshold! Or fix the app configuration to only use 85%.

Overly sensitive monitors lead to desensitization of the technicians and administrators who have to respond, which also means that when things are truly broken, they may just ignore it.

3: Pass the buck

This is the one that I see a lot of. Some of it falls under the point above, some tools don’t work that way or are too complicated to set up for all scenarios. It happens when something in the path between monitor and monitored breaks. For example if a switch breaks and I can’t ping the servers, and I generate alerts for all the servers being down. The server admins get frustrated because it’s clearly a network problem and close their tickets. The same happens when website monitors trigger because the single sign on (SSO) tool breaks.

THIS IS NOT A FALSE ALERT. It is not fake news. It’s just not your problem. You do need to make sure that it gets escalated to the right people. And, if possible, push your monitoring people to add dependencies so you don’t get disturbed again.


False alerts are different. They occur for different reasons that should be rare. Like you accidentally blacklisted the monitoring platform and it couldn’t ping your devices. Or the platform itself failed somehow.

In any case, an alert should always be a call to action – just not always the action that it immediately indicates.

Simple Mail Testing

Insert a witty comment about how it’s been so long since my last post.

In my travels, I came across a handy file which I felt was share-worthy.

Have you ever had a need to quickly test an SMTP  connection via telnet or netcat? I have. If you don’t have a file like this already, feel free to steal it. Save it as ‘smtprelaymessage’:

EHLO smtp.example.com
MAIL From:<from@example.com>
RCPT To:<to@example.com>
DATA
Subject: Mail Relay Test

This is a test message.

Command is:

cat smtprelaymessage | telnet localhost 25

In the above example the message is a prepared smtpmail message

Signed,

The Kiwi
.
quit

The command to run is embedded in the message itself, so if you ever forget how to make it work just open the file and remember.

Now, obviously you need to decide whether to default ‘telnet’ or ‘nc’ as not all OSes have both anymore. You’ll also need to update the top three lines to be more useful data for your testing.

But this should be a good stepping stone to easy mail testing, if it is of any use to you.

Get Training Done Right

Training people on new concepts is difficult. This is not a new idea.

Employers could save significant sums of money, and also keep employee morale up if they did a better job of understanding and accounting for this.

Training should be relatable

When you train someone on a new concept, it is critical that they be able to relate what they’re learning to something that they already know something about. For example, when you teach someone to drive, it may be their first time behind the wheel, but it is rarely their first time in a car, or on the road. They already have a basic understanding of how a car works simply through observation of other people driving. You’re not starting from the basic question of — what is a road? Why do we have roads? What is the purpose of a car? Etc.

Too often, people are pushed into training sessions on topics that they can’t easily relate to. Consider a product that you want someone to learn about, it becomes much easier if you have some background on a) What the product is, b) What the product does. Ideally you would also be able to grant c) Some time with the product before training on it, so that you have some time to experiment for yourself and learn.

Training should be structured

Even loosely. When you’re educating someone on something, it becomes far easier if there is an achievement expectation that you’re working towards. By the end of this training session, my students should know A, B, C. You might not have slides. You might not have demonstrations. You might have a lot of tangential discussions. But by the end of the training, if your students understand and know A, B, C, you can be considered successful.

Everything else, the slides, the demos, the limiting of unrelated discussion, all of that is secondary, albeit potentially critical to the success of the program. In a large group, or in formal training, slides are almost a necessity to ensure good time management. They’re also great for learners who operate better when they can read things. Demonstrations can be critical to explaining an idea that isn’t easily explained with just words. But slides and demos are meaningless when the training that they are supporting feels like ramblings of an overly excited or unnecessarily jaded trainer.

Training should be interesting

A lot of the stuff we train or get trained on is boring. So do what you can to make it interesting, and hold as much of the audiences attention as possible. If you deal with groups, focus on the specific needs of a couple of people who are into the idea, or who would benefit most from a specific example. Not always possible, but it helps. If you need to use examples, try to tailor them to your audience whether it is specific or generic. Maybe you can make it specific to my company, or you might understand that most of the people you train are geeky or nerdy, and so you might use a Star Wars or Star Trek type reference.

I should want to pay attention, not just because I feel guilty for wasting my time and my company’s money, but because I want to learn what you’re teaching.

Training for as many as possible

When you’re training a group of people, or lots of people on the same topic/ideas, there is a simple truth that is often overlooked. There are multiple ways that people learn, and most of the human race only fits into a couple of those buckets. Some learn best by reading, and struggle to retain knowledge when it is spoken. Some learn best by hearing, and struggle to concentrate when expected to read large amounts of information. Some learn best by being given instructions and then being allowed to try it themselves. Others learn better by trying it first, and then being offered instruction and correction.

If you’re dealing with lots of people, it is in your, and their, best interest to put together a training program that covers as many of these as possible, be they slides that include text (long form or summary) of what you intend to say, be they demonstrations (on video or in person) that convey complex ideas, and interactive labs (where participants can try doing things themselves, with you or other SMEs on hand to confirm things are right, or offer guidance/correction as necessary).

Training should request feedback

How a training session went is a very subjective topic. Some students will walk away empowered, with new ideas and understandings. Some will walk away bewildered. Others will leave glass-eyed trying to understand what just happened.

If the situation is right, a good trainer should hand out surveys near the end of their class and solicit feedback for their own benefit. Accept the criticism and let it mould how you train the next class. Accept the praise and let it do the same. Take both with a grain of salt, but if you’re consistently seeing complaints about heavy focus in one area, or lack of clarity in another, it might be time to re-consider how you teach those. Not necessarily changing the content, but including explanations of why things are done the way they are. Or changing the content.

 

I’ve been through a number of training sessions, of different sizes, formats, topics. This is what I’ve learned. Your personal mileage may vary. At least think about it.

Setting up New Employees to Fail

Some readers of my rambling are IT managers, some are lower level IT staff, some stumble upon them from Google or elsewhere. Just about all of you have, or will, at some point, be involved in the hiring and onboarding of a new employee whether it is as their manager, their colleague, or their underling. This also applies outside IT to an extent.

There are several things that are critical to the success of any new employee that must be communicated or handled early in the onboarding process. Sometimes these are handled all by one or two people, sometimes they are handled by multiple people, across multiple teams. Some of them are obvious, some of them are less so.

  1. “HR Junk” – All of the necessary HR items. Your tax forms. Your non-disclosures. Your direct deposits. Your benefits enrollments. The associated education of how employees submit time off requests (and broadly what that process looks like), how they file benefits claims if needed, how they see open positions and submit resumes (their own or their friends/family) for consideration. How to report abuses, and who to ask questions about organizational concerns.
  2. Job basics. If an employee needs to be getting themselves set up, they need to know what the hardware and software requirements are. Do they need to request a phone and extension? Do they need to request licensing for a piece of software? Ideally this is all handled behind the scenes before day one, but I’ve worked at small (30 people) companies and medium (300 people) companies where this wasn’t always the case.
  3. Management. If an employee has questions, who do they ask for help? If there are multiple people, who do they ask for what, or in what order?
  4. Getting work. Even if an employee is supposed to be self-driven, they still need to get that work assigned or picked up somehow. The employee needs to know how that process works, and how they’re expected to do it.

Far too often, one or more of these items is either not handled at all, or is handled poorly. I firmly believe that there is a point where there is too much formalization of training, and not enough space given for ad-hoc Q&A “I don’t understand” type discussion. But there is also the other extreme, where the onboarding process is not nearly formal enough, where everything is handled by the employee asking questions or the manager thinking about something and realizing they need to educate on it. The worst part of this is that many new employees are in a situation where they don’t know what they don’t know, so they don’t think to ask questions because they didn’t consider it to be a problem. Even experienced employees may not consider that they need access to organization-specific things if they weren’t aware of them. I need a ticket system, a documentation resource, a code repository, and an HR portal. Oh, there are two repositories? There are two ticketing systems? When do I use them, and what do I use each one for? Do I have access to both?

Another risk is the situation where onboarding processes are done by multiple people. This risk is managed by the managers having a defined process, or at least some intercommunication. When they are all very hands-off in their approach, you have a recipe for disaster. Recently I reached out to my two managers and asked what was going on, because I was very much in the dark about my situation. Both of them responded in surprise – they thought I was slammed in one way or another. Then it became a cluster of rushed direction changes in order to make this resource productive.

This lends back to the idea that everything should either be a process, or have a process. It doesn’t need to be complex, it does need to be flexible. Checklists are usually fine, with a couple of well-educated resources who can answer any questions. If you’re hiring for a new position, check over the checklists and make sure everything is listed that needs to be touched during the first couple of weeks. Even if you’re hiring for existing positions, it’s worth a moment to check that nothing significant has changed since your last person was brought into that position.

And as with just about everything else, the key to solving anything is communication. If you’re a new employee and you’re confused about something, ask. If you’re not sure who to ask, ask the person who sits next to you, or at least ask them who you should ask. If you’re a manager, talk to your employees — not just the new ones, but set aside extra time for them. Talk with them and learn where they’re at, what they need, figure out what they haven’t learned because they didn’t realize they needed to know it. If there are going to be issues or delays, let them know. If there is a training schedule, outline it for them. Set (realistic) expectations and then meet them as best you can.

Logitech Modifier Keys for Mac

I started a new job last month, and they issued me a MacBook Pro — the other choice was a smaller Windows laptop, and since I prefer the *nix environment I chose the Mac. I could make plenty of arguments or even just observations, since it’s been close to ten years since I last used a Mac on any kind of regular basis, but I won’t.

Instead I’ll offer a quick tip for those of you who happen to use a Mac, who happen to like Chrome, and who also happen to use a Logitech Keyboard-and-Mouse combo, like I do. I know, there aren’t many of us.

For the last few weeks I’ve been frustrated at the lack of back/forward functionality in Chrome. I use it all the time on my Windows laptop at home, and it’s annoying that it didn’t work here. Finally I did some searching, and a few suggestions were to change the action in the Logitech Control center from “Back/Forward” to “Keystroke.” Simple enough, I thought, change the “key” to a ‘[‘ or ‘]’ (back and forward in Chrome, respectively), and the ‘Command’ modifier — piece of cake.

Except it didn’t work. It didn’t work at all. It took a few minutes, but I realized I had made changes when I installed my keyboard. Because it’s a Windows keyboard on a Mac, the “Alt” key is in the wrong position. I’ve been through this before using external keyboards on my old personal Mac, and I found it handy to re-map the keys so that they are the same whether I’m using the laptop as a desktop (with extra keyboard) or as a laptop, with the built-in one. So I remap alt/option as command, and command as alt/option.

Some of you reading this word-for-word will be a step ahead. By remapping the keyboard functions, the keyboard keys that the mouse was ‘virtually’ pressing were now wrong. By configuring the “keystroke” to be Option+[ and Option+] the mouse now works correctly.

The Times, They Are a-Changin’

“Changing was necessary. Change was right. He was all in favour of change. What he was dead against was things not staying the same.” — Masklin (Terry Pratchett, Diggers)

Today marks the end of a moment. The end of an era. The end of my employment with an organization that has seen much change itself in recent years.

I came to this company nearly five years ago, almost as an act of desperation. Where I was previously was clearly failing, and couldn’t (or wouldn’t) offer me what I was looking for as the next step on my career path — a move out of SDQA and into IT/Systems. In hindsight, it turned out to be the right choice. My old team was laid off within six months, and only a subset of the people I worked with are still with the company.

I came to this company with a 100 mile, each way, commute. For the first six months of my employment I would wake up, spend two hours on the road, work for eight hours, spend another two hours on the road and then sleep until it was time to do it again. But it was worth it — I enjoyed the work I was doing, I liked the people I was doing the work for, and I enjoyed the company of the people I was doing the work with.

This job saw me purchase four cars, move house once, and expand the family by one baby. This job saw my salary increase to more than twice what I started at. This job saw multiple levels of promotion, far beyond what I had hoped when the journey began.

Today marks the end of this moment. The end of this era.

I’ve backed up my critical data, I’ve cleaned my laptop and my desktop, I’ve given away the projects that needed to be transitioned and also the personal items that I’ve chosen not to take with me.

I’ve shared my secrets, I’ve said my goodbyes, I’ve wiped away a couple of tears, and walked out the door with my access badges on my empty desk.

I’ve marked the end of the moment, the end of the era. The sun will soon set beyond the horizon, and I’ll be celebrating with my family at a nearby restaurant.

Soon, Monday will come. The sun will rise, and I’ll drive to a new location. I’ll quell the butterflies as best I can, and I’ll meet a new team of co-workers. I’ll struggle (but eventually succeed) in learning their names.

I will have a brand new set of challenges, and I will rise to meet them.

It will mark the beginning of a new moment. The beginning of a new, and hopefully just as long-lasting era.

I’ll try to keep you all in the loop.

Re-Architecting from the Ground Up

Well, nearly the ground.

I think one of the most enjoyable experiences in my line of work is when I get to design and build an environment from almost nothing, with a handful of requirements and design constraints. We do it with some regularity for customers as part of our support agreements, and I’ve now had the joy of doing this twice for my employer – once as the result of an audit/upgrade process, and once as the result of a merger and associated consolidation effort.

The thing is, there are so many roles involved in operating and maintaining today’s corporate environments, and each one needs to have a decision made, either directly or by association, about what tool or policy will fill that role.

Take User Authentication as an example. There are choices to make, which partly depend on whether your logins are on Windows, Linux, Mac, or something else again. Perhaps the three most common choices for filling this role are 1) Active Directory, 2) LDAP, 3) Local Authentication. For tiny organizations, #3 is quite acceptable. If the organization wants central authentication, then the existing skillset and projected direction of the company will typically dictate whether #1 or #2 is chosen, each having their relative merits for any given situation.

When I first started with my current company, the “corporate standard” (and I use the term loosely) was CentOS 5 (latest available at time of install), with centralized LDAP for authentication and authorization to backend SSH logins as well as several HTTP Basic Auth sites that we use(d). No configuration management whatsoever, and all patching was undertaken manually. There was manually configured Nagios monitoring, There wasn’t even a complete list of systems that the group managed. All of the switches and routers were managed by a separate networking team who configured everything with local authentication, individual user accounts on each device. They at least had a more accurate list of network devices that they managed.

The second iteration that I saw, and later became a part of the continued development for, was an improvement over the previous. We were using CentOS 6 (latest available), at the beginning it was still centralized LDAP without any patch management, but we were using Puppet to distribute and manage configurations. Once I was involved we took three drastic measures (not all my choice, in fairness): First, we implemented Spacewalk (2.3, from memory) to handle system patch management. Second, we automated the systems monitoring by integrating Nagios with Puppet. And third, we implemented FreeIPA as a replacement for LDAP. Patching systems became a breeze instead of a nightmare. Monitoring was as simple as adding a system to Puppet and running the agent — we were operating Puppet on 30 minute checkins, so service checks appearing in Nagios should take no more than an hour. Adding services to be checked by Nagios also meant only a few lines of code for the Puppet manifests, and once pushed into the Production server it would be updated into Nagios within the hour. FreeIPA made our centralized auth environment simpler, and while our implementation had issues and was far from perfect it made administration of users and groups much easier overall.

Now, we’re in the middle of doing it all again. The new environment is approximately thus:

  • User authentication is handled by Active Directory. Old FreeIPA and LDAP environments will be deprecated and decommissioned. (We considered keeping FreeIPA and creating a trust, and decided that at our size it was more complexity than was warranted. Most of the FreeIPA tasks that can’t be handled easily by Active Directory will be handed back to Puppet).
  • System standard is for CentOS 7. (We like CentOS around here, OK?)
  • Configuration management is still Puppet, though now being handled by The Foreman (vs. manually, as it was before). We really like puppet — it works well for us.
  • Patch management is handled by the Katello plugin for The Foreman. (Spacewalk has been deprecated, and there are a few things about it that we just didn’t like or use so there was little-to-no point in keeping it around aside from its ability to patch things. Katello and The Foreman replace many of the functions that Spacewalk did or could have done, and do it ten times better as well).

Like I said in my last post, everything good must end. I remember when I built my first environment backend management system (something like Puppet + Spacewalk) and naively thinking I’d never have to build another design again, we’d just keep updating the one we’d chosen. No dice. Every opportunity one has to rebuild the environment should be taken — approach it carefully, take into account what was learned in the last round, understand what was done wrong and what was done right, what can be improved upon with changes in technology and products available.

I, for one, enjoy the challenge.

Everything Good Must End

At one point it was scribed…

Progress is impossible without change, and those who cannot change their minds cannot change anything.

It’s true. Everything that begins must one day end, it is necessary for all progress.

It’s not always big things, though this is true. Lives end, marriages end (either by divorce or death), jobs end (either by retirement, resignation, firing, incapacitation…). Smaller things end also — processes for minor tasks are adjusted over time, and in a way the old process ends while the new process begins in its place.

I’m constantly reminded, and also constantly reminding others, especially when it comes to technology and its relationship with our business, nothing is forever. Decisions that were made even a year ago are eligible for review and adjustment – decisions that were made based on facts that were true five years ago are almost certainly not true today. Things we build now may, through no fault of our own, no longer be useful in a few short years time. If we are doing our jobs right, if we hold the right mindset, this won’t be a problem.

The nature of technology is constantly changing. While the core principles may remain for years at a time – e-mail comes to mind, it’s a technology that hasn’t significantly changed in the last thirty years – the surrounding architecture is constantly moving and developing. We added authentication, encryption, validation, scanning, HTML, and an array of other features. Websites have developed from simple text and hyperlinks to including images, animations, all kinds of client-side dynamic content. We’ve been through eras of Shockwave and Flash, through Java and Javascript, and finally moving into HTML 5.

It’s hard to let go when the time comes to move on, but letting go is what must be done. Very often we are able to take lessons from what’s been done before when building the next iteration, but that doesn’t always make the transition easier. CentOS 7 has changed significantly from CentOS 6, and it’s forcing habits to be broken. Systemd is a bit of a learning curve, though it’s one I embrace. Likewise, Windows Server 2012 was a change from Server 2008, and I expect the next iteration to do the same. 90% may be the same, but it’s always that 10% that throws you through a loop.

It’s Always DNS..

Also, Rule 39 — there’s no such thing as a coincidence (there is, but shush!)

A few weeks ago we had a power outage in one of our larger colocation facilities hosting customer racks. Because (legacy, cost) we have a number of servers providing some core services that don’t have redundant power supplies, and so a couple of these went down with everything else. Ever since, we’d been having serious performance issues with one of those nodes.

Some quick background: At every facility we provide service in, we have at least a couple of servers which exist as hypervisors. Back in the day we were using OpenVZ to operate separate containers, then we were using KVM-based VMs, and now we’re in the process of moving over to VMWare. In each iteration, the principal is the same: Provide services within the datacenter that are either not ideal, or not recommended, to traverse the public internet.

For example, we host a speed test server so that we can have our clients test their connectivity within the datacenter, to the next closest facility, or across the country. We have a syslog proxy server which takes plaintext input and sends it back to a central logging server via an encrypted link. We have a couple of DNS servers which we offer to our clients for local DNS resolvers. We have a monitoring system that reaches out to local devices via SNMP (often unencrypted) and reports back to the cloud-based monitoring tool.

Ever since the power outage, we noticed that the monitoring server was a lot more sluggish than usual, and reporting itself as down multiple times per day. I did some digging, noticed a failure on one of the drives, had it replaced — that didn’t go well. Ended up taking the entire hypervisor down and rebuilding it and its VMs (these hosts don’t have any kind of centralized storage). No big, I thought. This won’t take long, I thought.

The process was going a lot slower than usual, and I didn’t put nearly enough thought into why that might be the case. I left the hypervisor building overnight and came back in with a fresh head. I was just settling in when I noticed a flood of alerts arriving from hosts that they weren’t able to reach Puppet. Odd, I thought, the box is fine! Memory is good, CPU is good, I/O Wait time (which is notoriously poor on this hardware) was fine. I set about monitoring — puppet runs were taking forever. Our puppet master typically compiles the manifest in 3-4 seconds, tops. It was taking 20-30s per host — that’s not sustainable with nearly 200 servers checking in.

What’s embarrassing for me is that it wasn’t until I had a customer ticket come in that they were getting slow DNS lookups that I realized exactly what was happening — the hypervisor I was rebuilding, the one I had consciously deleted configs for “dns1” on, was the culprit.

A quick modification to Puppet Master’s resolv.conf, and also a temporary update to the Puppet-configured resolv.conf that we push out, reversing those internal nameservers, and everything started to clear out. Puppet runs dropped back down to single digits, customers reported everything was fine.

Of course it was DNS. It’s always DNS.