I’m typically not one to believe in co-incidence, but I’m also a fan of correlation not necessarily proving causation, while also not disproving it.
We run a large number of Tripplite PDU (power distribution unit) devices in customer cabinets in our facility, they are largely how we determine power usage and billing for overages in power (using SNMP cards, a private VLAN and a set of monitoring systems). It’s rare that we have issues with them, but just the other day we suddenly started getting multiple communication alerts for one PDU in particular.
I saw it and noted it, verified everything attached was online, and left it alone until I had a few minutes to go and reseat the web-card. I didn’t think about it for another hour until it tripped again, and then again, and then again. Several times it would come back online on it’s own without us going to reseat the card. There seemed to be no rhyme or reason as to why it was dropping or when. This was late on a Friday, so we decided to ride it out through the weekend and let the provisioning team know so it could be handled on Monday.
Today (Sunday) I came in and saw a few alerts for the same PDU, and then noticed one for another PDU in a different cab with the same issue. These two have very little in common, they’re the same type of PDU but they’re in different cabinets in different rows attached to two different switches. I check the alert history and notice it has done the same thing 4-5 times in the previous 72 hours. It seems like both have failing web cards but it seems odd that they would fail together, especially separated as they are. At this point, our provisioning engineer who works Sunday’s was already investigating the first one, so I added it to his notes to take a look.
To cut an even longer story short, it was determined that at some point in the past, one of the two cards had crapped itself; it was no longer showing the serial number in the web interface and it had reset it’s MAC address to “40:66:00:00:00:00.” This wasn’t a major issue, it was still responsive and everything stayed happy. Until one day (earlier this week, presumably) the other card in the other PDU did the same thing – no serial and MAC address 40:66:00:00:00:00. Now we have a MAC address conflict on the VLAN and suddenly they begin interfering with each other. Once this was determined we pulled one of the cards – the alerts have been quiet for several hours, pinging the remaining card shows no packet loss over more than 6000 pings.
The good news is that to resolve the MAC address conflict, we really only need to replace one of the two cards. For now at least.