Since March of 2013, I have been working as a Systems Administrator for ZedX, Inc, a small company, that focuses on meteorological and agricultural based information systems. From day one, I have been heavily involved in projects and helping the company to the best of my ability. ZedX leverages the community supported version of Red Hat, known as Cent OS, for nearly all of it’s computational needs. Within the environment we have over 100 virtual machines, a handful of physical servers and over 70 TB’s of data. The systems group consists of only 2 people to manage all of ZedX’s networking, systems, and computer needs.
As I stated on the Home page, At ZedX, I have worked extensively to implement automation, auto-recovery, smart alerts, industry-standard security practices, and increase the ability for our developers to do their work. One of the many ways I have done that was implementing error recovery. For many years, let’s say a process (httpd) stop running, one might get an alert from an external monitoring service but if it’s the middle of the night, and we have customers all over the world, we cannot afford to be down for 6-8 hours. This was always a challenge for us, no matter how good our monitoring was, it wouldn’t do anything but monitor and alert, leaving it up to 1 of possibly 3 people that understood how to fix the issue. My first solution to this was to incorporate it into our Config management service called CFEngine. This allowed us to group machines by “service” and then have a CFEngine promise that would check the service every five minutes to see if it was alive, if not restart the service. This worked pretty well, it automatically attempted to fix the problem no matter the time of day, leaving us to live happily ever after…well as we moved to a more High Availability mind set, five minutes of possible downtime was too slow for error recovery, and in some of our clustering software would force a master from reconnecting to the group, which I did not want to happen, when a simple but quicker restart would resolve that issue. After some research I discover the Monit project which promised to check services, files, protocols, really anything, every 60 seconds. Low and behold this was the answer I had been looking for, after a few days of testing, I deployed it to our main cluster. With it we got alerts when a process went down, and when it recovered, also it continually alerts if the process does not start again, so we know if we have to still login and take action. After a resounding yes from testing and a small deploy to show how effective it was, we rolled it out to all of our servers and set it up to monitor all of our important 24/7 services. Then in a fit of paranoia and also inspiration from the graphic novel, Watchmen, I setup CFEngine to monitor Monit’s process. “Who watches the watchmen,” indeed.
Another example of my leadership in automation, was before I arrived and about a year into my work at ZedX, it would take my superior about 4 hours to setup and install all of the required sources for our production web servers. We have highly specialized needs. I thought that time spent was ridiculous, especially when he had a list of every package he needed to install, and links to every source build with instructions on how to install it all. I then set out to automate this installation process, through this I learned much more about, dhcpd, TFTP/PXEboot, Kickstart and very careful attention to detail while Bash scripting. The result was whittling the installation time down to ~30 minutes and the only work one had to do was select which kickstart config they wanted to have installed. I setup one for basic CentOS 6, our Prod web CentOS 6, our Prod SQL CentOS 6, and when CentOS 7 rolled out I was very quickly able to add a new series of configs once i understood the differences between CentOS 6 and CentOS 7.
Let the machines do the hard work for us.