Archive for January, 2010

basic sysadmin troubleshooting part 2

Friday, January 29th, 2010
  • top
    You’ll probably want to learn some basic top commands. E.g. hit “1″ to see the CPU break-out. Hit “z” to highlight processes that are in state “R”. Hit “O, n” to sort by memory usage. Hit “u” to type in a particular user name. “c” to see full command lines. 15 Practical Linux Top Command Examples
  • ps
    You’ll probably want to learn some basic ps switches. “ps -ef” for a listing that gives you users and commands. “ps auxf” that also adds some CPU and memory information for the processes and shows them as a “forest”. “ps -efL” also shows you threads for multi-threaded processes. “man ps” will tell you way more than you want to know. Another more useful example: ps -eo pmem,pcpu,rss,vsize,args|sort -k 1 -r -n|head
  • iostat
    Iostat will show you some information about the I/O subsystem. I like “iostat -k 5″; it’ll show you updates in kilobytes and 5s increments. The very first screen will show averages since boot, subsequent screens only information over the last 5 seconds. Add “-x” to see information about queue lengths and average request size as well as “service time”, i.e. latency of I/O processing.

basic sysadmin troubleshooting part 1

Monday, January 25th, 2010

There are a bunch of things that I look at almost any time I log into a machine.

  • date
    Is the output of this command what you expect? Time synchronization issues are often the cause of odd problems. If the time is wrong, check your ntpd config and output of ‘ntpq -p’.
  • w
    This will show you the uptime, first. Is that value roughly what you expect? It will show you load and users. If the machine has been up for less time than you expect, figure out why it rebooted. (And consider upgrading your sysadmin philosophy towards change management). If there are users you don’t expect to be logged in…
  • vmstat 2
    Check the swap I/O, regular I/O, check if any processes are blocked, check if memory usage changes drastically over a short time, check the CPU usage. collectl can give you more detail, but it needs installation.
  • dmesg|tail
    Are there any unusual message here? What will you do about them?

puppet notes

Thursday, January 14th, 2010

There is one note that I came across that probably explains some unusual intermittent problems I’ve seen with puppet. “If you update the config and do not restart puppetmasterd and that new config is invalid puppetmasterd appears to serve up the previous version of the config that it knew worked.” So the workflow looks like this: you update the config, restart puppet on the node, nothing changes, you get confused, mess around with the config, then maybe restart puppetmasterd, then eventually it works. The key is to tail the puppetmasterd logs when you see some unusual behavior, and maybe also tail the puppet log on the client. There will typically be some relevant error message, or at least an unexpected change in behaviour (’huh, why does the node think it no longer has a defined config?’).

I also came across a couple of ‘chef vs puppet’ blog posts. One was posted on the puppet-users mailing lists and did not draw any criticism there, not surprisingly. The blog commenters were less forgiving and did a good job of convincing me that the post was not well-written. Some of the comments are worth reading though. The second post is also well written: Puppet vs Chef, and discusses the differing underlying philosophy of the two tools.

I have to say that I agree that the more advanced pre-requisites for Chef make it less appealing to a sysadmin, or even to a non-Ruby developer. It sucks having to know how to configure and run Merb and OpenID just to try the tool out. The comparison of Nagios and Puppet is an apt one, I think; we sysadmins are OK with arcane configuration syntax, so long as it’s well-documented and examples are easy to find.

A new tool that is intended to sound totally awesome is Foreman, but to me it sounds a lot like Cobbler + Puppet, which is what I used to use before I gave up on dealing with “automatically managing” my DNS, DHCP, TFTP configs. Except here you have the additional hassle of having a working Puppet stack and maybe a working Passenger install before you even start up the tool. I’m quite alright with just writing my /etc/ethers and /etc/hosts and /tftpboot/pxelinux.cfg/default by hand.