July 8th, 2010
This is OpenAMD: http://amd.hope.net/frequently-asked-questions-faq/ It’s a cool “hacker” project, but I think it can actually be used to solve a number of real-world problems. Small, common problems that attendees of a conference have.
Problem #1: quickly and easily looking up information about the person next to you, in parallel with talking to them. Ideally silently and covertly, but the HUD is not yet available
Implementation #1: use the location API, figure out which is you, figure out which other has minimum distance, display all info about that one. Perhaps do this in real-time, so you should get an updated view when a new person walks up to you. And you can probably do this on your smartphone or on your laptop.
Problem #2: find a particular person
Implementation #2: this one should be pretty easy, you just need their uid, and you query their current coordinates. However, the grid coordinates might not be so human-readable, so a directional arrow would be nice. For even more icing on the cake, display their historical path, so you can guess which direction they’re moving.
Problem #3: finding a person by real name or other attribute (zip code?) This is an extension of #2. Use case: you know your friend Dave is there, but you’re not sure where he is. How do you find his tag’s uid? Hopefully you can query just by name or handle to get the uid.
Implementation #3: sounds like you’ll have to query a separate database of metadata to see if you can identify the uid of the person based on the information that you have
July 1st, 2010
Iperf is an excellent tool to test your network performance. But it’s also a good troubleshooting tool: run iperf and see if the results “feel” right.
There are two problems I’ve been able to diagnose using iperf. One was a performance problem with a NIC driver. Take two new machines connected by a gigE switch, and run iperf between them. You should get very close to wire speed. If you don’t, it’s worth investigating. In my case, I was getting ~650 Mbits instead of the expected ~980 (among other problems). After fixing the problem (in this case, upgrading NIC driver), iperf showed close to wire speed as expected.
Another time, there was what turned out to be a problem with a switch. It was a gigE switch, and single iperf streams were fine, ~950Mbits between two machines. However, having one machine as an iperf server and starting up several iperf clients showed much lower aggregate throughput. Another thing I noticed was that the throughput was not evenly split between nodes. After a lot of other troubleshooting (kernel settings, NIC drivers, switch config) replacing the switch with a different one and re-running iperf showed even distribution of performance, e.g. ~100Mbits for each of 10 clients, with an aggregate of ~1Gb at the server.
I use something like ‘iperf -s -w 512K -l 56k’ and ‘iperf -c head -i 3 -t 120 -P 16′.
June 13th, 2010
The ‘dig’ command is a tool that allows you to query the DNS system. Here are some ways that I use it that are not covered in the man page.
By default, ‘dig’ will use the DNS servers configured in your system resolver (/etc/resolv.conf on Linux) but you can specify any DNS server. Useful ones are some public ones: 220.127.116.11 and 18.104.22.168 are provided by Google. OpenDNS provides 22.214.171.124 and 126.96.36.199 (but beware they don’t return NXDOMAIN). There’s also 188.8.131.52 (not sure who provides it, but it’s easy to remember).
So if your home ISP DNS server does “DNS hijacking” and returns the IP of one of their web servers instead of NXDOMAIN, you can double-check the result with a quick dig command.
It’s also useful for checking how the propagation of a DNS entry is going. Ask the authoritative name server for the entry, then one of these public caching servers, then your ISP.
The two most common flags I use for dig are “+short” and “-x”, for terse output and a reverse lookup, respectively.
You can get the ‘dig’ command on Debian/Ubuntu by installing the ‘dnsutils’ package. On RH, it’s in ‘bind-utils’.
April 14th, 2010
While we all strive for maximal Uptime, one thing that often goes by the wayside is the ability to reboot the machine without wondering whether it will come back up correctly.
You do not want to find out that the machine has the wrong boot settings when it gets powered off unexpectedly and then doesn’t come back up.
“do a reboot” is a good checklist item when adding a new service to a machine. You’ll want to ensure that the service comes up correctly without intervention after a reboot. Of course, you’re installing the service during a maintenance period, so rebooting is not an issue, right?
In my day job, we learned this lesson the “medium hard” way when we had to move datacenters. The specialized movers that were hired had a full power cycle on their checklist of things to do to the machines before unracking them. You really want to make sure the machine comes up correctly before moving it to a different datacenter/network where you may have different unrelated problems.
If you ask around, you’ll hear lots of horror stories about machines that had been up and running with great uptimes (1 year! 2 years!) that no one was willing to touch lest some undocumented thing breaks on reboot.
Today, with many services living in small virtual machines (or VPSs), a reboot only takes a few seconds, so it’s much easier to do.
Add this “best practice” to your sysadmin toolkit. For more details on related topics, see “The Practice of System and Network Administration”.
April 12th, 2010
There are a lot of writers who write about controversial topics not to add anything of value to the debate, but merely to stir up the flames. Today, that typically means attracting a lot of page views and lots of comments.
How Canonical Can Do Ubuntu Right: It Isn’t a Technical Problem by Caitlyn Martin is a perfect example.
We won’t focus on the fact that the sensationalist headline does not match what she says in the article, which is in fact mostly about technical problems she had with Ubuntu.
It’s the “trolling” sentences that are the signature of this type of article.
- “Other distributions which target the desktop and the wider consumer market do a much better job from a technical standpoint. They produce a better product.”
Which “other distributions”? How are they better?
- “Even considering all of that I still feel that the downloaded Ubuntu offerings more often than not have been substandard when compared to other distributions.”
Which “other distributions”? How is Ubuntu “substandard”?
And finally, after rambling about several unrelated topics, the conclusion: “At this point I recommend Mandriva 2010 for newcomers to Linux. No, it is not bug free. No distribution is. Mandriva’s developers are simply more responsive to bug reports and get issues fixed, usually in a timely manner. In addition, while Mandriva has had a few less than stellar releases they have, more often than not, done a pretty good job of getting things out that work. As always, your mileage may vary.”
My technical mind translates that to “Mandriva worked better for me than Ubuntu on the one box I tried it on.”
I’m not writing this to say that Ubuntu is awesome. I’ve had my share of problems with Ubuntu. I’m writing this to say that Caitlyn’s article is awful. She doesn’t say anything new and she makes vague complaints that only trick other people into trying to counter them. Well, I’m not feeding this troll.
April 1st, 2010
Karanbir Sangh (one of the CentOS maintainers) asks “Why do you run CentOS?”
I’d say our indirect philosophical reasons are:
- We want to use Free Software
- We want to use the best tool for the job
- We want to hire smart people
For our application, we can use almost all Free Software, except this one commercial package (IBM’s GPFS), which is not Free, but is the best tool for the job. IBM only support GPFS on SuSE or Red Hat, so we choose Red Hat, mainly because it is more common. It is also much easier to find qualified people who are familiar with the RHisms of Linux. While I’m a Debian guy at heart, it’s easy to adjust between RH/Debian.
So that’s why we’re on CentOS.
February 23rd, 2010
Today, I went to a presentation given by Vint Cerf and Robert Kahn. One of the problems they presented as still unsolved was the problem of retaining information in a readable format in the long term. Vint made a pretty funny joke about trying to open a PPT file from 1997 in the year 3000. Even using Windows 3000 with Office 2998, the file may not necessarily be readable.
This is a problem that many people have experienced first-hand. Have any old 5 1/4″ floppies laying around? Think you can still read them? And assuming you can, can you then read the file formats, which may be proprietary to software that no longer exists?
Lo and behold, a couple of hours after the talk, there’s a Slashdot story on the very same topic, pointing to an American Scientist article titled “Avoiding a Digital Dark Age”.
My thoughts on the matter. There are three separate layers here: the media longevity, the media format, the file format, and each needs to be designed with the same longevity goal in mind.
I will give an example here of how one software product handles this problem. That software product is Bacula. It uses an open, documented format for its file contents. So you can print out the specification on paper, if you like, and then sit down and re-implement the code and be able to read its files. It also uses the same format on different media, be it tape or disk. AFAIK, this design decision was made after seeing the evolution of ‘tar’ and GNU tar. Even with the same name, there are some versions of tar that produce incompatible files.
So the key is to use an open, documented format. Furthermore, it needs to be truly Free Software, not just open source but encumbered by a patent, for example.
January 25th, 2010
There are a bunch of things that I look at almost any time I log into a machine.
Is the output of this command what you expect? Time synchronization issues are often the cause of odd problems. If the time is wrong, check your ntpd config and output of ‘ntpq -p’.
This will show you the uptime, first. Is that value roughly what you expect? It will show you load and users. If the machine has been up for less time than you expect, figure out why it rebooted. (And consider upgrading your sysadmin philosophy towards change management). If there are users you don’t expect to be logged in…
Check the swap I/O, regular I/O, check if any processes are blocked, check if memory usage changes drastically over a short time, check the CPU usage. collectl can give you more detail, but it needs installation.
Are there any unusual message here? What will you do about them?
January 14th, 2010
There is one note that I came across that probably explains some unusual intermittent problems I’ve seen with puppet. “If you update the config and do not restart puppetmasterd and that new config is invalid puppetmasterd appears to serve up the previous version of the config that it knew worked.” So the workflow looks like this: you update the config, restart puppet on the node, nothing changes, you get confused, mess around with the config, then maybe restart puppetmasterd, then eventually it works. The key is to tail the puppetmasterd logs when you see some unusual behavior, and maybe also tail the puppet log on the client. There will typically be some relevant error message, or at least an unexpected change in behaviour (’huh, why does the node think it no longer has a defined config?’).
I also came across a couple of ‘chef vs puppet’ blog posts. One was posted on the puppet-users mailing lists and did not draw any criticism there, not surprisingly. The blog commenters were less forgiving and did a good job of convincing me that the post was not well-written. Some of the comments are worth reading though. The second post is also well written: Puppet vs Chef, and discusses the differing underlying philosophy of the two tools.
I have to say that I agree that the more advanced pre-requisites for Chef make it less appealing to a sysadmin, or even to a non-Ruby developer. It sucks having to know how to configure and run Merb and OpenID just to try the tool out. The comparison of Nagios and Puppet is an apt one, I think; we sysadmins are OK with arcane configuration syntax, so long as it’s well-documented and examples are easy to find.
A new tool that is intended to sound totally awesome is Foreman, but to me it sounds a lot like Cobbler + Puppet, which is what I used to use before I gave up on dealing with “automatically managing” my DNS, DHCP, TFTP configs. Except here you have the additional hassle of having a working Puppet stack and maybe a working Passenger install before you even start up the tool. I’m quite alright with just writing my /etc/ethers and /etc/hosts and /tftpboot/pxelinux.cfg/default by hand.