One of the lesser-used features of DTrace is anonymous tracing: the ability to trace events that occur after early boot has completed but before there is an opportunity to log into the system and start tracing. As Bryan says: you rarely need this facility, but when you do, you REALLY need it. Anonymous tracing relies on stuffing DOF into dtrace(7d)‘s driver.conf(4) file, /kernel/drv/dtrace.conf, and adding forceload directives to /etc/system so that DTrace’s modules are loaded as early as possible; the -A option will helpfully do both of these for you. But what if you’re on SmartOS, where these files are on a ramdisk? The whole point of anonymous tracing is to collect data during the next boot, and the contents of the ramdisk are lost before they can be used.
Since the introduction of GRUB via the New Boot project roughly 10 years ago, illumos on x86 has required the use of a single boot module, known as the boot archive. This module, passed to the kernel via the Multiboot protocol, conventionally contains a UFS filesystem the kernel will use as the root filesystem during the middle phases of boot — after early loading but before enough infrastructure is built to mount the real root filesystem. A very limited VFS (“bootvfs“) provides access to this temporary root filesystem during this time, and facilities such as krtld can read kernel modules, driver.conf files, and other data such as /etc/system from it using a restricted set of filesystem-specific operations. Traditional illumos distributions require rebuilding the boot archive any time one of a number of such files have been modified in the real root filesystem, to preserve the fiction that the filesystem used during boot is the same persistent, ordinary one mounted at / when the system is running. That mechanism, then, picks up the changes needed for anonymous tracing to work and incorporates them into the archive prior to rebooting.
No similar mechanism exists on SmartOS; the only persistent means of modifying the boot archive is to rebuild the platform image (SmartOS’s name for the boot archive, though unlike the traditional boot archive, the platform image holds the entire contents of the normal root filesystem, not just a few special files used by the kernel). Building a new platform image can take anywhere from 15 minutes to 3 or 4 hours, depending on how much has changed and the method you’re using to rebuild it. Certainly it would be nice if there were a better way to override the contents of certain files contained in the platform image. So, during the recent Joyent hackathon, John Sonnenschein and I implemented a facility to pass arbitrary Multiboot modules to the OS at boot time and have them appear in the filesystem hierarchy. The first portion of this work has been integrated under OS-2629; this portion of the work provides facilities for handling any number of boot modules and making them available to the kernel at arbitrary locations in the filesystem hierarchy via a new bootvfs filesystem. This is sufficient to allow engineers to replace device drivers, driver.conf files, /etc/system, and other files consumed by the kernel prior to mounting the real root filesystem. The mechanism for passing in boot modules is described by some sketchy documentation we wrote up during the hackathon; better documentation will follow once the second phase of this work is completed, a filesystem providing access to these additional modules after the system has booted.
With this in hand, we can now enable anonymous tracing relatively easily on a standalone SmartOS host or SDC headnode. All we need is any location that we can modify from the running system and GRUB can read at boot time. If you’re using a USB key to store your SmartOS images, as we do, that’ll do fine. If you use PXE, you will need to modify the GRUB menu.lst your PXE server sends your SmartOS client. Using this facility with read-only optical media is left as an exercise for the reader. If we’re using a USB key, however, there are two basic steps involved.
Setting Up
Modify the GRUB configuration to pass the required two files to the kernel at boot time. You will need to convert the module declarations in your menu.lst from “old style” (which assumed a single module containing the boot archive) to the “new style” described in our writeup. This will end up looking something like this:
title Live 64-bit kernel$ /os/20131227T225949Z/platform/i86pc/kernel/amd64/unix -B console=${os_console},${os_console}-mode="115200,8,n,1,-",headnode=true module /os/20131227T225949Z/platform/i86pc/amd64/boot_archive type=rootfs name=ramdisk module /os/20131227T225949Z/platform/i86pc/amd64/boot_archive.hash type=hash name=ramdisk module /os/bootfs/etc/system type=file name=etc/system module /os/bootfs/kernel/drv/dtrace.conf type=file name=kernel/drv/dtrace.conf
This need only be done once; you can then iterate as many times as needed. When you are finished using anonymous tracing, you can simply remove the two new entries.
Iterating
After modifying your GRUB configuration, it’s easy to set up a new set of enablings for your next boot. Simply run dtrace -A as you normally would, then copy the resulting /etc/system and /kernel/drv/dtrace.conf into the location you chose above (in my example above, it would be <path to USB key>/os/bootfs). When you reboot next, you will be greeted by helpful messages indicating that your enablings have been created:
NOTICE: enabling probe 0 (fbt::rts_new_rtsmsg:entry) NOTICE: enabling probe 1 (proc:::exec-success) NOTICE: enabling probe 2 (dtrace:::ERROR)
You may then log in normally and use dtrace -a to claim the trace data collected during boot.
Solving a Real-World Problem
A long-running source of annoyance to both SmartOS users and Joyent engineers working on SDC has been a bug causing the NTP service to enter the maintenance state on boot, more or less reliably. While this is merely annoying on standalone SmartOS systems, it has some very unfortunate knock-on effects on SDC compute nodes. This problem has resisted efforts to debug it: simply logging in and clearing the service invariably works, and any debugging commands added to the NTP start method chased the problem away entirely. The proximate cause is simple: in order to avoid ntpd hanging forever (itself an egregious bug in this widely-used software) if none of its servers is reachable, we perform a simple application-level ping on each one prior to starting up. This is itself a product of countless past headaches, and perhaps a sign that we should simply have fixed the daemon in the first place. In any case, this check would reliably fail during boot.
Anonymous tracing made this relatively simply to debug. By observing the generation of routing socket messages (see route(7p)) and correlating them with the execution of various programs during boot, it became fairly clear that the problem boiled down to two basic issues:
- ifconfig(1m) does not wait for the kernel’s duplicate address detection to complete before exiting after making a change.
- An unneeded service was using ifconfig to reset every interface’s netmask and broadcast addresses just prior to the NTP service starting.
Both conditions are needed to trigger this problem, and the window in which the NTP service needs to start before ifconfig’s changes take effect is only a couple hundred milliseconds wide, explaining the ease of chasing the bug away. The exceedingly simple solution: get rid of the useless service. In general, however, new services should be using ipadm(1m) instead of ifconfig; it has its own mechanism that does not suffer from this problem. Future investigations of apparent “impossible service races” during startup should consider this as a first-class possibility. Fortunately, with anonymous tracing now back in the toolbox, it’ll be much easier to evaluate this hypothesis.
One Response
Great and very interesting work, thanks!