Capturing System Core Dumps
AHL Tech Article 'Capturing System Core Dumps'.
We recently published an article on process core dumping inside Docker containers. This article is a follow up on the related topic of capturing system core dumps (aka vmcores or crashdumps).
System core dumping
The goal of capturing the system core dumps is the same as for process core dumps - they allow us to debug issues with our systems, but in contrast to process core files, system core files are generated by the Linux kernel in response to the kernel itself crashing. They are invaluable in debugging kernel crashes, as they let you investigate what the system looked like at the time of crashing.
Traditionally, system core dumps are written to local disk in a special partition, or to a remote NFS share. The potential maximum size of a system core is the the amount of RAM and swap space in a server, so these files can get pretty big. As with process core dumps, we don’t want to maintain a large local partition for system core dumps, and we also don’t have NFS in our production environment, so what can we do to enable us to capture system cores?
Like for the process core files, we use FTP as a transport mechanism. FTP is very fast and it is easy to script, however the native RedHat/Centos crash utilities do not support FTP, so we’ve engineered a solution that does.
The use of FTP as a transport might sound somewhat controversial, but it has several good qualities:
- It is a very fast protocol, approaching theoretical speed limits (as opposed to e.g SSH, which is significantly slower (important when system core files get large))
- It does not need any special keys / secrets to work for us. We have an anonymous write only FTP server that we use to receive core dumps (inside our own network, of course).
- Unlike NFS, we do not need a lot of ports opened up in firewalls, and there is no dependency on RPC.
You can find all files related to this article at our GitHub repo.
What happens when we crash?
To understand our approach to capturing system core dumps, it is first necessary to understand how system core generation works in RedHat/Centos.
When Linux first boots, the boot loader will usually instruct the Linux kernel to reserve 128Mb of RAM for
crashkernel space. This is an area of memory that the kernel will not touch during its normal course of operation. E.g on RedHat/Centos 7.x,
grub usually passes in the
crashkernel=auto parameter to the kernel (see
Systemd starts the
kdump.service on boot. The
/usr/bin/kdumpctl, which loads a ‘dump-capture’ kernel into the reserved memory space. It does this by calling the
/usr/sbin/kexec command line tool (see the kexec(8) man page).
kexec in turn calls the
kexec_load() system call with the
KEXEC_ON_CRASH flag. In addition to loading the dump-capture kernel and dump-capture initrd into the reserved memory area, it also causes the the kernel to automatically start the dump-capture kernel if the system crashes (see the kexec_load(2) man page).
The dump-capture kernel will then start the
/usr/sbin/kdump process, which actually creates the system core.
The dump-capture disk image contents
Like the real kernel, the dump-capture kernel has an associated image file with its initial file system. We can add files to this image, and they will then be available to us during dump capture.
The initrd image is built when the
/usr/bin/kdumpctl script is called from
kdump.service. It reads
/etc/kdump.conf and creates an image file with the appropriate content (using
dracut), and stores it in the
/boot file system:
Our dump-capture image is a vanilla image, which we extend. We do this by instructing kdump to add extra files into the image (e.g. ncftpput, our pre-crash script etc). We use Ansible to create a custom “pre-hook” script per server using a template, which allows us to propagate IP (and other) information to the crashkernel.
Propagating state to the crashkernel
It is important to realise that the dump-capture kernel does not “know” anything about the state of the previously running system. However, as it has access to the whole memory of the server, it can run a program to dump this memory to the crashdump file. In fact, the crashdump is a memory image with some bits stripped out.
As a consequence of not having state from the previously running OS, e.g. the network setup is lost, and even local filesystems have to be remounted if we want to write a crashdump locally.
In order to send files over FTP, we obviously have to have a working network configuration. If we only had one subnet, we could imagine configuring the crashkernel to run with a specific, well defined, IP address. Kernel crashes are quite rare, and we’d be unlucky to have two kernels crash at the same time, so the fact that both crashkernels used the same IP would be unlikely to cause problems in the real world.
Unfortunately, we have many different subnets across our estate, and we don’t want to dedicate a specific IP per subnet for ‘crashing’, and also it would create extra tasks every time we changed network configurations.
kdump process inside the crashkernel has a set of hooks it runs at various stages during its operation. We start our own script using the first available hook, and our script then takes over from
kdump (and we never return control back to it).
As we have control of the contents of the image, we can pre-populate the script with the IP address of the individual server, so that it has the requisite information to set up its IP configuration.
The pre-crash script
When the crash kernel starts, the
kdump process inside it executes our
pre-crash.sh script, which performs the following steps:
- Attempt to configure IP with DHCP.
- If the previous step fails, use a statically set IP / network route / hostname, as determined and configured when kdump built the unique disk image.
- Dump the stripped memory contents to our FTP server using anonymous FTP. We strip the image to make it smaller, in order to produce a “fast” crash dump. At this point, the server can be rebooted as we can get most of the required information out of the stripped crash dump.
- After the initial dump has completed, we produce a full dump. This is usually not required for kernel debugging, but it can be useful in some instances.
- Reboot the system.
How we roll this out system wide
We use Ansible extensively, and we have created a role which:
- Drops the required scripts + binaries on the box.
- Rebuilds the crashkernel image.
The Ansible role is part of our base playbook, so it gets rolled out automatically to any new systems or images that we build.
Sounds great, but does it work in practice?
Yes, it works very well! Kernel crashes are thankfully rare, but we have successfully captured crashdumps from every kernel crash we have experienced since rolling this out. Previously, crashdump generation was hit-and-miss due to the local disk requirement, but now there is no such problem. The server where the crashdump files end up has a lot of disk space as well, so we can keep crashdumps around for a long time for analysis of common issues and trends.
If your current setup relies on local disk, and/or NFS or SSH based captures, we think that the FTP based solution we have developed is well worth a look.
A fun fact about reboot()
It is also possible to start the dump-capture kernel by calling the
reboot() system call with the
LINUX_REBOOT_CMD_KEXEC flag (reboot.c), which in turn calls
kernel_kexec() to start the dump-capture kernel that was previously loaded.
reboot() system call has to be called with two magic parameters set to the correct values before it will actually cause a reboot.
We see this in the function signature:
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, void __user *, arg)
The first parameter must be set to the value 4276215469, and the second parameter can be set to either 672274793, 85072278, 369367448 or 537993216. These are all defined in reboot.h
You may wonder what the significance of these numbers are, and if we convert them to hexadecimal notation using some Python, we see:
>>> map(lambda x: hex(x), [4276215469, 672274793, 85072278, 369367448, >>> 537993216]) ['0xfee1dead', '0x28121969', '0x5121996', '0x16041998', '0x20112000'] >>>
The first magic number is now obvious. The remaining numbers represent the birthday of Linus Torvalds and those of his three children. Whoever said us nerds can’t have fun? :-)
Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc (‘Man’). These opinions are subject to change without notice, are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which the Company and/or its affiliates provides investment advisory or any other financial services. Any organisations, financial instrument or products described in this material are mentioned for reference purposes only which should not be considered a recommendation for their purchase or sale. Neither the Company nor the authors shall be liable to any person for any action taken on the basis of the information provided. Some statements contained in this material concerning goals, strategies, outlook or other non-historical matters may be forward-looking statements and are based on current indicators and expectations. These forward-looking statements speak only as of the date on which they are made, and the Company undertakes no obligation to update or revise any forward-looking statements. These forward-looking statements are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements. The Company and/or its affiliates may or may not have a position in any financial instrument mentioned and may or may not be actively trading in any such securities. This material is proprietary information of the Company and its affiliates and may not be reproduced or otherwise disseminated in whole or in part without prior written consent from the Company. The Company believes the content to be accurate. However accuracy is not warranted or guaranteed. The Company does not assume any liability in the case of incorrectly reported or incomplete information. Unless stated otherwise all information is provided by the Company. Past performance is not indicative of future results.