First mentions that updated versions of VMware’s ESXi 6.7.0 installer doesn’t start on PC Engines platforms come from the beginning of 2019. We were aware of that issue since April (1). Older versions of ESXi worked fine.
There were fixes from other firmware vendors for Intel NUC platforms, but apparently those dealt with UEFI memory map problems, as mentioned here. Release notes for 0051 linked in that article (you have to open BIOS Update page, direct link to release notes is no longer valid) mention that it fixes versions 6.7 and 6.5, so this probably is different issue altogether.
Symptoms
For older firmware versions, boot process hanged at:
|
|
This changed to reboot in newer versions of coreboot:
|
|
This happens before kernel is even started, in bootloader. This stage of
bootloader is in mboot.c32
on the installation medium.
Different versions of mboot.c32
File with this name is a part of
SYSLINUX. It is
responsible for loading images using Multiboot specification. During our
research, we tried to use mboot.c32
from different versions of SYSLINUX.
ESXi uses its own version, which implements Mutiboot (we first though that this is a typo, but apparently it’s not) protocol. As the name suggests, it is a mutated variant of Multiboot :) Do not try to start ESXi with SYSLINUX’s modules as they will not work.
Source code and debug info
There are sources for vSphere available on the
VMware website.
Code for esxboot
is included in Open Source Disclosure package for VMware
vSphere Hypervisor (ESXi). It can be downloaded as an ISO image containing all
open source components. There is also a stale
Github repo with older code.
One of the most useful information found there is list of mboot.c32 options. This allowed us to gather more verbose output. From SYSLINUX menu press Tab and change command line to:
|
|
Lines that were printed without additional flags will be printed twice, sometimes intertwined. This is output with most unimportant (for this issue) lines removed:
|
|
This is the place where it hangs or reboots. It is a few hundred lines below the
<6>Shutting down firmware services...
line. It is printed by the code in
install_trampoline()
function in
reloc.c.
With reverse engineering we established that only_em64t
was not defined, so
only do_reloc()
is called before returning from this function.
install_trampoline()
is called from main()
in
mboot.c,
followed by Log()
, both for success and for failure, so we can assume that
install_trampoline()
does not return, right? Well, not quite.
We need to go deeper
Binary built by us would most likely be different than the one included on installation image, because it would use different toolchain. To have 100% identical machine code (up to a certain point) we decided to go with binary patching instead of dealing with different compilers and dependency hell.
It basically came to disassembling original image (which was already done to
check if only_em64t
was defined) and inserting new code, in the point we were
trying to test, using hexeditor. This code was (Intel syntax):
|
|
To write this in machine code, we can either make a dummy file and compile it
(sometimes requires cross-compilation), write it by hand with information from
Intel SDM Vol. 2
or, after a while, from memory (can be tedious), or use online tools like
this one. Code above translates to
byte sequence: 66 ba f8 03 b0 78 ee eb fe
.
This code has been put in important places as a checkpoints in the flow. It must overwrite the code, and not be inserted because offsets to other functions and structures must not change.
Those checkpoints revealed that not only do_reloc()
and install_trampoline()
returned, but also the first Log()
after that. Apparently it printed empty
string which is, let’s say, less intolerable than printing random bytes.
This seems like a broken relocation - call to Log()
points to a string that is
no longer there. At least mboot.c32
read-only data section was relocated and
overwritten, code might also be relocated but apparently it isn’t overwritten
because our checkpoint executed. There is a
warning
before do_reloc()
code about it being position-independent. Trampoline code
and data are objects of type [t]
(see top of the file for description of
types), and because of that they are handler with special care, but main()
’s
code and data isn’t.
Relocation - why is it needed?
Not all of the code is position-independent. An example of such code is the kernel (at least its initial part). It must be loaded at the address for which it was linked, as printed in log:
|
|
If this address is not available, i.e. not marked as free RAM in e820 map (type=1), boot fails. Base address is written in kernel file, it is not known to the bootloader before this file is loaded, extracted and parsed. It is very unlikely that it will be loaded to the correct address on the first write to the memory. Also, sections can have different sizes in file than in memory, usually padding is added after file is read.
Bootloader loads all modules at once, before any checks for address ranges are
made. In most cases, those modules are initially loaded in the range required by
the kernel. This can be deducted from mboot __executable_start is at 0x160000
and Total extracted: 477Mb (500775353 bytes)
. Therefore, some juggling is
required to make space for kernel. It is (relatively) easy for the modules, they
were not run yet so there is little difference between code and data. Relocating
binary that was already started is a different story altogether.
PIC
mboot.c32
is compiled as a position independent code (PIC). It means that
there are no hardcoded addresses, all of them are calculated relatively to the
program counter - EIP register. This involves a trick with reading return
address from the stack on x86; it is much easier for x86_64 as there is support
for RIP relative addressing.
There are some rules that must be followed during relocation. First of all, code responsible for relocation shouldn’t return to the code that called it, if the caller or the stack was being relocated. In that case, there should be no plain return statements, because they read return address from the stack (which might have been relocated), which holds the pointer to the old code (which also might have been relocated). Return address could be patched and stack could be protected, but that’s not all.
Even worse issue is that when the flow returns to the calling function (assuming its code was not overwritten), it still has old pointer values saved in local variables, be it on the stack or in registers. There is no easy way of patching such addresses.
It is much easier to relocate global data. When any global variable is accessed, its value is not loaded directly, instead a pointer to that variable (or any other symbol) is read from a relatively-addressed table containing absolute addresses to all such symbols. This table is called the Global Offset Table (GOT). It is present in the file, where it contains relative offsets to the data, just as if it were loaded at a base address 0. Pointers in that table are updated (real base address is added to them) by the binary itself - the loader doesn’t know enough about layout of sections of binary. It happens during self-initialization of a module, but nothing prevents us from doing something similar again after a relocation.
Global Offset Table and PIC in general is described in Eli Bendersky’s article, with examples. It is focused on shared libraries, but the main principles are still the same.
In this particular case on every function entry compiler adds a call to function that copies EIP to EBX. Then some value is added to it, different for each function, depending on its relative (to the base of image) entry point address. The resultant EBX always holds the address to the same place in binary - GOT.
Note that it may be any register, but most compilers will pick EBX - it is one of the least used registers for other tasks (e.g. multiplication and division is wired to use EAX/EDX, loops use ECX, ESI/EDI are used for string operations etc.). It is also one of the few callee-save registers for virtually every widely used calling convention, which means that the caller doesn’t have to save it for every function.
All global and/or static data is accessed through GOT. Local variables are saved on the stack, and accessed relative to ESP or EBP. Functions are called relative to EIP, return address is saved on the stack, from where it is read when returning from the function. With all of these, program should be able to run without any assumptions for any absolute address.
Workaround for booting problem
There is a way to boot ESXi 6.7U3 (perhaps older updates as well, not tested).
It comes down to marking the memory as reserved for the range where mboot.c32
(and other c32 files such as menu.c32
) are loaded.
Keep in mind that this is not a solution. It allows ESXi installer to boot. It was not tested against booting other OSes or installed version of ESXi. Use at your own risk.
|
|
The log produced after applying the above change starts with:
|
|
In these lines we can see that the specified region was appended to the previous
reserved range, e820[2]
, because there is no need to use two separate fields
when one would suffice. There are other worrisome lines, however.
One of those is mboot __executable_start is at 0x160000
- it is well within
the part of memory where we told it not to be. It was loaded at this address by
the previous module - menu.c32
in this case - so it suggests that it is not a
bug in mboot.c32
, as we
initially thought.
The second visible problem is in malloc arena - it also reports that a part of the memory in the reserved range is free to use by the module. This issue is a direct result of the previous one. Both are caused by the way SYSLINUX scans memory.
To be continued…
If you think we can help in improving the security of your firmware or you
looking for someone who can boost your product by leveraging advanced features
of used hardware platform, feel free to
book a call with us or
drop us email to contact<at>3mdeb<dot>com
. If you are interested in similar
content feel free to sign up for our newsletter