Inline function tracing with the kinst DTrace provider

tags: bsd · programming · projects

Table of Contents

  1. Quick background
  2. Usage
  3. Inline function tracing

Quick background

DTrace is a framework that gives administrators and kernel developers the ability to observe kernel behavior in real time. DTrace has modules called “providers”, that perform a particular instrumentation in the kernel (and sometimes userland) using “probes”.

kinst is a new low-level DTrace provider co-authored by Christos Margiolis and Mark Johnston for the FreeBSD operating system, which allows the user to trace arbitrary instructions in kernel functions. It is part of the base system as of FreeBSD 14.0.

kinst probes take the form of kinst::<function>:<instruction>, where <function> is the kernel function to be traced, and <instruction> is the offset to the instruction, relative to the beginning of the function, and can be obtained from the function’s disassembly. If the <instruction> field is left empty, kinst will trace all instructions in that function. Unlike FBT, kinst can also trace the entry and return points of inline functions (see Inline function tracing).

The origin of the name is inspired from an early paper written by A. Tamches and B. Miller discussing a tracing tool they developed called “KernInst”.

Usage

Find the offset corresponding to the third instruction in vm_fault() and trace it, printing the contents of the RSI register:

# kgdb
(kgdb) disas /r vm_fault
Dump of assembler code for function vm_fault:
   0xffffffff80f4e470 <+0>:     55      push   %rbp
   0xffffffff80f4e471 <+1>:     48 89 e5        mov    %rsp,%rbp
   0xffffffff80f4e474 <+4>:     41 57   push   %r15
...

# dtrace -n 'kinst::vm_fault:4 {printf("%#x", regs[R_RSI]);}'
  2  81500                       vm_fault:4 0x827c56000
  2  81500                       vm_fault:4 0x827878000
  2  81500                       vm_fault:4 0x1fab9bef0000
  2  81500                       vm_fault:4 0xe16cf749000
  0  81500                       vm_fault:4 0x13587c366000
  ^C

Trace the return point of critical_enter(), which is an inline function:

# dtrace -n 'kinst::critical_enter:return'
dtrace: description 'kinst::critical_enter:return' matched 130 probes
CPU     ID                    FUNCTION:NAME
  1  71024                spinlock_enter:53
  0  71024                spinlock_enter:53
  1  70992                uma_zalloc_arg:49
  1  70925    malloc_type_zone_allocated:21
  1  70994                uma_zfree_arg:365
  1  70924             malloc_type_freed:21
  1  71024                spinlock_enter:53
  0  71024                spinlock_enter:53
  0  70947         _epoch_enter_preempt:122
  0  70949           _epoch_exit_preempt:28
  ^C

Inline function tracing

How it works

To trace inline functions, libdtrace makes use of the DWARF Debugging Standard, to detect if the function specified is an inline call. If it is, D syntax is transformed to create kinst probes for each of the inline copies found. All work is done in libdtrace, instead of kinst(4). This feature has been added to FreeBSD with this patch.

Contrary to how kinst expects a <function>:<instruction> tuple to create probes, for inline functions, <instruction> is replaced by entry and return.

Syntactic transformations

Suppose the user wants to trace a probe of the form:

kinst::<func>:<entry|return>
/<pred>/
{
	<acts>
}

libdtrace sees that we have specified entry or return, instead of an offset, which is what a regular kinst probe would look like, so it loops through all loaded kernel modules and parses their DWARF and ELF info to see if this function is an inline — if not, the probe is converted to an FBT one, so that we don’t duplicate FBT’s functionality in kinst:

# dtrace -dn 'kinst::malloc:entry {exit(0);}'
fbt::malloc:entry
{
        exit(0x0);
}

dtrace: description 'kinst::malloc:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  2  31144                     malloc:entry

If the function however is an inline, libdtrace will find all calls refering to this function and create new probes for each one of the inline copies found.

# dtrace -dn 'kinst::cam_iosched_has_more_trim:entry { printf("\t%d\t%s", pid, execname); }'
kinst::cam_iosched_get_trim:13,
kinst::cam_iosched_next_bio:13,
kinst::cam_iosched_schedule:40
{
	printf("\t%d\t%s", pid, execname);
}

dtrace: description 'kinst::cam_iosched_has_more_trim:entry ' matched 4 probes
CPU     ID                    FUNCTION:NAME
  0  81502          cam_iosched_schedule:40     2       clock
  0  81501          cam_iosched_next_bio:13     2       clock
  2  81502          cam_iosched_schedule:40     2       clock
  1  81502          cam_iosched_next_bio:13     0	kernel
  1  81503          cam_iosched_schedule:40     0	kernel
^C

There can also be both inline and non-inline definitions of the same function. In this case, kinst creates an additional FBT probe for the non-inline definition.

The -d flag used in these examples to dump the D script after libdtrace has applied syntactic transformations, has been added to DTrace in commit 1e136a9cbd3a.

Heuristic for calculating the entry and return offsets

libdtrace reuses parts of the mechanism implemented in my inlinecall(1) program, which finds and prints all call sites of a given inline function:

$ ./inlinecall vm_page_mvqueue /usr/lib/debug/boot/kernel/kernel.debug
/usr/src/sys/vm/vm_page.c:4142
        [0xffffffff80f91541 - 0xffffffff80f91599]       /usr/src/sys/vm/vm_page.c:4195  vm_page_readahead_finish()
        [0xffffffff80f915f5 - 0xffffffff80f91603]       /usr/src/sys/vm/vm_page.c:4195  vm_page_readahead_finish()
        [0xffffffff80f9163d - 0xffffffff80f916c2]       /usr/src/sys/vm/vm_page.c:4184  vm_page_activate()
        [0xffffffff80f916cd - 0xffffffff80f916e5]       /usr/src/sys/vm/vm_page.c:4184  vm_page_activate()
        [0xffffffff80f916fe - 0xffffffff80f91747]       /usr/src/sys/vm/vm_page.c:4195  vm_page_deactivate()
        [0xffffffff80f91750 - 0xffffffff80f91768]       /usr/src/sys/vm/vm_page.c:4195  vm_page_deactivate()
        [0xffffffff80f94a59 - 0xffffffff80f94aa9]       /usr/src/sys/vm/vm_page.c:4195  vm_page_reclaim_contig_domain()
        [0xffffffff80f94de4 - 0xffffffff80f94df9]       /usr/src/sys/vm/vm_page.c:4195  vm_page_reclaim_contig_domain()
        [0xffffffff80f9661e - 0xffffffff80f96667]       /usr/src/sys/vm/vm_page.c:4202  vm_page_deactivate_noreuse()
        [0xffffffff80f96670 - 0xffffffff80f96688]       /usr/src/sys/vm/vm_page.c:4202  vm_page_deactivate_noreuse()
        [0xffffffff80f9669e - 0xffffffff80f966ea]       /usr/src/sys/vm/vm_page.c:4212  vm_page_launder()
        [0xffffffff80f966f3 - 0xffffffff80f9670b]       /usr/src/sys/vm/vm_page.c:4212  vm_page_launder()
        [0xffffffff80f96d4c - 0xffffffff80f96dac]       /usr/src/sys/vm/vm_page.c:4212  vm_page_advise()
        [0xffffffff80f96dac - 0xffffffff80f96e07]       /usr/src/sys/vm/vm_page.c:4202  vm_page_advise()

Most of the entries above appear twice but with different boundaries:

        [0xffffffff80f9163d - 0xffffffff80f916c2]       /usr/src/sys/vm/vm_page.c:4184  vm_page_activate()
        [0xffffffff80f916cd - 0xffffffff80f916e5]       /usr/src/sys/vm/vm_page.c:4184  vm_page_activate()

This means that the inline copy’s boundaries are split into more than one parts, which can be caused by having early returns inside the inline function. By using this assumption, we can deduce that the entry point of the function is 0xffffffff80f9163d, and each of the upper boundaries are the return points (i.e 0xffffffff80f916c2 and 0xffffffff80f916e5), so we end up with one entry and two return trace points.

However, this is not exactly true, because the final return address given by DWARF corresponds to the instruction after the actual return instruction. We can verify this in GDB using the two return addresses from the example above:

# kgdb
(kgdb) disas vm_page_activate
...
   0xffffffff80f916c0 <+144>:   jmp    0xffffffff80f91673 <vm_page_activate+67>
   0xffffffff80f916c2 <+146>:   add    $0x8,%rsp			<-- first address given by DWARF
...
   0xffffffff80f916e0 <+176>:   call   0xffffffff80bed360 <panic>	<-- last instruction
									<-- second address should be here

It turns out that 0xffffffff80f916e5 is, in fact, outside vm_page_activate() altogether!

(kgdb) x/i 0xffffffff80f916e5
   0xffffffff80f916e5:  data16 cs nopw 0x0(%rax,%rax,1)

After running a couple of tests, I came to the conclusion that there are two possible cases with return addresses:

In order to fix this, we have to go one instruction back whenever we come across one of those 3 cases.

Finally the entry offset is calculated as:

inline_bound_lo - caller_bound_lo

And the return one, including the modifications (if any) discussed above, as:

inline_bound_hi - caller_bound_lo

These offsets are then used to create regular kinst probes of the form kinst::<func>:<instruction>, which is what kinst actually expects:

# dtrace -dn 'kinst::vm_page_mvqueue:entry,kinst::vm_page_mvqueue:return'
dtrace: description 'kinst::vm_page_mvqueue:entry,kinst::vm_page_mvqueue:return' matched 22 probes
kinst::vm_page_readahead_finish:33,
kinst::vm_page_readahead_finish:121,
kinst::vm_page_activate:13,
kinst::vm_page_deactivate:14,
kinst::vm_page_reclaim_contig_domain:1961,
kinst::vm_page_deactivate_noreuse:14,
kinst::vm_page_launder:14,
kinst::vm_page_advise:236,
kinst::vm_page_advise:332,
kinst::vm_page_readahead_finish:220,
kinst::vm_page_activate:146,
kinst::vm_page_activate:176,
kinst::vm_page_deactivate:87,
kinst::vm_page_deactivate:115,
kinst::vm_page_reclaim_contig_domain:2041,
kinst::vm_page_reclaim_contig_domain:2884,
kinst::vm_page_deactivate_noreuse:87,
kinst::vm_page_deactivate_noreuse:115,
kinst::vm_page_launder:90,
kinst::vm_page_launder:118,
kinst::vm_page_advise:330,
kinst::vm_page_advise:421
{
}

CPU     ID                    FUNCTION:NAME
  3  95381              vm_page_activate:13 
  3  95389             vm_page_activate:146 
  2  95381              vm_page_activate:13 
  2  95389             vm_page_activate:146 
  1  95387               vm_page_advise:332 
  1  95400               vm_page_advise:421 
  1  95387               vm_page_advise:332 
  1  95400               vm_page_advise:421 
  1  95387               vm_page_advise:332 
^C