eBPF Oneliner with bpftrace for Tracing NVMe Driver Data Structure Members

Introduction

Tracing the Kernel and Kernel driver data is crucial when you need to analyze complex scenarios, performance issues, or driver issues.

bpftrace is a tool that provides a high level language that you can use to easily write eBPF programs for tracing Kernel data. bpftrace is built on top of BCC and the bpftrace scripts written in this high level language get converted into BCC programs which are then compiled at runtime using the LLVM and Clang toolchain. You can use bpftrace to write eBPF one-liners or short eBPF programs. Although bpftrace high level language is limited and cannot be used to write complex eBPF programs, there are still a lot scenarios where bpftrace comes handy. bpftrace contains a lot of built-in functionality for aggregating information and creating histograms. bpftrace converts programs written in the high-level language into eBPF kernel code and also provides some output formatting functionality that can be very useful to effectively show tracing results in the terminal. With bpftrace you can attach to tracing events such as kprobes, uprobes and tracepoints.

You can list all the tracing events that bpftrace is able to attach with:

bash
sudo bpftrace -l
bash
man bpftrace
    -l [search]    list probes

For example to list all the NVME driver commands:

bash
sudo bpftrace -l | grep nvme.*cmd

kfunc:nvme_core:__traceiter_nvme_setup_cmd
kfunc:nvme_core:nvme_cleanup_cmd
kfunc:nvme_core:nvme_cmd_allowed
kfunc:nvme_core:nvme_dev_uring_cmd
kfunc:nvme_core:nvme_ns_chr_uring_cmd
kfunc:nvme_core:nvme_ns_chr_uring_cmd_iopoll
kfunc:nvme_core:nvme_ns_head_chr_uring_cmd
kfunc:nvme_core:nvme_ns_head_chr_uring_cmd_iopoll
kfunc:nvme_core:nvme_ns_uring_cmd
kfunc:nvme_core:nvme_setup_cmd
kfunc:nvme_core:nvme_submit_sync_cmd
kfunc:nvme_core:nvme_trace_parse_admin_cmd
kfunc:nvme_core:nvme_trace_parse_fabrics_cmd
kfunc:nvme_core:nvme_trace_parse_nvm_cmd
kfunc:nvme_core:nvme_uring_cmd_end_io
kfunc:nvme_core:nvme_uring_cmd_end_io_meta
kfunc:nvme_core:nvme_uring_cmd_io
kfunc:nvme_core:nvme_user_cmd64
kprobe:__nvme_submit_sync_cmd
kprobe:__traceiter_nvme_setup_cmd
kprobe:nvme_cleanup_cmd
kprobe:nvme_cmd_allowed
kprobe:nvme_dev_uring_cmd
kprobe:nvme_ns_chr_uring_cmd
kprobe:nvme_ns_chr_uring_cmd_iopoll
kprobe:nvme_ns_head_chr_uring_cmd
kprobe:nvme_ns_head_chr_uring_cmd_iopoll
kprobe:nvme_ns_uring_cmd
kprobe:nvme_setup_cmd
kprobe:nvme_submit_sync_cmd
kprobe:nvme_submit_user_cmd
kprobe:nvme_trace_parse_admin_cmd
kprobe:nvme_trace_parse_fabrics_cmd
kprobe:nvme_trace_parse_nvm_cmd
kprobe:nvme_uring_cmd_end_io
kprobe:nvme_uring_cmd_end_io_meta
kprobe:nvme_uring_cmd_io
kprobe:nvme_user_cmd.constprop.0
kprobe:nvme_user_cmd64
tracepoint:nvme:nvme_setup_cmd

As an example we will attach a kprobe to the "nvme_submit_user_cmd" function in order to capture some data structure members, one of the arguments of the function which has a specific NVME data structure type.

Setting Up bpftrace

First you need to install bpftrace with:

bash
sudo apt install bpftrace

Analyzing the nvme_submit_user_cmd Function

Now let's see in the Kernel code how does the "nvme_submit_user_cmd" function definition look like, what arguments does it have and what can we extract from these arguments which are passed to the "nvme_submit_user_cmd" function.

We can see in the Kernel source that it has a number of arguments and second argument is a pointer of type "struct nvme_command":

c
drivers/nvme/host/ioctl.c:141

static int nvme_submit_user_cmd(struct request_queue *q,
                struct nvme_command *cmd, u64 ubuffer,
                unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
                u32 meta_seed, u64 *result, unsigned timeout, bool vec)

Let's check what members does the "struct nvme_command" have?

c
grep -rnI "struct nvme_command {" kernel_src/

include/linux/nvme.h:1531

struct nvme_command {
        union {
                struct nvme_common_command common;
                struct nvme_rw_command rw;
                struct nvme_identify identify;
                struct nvme_features features;
                struct nvme_create_cq create_cq;
                struct nvme_create_sq create_sq;
                struct nvme_delete_queue delete_queue;
                struct nvme_download_firmware dlfw;
                struct nvme_format_cmd format;
                struct nvme_dsm_cmd dsm;
                struct nvme_write_zeroes_cmd write_zeroes;
                struct nvme_zone_mgmt_send_cmd zms;
                struct nvme_zone_mgmt_recv_cmd zmr;
                struct nvme_abort_cmd abort;
                struct nvme_get_log_page_command get_log_page;
                struct nvmf_common_command fabrics;
                struct nvmf_connect_command connect;
                struct nvmf_property_set_command prop_set;
                struct nvmf_property_get_command prop_get;
                struct nvme_dbbuf dbbuf;
                struct nvme_directive_cmd directive;
        };
};

Within "struct nvme_command" we would like to access the "struct nvme_common_command common" member. Now let's check the layout of the "struct nvme_common_command" data structure in the Kernel code:

c
grep -rnI "struct nvme_common_command {" kernel_src/

./include/linux/nvme.h:901

struct nvme_common_command {
        __u8                    opcode;
        __u8                    flags;
        __u16                   command_id;
        __le32                  nsid;
        __le32                  cdw2[2];
        __le64                  metadata;
        union nvme_data_ptr     dptr;
        struct_group(cdws,
        __le32                  cdw10;
        __le32                  cdw11;
        __le32                  cdw12;
        __le32                  cdw13;
        __le32                  cdw14;
        __le32                  cdw15;
        );
};

"struct nvme_common_command" has this "opcode" member which value we want to capture using bpftrace.

Writing the bpftrace Oneliner

Now let's write the bpftrace oneliner using the high level language provided by bpftrace:

bash
sudo bpftrace -e 'kprobe:nvme_submit_user_cmd { 
printf("opcode: %x\n", ((struct nvme_command *)arg1)->common.opcode);
printf("command_id: %x\n", ((struct nvme_command *)arg1)->common.command_id); 
printf("nsid: %x\n", ((struct nvme_command *)arg1)->common.nsid); 
printf("cdw10: %x\n", ((struct nvme_command *)arg1)->common.cdw10); 
}'

As you can see the oneliner attaches a kprobe to the "nvme_submit_user_cmd" function.

arg0 is the first argument, arg1 is the second argument and so on and so forth...

So in order to refer to the second argument of the "nvme_submit_user_cmd" function, that is "struct nvme_command *cmd" we must write arg1 in the oneliner.

We have to do typecasting of arg1 to "struct nvme_command" pointer as in the function signature, since arg1 is the second argument which takes a pointer to "struct nvme_command". Also in order to access the data members of pointer "struct nvme_command" we use the arrow operator just like in C/C++ and since "nvme_command" has a data member called "common" which is of type "struct nvme_common_command" and not a pointer then we use the dot operator to access all of "nvme_common_command" data members.

"-e" parameter in bpftrace means that we wish bpftrace to execute this or that eBPF program:

bash
man bpftrace
    -e 'program'   execute this program

Testing the Oneliner

Now let's install nvme-cli in order to send an admin command to one of the NVME SSD devices:

bash
sudo apt install nvme-cli

Then start running the eBPF oneliner through bpftrace:

bash
sudo bpftrace -e 'kprobe:nvme_submit_user_cmd { 
printf("opcode: %x\n", ((struct nvme_command *)arg1)->common.opcode);
printf("command_id: %x\n", ((struct nvme_command *)arg1)->common.command_id); 
printf("nsid: %x\n", ((struct nvme_command *)arg1)->common.nsid); 
printf("cdw10: %x\n", ((struct nvme_command *)arg1)->common.cdw10); 
}'
Attaching 1 probe...

Here the printout says that 1 kprobe is now attached.

And we can also check this fact with:

bash
sudo cat /sys/kernel/debug/kprobes/list
ffffffffc077a380  k  nvme_submit_user_cmd+0x0  nvme_core [FTRACE]

Now let's list the NVME devices:

bash
sudo nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            S4DYNX0R756769       SAMSUNG MZVLB512HBJQ-000L2               1          94.17  GB / 512.11  GB    512   B +  0 B   3L1QEXF7

As soon as we shuffle the nvme-cli command, bpftrace is capturing and printing the following data:

bash
sudo bpftrace -e 'kprobe:nvme_submit_user_cmd { 
printf("opcode: %x\n", ((struct nvme_command *)arg1)->common.opcode);
printf("command_id: %x\n", ((struct nvme_command *)arg1)->common.command_id); 
printf("nsid: %x\n", ((struct nvme_command *)arg1)->common.nsid); 
printf("cdw10: %x\n", ((struct nvme_command *)arg1)->common.cdw10); 
}'
Attaching 1 probe...
text
opcode: 6
command_id: 0
nsid: 1
cdw10: 0

opcode: 6
command_id: 0
nsid: 1
cdw10: 3

This means that the nvme-cli tool triggers the "nvme_submit_user_cmd" nvme kernel driver function 2 times, same opcode, 0x6 (HEX) which in terms of NVMe admin commands it means "Identify".

Now let's run an NVME admin passthru command to trigger a short device self-test in the NVME SSD:

bash
sudo nvme admin-passthru /dev/nvme0 -n 0x1 --opcode=0x14 --cdw10=0x1 -r

Admin Command Device Self-test is Success and result: 0x00000000

Now bpftrace captures and prints out the following data:

text
opcode: 14
command_id: 0
nsid: 1
cdw10: 1

Opcode 0x14 (HEX) means "Device Self-test" according to "NVM Express Base Specification Revision 2.0a" "Figure 138: Opcodes for Admin Commands". cdw10 Command Dword 10 is a command specific field. Namespace ID is 0x1.

Conclusion

So in conclusion we were able to capture these Data Structure members of "nvme_common_command" structure embedded in the "nvme_command" structure just by writing a simple eBPF one-liner program and load it through bpftrace as a kprobe attached to the "nvme_submit_user_cmd" function.

Related Documents