What is the most easiest way you can start writing eBPF programs? How you can write an easy eBPF program which can trace specific Data Structure members in the Kernel used by Kernel functions, and how can you attach this eBPF program of yours to specific events, syscalls and kernel functions that you aim to trace?
There are several different libraries and frameworks that you can use to write eBPF applications. The most easiest and most accessible way to write an eBPF program from scratch is to use the BCC Python Framework.
The BCC project at https://github.com/iovisor/bcc contains a great number of CLI tools for tracing the Linux system and Kernel, the tools directory contains a lot of Python-based eBPF examples https://github.com/iovisor/bcc/tree/master/tools.
Although BCC was the first popular project for implementing eBPF programs, providing framework for both user space and kernel space aspect, this way making eBPF development relatively easy for programmers without much kernel experience, still it is not recommended for production level eBPF development, but it is awesome for taking the first steps in eBPF development.
eBPF programs can be used to dynamically change the behavior of the system, the eBPF code starts taking effect as soon as it is attached to an event which can be a syscall or kernel function.
The main objective of this article is to teach you how to write an eBPF program in the possibly most easiest way, still I would like to give you a bit more serious example, where we trace the Kernel Data Structure members used by a Kernel function.
I have a Python application which contains the eBPF program and it's using the BCC framework to compile and load it into the Kernel:
vi kprobe_nvme.py
--------------------------------------------------------------------
#!/usr/bin/python
from bcc import BPF
from bcc.utils import printb
from time import strftime
# define BPF program
bpf_text = """
#include
#include
struct data_t {
u32 pid;
char comm[TASK_COMM_LEN];
u8 opcode;
u16 command_id;
u32 nsid;
u32 cdw10;
};
struct nvme_sgl_desc {
__le64 addr;
__le32 length;
__u8 rsvd[3];
__u8 type;
};
struct nvme_keyed_sgl_desc {
__le64 addr;
__u8 length[3];
__u8 key[4];
__u8 type;
};
union nvme_data_ptr {
struct {
__le64 prp1;
__le64 prp2;
};
struct nvme_sgl_desc sgl;
struct nvme_keyed_sgl_desc ksgl;
};
struct nvme_common_command {
__u8 opcode;
__u8 flags;
__u16 command_id;
__le32 nsid;
__le32 cdw2[2];
__le64 metadata;
union nvme_data_ptr dptr;
struct_group(cdws,
__le32 cdw10;
__le32 cdw11;
__le32 cdw12;
__le32 cdw13;
__le32 cdw14;
__le32 cdw15;
);
};
struct nvme_command {
union {
struct nvme_common_command common;
};
};
BPF_PERCPU_ARRAY(unix_data, struct data_t, 1);
BPF_PERF_OUTPUT(events);
int trace_nvme_submit_user_cmd(struct pt_regs *ctx,
void *q,
struct nvme_command *cmd
)
{
struct data_t data = {};
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32; // PID is higher part
data.pid = pid;
// get current process name
bpf_get_current_comm(&data.comm, sizeof(data.comm));
__u8 a_opcode;
bpf_probe_read_kernel(&a_opcode, sizeof(a_opcode), &cmd->common.opcode);
data.opcode = a_opcode;
__u16 a_command_id;
bpf_probe_read_kernel(&a_command_id, sizeof(a_command_id), &cmd->common.command_id);
data.command_id = a_command_id;
__le32 a_nsid;
bpf_probe_read_kernel(&a_nsid, sizeof(a_nsid), &cmd->common.nsid);
data.nsid = a_nsid;
__le32 a_cdw10;
bpf_probe_read_kernel(&a_cdw10, sizeof(a_cdw10), &cmd->common.cdws.cdw10);
data.cdw10 = a_cdw10;
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
"""
# process event
def print_event(cpu, data, size):
event = b["events"].event(data)
print("%-9s %-9s %-7s %-8x %-12x %-6x %-6x" % (
strftime("%H:%M:%S"),
event.comm,
event.pid,
event.opcode,
event.command_id,
event.nsid,
event.cdw10,
))
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_submit_user_cmd", fn_name="trace_nvme_submit_user_cmd")
# header
print("%-9s %-9s %-7s %-8s %-12s %-6s %-6s" % (
"TIME", "COMM", "PID", "OPCODE", "COMMAND-ID", "NSID", "CDW10"))
# read events
# loop with callback to print_event
b["events"].open_perf_buffer(print_event, page_cnt=64)
while 1:
try:
b.perf_buffer_poll(timeout=1000)
except KeyboardInterrupt:
exit()
--------------------------------------------------------------------
More specifically our Python code will load the wrapped eBPF program written in C, into the Kernel and will attach a kprobe to trace "nvme_submit_user_cmd" NVMe Kernel driver function.
We have to install nvme-cli so we can trigger the invocation of this "nvme_submit_user_cmd" kernel function.
Now let's install nvme-cli in order to send an admin command to one of the NVME SSD devices:
sudo apt install nvme-cli
Now let's see in the Kernel code how does the "nvme_submit_user_cmd" function definition look like, what arguments does it have and what can we extract from these arguments which are passed to the "nvme_submit_user_cmd" function.
We can see in the Kernel source that it has a number of arguments and the second argument is a pointer of type "struct nvme_command":
drivers/nvme/host/ioctl.c:141
static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, u64 ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
u32 meta_seed, u64 *result, unsigned timeout, bool vec)
Let's check in the Kernel source code what members does the "struct nvme_command" have?
grep -rnI "struct nvme_command {" kernel_src/
include/linux/nvme.h:1531
struct nvme_command {
union {
struct nvme_common_command common;
struct nvme_rw_command rw;
struct nvme_identify identify;
struct nvme_features features;
struct nvme_create_cq create_cq;
struct nvme_create_sq create_sq;
struct nvme_delete_queue delete_queue;
struct nvme_download_firmware dlfw;
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
struct nvme_write_zeroes_cmd write_zeroes;
struct nvme_zone_mgmt_send_cmd zms;
struct nvme_zone_mgmt_recv_cmd zmr;
struct nvme_abort_cmd abort;
struct nvme_get_log_page_command get_log_page;
struct nvmf_common_command fabrics;
struct nvmf_connect_command connect;
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
struct nvme_directive_cmd directive;
};
};
Within "struct nvme_command" we would like to access the "struct nvme_common_command common" member. Now let's check the layout of the "struct nvme_common_command" data structure in the Kernel source code:
grep -rnI "struct nvme_common_command {" kernel_src/
./include/linux/nvme.h:901
struct nvme_common_command {
__u8 opcode;
__u8 flags;
__u16 command_id;
__le32 nsid;
__le32 cdw2[2];
__le64 metadata;
union nvme_data_ptr dptr;
struct_group(cdws,
__le32 cdw10;
__le32 cdw11;
__le32 cdw12;
__le32 cdw13;
__le32 cdw14;
__le32 cdw15;
);
};
"struct nvme_common_command" has this "opcode" member plus a number of other member like command_id, nsid (Namespace ID), cdw10 (Command Dword 10 is an NVMe command specific field) which values we want to trace with our eBPF program.
Now let's load the eBPF program into the Kernel. The Python script using the BCC Python Framework will load our eBPF program, for that we need to run the python script with sudo:
sudo ./kprobe_nvme.py
As soon as we start executing the nvme-cli tool to list NVMe devices on the system:
sudo nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 /dev/ng0n1 S4DYNX0R756769 SAMSUNG MZVLB512HBJQ-000L2 1 93.05 GB / 512.11 GB 512 B + 0 B 3L1QEXF7
The "nvme_submit_user_cmd" function is invoked in the NVMe Kernel driver level, and the kprobe attached by our eBPF program will trace the NVMe data structure members that we are hooked onto, opcode, command-id, nsid, cdw10:
sudo ./kprobe_nvme.py
TIME COMM PID OPCODE COMMAND-ID NSID CDW10
02:34:13 b'nvme' 32086 6 0 1 0
02:34:13 b'nvme' 32086 6 0 1 3
This means that the nvme-cli tool triggers the "nvme_submit_user_cmd" nvme kernel driver function 2 times, same opcode, 0x6 (HEX) which in terms of NVMe admin commands it means "Identify".
Now let's run an NVME admin passthru command to trigger a short device self-test in the NVME SSD:
sudo nvme admin-passthru /dev/nvme0 -n 0x1 --opcode=0x14 --cdw10=0x1 -r
Admin Command Device Self-test is Success and result: 0x00000000
Now our eBPF program and kprobe captures the struct data members and the Python script prints out the following data:
sudo ./kprobe_nvme.py
TIME COMM PID OPCODE COMMAND-ID NSID CDW10
...
02:34:47 b'nvme' 32094 14 0 1 1
Opcode 0x14 (HEX) means "Device Self-test" according to "NVM Express Base Specification Revision 2.0a" "Figure 138: Opcodes for Admin Commands". cdw10 Command Dword 10 is a command specific field. Namespace ID is 0x1.
So in conclusion we were able to capture these Data Structure members of "nvme_common_command" structure embedded in the "nvme_command" structure just by writing a simple eBPF program which is built and loaded into the Kernel using a Python script and BCC framework, and it's attached to the kprobe of the "nvme_submit_user_cmd" function, and whenever the "nvme-cli" utility triggers the "nvme_submit_user_cmd" function on the NVMe Kernel driver level, the eBPF program, writes a line of trace into a perf buffer and the Python program reads the trace message from the perf buffer and displays it to the user.
For this we use the BPF_PERF_OUTPUT BCC macro, which let you write data in a structure of your choosing into a perf ring buffer map.
BCC defines the macro BPF_PERF_OUTPUT for creating a map that will be used to pass messages from the kernel to user space.