Plain AF_PACKET Socket
The AF_PACKET socket allows user-space application to capture raw packets at link layer so that it can see the whole packet data starting from link-layer headers and bottom up to transport layer and application payload.
Application creates the AF_PACKET socket like other types of socket with the socket
function:
where,
- The first argument
AF_PACKET
indicates socket family. - The second argument could be either
SOCK_RAW
orSOCK_DGRAM
. If you want to receive the packet with the 14-byte Ethernet header,SOCK_RAW
is the right socket type, or else the link layer head will be removed in case ofSOCK_DGRAM
. - The third arugment specifies the link-layer protocol,
ETH_P_ALL
indicates all protocols andETH_P_IP
indicates IPv4 protocol. The protocol definition follows the convention ofETH_P_xxx
, which can be found atlinux/if_ether.h
header file.
AF_PACKET with MMAP
Receving / sending packets from / to plain AF_PACKET socket is very inefficient as it uses very limited buffers and requires system call once every time to capture a packet or send a packet from / to kernel; meanwhile the packet data has to be moved between the user and kernel spaces. Given this, PACKET_MMAP arised to boost the performance by eliminating the need of moving packet data between user and kernel spaces and also reducing the number of system calls. A size configurable ring buffer is shared between kernel and user spaces so that user applications just need to wait for packets at receiving side. Concerning tranmission, multiple packets can be put to the ring buffer followed by one system call to notify kernel transmitting these packets.
Ring Buffer
The AF_PACKET has PACKET_RX_RING
and PACKET_TX_RING
respectively for packet reception and transmission. A ring buffer is a contiguous physical region of memory, which is logicially segmented into a number of blocks. Each block contains a few frames and each frame has two parts:
- frame header: It contains the status of this frame.
- data buffer: It holds the packet data.
The PACKET_MMAP
for AF_PACKET evolved 3 versions:
- TPACKET_V1
- TPACKET_V2
- Timestamp resolution at nanosecond scale instead of microsecond.
- VLAN metadata information is available for packets.
- TPACKET_V3
- Read / poll is at block level instead of frame level.
- Added poll timeout to avoid blocking poll.
- RX hash data is available to user space application.
By default TPACKET_V1 is used, but use setsockopt
function to change the version to TPACKET_V3 is highly recommended as polling at block level brings the benefit of 15% - 20% reduction in CPU usage, and ~20% increase in packet capture rate.
To setup rings for RX and TX, TPACKET_V1 and TPACKET_V2 uses struct tpacket_req
and TPACKET_V3 uses struct tpacket_req3
, both struct’s are defined in uapi/linux/if_packet.h
. The following piece of code sets the PACKET_RX_RING
with 128 blocks, each block has 4096 bytes and contains of 2 frames with frame size of 2048 bytes.
Similarly, you can also use the setsockopt
function to setup the PACKET_TX_RING
for packet transmission. Next, the application has to create the ring buffer with mmap
function to share the memory between user and kernel spaces. The ring buffer will be formatted as blocks and frames based on the the parameters setting up the PACKET_RX_RING
or PACKET_TX_RING
.
where,
- The fist argument specifies the starting address of the shared buffer, if it is
NULL
, kernel chooses the address at which to create the mapping. - The second argument specifies the total size of the shared buffer.
PROT_READ|PROT_WRITE
at the third argument indicates the mapping space is readable and writable.- Flag at the fourth argument determines whether updates to the mapping are visible to other processes mapping the same memory space.
- The last argument is an offset always set to 0 in mapping ring bufffer for AF_PACKET.
Receiving Packets
The following macros defined in include/linux/if_packet.h
implies the status of a frame in the ring.
The kernel initializes all frames to TP_STATUS_KERNEL
. When the kernel receives a packet it puts in the ring buffer and updates the status with at least the TP_STATUS_USER
flag. Userspace application has to poll the socket file descriptor to check if there are new packets in the ring. Then the application can read the packet if the status has the TP_STATUS_USER
flag, once the packet is read the application must zero the status field, so the kernel can reuse that frame buffer to store next received packet. The rest of status flags are explained in the following table.
Macro | Description |
---|---|
|
This flag indicates that the frame (and associated metadata) has been truncated because it’s larger than |
|
Indicates there were packet drops from last time statistics where checked with |
|
This flag indicates that at least the transport header checksum of the packet has been already validated on the kernel side. If the flag is not set then the userspace applications are free to check the checksum provided that |
TPACKET_V3 block descriptor
Since TPACKET_V3 introduced the polling at block level, there is a block descriptor describes the status and information of each block. The structure of a block is depicted as the following diagram.
The following example code shows how to do block-level polling with TPACKET_V3 and walk through frames in the block.
Load balancing
The AF_PACKET
fanout mode enables load balancing capability for packet reception. You can load-balance the packet reception among multiple processes or CPUs based on the following policies.
Fanout Policy | Description |
---|---|
|
Schedule to socket by |
|
Schedule to socket by round-robin. |
|
Schedule to socket by CPU packet arrives on. |
|
Schedule to socket by random selection. |
|
If one socket is full, rollover to another. |
|
chedule to socket by |
Transmitting Packets
There are also macros defined for transmission process:
First, the kernel initializes all frames to TP_STATUS_AVAILABLE
. To send a packet, the application fills a data buffer of an available frame, sets tp_len
to current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST
. This can be done on multiple frames. Once the application is ready to transmit, it calls send()
. Then all buffers with status equal to TP_STATUS_SEND_REQUEST
are forwarded to the network device. The kernel updates each status of sent frames with TP_STATUS_SENDING
until the end of transfer. At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE
. So when application fills packet into a frame, it should ensure not overriding packet that is in transmission.
Specific to TPACKET_V3, unlike the structure of blocks in RX ring, which has a block descriptor for each block, TX ring doesn’t have the block descriptor as it doesn’t need to poll. So sending a packet is quite straightforward like the below code does.
You may want to aggressively exploit the transmission speed and reduce the latency as much as possible like packet generator software usually does, then the option PACKET_QDISC_BYPASS
comes to rescure. You can set this option after socket created.
The side effect of this option is AF_PACKET will bypass the kernel’s qdisc layer and forcedly push packets to the driver directly. That means, the packets are not buffered and no TC disciplines are applied, and hence potentially increasing the loss in present of microburst. Generally, this option could be used for stress performance testing or in scenario where you really don’t care too much of packet loss.
The Golang Implementation
The github.com/google/gopacket package provides a golang implementation of the three versions of TPacket’s for AF_PACKET. Examples are available at https://github.com/google/gopacket/tree/master/examples/afpacket. However, this gopacket package doesn’t implement the TX_RING, I have forked the repo and committed my TX_RING implementation at https://github.com/csulrong/gopacket/tree/master/afpacket.