[Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP

John Fastabend john.fastabend at gmail.com
Fri Dec 9 17:54:55 UTC 2016


On 16-12-09 09:15 AM, Alexander Duyck wrote:
> On Fri, Dec 9, 2016 at 12:22 AM, Björn Töpel <bjorn.topel at gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel at intel.com>
>>
>> This commit adds basic XDP support for i40e derived NICs. All XDP
>> actions will end up in XDP_DROP.
>>
>> Only the default/main VSI has support for enabling XDP.
>>
>> Acked-by: John Fastabend <john.r.fastabend at intel.com>
>> Signed-off-by: Björn Töpel <bjorn.topel at intel.com>
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e.h         |  13 +++
>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  77 +++++++++++++
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 146 ++++++++++++++++++++-----
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +
>>  5 files changed, 216 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>> index ba8d30984bee..05d805f439e6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -545,6 +545,8 @@ struct i40e_vsi {
>>         struct i40e_ring **rx_rings;
>>         struct i40e_ring **tx_rings;
>>
>> +       struct bpf_prog *xdp_prog;
>> +

This is being protected by rcu right at least in patch 3 this seems to
be the case? We should add __rcu annotation here,

  struct bpf_prog __rcu *xdp_prog

Then also run

  make C=2 ./drivers/net/ethernet/intel/i40e/

This should help catch any rcu errors. Oh you also need the PROVE_RCU
config option set in .config

I wonder if patch 3 could be squashed with patch 1 here? Do you think
that this patch is easier to read without it.

>>         u32  active_filters;
>>         u32  promisc_threshold;
>>
>> @@ -904,4 +906,15 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
>> +
>> +/**
>> + * i40e_enabled_xdp_vsi - Check if VSI has XDP enabled
>> + * @vsi: pointer to a vsi
>> + *
>> + * Returns true if the VSI has XDP enabled.
>> + **/
>> +static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
>> +{
>> +       return vsi->xdp_prog;
>> +}
> 
> It might be better to have this return !!vsi->xdp_prog.  That way it
> is explicity casting the return value as a boolean value.
> 
>>  #endif /* _I40E_H_ */
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> index cc1465aac2ef..831bbc208fc8 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> @@ -1254,6 +1254,9 @@ static int i40e_set_ringparam(struct net_device *netdev,
>>         if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>>                 return -EINVAL;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi))
>> +               return -EINVAL;
>> +
>>         if (ring->tx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>>             ring->tx_pending < I40E_MIN_NUM_DESCRIPTORS ||
>>             ring->rx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index da4cbe32eb86..b247c3231170 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -24,6 +24,7 @@
>>   *
>>   ******************************************************************************/
>>
>> +#include <linux/bpf.h>
>>  #include <linux/etherdevice.h>
>>  #include <linux/of_net.h>
>>  #include <linux/pci.h>
>> @@ -2431,6 +2432,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>>         struct i40e_netdev_priv *np = netdev_priv(netdev);
>>         struct i40e_vsi *vsi = np->vsi;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +               if (max_frame > I40E_RXBUFFER_2048)
>> +                       return -EINVAL;
>> +       }
>> +
> 
> I suppose this works for now, but we should probably be using a value
> smaller than 2048.  Most likely we are going to need to restrict this
> down to about 1.5K or something in that neighborhood.  I am already
> looking at changes that clean most of this up to prep it for the DMA
> and page count fixes.  To resolve the jumbo frames case we are
> probably looking at using a 3K buffer size with an order 1 page on
> x86.  That will give us enough room for adding headers and/or any
> trailers needed to the frame.
> 

Also looks like we are trying to standardize on 256B of headroom for
headers. So when we add the adjust_head support this will need another
256B addition here. And we will have to do an offset on the DMA region.

I thought Alex was maybe trying to clean some of this up so it would
be easier to do this. I vaguely remember something about SKB_PAD.

>>         netdev_info(netdev, "changing MTU from %d to %d\n",
>>                     netdev->mtu, new_mtu);
>>         netdev->mtu = new_mtu;
>> @@ -3085,6 +3093,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>>         writel(0, ring->tail);
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               struct bpf_prog *prog;
>> +
>> +               prog = bpf_prog_add(vsi->xdp_prog, 1);
>> +               if (IS_ERR(prog))
>> +                       return PTR_ERR(prog);
>> +               ring->xdp_prog = prog;
>> +       }
>> +
>>         i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>>
>>         return 0;
>> @@ -9234,6 +9251,65 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>>  }
>>
>> +/**
>> + * i40e_xdp_setup - Add/remove an XDP program to a VSI
>> + * @vsi: the VSI to add the program
>> + * @prog: the XDP program
>> + **/
>> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
>> +                         struct bpf_prog *prog)
>> +{
>> +       struct i40e_pf *pf = vsi->back;
>> +       struct net_device *netdev = vsi->netdev;
>> +       int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +       if (prog && prog->xdp_adjust_head)

I think we want a follow on patch to enable adjust head next. It would
be best to get this in the same kernel version but not necessarily in
the same series.

>> +               return -EOPNOTSUPP;
>> +
>> +       if (frame_size > I40E_RXBUFFER_2048)
>> +               return -EINVAL;
>> +
>> +       if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
>> +               return -EINVAL;
>> +
> 
> Why would MSI-X matter?  Is this because you are trying to reserve
> extra Tx rings for the XDP program?
> 
> If so we might want to update i40e to make it more ixgbe like since it
> can in theory support multiple queues with a single vector.  It might
> be useful to add comments on why each of these criteria must be met
> when you perform the checks.
> 
>> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
>> +               return 0;
>> +
>> +       i40e_prep_for_reset(pf);
>> +
>> +       if (vsi->xdp_prog)
>> +               bpf_prog_put(vsi->xdp_prog);
>> +       vsi->xdp_prog = prog;
>> +

So I worry a bit about correctness here. Patch 3 likely fixes this but
between patch 1 and 3 it seems like there could be some race here.

This makes me think patch3 should be merged here.

>> +       i40e_reset_and_rebuild(pf, true);
>> +       return 0;
>> +}
>> +
>> +/**
>> + * i40e_xdp - NDO for enabled/query
>> + * @dev: the netdev
>> + * @xdp: XDP program
>> + **/
>> +static int i40e_xdp(struct net_device *dev,
>> +                   struct netdev_xdp *xdp)
>> +{
>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>> +       struct i40e_vsi *vsi = np->vsi;
>> +
>> +       if (vsi->type != I40E_VSI_MAIN)
>> +               return -EINVAL;
>> +
>> +       switch (xdp->command) {
>> +       case XDP_SETUP_PROG:
>> +               return i40e_xdp_setup(vsi, xdp->prog);
>> +       case XDP_QUERY_PROG:
>> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>> +               return 0;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +}
>> +
>>  static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_open               = i40e_open,
>>         .ndo_stop               = i40e_close,
>> @@ -9270,6 +9346,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_features_check     = i40e_features_check,
>>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
>> +       .ndo_xdp                = i40e_xdp,
>>  };
>>
>>  /**
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index 352cf7cd2ef4..d835a51dafa6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -24,6 +24,7 @@
>>   *
>>   ******************************************************************************/
>>
>> +#include <linux/bpf.h>
>>  #include <linux/prefetch.h>
>>  #include <net/busy_poll.h>
>>  #include "i40e.h"
>> @@ -1040,6 +1041,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>>         rx_ring->next_to_alloc = 0;
>>         rx_ring->next_to_clean = 0;
>>         rx_ring->next_to_use = 0;
>> +
>> +       if (rx_ring->xdp_prog) {
>> +               bpf_prog_put(rx_ring->xdp_prog);
>> +               rx_ring->xdp_prog = NULL;
>> +       }
>>  }
>>
>>  /**
>> @@ -1600,30 +1606,104 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
>>  }
>>
>>  /**
>> + * i40e_run_xdp - Runs an XDP program for an Rx ring
>> + * @rx_ring: Rx ring used for XDP
>> + * @rx_buffer: current Rx buffer
>> + * @rx_desc: current Rx descriptor
>> + * @xdp_prog: the XDP program to run
>> + *
>> + * Returns true if the XDP program consumed the incoming frame. False
>> + * means pass the frame to the good old stack.
>> + **/
>> +static bool i40e_run_xdp(struct i40e_ring *rx_ring,
>> +                        struct i40e_rx_buffer *rx_buffer,
>> +                        union i40e_rx_desc *rx_desc,
>> +                        struct bpf_prog *xdp_prog)
>> +{
>> +       u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
>> +       unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
>> +                           I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
>> +       struct xdp_buff xdp;
>> +       u32 xdp_action;
>> +
>> +       WARN_ON(!i40e_test_staterr(rx_desc,
>> +                                  BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
>> +
> 
> We can't just do a WARN_ON if we don't have the end of the frame.
> Also it probably doesn't even matter unless we are looking at the
> first frame.  I would argue that this might be better as a BUG_ON
> since it should not happen.
> 

Or if its really just a catch all for something that shouldn't happen
go ahead and drop it.

>> +       xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset;
>> +       xdp.data_end = xdp.data + size;
>> +       xdp_action = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +
>> +       switch (xdp_action) {
>> +       case XDP_PASS:
>> +               return false;
>> +       default:
>> +               bpf_warn_invalid_xdp_action(xdp_action);
>> +       case XDP_ABORTED:
>> +       case XDP_TX:
>> +       case XDP_DROP:
>> +               if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
>> +                       i40e_reuse_rx_page(rx_ring, rx_buffer);
>> +                       rx_ring->rx_stats.page_reuse_count++;
>> +                       break;
>> +               }
>> +
>> +               /* we are not reusing the buffer so unmap it */
>> +               dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
>> +                              DMA_FROM_DEVICE);
>> +               __free_pages(rx_buffer->page, 0);
>> +       }
>> +
>> +       /* clear contents of buffer_info */
>> +       rx_buffer->page = NULL;
>> +       return true; /* Swallowed by XDP */
>> +}
>> +
>> +/**
>>   * i40e_fetch_rx_buffer - Allocate skb and populate it
>>   * @rx_ring: rx descriptor ring to transact packets on
>>   * @rx_desc: descriptor containing info written by hardware
>> + * @skb: The allocated skb, if any
>>   *
>> - * This function allocates an skb on the fly, and populates it with the page
>> - * data from the current receive descriptor, taking care to set up the skb
>> - * correctly, as well as handling calling the page recycle function if
>> - * necessary.
>> + * Unless XDP is enabled, this function allocates an skb on the fly,
>> + * and populates it with the page data from the current receive
>> + * descriptor, taking care to set up the skb correctly, as well as
>> + * handling calling the page recycle function if necessary.
>> + *
>> + * If the received frame was handled by XDP, true is
>> + * returned. Otherwise, the skb is returned to the caller via the skb
>> + * parameter.
>>   */
>>  static inline
>> -struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>> -                                    union i40e_rx_desc *rx_desc)
>> +bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>> +                         union i40e_rx_desc *rx_desc,
>> +                         struct sk_buff **skb)
>>  {
>>         struct i40e_rx_buffer *rx_buffer;
>> -       struct sk_buff *skb;
>>         struct page *page;
>>
>>         rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
>>         page = rx_buffer->page;
>>         prefetchw(page);
>>
>> -       skb = rx_buffer->skb;
> 
> I'm not sure where this line came from.  It was dropped from the
> driver in next-queue a while ago so I don't think your patch applies
> currently.
> 
>> +       /* we are reusing so sync this buffer for CPU use */
>> +       dma_sync_single_range_for_cpu(rx_ring->dev,
>> +                                     rx_buffer->dma,
>> +                                     rx_buffer->page_offset,
>> +                                     I40E_RXBUFFER_2048,
>> +                                     DMA_FROM_DEVICE);
> 
> I would prefer if moving this piece up was handled in a separate
> patch.  It just makes things more readable that way and I have patches
> I will be submitting that will pull this and the few lines ahead of it
> out of the fetch_rx_buffer entirely.
> 
>> +
>> +       if (rx_ring->xdp_prog) {
>> +               bool xdp_consumed;
>> +
>> +               xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
>> +                                           rx_desc, rx_ring->xdp_prog);

So here is the race I commented on above if we NULL xdp_prog between
'if' block and i40e_run_xdp.


>> +               if (xdp_consumed)
>> +                       return true;
>> +       }
>>
>> -       if (likely(!skb)) {
>> +       *skb = rx_buffer->skb;
>> +
>> +       if (likely(!*skb)) {
> 
> Instead of making skb a double pointer we can just change a few things
> to make this all simpler.  For example one thing we could look at
> doing is taking an hlist_nulls type approach where we signal with a
> value of 1 to indicate that XDP stole the buffer so there is no skb to
> assign.  Then you just have to clean it up in i40e_is_non_eop by doing
> a "&" and continue from there, or could add your own block after
> i40e_is_non_eop.
> 
> For that matter we coudl probably consolidate a few things into
> i40e_cleanup_headers and have it handled there.
> 
>>                 void *page_addr = page_address(page) + rx_buffer->page_offset;
>>
>>                 /* prefetch first cache line of first page */
>> @@ -1633,32 +1713,25 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>  #endif
>>
>>                 /* allocate a skb to store the frags */
>> -               skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>> -                                      I40E_RX_HDR_SIZE,
>> -                                      GFP_ATOMIC | __GFP_NOWARN);
>> -               if (unlikely(!skb)) {
>> +               *skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>> +                                       I40E_RX_HDR_SIZE,
>> +                                       GFP_ATOMIC | __GFP_NOWARN);
>> +               if (unlikely(!*skb)) {
>>                         rx_ring->rx_stats.alloc_buff_failed++;
>> -                       return NULL;
>> +                       return false;
>>                 }
>>
>>                 /* we will be copying header into skb->data in
>>                  * pskb_may_pull so it is in our interest to prefetch
>>                  * it now to avoid a possible cache miss
>>                  */
>> -               prefetchw(skb->data);
>> +               prefetchw((*skb)->data);
>>         } else {
>>                 rx_buffer->skb = NULL;
>>         }
>>
>> -       /* we are reusing so sync this buffer for CPU use */
>> -       dma_sync_single_range_for_cpu(rx_ring->dev,
>> -                                     rx_buffer->dma,
>> -                                     rx_buffer->page_offset,
>> -                                     I40E_RXBUFFER_2048,
>> -                                     DMA_FROM_DEVICE);
>> -
>>         /* pull page into skb */
>> -       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
>> +       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, *skb)) {
>>                 /* hand second half of page back to the ring */
>>                 i40e_reuse_rx_page(rx_ring, rx_buffer);
>>                 rx_ring->rx_stats.page_reuse_count++;
>> @@ -1671,7 +1744,7 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>         /* clear contents of buffer_info */
>>         rx_buffer->page = NULL;
>>
>> -       return skb;
>> +       return false;
>>  }
>>
>>  /**
>> @@ -1716,6 +1789,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>>  }
>>
>>  /**
>> + * i40e_update_rx_next_to_clean - Bumps the next-to-clean for an Rx ing
>> + * @rx_ring: Rx ring to bump
>> + **/
>> +static void i40e_update_rx_next_to_clean(struct i40e_ring *rx_ring)
>> +{
>> +       u32 ntc = rx_ring->next_to_clean + 1;
>> +
>> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
>> +       rx_ring->next_to_clean = ntc;
>> +
>> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
>> +}
>> +
>> +/**
>>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @rx_ring: rx descriptor ring to transact packets on
>>   * @budget: Total limit on number of packets to process
>> @@ -1739,6 +1826,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                 u16 vlan_tag;
>>                 u8 rx_ptype;
>>                 u64 qword;
>> +               bool xdp_consumed;
>>
>>                 /* return some buffers to hardware, one at a time is too slow */
>>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
>> @@ -1764,7 +1852,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                  */
>>                 dma_rmb();
>>
>> -               skb = i40e_fetch_rx_buffer(rx_ring, rx_desc);
>> +               xdp_consumed = i40e_fetch_rx_buffer(rx_ring, rx_desc, &skb);
>> +               if (xdp_consumed) {
>> +                       cleaned_count++;
>> +
>> +                       i40e_update_rx_next_to_clean(rx_ring);
>> +                       total_rx_packets++;
>> +                       continue;
>> +               }
>> +
> 
> There are a few bits here that aren't quite handled correctly.  It
> looks like we are getting the total_rx_packets, but you didn't update
> total_rx_bytes.
> 
>>                 if (!skb)
>>                         break;
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index e065321ce8ed..957d856a82c4 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -341,6 +341,8 @@ struct i40e_ring {
>>
>>         struct rcu_head rcu;            /* to avoid race on free */
>>         u16 next_to_alloc;
>> +
>> +       struct bpf_prog *xdp_prog;
>>  } ____cacheline_internodealigned_in_smp;
>>
> 
> You might be better off moving this further up in the structure.
> Maybe into the slot after *tail.  Otherwise it is going to be in a
> pretty noisy cache line, although it looks like nobody ever really
> bothered to optimize the i40e_ring structure so that you had read
> mostly and write mostly cache lines anyway so maybe you don't need to
> bother.

Maybe another patch to clean up the structures.

> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 



More information about the Intel-wired-lan mailing list