[Intel-wired-lan] [PATCH bpf 3/4] net/i40e: Fix concurrency issues between config flow and XSK

Mon Dec 16 07:50:03 UTC 2019

On 2019-12-13 18:10, Maxim Mikityanskiy wrote:
> On 2019-12-11 12:08, Björn Töpel wrote:
>> On 2019-12-10 15:26, Maxim Mikityanskiy wrote:
>>> On 2019-12-06 11:03, Björn Töpel wrote:
>>>> On Thu, 5 Dec 2019 at 16:52, Maxim Mikityanskiy
>>>> <maximmi at mellanox.com> wrote:
>>>>>
>>>>> Use synchronize_rcu to wait until the XSK wakeup function finishes
>>>>> before destroying the resources it uses:
>>>>>
>>>>> 1. i40e_down already calls synchronize_rcu. On i40e_down either
>>>>> __I40E_VSI_DOWN or __I40E_CONFIG_BUSY is set. Check the latter in
>>>>> i40e_xsk_async_xmit (the former is already checked there).
>>>>>
>>>>> 2. After switching the XDP program, call synchronize_rcu to let
>>>>> i40e_xsk_async_xmit exit before the XDP program is freed.
>>>>>
>>>>> 3. Changing the number of channels brings the interface down (see
>>>>> i40e_prep_for_reset and i40e_pf_quiesce_all_vsi).
>>>>>
>>>>> 4. Disabling UMEM sets __I40E_CONFIG_BUSY, too.
>>>>>
>>>>> Signed-off-by: Maxim Mikityanskiy <maximmi at mellanox.com>
>>>>> ---
>>>>>     drivers/net/ethernet/intel/i40e/i40e_main.c | 7 +++++--
>>>>>     drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 4 ++++
>>>>>     2 files changed, 9 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> index 1ccabeafa44c..afa3a99e68e1 100644
>>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> @@ -6823,8 +6823,8 @@ void i40e_down(struct i40e_vsi *vsi)
>>>>>            for (i = 0; i < vsi->num_queue_pairs; i++) {
>>>>>                    i40e_clean_tx_ring(vsi->tx_rings[i]);
>>>>>                    if (i40e_enabled_xdp_vsi(vsi)) {
>>>>> -                       /* Make sure that in-progress ndo_xdp_xmit
>>>>> -                        * calls are completed.
>>>>> +                       /* Make sure that in-progress ndo_xdp_xmit and
>>>>> +                        * ndo_xsk_async_xmit calls are completed.
>>>
>>> Ooops, I see now some changes were lost during rebasing. I had a version
>>> of this series with ndo_xsk_async_xmit -> ndo_xsk_wakeup fixed.
>>>
>>>>>                             */
>>>>>                            synchronize_rcu();
>>>>>                            i40e_clean_tx_ring(vsi->xdp_rings[i]);
>>>>> @@ -12545,6 +12545,9 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
>>>>>                    i40e_prep_for_reset(pf, true);
>>>>>
>>>>>            old_prog = xchg(&vsi->xdp_prog, prog);
>>>>> +       if (old_prog)
>>>>> +               /* Wait until ndo_xsk_async_xmit completes. */
>>>>> +               synchronize_rcu();
>>>>
>>>> This is not needed -- please correct me if I'm missing something! If
>>>> we're disabling XDP, the need_reset-clause will take care or the
>>>> proper synchronization.
>>>
>>> Yes, it was added to cover the case of disabling XDP. I'm not sure how
>>> i40e_reset_and_rebuild will help here. What I wanted to achieve is to
>>> wait until all i40e_xsk_wakeups finish after setting vsi->xdp_prog =
>>> NULL and before doing anything else. I don't see how
>>> i40e_reset_and_rebuild helps here, but I'm not very familiar with i40e
>>> code. Could you guide me through the code of i40e_reset_and_rebuild and
>>> show how it takes care of the synchronization?
>>>
>>>> And we don't need to worry about old_prog for
>>>> the swap-XDP case,
>>>
>>> Right.
>>>
>>>> because bpf_prog_put() does cleanup with
>>>> call_rcu().
>>>
>>> But if i40e_xsk_wakeup is not wrapped with rcu_read_lock, it won't sync
>>> with it.
>>>
>>>>
>>>>>
>>>>>            if (need_reset)
>>>>>                    i40e_reset_and_rebuild(pf, true, true);
>>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
>>>>> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
>>>>> index d07e1a890428..f73cd917c44f 100644
>>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
>>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
>>>>> @@ -787,8 +787,12 @@ int i40e_xsk_wakeup(struct net_device *dev, u32
>>>>> queue_id, u32 flags)
>>>>>     {
>>>>>            struct i40e_netdev_priv *np = netdev_priv(dev);
>>>>>            struct i40e_vsi *vsi = np->vsi;
>>>>> +       struct i40e_pf *pf = vsi->back;
>>>>>            struct i40e_ring *ring;
>>>>>
>>>>> +       if (test_bit(__I40E_CONFIG_BUSY, pf->state))
>>>>> +               return -ENETDOWN;
>>>>> +
>>>>
>>>> You right that we need to check for BUSY, since the XDP ring might be
>>>> stale! Thanks for catching this! Can you respin this patch, with just
>>>> this hunk? (Unless I'm wrong! :-))
>>>
>>> This check itself will not be enough, unless you wrap i40e_xsk_wakeup
>>> with rcu_read_lock.
>>>
>>> i40e_xsk_wakeup does some checks in the beginning, then the main part of
>>> the code goes. My assumption is that if the checks don't pass, the main
>>> part will fail in some way (e.g., NULL or dangling pointer when
>>> accessing the ring), otherwise you wouldn't need those checks at all.
>>> Without RCU, even if you do these checks, it may still fail in the
>>> following scenario:
>>>
>>> 1. i40e_xsk_wakeup starts running, checks the flag, the flag is set, the
>>> function goes on.
>>>
>>> 2. The flag gets cleared.
>>>
>>> 3. The resources are destroyed.
>>>
>>> 4. i40e_xsk_wakeup tries to access the resources that were destroyed.
>>>
>>> I don't see anything in i40e and ixgbe that currently protects from such
>>> a race condition. If I'm missing anything, please point me to it,
>>> otherwise i40e_xsk_wakeup really needs to be wrapped into rcu_read_lock.
>>> I would prefer to have rcu_read_lock in the kernel, so that all drivers
>>> could benefit from it, because I think it's the same issue in all
>>> drivers, but I'm fine with moving it to the drivers if there is a reason
>>> why some drivers may not need it.
>>>
>>> Thanks for taking a look. Let's settle the case with Intel's drivers, so
>>> that I will know what fixes to include into the respin.
>>>
>>
>> Max, you're right. It's not correct without RCU locking. Thanks for the
>> patience.
>>
>> For the Intel ndo_xsk_wakeup implementation we need to make sure that
>> the 1. umem(socket) exists, and that the 2. XDP rings exists for the
>> duration of the call.
>>
>> 1. In xsk_unbind_dev() the state is changed to UNBOUND and then
>>      followed by synchronize_net(). It would be, as you wrote, incorrect
>>      without RCU locking the ndo_xsk_wakeup() region.
>>
>> 2. To ensure that the XDP rings are intact for the duration of the
>>      ndo_xsk_wakeup(), a synchronize_rcu() is required when the XDP
>>      program is swapped out (prog->NULL).
>>
>> Again, both 1 and 2 requires RCU locking.
>>
>> What do you think about the this patch for xsk.c (and the Intel
>> drivers)? It moves all ndo_xsk_wakeup calls do one place.
> 
> Looks good, please also see a few comments below.
> 
> Checking the __I40E_CONFIG_BUSY and __IXGBE_TX_DISABLED flags in the
> wakeup functions is still required, I believe. You didn't include these
> parts into your patch, I just want to make sure you don't suggest to
> drop these parts of my patches.
>

No, you're right. I missed that!

> How would you like to proceed with it? Do you want me to integrate your
> changes into my patches and respin with both my and your sign-offs?
>

That sounds like a good plan! Let me know if you'd like me to do it, 
given that you're on vacation!

>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index cb6367334ca7..4833187bd259 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -1152,7 +1152,7 @@ void i40e_set_fec_in_flags(u8 fec_cfg, u32 *flags);
>>
>>    static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
>>    {
>> -    return !!vsi->xdp_prog;
>> +    return !!READ_ONCE(vsi->xdp_prog);
>>    }
>>
>>    int i40e_create_queue_channel(struct i40e_vsi *vsi, struct
>> i40e_channel *ch);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index 1ccabeafa44c..fd084f03f00d 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -12546,8 +12546,11 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
>>
>>        old_prog = xchg(&vsi->xdp_prog, prog);
>>
>> -    if (need_reset)
>> +    if (need_reset) {
>> +        if (!prog)
>> +            synchronize_rcu();
>>            i40e_reset_and_rebuild(pf, true, true);
>> +    }
>>
>>        for (i = 0; i < vsi->num_queue_pairs; i++)
>>            WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 25c097cd8100..adff2cbcde1a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -10261,7 +10261,11 @@ static int ixgbe_xdp_setup(struct net_device
>> *dev, struct bpf_prog *prog)
>>
>>        /* If transitioning XDP modes reconfigure rings */
>>        if (need_reset) {
>> -        int err = ixgbe_setup_tc(dev, adapter->hw_tcs);
>> +        int err;
>> +
>> +        if (!prog)
>> +            synchronize_rcu();
>> +        err = ixgbe_setup_tc(dev, adapter->hw_tcs);
>>
>>            if (err) {
>>                rcu_assign_pointer(adapter->xdp_prog, old_prog);
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 956793893c9d..dbf06c3b7061 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -334,12 +334,20 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem,
>> struct xdp_desc *desc)
>>    }
>>    EXPORT_SYMBOL(xsk_umem_consume_tx);
>>
>> -static int xsk_zc_xmit(struct xdp_sock *xs)
>> +static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
>>    {
>>        struct net_device *dev = xs->dev;
>> +    int err;
>>
>> -    return dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
>> -                           XDP_WAKEUP_TX);
>> +    rcu_read_lock();
>> +    err = dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id, flags);
>> +    rcu_read_unlock();
>> +    return err;
>> +}
>> +
>> +static int xsk_zc_xmit(struct xdp_sock *xs)
>> +{
>> +    return xsk_wakeup(xs, XDP_WAKEUP_TX);
>>    }
>>
>>    static void xsk_destruct_skb(struct sk_buff *skb)
>> @@ -453,19 +461,16 @@ static __poll_t xsk_poll(struct file *file, struct
>> socket *sock,
>>        __poll_t mask = datagram_poll(file, sock, wait);
>>        struct sock *sk = sock->sk;
>>        struct xdp_sock *xs = xdp_sk(sk);
>> -    struct net_device *dev;
>>        struct xdp_umem *umem;
>>
>>        if (unlikely(!xsk_is_bound(xs)))
>>            return mask;
>>
>> -    dev = xs->dev;
>>        umem = xs->umem;
>>
>>        if (umem->need_wakeup) {
>> -        if (dev->netdev_ops->ndo_xsk_wakeup)
>> -            dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
>> -                            umem->need_wakeup);
>> +        if (xs->zc)
>> +            xsk_wakeup(umem->need_wakeup);
> 
> Thanks for spotting this. I missed the second place where wakeup is called.
> 
> BTW, does the current code contain a bug? The socket can be opened in
> copy mode even if ndo_xsk_wakeup is not NULL, and we'll call
> ndo_xsk_wakeup in this case, although it's not zero-copy. After your
> patch it will be fixed.
>

Ah, yes indeed. Good catch!

> P.S. I'll be on vacation, so I added my personal email to Cc to be able
> to receive responses.
>

ACK! ...and again let me know if you'd like me to spin the patches.

Cheers,
Björn

>>            else
>>                /* Poll needs to drive Tx also in copy mode */
>>                __xsk_sendmsg(sk);
>>
>