[Intel-wired-lan] ixgbe: 5.10.0 kernel regression for 2.5Gbps link negotiation?

Hisashi T Fujinaka htodd at twofifty.com
Mon Dec 21 15:31:25 UTC 2020


I'm going to answer this from home, where Outlook isn't impeding me.
This is the only time I'm doing this because I can't find your email any
more. Outlook has cleverly hiddent it from me.

On Mon, 21 Dec 2020, Fujinaka, Todd wrote:

> I would listen to you on Linus' list, but this is Intel-wired-lan.
>
> Todd Fujinaka
> Software Application Engineer
> Data Center Group
> Intel Corporation
> todd.fujinaka at intel.com
>
> -----Original Message-----
> From: Paul Menzel <pmenzel at molgen.mpg.de>
> Sent: Monday, December 21, 2020 7:10 AM
> To: Fujinaka, Todd <todd.fujinaka at intel.com>; Ben Greear <greearb at candelatech.com>
> Cc: intel-wired-lan at lists.osuosl.org; Greg KH <gregkh at linuxfoundation.org>; Linus Torvalds <torvalds at linux-foundation.org>; Brandeburg, Jesse <jesse.brandeburg at intel.com>; Nguyen, Anthony L <anthony.l.nguyen at intel.com>
> Subject: Re: [Intel-wired-lan] ixgbe: 5.10.0 kernel regression for 2.5Gbps link negotiation?
>
> Dear Todd,
>
>
> I kindly ask you again, please do not top-post. It?s impolite, and more importantly, it wastes the readers time as it looses context, and results in misunderstandings.

This is where I should've inserted my comment about Outlook and
intel-wired-lan vs Linus' lists. It's a pain, but

> Am 19.12.20 um 17:19 schrieb Fujinaka, Todd:
>> This is a bad case with no ideal solution. Detecting the case is not
>> possible as autonegotiation happens in the hardware without software
>> involvement.
>>
>> One solution was to update the switch firmware for the a switch that
>> is is the link partner that give us the most trouble. The issue
>> appears to be in competing or half-implemented standards. 2.5G and 5G
>> were initially non-IEEE standards that different manufacturers hacked
>> onto 1G in different ways. We implemented it to one of the standards
>> which should be interoperable, but the corner case of the
>> widely-deployed switch will take the link from 10G to 1G with no
>> automated way to fix it.
>
> Thank you for the background, which should have been in the commit message.
>
> Can you please tell us the problematic switch name and the problematic firmware version and the one, where this issues is fixed?

I can ask around. I wasn't on those issues. The problem isn't with the
switch manufacturer because they're released a fix, but with the
datacenters who don't want to update their switches. I've been loath to
reveal more data because that's confidential to the customer.

>> Updating switches means a lot of downtime for a lot of datacenters and
>> the OEMs we deal with would not accept that answer.
>
> Well, then please discuss the problem and possible solutions on the mailing list. Breaking other peoples setups is unacceptable. A Linux kernel runtime parameter would be one solution, your customers could have used.

Runtime parameter? That's even higher on the list of "not allowed". I've
said several times that the end-customers wouldn't update their switches
and wouldn't use any boot parameters. Customers high enough that the
executive VP of several companies called our company and demanded an
immediate fix.

>> Our solution was to disable 2.5G and 5G by default. This fixes 10G
>> linking at 1G on that switch, but 2.5G and 5G will link at 1G by
>> default. And, as I said, I've had very little contact with people
>> using 2.5G and 5G and I'm the guy on all the mailing lists.
>
> Unfortunately, a lot of users are not on the mailing list.

On ANY mailing list. This isn't the only one I'm on.

>> I apologize for making your life harder, but it seems like it's just
>> you so far. Paul seems to be arguing with me just for the fun of it.
>
> Please keep the discussion respectful, and do not insult others.

I'm not being disrespectful, I'm just saying you're just arguing
semantics and "rules".

> Unfortunately, at work we have now been bitten several times by regressions updating to the current mainline Linux kernel, causing frictions in the team about what Linux kernel to use.
>
> I am missing a statement by you, acknowledging that the commit and the whole communication was a big fail, and how you will fix the regression.
> Additionally, an analysis would be nice, where the process failed ? why was the commit message incomplete and why did the test (Tested-by
> present) not spot the issue ? and how to improve it to avoid such a situation in the future.

Communications was a big fail, and I'm here to try to solve that. We
will not be reverting this, in fact I've been told by my management that
this is required. And my management goes way up the chain to executive
VPs at Intel.

Right now I'm between a rock and a hard place and 2.5G and 5G is not our
primary market. I'm not the marketing guy, so I didn't make that
decision.


More information about the Intel-wired-lan mailing list