[Intel-wired-lan] tx hang, server reboot with driver igb under load
Gael Le Mignot
gael at pilotsystems.net
Thu Oct 27 12:20:00 UTC 2016
Hello,
Summary of the problem
We have had a few crash under network load on production servers
using the igb network driver. Those crashes are not very
frequent (a couple of times per year at most) but very disrupting
when they happen on production servers.
The setup
Hardware :
- SuperMicro servers
- Dual AMD Opteron
- Network card integrated in motherboard :
Intel Corporation 82576 Gigabit Network Connection
Software stack is :
- Xen hypervisor (4.4.1)
- Debian GNU/Linux stable - Jessie (8.x)
- Linux 3.16.0-4 (Debian’s package)
Integrated igb driver (5.0.5-k)
In addition we use the following technologies :
- DRBD for disk replication ;
- taged vlans ;
- ethernet bridges.
Those servers being currently used in production, additional
testing might be complicated.
Symptoms
Occasionally, the following events occur :
- timeout on DRBD :
[22020289.869016] block drbd7: Remote failed to finish a request within ko-count * timeout
- followed by a tx hang:
[22020294.529389] igb 0000:02:00.0: Detected Tx Unit Hang
- followed by an attempt to reset network adapter :
[22020301.536766] igb 0000:02:00.0 eth0: Reset adapter
[22020304.674250] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
- but the problem persists :
[22020306.530956] igb 0000:02:00.0: Detected Tx Unit Hang
- after a couple of similar cycles, the server reboots.
What we tried
We tried the following operations, which didn’t solve the problem :
- upgrading the kernel to 4.1.0-0.bpo.2 (igp version 5.2.15-k) ;
- replacing the embedded network card by an external one (which uses the same driver) :
Intel Corporation I350 Gigabit Network Connection
We tried to temporarily remove one of the servers from the
datacenter to perform stress testing, but we couldn’t reproduce
the crash outside real-world operations.
Additional informations
Other similar, but slightly older, servers don’t seem to exhibit
the same issue.
We uploaded additional information to http://www-in.pilotsystems.net/igp/ :
- the full logs of last crash/reboot ;
- lspci -v, ethtool -i, ethtool -k, dmidecode on a server with
the issue (gandalf) and another one that seems fine (buffy)
Regards,
--
Gaël Le Mignot - gael at pilotsystems.net
Pilot Systems - 82, rue de Pixérécourt - 75020 Paris
Tel : +33 1 44 53 05 55 - www.pilot-systems.net
Gérez vos contacts et vos newsletters : www.cockpit-mailing.com
More information about the Intel-wired-lan
mailing list