[Intel-wired-lan] e1000e driver - hang after 4 hours of uptime - finally bisected!

Thu Jun 18 22:34:23 UTC 2015

On Thu, 2015-06-18 at 12:46 -0400, Valdis Kletnieks wrote:
> (follow up to a report from last week - bisecting took a while as I could
> only do 1 or 2 tests an evening)
> 
> My Dell Latitude E6530 crashes with a specific kernel lockup almost
> exactly 4 hours after boot if there isn't a cable connected to the
> Ethernet port:
> 
> [14508.846327] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14468.229720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14463.254791] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14491.134413] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14463.396593] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
> [14490.390223] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14494.680591] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14513.365378] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14482.271716] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> 
> As far as I can tell, the timestamp jitter is just how long it takes me to
> enter the cryptLUKS passphrase for the hard drive at boot...
> 
> lspci tells me:
> 
> lspci -vvv -s "00:19.0"
> 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
>         DeviceName:  Onboard LAN
>         Subsystem: Dell Device 0535
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 28
>         Region 0: Memory at f7700000 (32-bit, non-prefetchable) [size=128K]
>         Region 1: Memory at f7739000 (32-bit, non-prefetchable) [size=4K]
>         Region 2: I/O ports at f040 [size=32]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000fee00318  Data: 0000
>         Capabilities: [e0] PCI Advanced Features
>                 AFCap: TP+ FLR+
>                 AFCtrl: FLR-
>                 AFStatus: TP-
>         Kernel driver in use: e1000e
> 
> 
> The traceback always looks like:
> 
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> 
> [14479.906908] Call Trace:
> [14479.906914]  <NMI>  [<ffffffffba94db16>] dump_stack+0x50/0xa8
> [14479.906930]  [<ffffffffba948bb9>] panic+0xcd/0x1e4
> [14479.906940]  [<ffffffffba166a60>] ? perf_event_task_disable+0xc0/0xc0
> [14479.906952]  [<ffffffffba125d8b>] watchdog_overflow_callback+0x9b/0xa0
> [14479.906959]  [<ffffffffba16a684>] __perf_event_overflow+0xc4/0x1f0
> [14479.906968]  [<ffffffffba16b3a4>] perf_event_overflow+0x14/0x20
> [14479.906976]  [<ffffffffba022271>] intel_pmu_handle_irq+0x1e1/0x430
> [14479.906990]  [<ffffffffba01a0f6>] perf_event_nmi_handler+0x26/0x40
> [14479.906999]  [<ffffffffba0085b3>] nmi_handle+0x103/0x340
> [14479.907005]  [<ffffffffba0084b5>] ? nmi_handle+0x5/0x340
> [14479.907017]  [<ffffffffba008a53>] default_do_nmi+0xc3/0x120
> [14479.907032]  [<ffffffffba008b98>] do_nmi+0xe8/0x130
> [14479.907044]  [<ffffffffba95c9a8>] end_repeat_nmi+0x1e/0x2e
> [14479.907055]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907061]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907069]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907075]  <<EOE>>  [<ffffffffba0e9529>] timecounter_read+0x19/0x60
> [14479.907088]  [<ffffffffba53687e>] e1000e_phc_gettime+0x2e/0x60
> [14479.907098]  [<ffffffffba536a31>] e1000e_systim_overflow_work+0x31/0x70
> [14479.907105]  [<ffffffffba07ad19>] process_one_work+0x3c9/0x980
> [14479.907115]  [<ffffffffba07ac62>] ? process_one_work+0x312/0x980
> [14479.907125]  [<ffffffffba07b348>] ? worker_thread+0x78/0x760
> [14479.907134]  [<ffffffffba07b59c>] worker_thread+0x2cc/0x760
> [14479.907144]  [<ffffffffba07b2d0>] ? process_one_work+0x980/0x980
> [14479.907154]  [<ffffffffba082a5e>] kthread+0xfe/0x120
> [14479.907163]  [<ffffffffba08ca50>] ? finish_task_switch+0x50/0x1c0
> [14479.907173]  [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907179]  [<ffffffffba95ae4f>] ret_from_fork+0x3f/0x70
> [14479.907188]  [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907243] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> Bisection tells me it's this commit:
> 
> commit 83129b37ef35bb6a7f01c060129736a8db5d31c4
> Author: Yanir Lubetkin <yanirx.lubetkin at intel.com>
> Date:   Tue Jun 2 17:05:45 2015 +0300
> 
>     e1000e: fix systim issues
> 
>     Two issues involving systim were reported.
>     1. Clock is not running in the correct frequency
>     2. In some situations, systim values were not incremented linearly
>     This patch fixes the hardware clock configuration and the spurious
>     non-linear increment.

Thanks Valdis!  I will have Yanir look into it and hopefully we should
have a fix here soon for you to verify.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20150618/3c52a5e2/attachment.asc>