2011-05-06

When a PC is not Point-and-Shoot: WHEA-Logger EventID 17

When everything's working smoothly, a PC seems a simple device, like a toaster, or, to use a better example, like a digital camera. Turn it on, launch an application, do work, and turn it off. A smoothly working PC offers its own quietly efficient version of Point and Shoot. 

Beneath the covers, with both devices, but more so with today's PC, a lot needs to go right to produce the appearance of a carefree toaster-like experience. Some problems are comparatively common (if not frequent) but straightforward to diagnose and correct. Disk drives still fail, and for most, while SMART doesn't generally provide warning of pending failure, it's usually obvious when it happens and the corrective action is also obvious. But once in awhile, problems arise that are a reminder of how much complexity lies beneath that ever-thinner chassis.

An incident in this latter category has been occurring with an HP Elitebook 8540p laptop. Several times a minute, Windows 7 throws this exception to the event log:

Event 17, WHEA-Logger
Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)
Bus: Device:Function: 0x0:0x3:0x0
Vendor ID:Device ID: 0x8086:0xd138
Class Code: 0x30400

More details are provided by Windows in the full WMI-interfaced Windows Hardware Error Architecture (WHEA) message. There were several problems with the laptop that could have been related to this error, which was thrown at least several times a minute, and sometimes more often depending on what processing was occurring. These included infrequent blue screens, fairly frequent glitching in pro audio devices (e.g., Echo AudioFire Pre8 and Ableton Live, as well as various USB audio devices), and an unsatisfactory rating by the Echo-recommended DPC Latency Checker.

Over several months, HP support and I worked on this issue. New device drivers were tried, USB device drivers were removed and added, self-tests and diagnostics were run. Nothing turned up. Finally, the hated "reinstall Windows" suggestion reared its head. Since everything else had been tried, the partition was erased and Windows 7 x64 was reinstalled. Result? Even before all the HP utilities were reinstalled, the problem recurred.

The last HP technician suggested it could be an issue with the Nvidia NVS 5100M discrete graphics card. There are two related threads in forums that illustrate the scope and diligence of some users in trying to resolve this error. One is in NotebookReview (more related to Asus motherboard problems) and the other in an Intel Community forum (where most of the blame was laid at the feet of AMD ATI graphics cards.

The laptop is off to HP for a repair attempt. Stay tuned. In the meantime, should this post have appeared in TechnologyHead.com or in ErrorProcessing.com? Arguing in favor of the later, for instance, were several remarks by hardware technicians consulted by worried forum posters that "these problems didn't used to be reported anyway, and can easily be ignored." Arguing in favor of TechnologyHead was the aspect of complexity; after all, the error message was close to what passes for best practice these days in error reporting (if not recovery).

5 comments:

Martin Møller said...

Hi there.

My 8540w model is doing the same thing. Tons of WHEA 17 erros hourly, thousands daily.

They stopped for a little while, when I installed the latest nVidia graphics driver, but after a reboot, they returned...

I would be inclined to ignore then, seeing as they are 'warnings', but my machine crashes almost every time I attempt to suspend it, even if Google Chrome is closed (chrome seems to have issues with suspend).

Have you had your machine back yet? Is there some clarification of why this happens over and over?

Regards,

/Martin.

knowlengr said...

Martin,

You're not going to like this answer.

A total of five in-warranty case numbers were assigned by HP. It went back to the KY depot four times. They tried replacing the NVidia card twice, the hard drive twice (why, I can't figure), and caused one of the returns by leaving a fan disconnected.

After many, many calls to HP (I keep a journal of such things, and there are too many entries to count here) the final case was escalated to Customer Service. I was given a couple of options, but the only option that would leave me with a MB that supported 16GB was to upgrade to a new -w model. I am currently in the fourth week of waiting for a new machine to be assembled in China.

My own assessment is that there was a problem with the motherboard or an interaction between the MBO and on-board devices. AFAIK HP never tried to replace the MBO.

If you search patiently for this error, you'll find that the same MBO maker in Taiwan supplied this MBO in an Asus machine with similar problems. The disposition was inconclusive, but the symptoms were strikingly similar. Attempts were made to blame ATI in this latter case, but I think that was misplaced.

It goes without saying that HP's handling of the case ranged from humorously inept to shockingly incompetent. Their internal workflow at times was so poor in the handoff between the call center in India, the call center in New Mexico, and between Technical Support and Customer Service that I almost felt sorry for the person at the other end of the phone -- almost. That HP itself did not have a process to intervene after the machine had been back to the depot three times showed an indifference to its customers -- and this is supposed to be their business class laptop with top tier service.

I see that your issues were with suspend; I disabled all that because my machine was used for audio processing. It may or may not be related to the central problem. If you read Microsoft's discussion of the error, it appears that the error wasn't detected or at least reported in previous versions of Windows. This may be a case of Microsoft keeping the hardware boys honest.

I don't have a replacement machine yet, and must confess that I wouldn't be surprised if the replacement machine shows the same flaw. If all you use the machine for is MS Office or Adobe applications, you may be (in my instance) OK, so long as you periodically purge the event log.

For semi-real time events such as I need for audio processing, it's a deal-breaker, because Windows stops servicing USB interrupts briefly to handle the WHEA error.

I'm sure HP had you install all the current firmware and driver updates (and they are numerous), and those are needed to address other problems, though I never found that it affected the core WHEA error 17.

The India-based technicians were effective enough, but they apparently had no initiative to escalate the problem when it recurred. Two different technicians logged in to see the Event log themselves: Didn't they believe me? I could see doing it a first time, but, the second time was patronizing.

The representative at Customer Service could not put together a grammatically correct English sentence in her emails, and was both unsympathetic and rigid. I despaired of getting any sort of suitable replacement, when suddenly someone kicked my issue back to my original sales agent! After that, things moved in a sensible path toward a replacement (though time will tell whether the order is actually in the pipeline).

One of the more annoying turn of events in the HP handling of the case was their request that I reinstall Windows. It's reasonable to do that after everything else failed, but it was apparent from the timestamps in the event log that the WHEA errors were logged *while* the machine was still in the HP depot facility.

To show for all this, I have a an altar of laptop shipping boxes, since HP's practice is to ship out a new empty box each time.

Do let me know how your story turns out.

Fingers said...

Hi Mark,

Did you ever get this solved? I'm having similar problems with an EliteBook 8540W with an FX1800 display adapter (WHEA 17 errors and BSODs).

The HP guy is going to swap out the video card as they seem to think that this is the issue. I hope it'll solve the dramas for me.

Thanks
Paul Lehmann

knowlengr said...

Never did. After five returns to the depot, I gave up on that machine and persuaded HP to replace it with a different model. Nothing they tried made any difference, though AFAIK they never tried to replace the motherboard. Pretty much everything else, though -- to no avail.

Fingers said...

Hi Mark,

Hopefully the new machine is going much better than the old one! I'm a bit stumped as to why they wouldn't have tried the MB, seeing as they had replaced everything else. I suppose it doesn't matter now.

HP have swapped out the display adapter in my machine. It's been a day and no WHEA 17 errors so far, no blue screens either. I've got my fingers crossed.

Cheers
Paul