Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
On Fri, May 05, 2006 at 08:23:44AM -0700, cerise@xxxxxxxxxx wrote:
> > Michal Szymanski wrote:
> >
> > >All systems crash (either hang with some "machine check exception"
> > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
> > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
> > >never survived more than a few hours.
>
> Let's try the easy stuff first -- if it's crashing with a machine check
> exception, then let's disable machine check exceptions, and see if things
> still break.
>
> Try booting with the parameter "nomce". Be aware that mce is a mechanism
> for the processor to inform the kernel of thermal issues or component
> failure. You'll only want to disable this mechanism if you aren't having
> thermal problems.
I tried "nomce". The machine does not "halt" now with MCE kernel panic
messages onscreen but resets after 3-4 hours of work under 2 or more jobs.
As I wrote in a response to Robert's message, it seems to be a memory
issue, as there are no crashes with Kingston 1GB memory modules.
One of the machines and the memory went back to the dealer for tests.
> P.S. I came a little late to this party -- I didn't see the original message.
> Did you include the text of the kernel crash?
Below the kernel message as OCR-ed from a screen digital photo :)
Plus the decoded message as adviced by the first message:
Fedora Core release 4 (Stentz)
kernel 2.6.16-1.2069_FC4smp on an x86_64
red10 login:
HARDWARE ERROR
CPU 0: Machine Check Exception: 4 Bank 4: f604a00200000813
TSC 1504205a42ba ADDR 115e47828
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check
Call Trace: <#MC>
<ffffffff80134e6a>{panic+133} (ffffffff801129eb){mcheck_timer+0}
<ffffffff801131fc>{do_machine_check+753}
<ffffffff8010be43>{machine_check+127} <EOE>
------------------
mcelog --ascii output:
HARDWARE ERROR
CPU 0 BANK 4 TSC 1504205a42ba
MCG status:MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_ADDR register valid
Processor context corrupt
MCA:BUS Generic Originated-request Read Memory-access Request-timeout Error
Model:
STATUS f604a00200000813 MCGSTATUS 4
------------------
regards, Michal.
--
Michal Szymanski (msz at astrouw dot edu dot pl)
Warsaw University Observatory, Warszawa, POLAND
-
: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
[Audio]
[Hams]
[Kernel Newbies]
[Security]
[Netfilter]
[Bugtraq]
[Photo]
[Yosemite Photos]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Samba]
[Video 4 Linux]
[Linux Resources]
[Fedora Users]