Google
  Web www.spinics.net

Re: Question about: EDAC amd4 MC0: Failed to translate InputAddr to csrow for address 0x27b028ff0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Wed, Aug 08, 2012 at 01:12:39AM +0000, Jiang Wang wrote:
> I just tested with upstream kernel 3.5 and got similar errors. There is one difference between mainline kernel and my customized SL 6.2 kernel.  On mainline kernel, the inputAddr is larger than the number showed in "init_memory_mapping" during boot time. On my customized SL 6.2 kernel, the inputAddr is smaller than it.
> 
> Following are the details:
> 
> I downloaded linux-3.5.tar.bz2 from kernel.org. Compiled with similar config file of my SL 6.2. Added a couple of lines in EDAC to disable syncflood. Then I booted with 3.5 kernel, used sysfs node to inject double bit errors as my first email.  I tried four times and see following errors in three of them. The remaining test was OK ( the address was translate successfully.)
> 
> # echo 1 > inject_read 
> # echo 1 > inject_write
> EDAC DEBUG: mcidev_store: mcidev_store() mem_ctl_info ffff8803fe5d7000
> EDAC DEBUG: amd64_inject_write_store: section=0x80000006 word_bits=0x8020088
> Disabling lock debugging due to kernel taint
> [Hardware Error]: 	MC4_ADDR: 0x000000fdf700cdb0
> [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> EDAC DEBUG: amd64_get_dram_hole_info:   DHAR info for node 0 base 0xe0000000 offset 0x20000000 size 0x20000000
> EDAC DEBUG: sys_addr_to_dram_addr: using DRAM Base register to translate SysAddr 0x41f00cdb0 to DramAddr 0x41f00cdb0
> EDAC DEBUG: dram_addr_to_input_addr:   Intlv Shift=0 DramAddr=0x41f00cdb0 maps to InputAddr=0x41f00cdb0
> EDAC DEBUG: sys_addr_to_input_addr: SysAdddr 0x41f00cdb0 translates to InputAddr 0x41f00cdb0
> EDAC DEBUG: input_addr_to_csrow: no matching csrow for InputAddr 0x41f00cdb0 (MC node 0)
> EDAC amd64 MC0: Failed to translate InputAddr to csrow for address 0x41f00cdb0
> EDAC amd64 MC0: ERROR_ADDRESS (0x41f00cdb0) NOT mapped to CS
> EDAC MC0: UE amd64_edac on any memory (8¬À^ÿÿFpage:0x41f00c offset:0xdb0 grain:0 - ERROR ADDRESS NOT mapped to CS)
> [Hardware Error]: ca                                                                          
> 
> Following is from boot log:
> 
> init_memory_mapping: [mem 0x00000000-0xdfe8ffff]
> init_memory_mapping: [mem 0x100000000-0x41effffff]
> 
> Note the address in EDAC error is 0x41f00cdb0, and it is bigger than 0x41effffff. I guess that is why it cannot translate the address. But then the question becomes why we get this address: 0x41f00cdb0.
> 
> The inputAddr was 0x41f0014b0 and 0x41f014db0 in other two tests.

Yes, this is right above TOP_MEM2 so of course this is not a valid DRAM
address.

Can you do

$ lspci -xxxx -s 0x18 > lspci.log

and send me that file? Privately is fine too. Just make sure it contains
the extended PCI config space, i.e. the PCI space dump per function
should end at offset 0xfff, i.e. 4K.

[ … ]

> CPU0: AMD Opteron(tm) Processor 4226                  stepping 02
> Performance Events: AMD Family 15h PMU driver.
> ... version:                0
> ... bit width:              48
> ... generic registers:      6
> ... value mask:             0000ffffffffffff
> ... max period:             00007fffffffffff
> ... fixed-purpose events:   0
> ... event mask:             000000000000003f
> MCE: In-kernel MCE decoding enabled.
> [Hardware Error]: CPU:0	MC4_STATUS[-|UE|MiscV|PCC|AddrV|-|-|UECC]: 0xbe002000ed080813
> [Hardware Error]: 	MC4_ADDR: 0x000000fdf7014db0
> [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

This error got logged before reboot and you're seeing it now.

> Booting Node   0, Processors  #1 #2 #3 #4 #5
> Brought up 6 CPUs
> Total of 6 processors activated (32402.72 BogoMIPS).

[ … ]

> EDAC MC: Ver: 2.1.0
> AMD64 EDAC driver v3.4.0
> EDAC amd64: DRAM ECC enabled.
> usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> EDAC amd64: SyncFlood on UE Enabled, Disabling....

Haha, I can see your small change here. And I was wondering why you even
get the UEs reported without the box syncflooding :).

> EDAC amd64: F15h detected (node 0).
> EDAC amd64: MC: 0:  2048MB 1:  2048MB
> EDAC amd64: MC: 2:  2048MB 3:  2048MB
> EDAC amd64: MC: 4:     0MB 5:     0MB
> EDAC amd64: MC: 6:     0MB 7:     0MB
> EDAC amd64: MC: 0:  2048MB 1:  2048MB
> EDAC amd64: MC: 2:  2048MB 3:  2048MB
> EDAC amd64: MC: 4:     0MB 5:     0MB
> EDAC amd64: MC: 6:     0MB 7:     0MB

Right, so you have 16Gb of memory on a single node but the injection
address is right above it.

> EDAC amd64: using x4 syndromes.

This could explain why the double-bit injection on my box was still a CE
- I'm using x8 DIMMs and thus x8 symbols so that 8 bits are covered by a
single symbol.

> EDAC amd64: MCT channel count: 2
> EDAC amd64: CS0: Registered DDR3 RAM
> EDAC amd64: CS1: Registered DDR3 RAM
> EDAC amd64: CS2: Registered DDR3 RAM
> EDAC amd64: CS3: Registered DDR3 RAM
> EDAC MC0: Giving out device to 'amd64_edac' 'F15h': DEV 0000:00:18.2
> EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED)

Ok, that should be it for now. I don't see anything else out of the
ordinary in the logs.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Free Online Dating]     [Linux Kernel]     [Linux SCSI]     [XFree86]

Add to Google Powered by Linux