On Mon, Aug 06, 2012 at 08:57:38PM +0000, Jiang Wang wrote: > Hi there, > > I am testing amd64 error injection with EDAC driver from SL6.2 on a machine which have two AMD Opteron(tm) Processor 2378. After I injected two bits errors via sysfs nodes, I got following message: > > [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 4: be002000ed080813 > [Hardware Error]: TSC 171bba7b052 ADDR 27b028ff0 MISC c008000201000000 > [Hardware Error]: PROCESSOR 2:100f42 TIME 1343090754 SOCKET 0 APIC 0 > [Hardware Error]: MC4_STATUS[-|UE|MiscV|PCC|AddrV|UECC]: 0xbe002000ed080813 > [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the > NB. > EDAC amd64 MC0: Failed to translate InputAddr to csrow for address 0x27b028ff0 Btw, this is an uncorrectable error MC4_STATUS[Val|UC|EN|MiscV|AddrV|PCC|UECC|EEC: DRAM ECC (0x08) (synd=0xed00)|ET: BUS(pp:SRC;t:NOTIMOUT;r4:RD;ii:MEM;ll:LG)]: 0xbe002000ed080813 because you're injecting 0x88, i.e. two bits, each in a different symbol. Which could mean that you have x4 DIMMs. Also, that physical address 0x27b028ff0 is around 10.5 Gb-ish, do you have that much memory on the system? Also, can you send full dmesg pls? > --------------------------------------------------------------------- > > EDAC amd64 MC0: ERROR_ADDRESS (0x27b028ff0) NOT mapped to CS > EDAC MC0: UE - no information available: amd64_edac > [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC > (no timeout) > [Hardware Error]: Machine check: Processor context corrupt > Kernel panic - not syncing: Fatal machine check on current CPU > > My questions is about this line: Failed to translate InputAddr to csrow for address 0x27b028ff0. > > Is this an expected behavior or is it a bug? > Btw: sometimes, the address can be translate to csrow. > > The commands I used to inject errors: > > cd /sys/devices/system/edac/mc/mc0 > echo 3 > inject_section > echo 7 > inject_word > echo 0x88 > inject_ecc_vector > echo 1 > inject_read > echo 1 > inject_write FWIW, I did the same on a F10h box here with 3.6-rc1 and it looks ok: [62312.996927] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc11400088080a13 [62313.006006] [Hardware Error]: MC4_ADDR: 0x00000004252dfe70 [62313.011832] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB. [62313.011837] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x4252dfe70 [62313.011843] EDAC DEBUG: f1x_match_to_this_node: (range 0) SystemAddr= 0x4252dfe70 Limit=0x437ffffff [62313.011846] EDAC DEBUG: f1x_match_to_this_node: Normalized DCT addr: 0x1f696fe40 [62313.011849] EDAC DEBUG: f1x_lookup_addr_in_dct: input addr: 0x1f696fe40, DCT: 0 [62313.011853] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=0 CSBase=0x0 CSMask=0xffffffe1fff9ffff [62313.011857] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x0 [62313.011861] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=1 CSBase=0x20000 CSMask=0xffffffe1fff9ffff [62313.011864] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x20000 [62313.011868] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=2 CSBase=0x40000 CSMask=0xffffffe1fff9ffff [62313.011871] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x40000 [62313.011875] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=3 CSBase=0x60000 CSMask=0xffffffe1fff9ffff [62313.011878] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x60000 [62313.011881] EDAC DEBUG: f1x_lookup_addr_in_dct: MATCH csrow=3 [62313.011894] EDAC MC0: 1 CE on unknown memory (csrow:3 channel:0 page:0x4252df offset:0xe70 grain:0 syndrome:0x8822) [62313.011904] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) Of course, the address is different. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 -- To unsubscribe from this list: send the line "unsubscribe linux-edac" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html