Oracle servers: DIMM error

If you have some dimm problem and regretly you can't diagnose which DIMM is fault use FMA to spot it:

The send mondo panics are likely to be as a result of the ECC errors as described in Sun Alert 235041. The dump device is not setup correctly and can be fixed with the following command.

dumpadm -d swap

Unfortunately the FMA system is disabled so we cannot determine which DIMM is causing the errors.

svcs-av.out:disabled - Sep_13 - svc:/system/fmd:default

To re-enable the FMA system first remove the previous fault history as it appears to be causing problems when FMA starts up.

Shut down the FMA subsystem

$ svcadm disable -s svc:/system/fmd:default

Remove all _files_ from the FMA log directories. This is very specific to the _files_ found all directories must be left intact.

$ cd /var/fm/fmd
$ find /var/fm/fmd -type f -exec ls {} \;

Check that only files within the /var/fm/fmd directory are identified then replace the ls with rm to remove them.

Start FMA after the files are removed

$ svcadm enable svc:/system/fmd:default

Check the output of fmdump -e you will probably see errors being generated. Once errors are being generated run an explorer or manually collect the fma errlog and fltlog. A command as simple as the following could be used to identify if one or more DIMMs is generating CEs.

fmdump -eV errlog | grep unum | sort | uniq -c

As we do not have the core files there are other issue which may have caused the send mondo panics. An OBP upgrade and Solaris patching would be required to fix all the issues.

A useful summary of what errors are significant from a Solaris messages file with pointers to some of the send mondo problems.

$ findaft mes*
################################################################################
This script looks for Hardware errors including all AFT and pci ECC events
Written for 108528-16/112233-01 or above. Some tests may fail on other revisions
Report bugs,RFEs or if you have questions email findaft-interest@sun.com
Version 2.52 homepage http://panacea/twiki/bin/view/Tools/ToolPageFindaft
Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft
Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script
################################################################################
Input file messages is 2.3 MB
Input file messages.0 is 6.5 MB
Input file messages.1 is 6.5 MB
Input file messages.2 is 6.2 MB
Input file messages.3 is 6.6 MB
A 100MB messages file takes ~5 minutes to process on a 1200Mhz USIII CPU
Running from an explorer using ../sysconfig/prtdiag-v.out
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
V1280 / 2900 lom events
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 lw8: AM Boot: ScApp 5.20.6, RTOS 46^M
2 lw8: PM Boot: ScApp 5.20.6, RTOS 46^M
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Panics, Reboots, Fatal errors etc
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 panic: failed to stop cpu0
1 panic: failed to stop cpu512
1 panic[cpu531]/thread=:
1 send mondo timeout (target 0x0) [7352882 NACK 1 BUSY]
1 Aug 23 E2900 SunOS Release 5.10 Version Generic_127111-09 64-bit
1 Sep 13 E2900 SunOS Release 5.10 Version Generic_127111-09 64-bit
################################################################################
WARNING: send_mondo/failed to stop panics may have been caused by software bugs
Sun Alert 235041 http://sunsolve.sun.com/search/document.do?assetkey=1-66-235041-1
Sun Alert 228406 http://sunsolve.sun.com/search/document.do?assetkey=1-66-228406-1
Sun Alert 200591 http://sunsolve.sun.com/search/document.do?assetkey=1-66-200591-1
Sun Alert 238746 http://sunsolve.sun.com/search/document.do?assetkey=1-66-238746-1
################################################################################
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Note: Solaris 10+ with FMA automates the diagnosis of almost all the faults that
findaft looks for. Checking the output of fmadm faulty to see current faults.
See Infodoc http://sunsolve.sun.com/search/document.do?assetkey=1-61-230797-1
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
OBP 5.20.8 is installed, latest OBP versions linked to from Infodoc 18474
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
################################################################################
WARNING: Sun Alert 238746 applies at the installed OBP level 5.20.8
Sun Fire Server with Solaris 10 may Panic or Reset with lpost message,
Asynchronous event, fail to stop CPU or send_mondo timeout
http://sunsolve.sun.com/search/document.do?assetkey=1-26-238746-1
################################################################################

Yet another Solaris user

Search This Blog

Oracle servers: DIMM error

If you have some dimm problem and regretly you can't diagnose which DIMM is fault use FMA to spot it:

Comments

Post a Comment

Popular posts from this blog

FOS Password recovery (Brocade Fabric OS Switch Password recovery procedure)

memory error detect XSCF uboot

SPARC OBP cheatsheet