Skip to main content

Oracle servers: DIMM error

If you have some dimm problem and regretly you can't diagnose which DIMM is fault use FMA to spot it: 


The send mondo panics are likely to be as a result of the ECC errors as described in Sun Alert 235041. The dump device is not setup correctly and can be fixed with the following command.

dumpadm -d swap

Unfortunately the FMA system is disabled so we cannot determine which DIMM is causing the errors.

svcs-av.out:disabled       -             Sep_13        - svc:/system/fmd:default

To re-enable the FMA system first remove the previous fault history as it appears to be causing problems when FMA starts up.

Shut down the FMA subsystem

$ svcadm disable -s svc:/system/fmd:default

Remove all _files_ from the FMA log directories. This is very specific to the _files_ found all directories must be left intact.

$ cd /var/fm/fmd
$ find /var/fm/fmd -type f -exec ls {} \;

Check that only files within the /var/fm/fmd directory are identified then replace the ls with rm to remove them.

Start FMA after the files are removed

$ svcadm enable svc:/system/fmd:default

Check the output of fmdump -e you will probably see errors being generated. Once errors are being generated run an explorer or manually collect the fma errlog and fltlog. A command as simple as the following could be used to identify if one or more DIMMs is generating CEs.

fmdump -eV errlog | grep unum | sort | uniq -c


As we do not have the core files there are other issue which may have caused the send mondo panics. An OBP upgrade and Solaris patching would be required to fix all the issues.

A useful summary of what errors are significant from a Solaris messages file with pointers to some of the send mondo problems.

$ findaft mes*
################################################################################
This script looks for Hardware errors including all AFT and pci ECC events
Written for 108528-16/112233-01 or above. Some tests may fail on other revisions
Report bugs,RFEs or if you have questions email findaft-interest@sun.com
Version 2.52 homepage http://panacea/twiki/bin/view/Tools/ToolPageFindaft
Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft
Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script
################################################################################
Input file messages is 2.3 MB
Input file messages.0 is 6.5 MB
Input file messages.1 is 6.5 MB
Input file messages.2 is 6.2 MB
Input file messages.3 is 6.6 MB
A 100MB messages file takes ~5 minutes to process on a 1200Mhz USIII CPU
Running from an explorer using ../sysconfig/prtdiag-v.out
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 V1280 / 2900 lom events
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2       lw8: AM Boot: ScApp 5.20.6, RTOS 46^M
2       lw8: PM Boot: ScApp 5.20.6, RTOS 46^M
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Panics, Reboots, Fatal errors etc
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1       panic: failed to stop cpu0
1       panic: failed to stop cpu512
1       panic[cpu531]/thread=:
1       send mondo timeout (target 0x0) [7352882 NACK 1 BUSY]
1       Aug 23 E2900 SunOS Release 5.10 Version Generic_127111-09 64-bit
1       Sep 13 E2900 SunOS Release 5.10 Version Generic_127111-09 64-bit
################################################################################
 WARNING: send_mondo/failed to stop panics may have been caused by software bugs
 Sun Alert 235041 http://sunsolve.sun.com/search/document.do?assetkey=1-66-235041-1
 Sun Alert 228406 http://sunsolve.sun.com/search/document.do?assetkey=1-66-228406-1
 Sun Alert 200591 http://sunsolve.sun.com/search/document.do?assetkey=1-66-200591-1
 Sun Alert 238746 http://sunsolve.sun.com/search/document.do?assetkey=1-66-238746-1
################################################################################
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Note: Solaris 10+ with FMA automates the diagnosis of almost all the faults that
 findaft looks for. Checking the output of fmadm faulty to see current faults.
 See Infodoc http://sunsolve.sun.com/search/document.do?assetkey=1-61-230797-1
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 OBP 5.20.8 is installed, latest OBP versions linked to from Infodoc 18474
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
################################################################################
 WARNING: Sun Alert 238746 applies at the installed OBP level 5.20.8
 Sun Fire Server with Solaris 10 may Panic or Reset with lpost message,
 Asynchronous event, fail to stop CPU or send_mondo timeout
 http://sunsolve.sun.com/search/document.do?assetkey=1-26-238746-1
################################################################################

Comments

Popular posts from this blog

Solaris. remove unusable scsi lun

Solaris remove unusable or failing scsi lun 1. The removed devices show up as drive not available in the output of the format command: # format Searching for disks...done ................      255. c1t50000974082CCD5Cd249 <drive not available>           /pci@3,700000/SUNW,qlc@0/fp@0,0/ssd@w50000974082ccd5c,f9 ................      529. c3t50000974082CCD58d249 <drive not available>           /pci@7,700000/SUNW,qlc@0/fp@0,0/ssd@w50000974082ccd58,f9 2. After the LUNs are unmapped Solaris displays the devices as either unusable or failing. # cfgadm -al -o show_SCSI_LUN | grep -i unusable # # cfgadm -al -o show_SCSI_LUN | grep -i failing c1::50000974082ccd5c,249       disk         connected    configured   failing c3::50000974082ccd58,249 ...

memory error detect XSCF uboot

If you see something like this when you poweron you server: memory error detect 80000008, address 000002d0 data 55555555 -> fbefaaaa capture_data hi fbefaaaa lo deadbeef ecc 1b1b capture_attributes 01113001 address 000002d0 memory error detect 80000008, address 000002d4 data aaaaaaaa -> deadbeef capture_data hi fbefaaaa lo deadbeef ecc 1b1b capture_attributes 01113001 address 000002d4 memXSCF uboot  01070000  (Feb  8 2008 - 11:12:19) XSCF uboot  01070000  (Feb  8 2008 - 11:12:19) SCF board boot factor = 7180     DDR Real size: 256 MB     DDR: 224 MB Than your XSCF card is broked. Replace it with new one. After that it will ask you for enter chassis number - located at front of the server XSCF promt to enter your chasses number ( is a S/N of your server ): Please input the chassis serial number : XXXXXXX 1:PANEL Please select the number : 1 Restoring data from PANEL to XSCF#0. Please wait for se...

SPARC OBP cheatsheet

Boot PROM Basics Boot PROM(programmable read only memory): It is a firmware (also known as the monitor program) provides: 1. basic hardware testing & initialization before booting. 2. contains a user interface that provide access to many important functions. 3. enables the system to boot from wide range of devices. It controls the system operation before the kernel becomes available. It provides a user interface and firmware utility commands known as FORTH command set. These commands include the boot commands, the diagnostic commands & the commands for modifying the default configuration. Command to determine the version of the Open Boot PROM on the system: # /usr/platform/'uname -m'/sbin/prtdiag -v (output omitted) System PROM revisions: ---------------------- OBP 4.16.4 2004/12/18 05:21 Sun Blade 1500 (Silver) OBDIAG 4.16.4.2004/12/18 05:21 # prtconf -v OBP 4.16.4 2004/12/18 05:21 Open Boot Architectures Standards: It is based on IEEE standard #1275, accord...