Skip to main content

Oracle servers: DIMM error

If you have some dimm problem and regretly you can't diagnose which DIMM is fault use FMA to spot it: 

The send mondo panics are likely to be as a result of the ECC errors as described in Sun Alert 235041. The dump device is not setup correctly and can be fixed with the following command.

dumpadm -d swap

Unfortunately the FMA system is disabled so we cannot determine which DIMM is causing the errors.

svcs-av.out:disabled       -             Sep_13        - svc:/system/fmd:default

To re-enable the FMA system first remove the previous fault history as it appears to be causing problems when FMA starts up.

Shut down the FMA subsystem

$ svcadm disable -s svc:/system/fmd:default

Remove all _files_ from the FMA log directories. This is very specific to the _files_ found all directories must be left intact.

$ cd /var/fm/fmd
$ find /var/fm/fmd -type f -exec ls {} \;

Check that only files within the /var/fm/fmd directory are identified then replace the ls with rm to remove them.

Start FMA after the files are removed

$ svcadm enable svc:/system/fmd:default

Check the output of fmdump -e you will probably see errors being generated. Once errors are being generated run an explorer or manually collect the fma errlog and fltlog. A command as simple as the following could be used to identify if one or more DIMMs is generating CEs.

fmdump -eV errlog | grep unum | sort | uniq -c

As we do not have the core files there are other issue which may have caused the send mondo panics. An OBP upgrade and Solaris patching would be required to fix all the issues.

A useful summary of what errors are significant from a Solaris messages file with pointers to some of the send mondo problems.

$ findaft mes*
This script looks for Hardware errors including all AFT and pci ECC events
Written for 108528-16/112233-01 or above. Some tests may fail on other revisions
Report bugs,RFEs or if you have questions email
Version 2.52 homepage http://panacea/twiki/bin/view/Tools/ToolPageFindaft
Or runnable from /net/
Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script
Input file messages is 2.3 MB
Input file messages.0 is 6.5 MB
Input file messages.1 is 6.5 MB
Input file messages.2 is 6.2 MB
Input file messages.3 is 6.6 MB
A 100MB messages file takes ~5 minutes to process on a 1200Mhz USIII CPU
Running from an explorer using ../sysconfig/prtdiag-v.out
 V1280 / 2900 lom events
2       lw8: AM Boot: ScApp 5.20.6, RTOS 46^M
2       lw8: PM Boot: ScApp 5.20.6, RTOS 46^M
 Panics, Reboots, Fatal errors etc
1       panic: failed to stop cpu0
1       panic: failed to stop cpu512
1       panic[cpu531]/thread=:
1       send mondo timeout (target 0x0) [7352882 NACK 1 BUSY]
1       Aug 23 E2900 SunOS Release 5.10 Version Generic_127111-09 64-bit
1       Sep 13 E2900 SunOS Release 5.10 Version Generic_127111-09 64-bit
 WARNING: send_mondo/failed to stop panics may have been caused by software bugs
 Sun Alert 235041
 Sun Alert 228406
 Sun Alert 200591
 Sun Alert 238746
 Note: Solaris 10+ with FMA automates the diagnosis of almost all the faults that
 findaft looks for. Checking the output of fmadm faulty to see current faults.
 See Infodoc
 OBP 5.20.8 is installed, latest OBP versions linked to from Infodoc 18474
 WARNING: Sun Alert 238746 applies at the installed OBP level 5.20.8
 Sun Fire Server with Solaris 10 may Panic or Reset with lpost message,
 Asynchronous event, fail to stop CPU or send_mondo timeout


Popular posts from this blog

memory error detect XSCF uboot

If you see something like this when you poweron you server: memory error detect 80000008, address 000002d0 data 55555555 -> fbefaaaa capture_data hi fbefaaaa lo deadbeef ecc 1b1b capture_attributes 01113001 address 000002d0 memory error detect 80000008, address 000002d4 data aaaaaaaa -> deadbeef capture_data hi fbefaaaa lo deadbeef ecc 1b1b capture_attributes 01113001 address 000002d4 memXSCF uboot  01070000  (Feb  8 2008 - 11:12:19) XSCF uboot  01070000  (Feb  8 2008 - 11:12:19) SCF board boot factor = 7180     DDR Real size: 256 MB     DDR: 224 MB Than your XSCF card is broked. Replace it with new one. After that it will ask you for enter chassis number - located at front of the server XSCF promt to enter your chasses number ( is a S/N of your server ): Please input the chassis serial number : XXXXXXX 1:PANEL Please select the number : 1 Restoring data from PANEL to XSCF#0. Please wait for se...

Solaris. remove unusable scsi lun

Solaris remove unusable or failing scsi lun 1. The removed devices show up as drive not available in the output of the format command: # format Searching for disks...done ................      255. c1t50000974082CCD5Cd249 <drive not available>           /pci@3,700000/SUNW,qlc@0/fp@0,0/ssd@w50000974082ccd5c,f9 ................      529. c3t50000974082CCD58d249 <drive not available>           /pci@7,700000/SUNW,qlc@0/fp@0,0/ssd@w50000974082ccd58,f9 2. After the LUNs are unmapped Solaris displays the devices as either unusable or failing. # cfgadm -al -o show_SCSI_LUN | grep -i unusable # # cfgadm -al -o show_SCSI_LUN | grep -i failing c1::50000974082ccd5c,249       disk         connected    configured   failing c3::50000974082ccd58,249 ...

FOS Password recovery (Brocade Fabric OS Switch Password recovery procedure)

Password recovery using root account If you have access to the root account, you can reset the passwords on the switch to default. This feature is available for all currently supported versions of the Fabric OS. Follow the below steps to reset any account password from the root account. 1. Open a CLI session (serial or telnet for an unsecured system and sectelnet for a secure system) to the switch. 2. Log in as root. 3. At the prompt, enter the passwddefault command as shown below: switch:root> passwddefault 4. Follow the prompts to reset the password for the selected account. For example: switch:root> passwddefault All account passwords have been successfully set to factory default. Once the passwords have been reset, log into the switch as admin, and modify your default passwords. Make sure to keep a hardcopy of your switch passwords in a secure location. The default passwords for Fabric OS switches are: Root fibranne Adminpassword Userpassword Password r...