Wednesday, August 25, 2010

In 11gR2 RAC after server reboot crsd fails to startup on 2nd node

Problem Description
Oracle Grid Infrastructure, Oracle software Installation went successfully and so creation of oracle database. After installation it is checked CRS daemon on both domains and it is shown all services are online like below.

# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
However, after both server rebooted cluster ready service on second node does not start, but on first node it works fine.

Some output from second node,
# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
If you look for the Grid alert log on second node it shows,
[crsd(26294)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/dc-db-02/crsd/crsd.log.
2010-08-25 12:06:55.475
[ohasd(25303)]CRS-2765:Resource 'ora.crsd' has failed on server 'dc-db-02'.
2010-08-25 12:06:56.553
[crsd(26305)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/dc-db-02/crsd/crsd.log.
2010-08-25 12:06:57.499
[ohasd(25303)]CRS-2765:Resource 'ora.crsd' has failed on server 'dc-db-02'.
2010-08-25 12:06:57.499
[ohasd(25303)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
If you look for crsd.log file, it will show
2010-08-25 12:06:56.552: [  OCRASM][1855820272]proprasmo: Error in open/create file in dg [DATA]
[ OCRASM][1855820272]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup

2010-08-25 12:06:56.553: [ OCRASM][1855820272]proprasmo: kgfoCheckMount returned [7]
2010-08-25 12:06:56.553: [ OCRASM][1855820272]proprasmo: The ASM instance is down
2010-08-25 12:06:56.553: [ OCRRAW][1855820272]proprioo: Failed to open [+DATA]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-08-25 12:06:56.553: [ OCRRAW][1855820272]proprioo: No OCR/OLR devices are usable
2010-08-25 12:06:56.554: [ OCRASM][1855820272]proprasmcl: asmhandle is NULL
2010-08-25 12:06:56.554: [ OCRRAW][1855820272]proprinit: Could not open raw device
2010-08-25 12:06:56.554: [ OCRASM][1855820272]proprasmcl: asmhandle is NULL
2010-08-25 12:06:56.554: [ OCRAPI][1855820272]a_init:16!: Backend init unsuccessful : [26]
2010-08-25 12:06:56.554: [ CRSOCR][1855820272] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2010-08-25 12:06:56.554: [ CRSD][1855820272][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-08-25 12:06:56.554: [ CRSD][1855820272] Done.
Cause of the Problem
Whenever you look for "CRS-4535: Cannot communicate with Cluster Ready Services" error immediate investigate in the grid alert log as well as crsd.log file. There you would get more information regarding CRS-4535 error. From the crsd.log file we see the oracle error message
ORA-15077: could not locate ASM instance serving a required diskgroup

If we try to run ls command after inside amdcmd it fails with ASMCMD-08102
$ export ORACLE_SID=+ASM2
$ amdcmd
ASMCMD> ls
ASMCMD-08102: no connection to ASM; command requires ASM to run"
Whenever it is tried to start ASM instance manually on the second node you get error message like below.
amdcmd> startup;
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpsemsper
This error message may mislead you because there is sufficient space in your disk. df -h command proves that. This error entirely related to kernel parameter settings. If you check your semaphores setting you see there does not have enough semaphores allowed for creating more processes.

Solution of the problem
As this problem is lower settings of semaphores value so solution is to increase the semaphores value.

- Check your current semapores settings by looking for value of the parameter "kernel.sem" inside the file /etc/sysctl.conf
# cat /etc/sysctl.conf

or issue,
# /sbin/sysctl -a | grep sem

- Modify SEMMNI value in the /etc/sysctl.conf like below.
# vi /etc/sysctl.conf
kernel.sem = 250 32000 100 200
On Red Hat Linux system, in order the affect the setting of the value immediately use,
# /sbin/sysctl -p

Now starting up the asm instance is just fine and also CRS daemon is ok. Restart the node to make sure everything is working perfect.
Related Documents
ORA-15018, ORA-15072 on 11gR2 grid node 2 when running root.sh
PRKP-1001, CRS-0215 while starting instance using srvctl
NTP and csstd time synchronization option to install Oracle Clusterware 11gR2
cluvfy fails with PRVF-5436 PRVF-9652 Cluster Time Synchronization Services check failed
In 11gR2 Grid root.sh fails with CRS-2674: Start of 'ora.cssd' failed
What to do after failure of Oracle 11gR2 Grid Infrastructure (CRS) Installation
Enable Archive log Mode for RAC database
List of Parameters that must have identical in RAC database
CRS Stack Fails to Start After Reboot ORA-29702 CRS-0184

1 comment: