VIP Failover Take Long Time After Network Cable Pulled [ID 403743.1]

ORACLE/RAC2012. 5. 18. 16:38

VIP Failover Take Long Time After Network Cable Pulled [ID 403743.1]

VIP Failover Take Long Time After Network Cable Pulled [ID 403743.1]
--------------------------------------------------------------------------------

수정 날짜 05-JAN-2011 유형 PROBLEM 상태 PUBLISHED

In this Document
Symptoms
Changes
Cause
Solution
References

--------------------------------------------------------------------------------

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7 - Release: 10.2 to 11.1
Information in this document applies to any platform.
***Checked for relevance on 05-Jan-2011***
Symptoms
This example is based on SUN Solaris platform, with IPMP configured for the public network. In this case, VIP failover takes almost 4 minutes to complete when both network cables of the public network are pulled from one node.

crsd.log shows:

2006-12-07 13:14:05.401: [ CRSAPP][4588] CheckResource error for ora.node1.vip error code = 1
2006-12-07 13:14:05.408: [ CRSRES][4588] In stateChanged, ora.node1.vip target is ONLINE
2006-12-07 13:14:05.409: [ CRSRES][4588] ora.node1.vip on node1 went OFFLINE unexpectedly
<<< detect network cable failure and VIP OFFLINE immediately

2006-12-07 13:14:05.410: [ CRSRES][4588] StopResource: setting CLI values
2006-12-07 13:14:05.420: [ CRSRES][4588] Attempting to stop `ora.node1.vip` on member `node1`
2006-12-07 13:14:06.651: [ CRSRES][4588] Stop of `ora.node1.vip` on member `node1` succeeded.
2006-12-07 13:14:06.652: [ CRSRES][4588] ora.node1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2006-12-07 13:14:06.667: [ CRSRES][4588] ora.node1.vip failed on node1 relocating.
2006-12-07 13:14:06.758: [ CRSRES][4588] StopResource: setting CLI values
2006-12-07 13:14:06.766: [ CRSRES][4588] Attempting to stop `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
2006-12-07 13:17:41.399: [ CRSRES][4588] Stop of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
<<< takes 3.5 minutes to stop listener

2006-12-07 13:17:41.402: Attempting to stop `ora.node1.ASM1.asm` on member `node1`
<<< stop dependant inst and ASM
2006-12-07 13:17:55.610: [ CRSRES][4588] Stop of `ora.node1.ASM1.asm` on member `node1` succeeded.

2006-12-07 13:17:55.661: [ CRSRES][4588] Attempting to start `ora.node1.vip` on member `node2`
2006-12-07 13:18:00.260: [ CRSRES][4588] Start of `ora.node1.vip` on member `node2` succeeded.
<<< now VIP failover complete after almost 4 mins

ora.node1.LISTENER_NODE1.lsnr.log shows:

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=node1vip)(PORT=1521)(IP=FIRST)))
TNS-12535: TNS:operation timed
2006-12-07 13:17:41.329: [ RACG][1] [23916][1][ora.node1.LISTENER_NODE1.lsnr]: out
   TNS-12560: TNS:protocol adapter error
     TNS-00505: Operation timed out
     Solaris Error: 145: Connection timed out
     Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.10.100)(PORT=1521)(IP=FIRST)))
The command completed successfully

Client connection hang during this failover time.

Changes
This may be a new setup, or a setup that was migrated from an earlier release.
Cause
This problem is caused by the first address in the listener.ora configuration being an address that uses the TCP protocol.

In this circumstance, when a network cable is pulled, "lsnrctl stop" listener has to wait for TCP timeout before it can check next address. On the Solaris platform, TCP timeout is defined by tcp_ip_abort_cinterval with a default value of 180000 (3 minutes). That is why shutting down listener almost took 3.5 minutes. (TCP timeout on other platforms may vary). The error message "Solaris Error: 145: Connection timed out" in ora.node1.LISTENER_NODE1.lsnr.log also indicates it is waiting for tcp timeout.

The listener.ora in this scenario is defined as:

[LISTENER_NODE1 =
(DESCRIPTION_LIST =
   (DESCRIPTION =
     (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = node1vip)(PORT = 1521)(IP = FIRST))
     )
     (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = 10.1.10.100)(PORT = 1521)(IP = FIRST))
     )
     (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
     )
   )
)Solution
To prevent this, move the IPC address to be the first address for the listener in the listener.ora, eg:

LISTENER_NODE1 =
(DESCRIPTION_LIST =
    (DESCRIPTION =
       (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
       )
       (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = TCP)(HOST = node1vip)(PORT = 1521)(IP = FIRST))
        )
       (ADDRESS_LIST =
           (ADDRESS = (PROTOCOL = TCP)(HOST = 10.1.10.100)(PORT = 1521)(IP = FIRST))
        )
     )
)

When lsnrctl tries to stop the listener, it will now connect to the IPC address first, which is available during that time. It will not have to wait for tcp timeout.

After the above change, the VIP failover only takes 48 to 50 seconds to complete regardless of the tcp_ip_abort_cinterval setting.

Please note, listener.ora files newly created from 10.2.0.3 to 11.1.0.7 should have the IPC protocol as the first address in listener.ora in most cases. However, if you have upgraded from a previous release, or manually modified/copied over a listener.ora from a previous install, you may not have the IPC protocol as the first address, regardless of your version. Manual modification is required to move IPC protocol to be the first address to avoid the problem described in this note.

References

저작자표시 비영리

'ORACLE > RAC' 카테고리의 다른 글

11g rac설치후 resource 자동으로 올라오게 설정 (0)	2013.05.30
Oracle 11g RAC startup policy 변경(crs start시 instance자동 start 설정) (0)	2013.05.30
How to Configure Solaris Link-based IPMP for Oracle VIP [ID 730732.1] (0)	2012.05.18
oracle RAC 10.2.0.1 -> 10.2.0.5 패치 (0)	2012.05.11
UDP Buffer Tuning 기법 (0)	2012.05.11

Posted by [PineTree]

About DATABASE

VIP Failover Take Long Time After Network Cable Pulled [ID 403743.1]

'ORACLE > RAC' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바