ORACLE/RAC2012. 5. 18. 16:38
반응형

VIP Failover Take Long Time After Network Cable Pulled [ID 403743.1]
--------------------------------------------------------------------------------
 
  수정 날짜 05-JAN-2011     유형 PROBLEM     상태 PUBLISHED  

In this Document
  Symptoms
  Changes
  Cause
  Solution
  References

--------------------------------------------------------------------------------

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7 - Release: 10.2 to 11.1
Information in this document applies to any platform.
***Checked for relevance on 05-Jan-2011***
Symptoms
This example is based on SUN Solaris platform, with IPMP configured for the public network. In this case, VIP failover takes almost 4 minutes to complete when both network cables of the public network are pulled from one node.

crsd.log shows:

2006-12-07 13:14:05.401: [ CRSAPP][4588] CheckResource error for ora.node1.vip error code = 1
2006-12-07 13:14:05.408: [ CRSRES][4588] In stateChanged, ora.node1.vip target is ONLINE
2006-12-07 13:14:05.409: [ CRSRES][4588] ora.node1.vip on node1 went OFFLINE unexpectedly
<<< detect network cable failure and VIP OFFLINE immediately

2006-12-07 13:14:05.410: [ CRSRES][4588] StopResource: setting CLI values
2006-12-07 13:14:05.420: [ CRSRES][4588] Attempting to stop `ora.node1.vip` on member `node1`
2006-12-07 13:14:06.651: [ CRSRES][4588] Stop of `ora.node1.vip` on member `node1` succeeded.
2006-12-07 13:14:06.652: [ CRSRES][4588] ora.node1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2006-12-07 13:14:06.667: [ CRSRES][4588] ora.node1.vip failed on node1 relocating.
2006-12-07 13:14:06.758: [ CRSRES][4588] StopResource: setting CLI values
2006-12-07 13:14:06.766: [ CRSRES][4588] Attempting to stop `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
2006-12-07 13:17:41.399: [ CRSRES][4588] Stop of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
<<< takes 3.5 minutes to stop listener

2006-12-07 13:17:41.402: Attempting to stop `ora.node1.ASM1.asm` on member `node1`
<<< stop dependant inst and ASM
2006-12-07 13:17:55.610: [ CRSRES][4588] Stop of `ora.node1.ASM1.asm` on member `node1` succeeded.

2006-12-07 13:17:55.661: [ CRSRES][4588] Attempting to start `ora.node1.vip` on member `node2`
2006-12-07 13:18:00.260: [ CRSRES][4588] Start of `ora.node1.vip` on member `node2` succeeded.
<<< now VIP failover complete after almost 4 mins


ora.node1.LISTENER_NODE1.lsnr.log shows:

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=node1vip)(PORT=1521)(IP=FIRST)))
TNS-12535: TNS:operation timed
2006-12-07 13:17:41.329: [ RACG][1] [23916][1][ora.node1.LISTENER_NODE1.lsnr]: out
   TNS-12560: TNS:protocol adapter error
     TNS-00505: Operation timed out
     Solaris Error: 145: Connection timed out
     Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.10.100)(PORT=1521)(IP=FIRST)))
The command completed successfully


Client connection hang during this failover time.


Changes
This may be a new setup, or a setup that was migrated from an earlier release.
Cause
This problem is caused by the first address in the listener.ora configuration being an address that uses the TCP protocol.

In this circumstance, when a network cable is pulled, "lsnrctl stop" listener has to wait for TCP timeout before it can check next address. On the Solaris platform, TCP timeout is defined by tcp_ip_abort_cinterval with a default value of 180000 (3 minutes).   That is why shutting down listener almost took 3.5 minutes. (TCP timeout on other platforms may vary).  The error message "Solaris Error: 145: Connection timed out" in ora.node1.LISTENER_NODE1.lsnr.log also indicates it is waiting for tcp timeout.

The listener.ora in this scenario is defined as:

 

[LISTENER_NODE1 =
 (DESCRIPTION_LIST =
   (DESCRIPTION =
     (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = node1vip)(PORT = 1521)(IP = FIRST))
     )
     (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = 10.1.10.100)(PORT = 1521)(IP = FIRST))
     )
     (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
     )
   )
 )Solution
To prevent this, move the IPC address to be the first address for the listener in the listener.ora, eg:


LISTENER_NODE1 =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
       (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
       )
       (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = TCP)(HOST = node1vip)(PORT = 1521)(IP = FIRST))
        )
       (ADDRESS_LIST =
           (ADDRESS = (PROTOCOL = TCP)(HOST = 10.1.10.100)(PORT = 1521)(IP = FIRST))
        )
     )
  )

When lsnrctl tries to stop the listener, it will now connect to the IPC address first, which is available during that time. It will not have to wait for tcp timeout.

After the above change, the VIP failover only takes 48 to 50 seconds to complete regardless of the tcp_ip_abort_cinterval setting.

Please note, listener.ora files newly created from 10.2.0.3 to 11.1.0.7 should have the IPC protocol as the first address in listener.ora in most cases.  However, if you have upgraded from a previous release, or manually modified/copied over a listener.ora from a previous install, you may not have the IPC protocol as the first address, regardless of your version. Manual modification is required to move IPC protocol to be the first address to avoid the problem described in this note.

References


 

반응형
Posted by [PineTree]