<Insert Picture Here>




Node Management in Oracle Clusterware
Markus Michalewicz
Senior Principal Product Manager Oracle RAC and Oracle RAC One Node
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remain at the sole discretion of Oracle.




Agenda
• Oracle Clusterware 11.2.0.1 Processes
                                            <Insert Picture Here>

• Node Monitoring Basics

• Node Eviction Basics

• Re-bootless Node Fencing (restart)

• Advanced Node Management

• The Corner Cases

• More Information / Q&A
Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management




Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management – focus!




                             OHASD


                                         CSSD
                                        ora.cssd


                                     CSSDMONITOR
                                      (was: oprocd)
                                     ora.cssdmonitor
<Insert Picture Here>



 Node Monitoring Basics




Basic Hardware Layout Oracle Clusterware
Node management is hardware independent

           Public Lan               Public Lan



                               Private Lan /
                               Interconnect




    CSSD                CSSD                        CSSD



             SAN                         SAN
            Network                     Network
                           Voting
                            Disk
What does CSSD do?
CSSD monitors and evicts nodes
• Monitors nodes using 2 communication channels:
   – Private Interconnect  Network Heartbeat
   – Voting Disk based communication  Disk Heartbeat
• Evicts (forcibly removes nodes from a cluster)
  nodes dependent on heartbeat feedback (failures)




                      CSSD            “Ping”           CSSD




                                      “Ping”




Network Heartbeat
Interconnect basics
• Each node in the cluster is “pinged” every second
• Nodes must respond in css_misscount time (defaults to 30 secs.)
   – Reducing the css_misscount time is generally not supported


• Network heartbeat failures will lead to node evictions
   – CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node
     mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds




                      CSSD            “Ping”           CSSD
Disk Heartbeat
Voting Disk basics – Part 1
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
• Nodes must receive a response in (long / short) diskTimeout time
   – I/O errors indicate clear accessibility problems  timeout is irrelevant


• Disk heartbeat failures will lead to node evictions
   – CSSD-log: … [CSSD] [1115699552] >TRACE:   clssnmReadDskHeartbeat:
     node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)




                        CSSD                               CSSD




                                         “Ping”




Voting Disk Structure
Voting Disk basics – Part 2
• Voting Disks contain dynamic and static data:
   – Dynamic data: disk heartbeat logging
   – Static data: information about the nodes in the cluster


• With 11.2.0.1 Voting Disks got an “identity”:
   – E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
     1.   2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]


• Voting Disks must therefore not be copied using “dd” or “cp” anymore




                   Node information             Disk Heartbeat Logging
“Simple Majority Rule”
Voting Disk basics – Part 3
• Oracle supports redundant Voting Disks for disk failure protection
• “Simple Majority Rule” applies:
  – Each node must “see” the simple majority of configured Voting Disks
     at all times in order not to be evicted (to remain in the cluster)

         trunc(n/2+1) with n=number of voting disks configured and n>=1




                      CSSD                               CSSD




Insertion 1: “Simple Majority Rule”…
… In extended Oracle clusters



                      • http://www.oracle.com/goto/rac
                          – Using standard NFS to support
                            a third voting file for extended
                            cluster configurations (PDF)


          CSSD                                                      CSSD




                        • Same principles apply
                        • Voting Disks are just
                          geographically dispersed
Insertion 2: Voting Disk in Oracle ASM
The way of storing Voting Disks doesn’t change its use

 [GRID]> crsctl query css votedisk
  1.   2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
  2.   2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
  3.   2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
 Located 3 voting disk(s).



• Oracle ASM auto creates 1/3/5 Voting Files
  – Based on Ext/Normal/High redundancy
    and on Failure Groups in the Disk Group
  – Per default there is one failure group per disk
  – ASM will enforce the required number of disks
  – New failure group type: Quorum Failgroup




                                                      <Insert Picture Here>



  Node Eviction Basics
Why are nodes evicted?
 To prevent worse things from happening…
• Evicting (fencing) nodes is a preventive measure (a good thing)!
• Nodes are evicted to prevent consequences of a split brain:
   – Shared data must not be written by independently operating nodes
   – The easiest way to prevent this is to forcibly remove a node from the cluster




                          1                               2

                        CSSD                              CSSD




How are nodes evicted in general?
“STONITH like” or node eviction basics – Part 1
• Once it is determined that a node needs to be evicted,
   – A “kill request” is sent to the respective node(s)
   – Using all (remaining) communication channels


• A node (CSSD) is requested to “kill itself”  “STONITH like”
   – “STONITH” foresees that a remote node kills the node to be evicted




                          1                               2

                        CSSD                              CSSD
How are nodes evicted?
EXAMPLE: Heartbeat failure
• The network heartbeat between nodes has failed
   – It is determined which nodes can still talk to each other
   – A “kill request” is sent to the node(s) to be evicted
          Using all (remaining) communication channels  Voting Disk(s)


• A node is requested to “kill itself”; executer: typically CSSD



                        1

                      CSSD                            CSSD


                                                  2




How can nodes be evicted?
Using IPMI / Node eviction basics – Part 2
• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional)
   – Intelligent Platform Management Interface (IPMI) drivers required


• IPMI allows remote-shutdown of nodes using additional hardware
   – A Baseboard Management Controller (BMC) per cluster node is required




                        1
                      CSSD                            CSSD
Insertion: Node Eviction Using IPMI
EXAMPLE: Heartbeat failure
• The network heartbeat between the nodes has failed
   – It is determined which nodes can still talk to each other
   – IPMI is used to remotely shutdown the node to be evicted




                       1
                     CSSD




Which node is evicted?
Node eviction basics – Part 3
• Voting Disks and heartbeat communication is used to determine the node


• In a 2 node cluster, the node with the lowest node number should survive
• In a n-node cluster, the biggest sub-cluster should survive (votes based)




                       1                             2

                     CSSD                            CSSD
<Insert Picture Here>



  Re-bootless Node
  Fencing (restart)




Re-bootless Node Fencing (restart)
Fence the cluster, do not reboot the node
• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
   – Re-boots affect applications that might run an a node, but are not protected
   – Customer requirement: prevent a reboot, just stop the cluster – implemented...




                Standalone                               Standalone
                  App X                                    App Y
               Oracle RAC                             Oracle RAC
                DB Inst. 1                             DB Inst. 2




                       CSSD                                  CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted


• It starts with a failure – e.g. network heartbeat or interconnect failure




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC                              Oracle RAC
                DB Inst. 1                              DB Inst. 2




                       CSSD                                   CSSD




Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted


• It starts with a failure – e.g. network heartbeat or interconnect failure




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC                              Oracle RAC
                DB Inst. 1                              DB Inst. 2




                       CSSD                                   CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted


• Then IO issuing processes are killed; it is made sure that no IO process remains
   – For a RAC DB mainly the log writer and the database writer are of concern




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC
                DB Inst. 1




                       CSSD                                   CSSD




Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• Once all IO issuing processes are killed, remaining processes are stopped
   – IF the check for a successful kill of the IO processes, fails → reboot




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC
                DB Inst. 1




                       CSSD                                   CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• Once all remaining processes are stopped, the stack stops itself with a “restart flag”




                 Standalone                                 Standalone
                   App X                                      App Y
                Oracle RAC
                 DB Inst. 1




                        CSSD                                   OHASD




Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• OHASD will finally attempt to restart the stack after the graceful shutdown




                 Standalone                                 Standalone
                   App X                                      App Y
                Oracle RAC
                 DB Inst. 1




                        CSSD                                   OHASD
Re-bootless Node Fencing (restart)
EXCEPTIONS
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:
   –   IF the check for a successful kill of the IO processes fails → reboot
   –   IF CSSD gets killed during the operation → reboot
   –   IF cssdmonitor (oprocd replacement) is not scheduled → reboot
   –   IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot




                 Standalone                              Standalone
                   App X                                   App Y
                Oracle RAC                            Oracle RAC
                 DB Inst. 1                            DB Inst. 2




                        CSSD                                 CSSD




                                                                      <Insert Picture Here>



  Advanced Node
  Management
Determine the Biggest Sub-Cluster
Voting Disk basics – Part 4
• Each node in the cluster is “pinged” every second (network heartbeat)
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second




        1                             2                             3
       CSSD                           CSSD                          CSSD




                                     1
                                     2
                                     3




Determine the Biggest Sub-Cluster
Voting Disk basics – Part 4
• In a n-node cluster, the biggest sub-cluster should survive (votes based)




        1                             2                             3
       CSSD                           CSSD                          CSSD


                                          2

                                     1


                                     3
Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5
• Redundant Voting Disks  Oracle managed redundancy




                     • Assume for a moment only 2
      1                voting disks are supported…
     CSSD
                 2                                         3
          CSSD                                             CSSD




Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5
• Advanced scenarios need to be considered




      1
                      • Without the “Simple Majority
     CSSD
                        Rule”, what would we do?
                 2                                         3
          CSSD                                             CSSD



                      • Even with the “Simple
                        Majority Rule” in place
                         – Each node can see only one
                           voting disk, which would lead
                           to an eviction of all nodes
Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5

                         1
                         2

                         3
    1
   CSSD
                   2                3
            CSSD                    CSSD


        1                       1
        2                       2

        3                       3




Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5

                         1
                         2

                         3
    1
   CSSD
                   2                3
            CSSD                    CSSD


        1                       1
        2                       2

        3                       3
<Insert Picture Here>



 The Corner Cases




Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…




               • A properly configured cluster
                 with 3 voting disks as shown


      CSSD                                       CSSD




               • What happens if there is a
                 storage network failure as
                 shown (lost remote access)?
Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…




                       • There will be no node eviction!
                       • IF storage mirroring is used
                         (for data files), the respective
                         solution must handle this case.
          CSSD                                              CSSD




                     • Covered in Oracle ASM 11.2.0.2:
                        – _asm_storagemaysplit = TRUE
                        – Backported to 11.1.0.7




Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
   – CSSD failed for some reason
   – CSSD is not scheduled within a certain margin


 OCSSDMONITOR (was: oprocd) will take over and execute



            1

          CSSD                           CSSD
Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
   – CSSD failed for some reason
   – CSSD is not scheduled within a certain margin


 OCSSDMONITOR (was: oprocd) will take over and execute



                1

               CSSD                 CSSDmonitor




                                                  CSSD




Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Cluster members (e.g Oracle RAC instances) can request
  Oracle Clusterware to kill a specific member of the cluster

• Oracle Clusterware will then attempt to kill the requested member




                    Oracle RAC                 Oracle RAC
                     DB Inst. 1                 DB Inst. 2
 Inst. 1:
kill inst. 2



                           CSSD                      CSSD
Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member


• If the requested member kill is unsuccessful, a node eviction
  escalation can be issued, which leads to the eviction of the
  node, on which the particular member currently resides



               Oracle RAC                    Oracle RAC
                DB Inst. 1                    DB Inst. 2
 Inst. 1:
kill inst. 2



                      CSSD                         CSSD




Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member


• If the requested member kill is unsuccessful, a node eviction
  escalation can be issued, which leads to the eviction of the
  node, on which the particular member currently resides



               Oracle RAC                    Oracle RAC
                DB Inst. 1                    DB Inst. 2
 Inst. 1:
kill inst. 2



                      CSSD                         CSSD
Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member


• If the requested member kill is unsuccessful, a node eviction
 escalation can be issued, which leads to the eviction of the
 node, on which the particular member currently resides



            Oracle RAC
             DB Inst. 1




                   CSSD




                                                           <Insert Picture Here>



  More Information
More Information
• My Oracle Support Notes:
  – ID 294430.1 - CSS Timeout Computation in Oracle Clusterware
  – ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration
    for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing,
    Panic and Reboot


• http://www.oracle.com/goto/clusterware
  – Oracle Clusterware 11g Release 2 Technical Overview


• http://www.oracle.com/goto/asm


• http://www.oracle.com/goto/rac

Oracle Clusterware Node Management and Voting Disks

  • 1.
    <Insert Picture Here> NodeManagement in Oracle Clusterware Markus Michalewicz Senior Principal Product Manager Oracle RAC and Oracle RAC One Node
  • 2.
    The following isintended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle. Agenda • Oracle Clusterware 11.2.0.1 Processes <Insert Picture Here> • Node Monitoring Basics • Node Eviction Basics • Re-bootless Node Fencing (restart) • Advanced Node Management • The Corner Cases • More Information / Q&A
  • 3.
    Oracle Clusterware 11gRel. 2 Processes Most are not important for node management Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management – focus! OHASD CSSD ora.cssd CSSDMONITOR (was: oprocd) ora.cssdmonitor
  • 4.
    <Insert Picture Here> Node Monitoring Basics Basic Hardware Layout Oracle Clusterware Node management is hardware independent Public Lan Public Lan Private Lan / Interconnect CSSD CSSD CSSD SAN SAN Network Network Voting Disk
  • 5.
    What does CSSDdo? CSSD monitors and evicts nodes • Monitors nodes using 2 communication channels: – Private Interconnect  Network Heartbeat – Voting Disk based communication  Disk Heartbeat • Evicts (forcibly removes nodes from a cluster) nodes dependent on heartbeat feedback (failures) CSSD “Ping” CSSD “Ping” Network Heartbeat Interconnect basics • Each node in the cluster is “pinged” every second • Nodes must respond in css_misscount time (defaults to 30 secs.) – Reducing the css_misscount time is generally not supported • Network heartbeat failures will lead to node evictions – CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds CSSD “Ping” CSSD
  • 6.
    Disk Heartbeat Voting Diskbasics – Part 1 • Each node in the cluster “pings” (r/w) the Voting Disk(s) every second • Nodes must receive a response in (long / short) diskTimeout time – I/O errors indicate clear accessibility problems  timeout is irrelevant • Disk heartbeat failures will lead to node evictions – CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1) CSSD CSSD “Ping” Voting Disk Structure Voting Disk basics – Part 2 • Voting Disks contain dynamic and static data: – Dynamic data: disk heartbeat logging – Static data: information about the nodes in the cluster • With 11.2.0.1 Voting Disks got an “identity”: – E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk 1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA] • Voting Disks must therefore not be copied using “dd” or “cp” anymore Node information Disk Heartbeat Logging
  • 7.
    “Simple Majority Rule” VotingDisk basics – Part 3 • Oracle supports redundant Voting Disks for disk failure protection • “Simple Majority Rule” applies: – Each node must “see” the simple majority of configured Voting Disks at all times in order not to be evicted (to remain in the cluster)  trunc(n/2+1) with n=number of voting disks configured and n>=1 CSSD CSSD Insertion 1: “Simple Majority Rule”… … In extended Oracle clusters • http://www.oracle.com/goto/rac – Using standard NFS to support a third voting file for extended cluster configurations (PDF) CSSD CSSD • Same principles apply • Voting Disks are just geographically dispersed
  • 8.
    Insertion 2: VotingDisk in Oracle ASM The way of storing Voting Disks doesn’t change its use [GRID]> crsctl query css votedisk 1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA] 2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA] 3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA] Located 3 voting disk(s). • Oracle ASM auto creates 1/3/5 Voting Files – Based on Ext/Normal/High redundancy and on Failure Groups in the Disk Group – Per default there is one failure group per disk – ASM will enforce the required number of disks – New failure group type: Quorum Failgroup <Insert Picture Here> Node Eviction Basics
  • 9.
    Why are nodesevicted?  To prevent worse things from happening… • Evicting (fencing) nodes is a preventive measure (a good thing)! • Nodes are evicted to prevent consequences of a split brain: – Shared data must not be written by independently operating nodes – The easiest way to prevent this is to forcibly remove a node from the cluster 1 2 CSSD CSSD How are nodes evicted in general? “STONITH like” or node eviction basics – Part 1 • Once it is determined that a node needs to be evicted, – A “kill request” is sent to the respective node(s) – Using all (remaining) communication channels • A node (CSSD) is requested to “kill itself”  “STONITH like” – “STONITH” foresees that a remote node kills the node to be evicted 1 2 CSSD CSSD
  • 10.
    How are nodesevicted? EXAMPLE: Heartbeat failure • The network heartbeat between nodes has failed – It is determined which nodes can still talk to each other – A “kill request” is sent to the node(s) to be evicted  Using all (remaining) communication channels  Voting Disk(s) • A node is requested to “kill itself”; executer: typically CSSD 1 CSSD CSSD 2 How can nodes be evicted? Using IPMI / Node eviction basics – Part 2 • Oracle Clusterware 11.2.0.1 and later supports IPMI (optional) – Intelligent Platform Management Interface (IPMI) drivers required • IPMI allows remote-shutdown of nodes using additional hardware – A Baseboard Management Controller (BMC) per cluster node is required 1 CSSD CSSD
  • 11.
    Insertion: Node EvictionUsing IPMI EXAMPLE: Heartbeat failure • The network heartbeat between the nodes has failed – It is determined which nodes can still talk to each other – IPMI is used to remotely shutdown the node to be evicted 1 CSSD Which node is evicted? Node eviction basics – Part 3 • Voting Disks and heartbeat communication is used to determine the node • In a 2 node cluster, the node with the lowest node number should survive • In a n-node cluster, the biggest sub-cluster should survive (votes based) 1 2 CSSD CSSD
  • 12.
    <Insert Picture Here> Re-bootless Node Fencing (restart) Re-bootless Node Fencing (restart) Fence the cluster, do not reboot the node • Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot” • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because: – Re-boots affect applications that might run an a node, but are not protected – Customer requirement: prevent a reboot, just stop the cluster – implemented... Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD
  • 13.
    Re-bootless Node Fencing(restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • It starts with a failure – e.g. network heartbeat or interconnect failure Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • It starts with a failure – e.g. network heartbeat or interconnect failure Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD
  • 14.
    Re-bootless Node Fencing(restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Then IO issuing processes are killed; it is made sure that no IO process remains – For a RAC DB mainly the log writer and the database writer are of concern Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD CSSD Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Once all IO issuing processes are killed, remaining processes are stopped – IF the check for a successful kill of the IO processes, fails → reboot Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD CSSD
  • 15.
    Re-bootless Node Fencing(restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Once all remaining processes are stopped, the stack stops itself with a “restart flag” Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD OHASD Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • OHASD will finally attempt to restart the stack after the graceful shutdown Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD OHASD
  • 16.
    Re-bootless Node Fencing(restart) EXCEPTIONS • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…: – IF the check for a successful kill of the IO processes fails → reboot – IF CSSD gets killed during the operation → reboot – IF cssdmonitor (oprocd replacement) is not scheduled → reboot – IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD <Insert Picture Here> Advanced Node Management
  • 17.
    Determine the BiggestSub-Cluster Voting Disk basics – Part 4 • Each node in the cluster is “pinged” every second (network heartbeat) • Each node in the cluster “pings” (r/w) the Voting Disk(s) every second 1 2 3 CSSD CSSD CSSD 1 2 3 Determine the Biggest Sub-Cluster Voting Disk basics – Part 4 • In a n-node cluster, the biggest sub-cluster should survive (votes based) 1 2 3 CSSD CSSD CSSD 2 1 3
  • 18.
    Redundant Voting Disks– Why odd? Voting Disk basics – Part 5 • Redundant Voting Disks  Oracle managed redundancy • Assume for a moment only 2 1 voting disks are supported… CSSD 2 3 CSSD CSSD Redundant Voting Disks – Why odd? Voting Disk basics – Part 5 • Advanced scenarios need to be considered 1 • Without the “Simple Majority CSSD Rule”, what would we do? 2 3 CSSD CSSD • Even with the “Simple Majority Rule” in place – Each node can see only one voting disk, which would lead to an eviction of all nodes
  • 19.
    Redundant Voting Disks– Why odd? Voting Disk basics – Part 5 1 2 3 1 CSSD 2 3 CSSD CSSD 1 1 2 2 3 3 Redundant Voting Disks – Why odd? Voting Disk basics – Part 5 1 2 3 1 CSSD 2 3 CSSD CSSD 1 1 2 2 3 3
  • 20.
    <Insert Picture Here> The Corner Cases Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way… • A properly configured cluster with 3 voting disks as shown CSSD CSSD • What happens if there is a storage network failure as shown (lost remote access)?
  • 21.
    Case 1: PartialFailures in the Cluster When somebody uses a pair of scissors in the wrong way… • There will be no node eviction! • IF storage mirroring is used (for data files), the respective solution must handle this case. CSSD CSSD • Covered in Oracle ASM 11.2.0.2: – _asm_storagemaysplit = TRUE – Backported to 11.1.0.7 Case 2: CSSD is stuck CSSD cannot execute request • A node is requested to “kill itself” • BUT CSSD is “stuck” or “sick” (does not execute) – e.g.: – CSSD failed for some reason – CSSD is not scheduled within a certain margin  OCSSDMONITOR (was: oprocd) will take over and execute 1 CSSD CSSD
  • 22.
    Case 2: CSSDis stuck CSSD cannot execute request • A node is requested to “kill itself” • BUT CSSD is “stuck” or “sick” (does not execute) – e.g.: – CSSD failed for some reason – CSSD is not scheduled within a certain margin  OCSSDMONITOR (was: oprocd) will take over and execute 1 CSSD CSSDmonitor CSSD Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests • Cluster members (e.g Oracle RAC instances) can request Oracle Clusterware to kill a specific member of the cluster • Oracle Clusterware will then attempt to kill the requested member Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1: kill inst. 2 CSSD CSSD
  • 23.
    Case 3: NodeEviction Escalation Members of a cluster can escalate kill requests • Oracle Clusterware will then attempt to kill the requested member • If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1: kill inst. 2 CSSD CSSD Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests • Oracle Clusterware will then attempt to kill the requested member • If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1: kill inst. 2 CSSD CSSD
  • 24.
    Case 3: NodeEviction Escalation Members of a cluster can escalate kill requests • Oracle Clusterware will then attempt to kill the requested member • If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC DB Inst. 1 CSSD <Insert Picture Here> More Information
  • 25.
    More Information • MyOracle Support Notes: – ID 294430.1 - CSS Timeout Computation in Oracle Clusterware – ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing, Panic and Reboot • http://www.oracle.com/goto/clusterware – Oracle Clusterware 11g Release 2 Technical Overview • http://www.oracle.com/goto/asm • http://www.oracle.com/goto/rac