Friday, October 21, 2016

Check Exadata key InfiniBand fabric error counters via Oracle Linux

Checking key InfiniBand fabric error counters on your Exadata machine is good practice and can be done from the Linux operating system. This check is part of the exachk. The exachk should actually be run regularaly to ensure your Oracle Engineered system is in a good shape. However, a lot of people also tend to ensure that some checks are done more regular. As part of a list of checks that can be done more regular is checking on key InfiniBand fabric error counters.

For checking the key InfiniBand fabric error counters on your Oracle Exadata you can use the exachk report, however, you can also do this directly from the Oracle Linux operating system. The code used for this check is shown in the details pages of the exchck report (and shown below).

if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]
then
  echo -e "\nThis check will not run in a user domain of a virtualized environment.  Execute this check in the management domain.\n"
else
    RAW_DATA=$(ibqueryerrors.pl | egrep 'Recover.*SymbolError|SymbolError.*Recover|SymbolError|LinkDowned|RcvErrors|RcvRemotePhys|LinkIntegrityErrors');
  if [ -z "$RAW_DATA" ]
  then
    echo -e "SUCCESS: Key InfiniBand fabric error counters were not found"
  else
    echo -e "WARNING: Key InfiniBand fabric error counters were found\n\nCounters Found:\n";
    echo -e "$RAW_DATA";
  fi;
fi;

You can easily take this part of the code and put this into a custom bash script to be executed by your Exadata administrators. However, if you do implement a large set of custom build checks using Oracle Enterprise Manager you can also use the above code to build a custom check.

When all results are good you should receive the message "SUCCESS: Key InfiniBand fabric error counters were not found". When building a custom OEM check it might be better to change this into a numeric value. For example, all OK should represent 0.

In case there are errors you will receive a message like the one below shown as an example:

WARNING: Key InfiniBand fabric error counters were found

Counters Found:

   GUID 0x21286ccbaea0a0 port ALL: [SymbolErrorCounter == 2]
   GUID 0x21286ccbaea0a0 port 34: [SymbolErrorCounter == 2]

No comments: