Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-8360

If the persistence layer is not available or unresponsive the WebUI becomes unavailable or extremely slow and key components stop working




      When Newts is enabled, and for some reasons the Cassandra cluster is not available, either because it is offline or because it is too busy to respond, certain key pages like the node page becomes unavailable. An equivalent statement exist when using RRDtool/JRobin, for example, when the underlaying disks are not fast enough to update all the RRD/JRB files

      I can understand that if persistence layer is not available or unresponsive, you cannot use the reports or the resource graphs pages. But, the rest of the OpenNMS WebUI should be available (especially the node page), and the key functionality of OpenNMS should be operative. The main reason for saying this is: the response time graphs and the performance graphs are not as important as being sure that all the NOC is up and running (which is the purpose of Pollerd), and when something wrong happen, the operators will be notified (which is the purpose of the NBI, Notifd, and integrations through Scriptd or something else).

      As an example for unresponsive pages we have the node page. It uses the ResourceDAO to make an expensive query to decide if the "Resource Graphs" links should be displayed or not. In my opinion, it is better to always show that link. Then, on the choose resources page (where it makes sense to use the ResourceDAO) display a friendly message saying that there are no resources, or a friendly error message if persistence backend is not available. That way, key features like polling, alarms, etc. can still be used, and can be accessed through the WebUI (which is how users interact with OpenNMS most of the time).

      Now, because Cassandra is not available (which the equivalent of high I/O when RRDtool is enabled), key components like Pollerd and Collectd could make the whole OpenNMS unresponsive, and can even end up into memory crashes which should never happen.

      In fact, we might introduce some intelligence to the persistence layer to just stop trying to persist data if the underlaying technology (i.e. disks for RRD/JRB files or the Cassandra cluster) is too busy or unresponsive, and generate a Major alarm when this decision was made. That way, OpenNMS can still be useful and can operate properly, specially for its major feature: notify the operators when nodes are not responding.

      On Newts there is a intermediate cache that will start dropping data when the Cassandra backend cannot accept them. We can make decisions based on how fast we're dropping data and just stop trying to push data to Cassandra when a threshold on is passed on this cache, or when the Cassandra cluster is just not responding. For RRDtool, we need a way to know if the underlaying disk is overwhelm with I/O and do a similar thing.

      When this happens, we can generate a Major or Critical alarm to let the operators know that there is a problem with the persistence backend, and keep OpenNMS operational, because as I said, if one of the key routers goes down and the operators don't receive a notification, that could have serious consequences, specially when the reason is: I cannot store metrics and for this reason Pollerd cannot operate and generate events.




            agalue Alejandro Galue
            agalue Alejandro Galue
            0 Vote for this issue
            2 Start watching this issue