Outage records are not getting written to the database
Description
Acceptance / Success Criteria
Attachments
Lucidchart Diagrams
Activity

Alejandro Galue February 4, 2015 at 11:22 AM
I repeated the tests again this morning, and I take back my statement when I said Pollerd was working "slow". It is working as expected.
Here are the details of the tests that I did today:
1) I configured two exact CentOS 7 VMs: one with OpenNMS 15 installed through YUM-stable, and another one with OpenNMS 16-SNAPSHOT from jira/ through YUM-branches.
2) I've started OpenNMS on both VMs at the same time, and pushed the same requisition in both servers at the same time.
3) Then, I took down several nodes (one at a time), to see when both OpenNMS servers see the nodeDown (and the nodeUp), and in all cases (even on those with Path-Outages) the alarms came almost at the same time. So, there are no "delays" on the workflow of Pollerd.
I think the issue with the "Not Monitored" vs "100% Available" is not related with this pull-request, so I've created another Jira issue for it.
I've merged the changes into develop (revision 82e2861dfc697b5b42ea4b0e35dc8685b528e5fb) and closed the pull-request.

Alejandro Galue February 2, 2015 at 5:44 PM
The binary DB dump was generated with PostgreSQL 9.4.

Alejandro Galue February 2, 2015 at 5:38 PM
In general, with Jesse's changes, Pollerd works better, but it is still far to behave like Pollerd in OpenNMS 14.
I have a small virtual Cisco environment I use for testing OpenNMS. I've compiled from scratch the branch for this issue to start playing with Pollerd. This is what I did:
1) Compile OpenNMS
2) Start OpenNMS
3) Import a requisition for my Cisco environment
4) Simulate an outage for one device, and verify it works. Then, revert the change and verify the outage is cleared.
5) Simulate an outage for the central device (i.e, the device behind all the virtual Cisco network) and verify the workflow. It took a lot more to detect all the nodes to be down compared with 14, but it eventually did it.
6) Add another requisition to monitor 2 servers, the OpenNMS server itself and another server; and simulate some outages. The PostgreSQL server which doesn't exist on poller-configuration.xml appears as 100%, while in 14, it appears as "Not Monitored" (which is the expected behavior). A simulated outage for HTTPS on the other server: after restore it and verify the access, OpenNMS still shows the HTTPS service as down.
I'm attaching the configurations (with GIT, to show the changes), and the binary dump of the database.

Jesse White February 2, 2015 at 2:38 PM
The issue is in fact a race condition, leading to to events which may be received out of order. The poller daemon does not handle this scenario properly, leading to two outages being simultaneously opened in the database.
This was previously undetected, but the new Hibernate changes in 15 verified uniqueness of these records.
A fix is available in https://github.com/OpenNMS/opennms/pull/220, currently pending peer review.
For those affected by the bug, the only remediation available with 15 is to manually delete the duplicate records from the outages table:
DELETE FROM outages WHERE ifregainedservice is null AND outageid NOT IN (
SELECT MIN(o.outageid)
FROM outages o
WHERE ifregainedservice IS NULL
GROUP BY o.ifserviceid
);

lyndonl@mobiletorque.co.za January 31, 2015 at 1:23 PM
I have the same issue on my install, it was an upgrade from 14.0.3
Notifications are sent, but the availability continues to show 100% availability and the outage does not display in "Nodes with Outages" or in the "Surveillance" screens.
It does however show up in "Nodes with Pending Problems" but once again when selecting the listed down device, the availability is 100% even though the device has been down for hours.
If there is further information or logs I can send let me know
Details
Assignee
Jesse WhiteJesse WhiteReporter
Tarus BalogTarus BalogComponents
Fix versions
Affects versions
Priority
Blocker
Details
Details
Assignee

Reporter

Components
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

With OpenNMS 15, no outages records are getting written to the outage table of the database.
For example, on demo there are a number of devices who have an outage due to software needing to be updated. RTC correctly shows the outage, but the outages list does not and the availability of those services shows 100%.
This is a pretty nasty bug as it affects all aspects of availability calculations. The lost service events, however, are still being generated. Not sure if this is affecting node down or interface down events.