Node page very slow to load for nodes with more than 1000 events

Description

The following "long running" query was found by a customer that is reporting that a few nodes have a very long load time on the node page. Several customers are reporting this problem.

SELECT events.*,
                 node.nodelabel,
                 service.servicename,
                 monitoringsystems.id AS systemId,
                 monitoringsystems.label AS systemLabel,
                 monitoringsystems.location AS location
            FROM node
RIGHT OUTER JOIN events
              ON (events.nodeid = node.nodeid)
 LEFT OUTER JOIN monitoringsystems
              ON (events.systemid = monitoringsystems.id)
 LEFT OUTER JOIN service
              ON (service.serviceid = events.serviceid)
           WHERE EVENTACKUSER IS NULL
             AND EVENTS.NODEID = <node id>
             AND EVENTDISPLAY='Y'
        ORDER BY EVENTID
            DESC
           LIMIT 5
          OFFSET 0;

Acceptance / Success Criteria

None

OpenNMS Bamboo

https://bamboo.opennms.org/browse/OPENNMS-ONMS2723/latest

Lucidchart Diagrams

Activity

Jesse White January 31, 2019 at 2:11 PM

PR: https://github.com/OpenNMS/opennms/pull/2336

Jesse White January 30, 2019 at 2:04 PM

Using the database in question:

opennms=> EXPLAIN ANALYZE SELECT events.*,                                                                                                       
                 node.nodelabel,
                 service.servicename,                                                                                                            
                 monitoringsystems.id AS systemId,                                                                                               
                 monitoringsystems.label AS systemLabel,                                                                                         
                 monitoringsystems.location AS location                                                                                          
            FROM node
RIGHT OUTER JOIN events                                                                                                                          
              ON (events.nodeid = node.nodeid)
 LEFT OUTER JOIN monitoringsystems                                                                                                               
              ON (events.systemid = monitoringsystems.id)
 LEFT OUTER JOIN service                                                                                                                         
              ON (service.serviceid = events.serviceid)                                                                                          
           WHERE EVENTACKUSER IS NULL
             AND EVENTS.NODEID = 363
             AND EVENTDISPLAY='Y'                                                                                                                
        ORDER BY EVENTID           
            DESC                                                                                                                             
           LIMIT 5                                                                                                                               
          OFFSET 0;
                                                                              QUERY PLAN                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.71..18315.80 rows=5 width=5701) (actual time=4207.683..6440.949 rows=5 loops=1)
   ->  Nested Loop Left Join  (cost=0.71..33879263.33 rows=9249 width=5701) (actual time=4207.682..6440.943 rows=5 loops=1)
         Join Filter: (service.serviceid = events.serviceid)
         Rows Removed by Join Filter: 55
         ->  Nested Loop Left Join  (cost=0.71..33877736.11 rows=9249 width=5185) (actual time=4206.996..6440.247 rows=5 loops=1)
               Join Filter: (events.systemid = monitoringsystems.id)
               ->  Nested Loop Left Join  (cost=0.71..33877388.24 rows=9249 width=5089) (actual time=4206.982..6440.225 rows=5 loops=1)
                     Join Filter: (events.nodeid = node.nodeid)
                     ->  Index Scan Backward using pk_eventid on events  (cost=0.43..33877241.21 rows=9249 width=5071) (actual time=4204.640..6437.870 rows=5 loops=1)
                           Filter: ((eventackuser IS NULL) AND (nodeid = 363) AND (eventdisplay = 'Y'::bpchar))
                           Rows Removed by Filter: 1829130
                     ->  Materialize  (cost=0.28..8.30 rows=1 width=22) (actual time=0.468..0.468 rows=1 loops=5)
                           ->  Index Scan using node_id_type_idx on node  (cost=0.28..8.29 rows=1 width=22) (actual time=2.315..2.315 rows=1 loops=1)
                                 Index Cond: (nodeid = 363)
               ->  Materialize  (cost=0.00..1.04 rows=3 width=96) (actual time=0.002..0.002 rows=1 loops=5)
                     ->  Seq Scan on monitoringsystems  (cost=0.00..1.03 rows=3 width=96) (actual time=0.008..0.008 rows=1 loops=1)
         ->  Materialize  (cost=0.00..1.17 rows=11 width=520) (actual time=0.136..0.137 rows=11 loops=5)
               ->  Seq Scan on service  (cost=0.00..1.11 rows=11 width=520) (actual time=0.678..0.679 rows=11 loops=1)
 Planning time: 0.439 ms
 Execution time: 6441.001 ms
(20 rows)

The query take 6+ seconds.

Adding eventtime to the ORDER BY clause reduces this query down to <1ms:

opennms=> EXPLAIN ANALYZE SELECT events.*,                                                                                                       
opennms->                  node.nodelabel,
opennms->                  service.servicename,                                                                                                            
opennms->                  monitoringsystems.id AS systemId,                                                                                               
opennms->                  monitoringsystems.label AS systemLabel,                                                                                         
opennms->                  monitoringsystems.location AS location                                                                                          
opennms->             FROM node
opennms-> RIGHT OUTER JOIN events                                                                                                                          
opennms->               ON (events.nodeid = node.nodeid)
opennms->  LEFT OUTER JOIN monitoringsystems                                                                                                               
opennms->               ON (events.systemid = monitoringsystems.id)
opennms->  LEFT OUTER JOIN service                                                                                                                         
opennms->               ON (service.serviceid = events.serviceid)                                                                                          
opennms->            WHERE EVENTACKUSER IS NULL
opennms->              AND EVENTS.NODEID = 363
opennms->              AND EVENTDISPLAY='Y'                                                                                                                
opennms->         ORDER BY EVENTID, EVENTTIME
opennms->             DESC
opennms->            LIMIT 5                                                                                                                               
opennms->           OFFSET 0;
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=36040.97..36040.98 rows=5 width=5701) (actual time=0.399..0.401 rows=5 loops=1)
   ->  Sort  (cost=36040.97..36064.09 rows=9249 width=5701) (actual time=0.399..0.399 rows=5 loops=1)
         Sort Key: events.eventid, events.eventtime DESC
         Sort Method: top-N heapsort  Memory: 30kB
         ->  Hash Left Join  (cost=177.64..35887.35 rows=9249 width=5701) (actual time=0.046..0.337 rows=88 loops=1)
               Hash Cond: (events.serviceid = service.serviceid)
               ->  Nested Loop Left Join  (cost=176.39..35861.82 rows=9249 width=5185) (actual time=0.037..0.295 rows=88 loops=1)
                     Join Filter: (events.systemid = monitoringsystems.id)
                     ->  Nested Loop Left Join  (cost=176.39..35513.95 rows=9249 width=5089) (actual time=0.033..0.262 rows=88 loops=1)
                           Join Filter: (events.nodeid = node.nodeid)
                           ->  Bitmap Heap Scan on events  (cost=176.11..35366.92 rows=9249 width=5071) (actual time=0.025..0.211 rows=88 loops=1)
                                 Recheck Cond: (nodeid = 363)
                                 Filter: ((eventackuser IS NULL) AND (eventdisplay = 'Y'::bpchar))
                                 Heap Blocks: exact=71
                                 ->  Bitmap Index Scan on events_nodeid_idx  (cost=0.00..173.80 rows=9249 width=0) (actual time=0.015..0.015 rows=88 loops=1)
                                       Index Cond: (nodeid = 363)
                           ->  Materialize  (cost=0.28..8.30 rows=1 width=22) (actual time=0.000..0.000 rows=1 loops=88)
                                 ->  Index Scan using node_id_type_idx on node  (cost=0.28..8.29 rows=1 width=22) (actual time=0.006..0.006 rows=1 loops=1)
                                       Index Cond: (nodeid = 363)
                     ->  Materialize  (cost=0.00..1.04 rows=3 width=96) (actual time=0.000..0.000 rows=1 loops=88)
                           ->  Seq Scan on monitoringsystems  (cost=0.00..1.03 rows=3 width=96) (actual time=0.002..0.002 rows=1 loops=1)
               ->  Hash  (cost=1.11..1.11 rows=11 width=520) (actual time=0.006..0.006 rows=11 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     ->  Seq Scan on service  (cost=0.00..1.11 rows=11 width=520) (actual time=0.002..0.003 rows=11 loops=1)
 Planning time: 0.367 ms
 Execution time: 0.454 ms
(26 rows)

David Hustace January 26, 2019 at 3:22 PM

We have a DB dump from the customer and a short list of nodes that are repeatedly effected after cleaning out all events.

Jesse White January 10, 2019 at 7:14 PM

Can we grab the same details as above for the deployments & nodes where this was observed?

Jesse White January 10, 2019 at 7:08 PM

I tried persisting 2 million+ events for a node:

opennms=> select count(*) from events where nodeid = 1;
  count
---------
 2380663
(1 row)

And then running the query:

opennms=> EXPLAIN ANALYZE SELECT events.*,                                                                                                       
                 node.nodelabel,
                 service.servicename,                                                                                                            
                 monitoringsystems.id AS systemId,                                                                                               
                 monitoringsystems.label AS systemLabel,                                                                                         
                 monitoringsystems.location AS location                                                                                          
            FROM node
RIGHT OUTER JOIN events                                                                                                                          
              ON (events.nodeid = node.nodeid)
 LEFT OUTER JOIN monitoringsystems                                                                                                               
              ON (events.systemid = monitoringsystems.id)
 LEFT OUTER JOIN service                                                                                                                         
              ON (service.serviceid = events.serviceid)                                                                                          
           WHERE EVENTACKUSER IS NULL
             AND EVENTS.NODEID = 1
             AND EVENTDISPLAY='Y'                                                                                                                
        ORDER BY EVENTID
            DESC
           LIMIT 5                                                                                                                               
          OFFSET 0;
                                                                            QUERY PLAN                                                           

--------------------------------------------------------------------------------------------------------------------------------------------------
----------------
 Limit  (cost=0.86..2.96 rows=5 width=7138) (actual time=0.055..0.076 rows=5 loops=1)                                                            
   ->  Nested Loop Left Join  (cost=0.86..1001685.12 rows=2381725 width=7138) (actual time=0.054..0.074 rows=5 loops=1)                          
         ->  Nested Loop Left Join  (cost=0.71..614606.79 rows=2381725 width=6622) (actual time=0.049..0.068 rows=5 loops=1)                     
               ->  Nested Loop Left Join  (cost=0.57..221675.17 rows=2381725 width=6526) (actual time=0.038..0.051 rows=5 loops=1)               
                     Join Filter: (events.nodeid = node.nodeid)
                     ->  Index Scan Backward using pk_eventid on events  (cost=0.43..185941.13 rows=2381725 width=6010) (actual time=0.013..0.025
rows=5 loops=1)
                           Filter: ((eventackuser IS NULL) AND (nodeid = 1) AND (eventdisplay = 'Y'::bpchar))                                    
                           Rows Removed by Filter: 8
                     ->  Materialize  (cost=0.14..8.16 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=5)                               
                           ->  Index Scan using node_id_type_idx on node  (cost=0.14..8.15 rows=1 width=520) (actual time=0.020..0.020 rows=1 loop
s=1)
                                 Index Cond: (nodeid = 1)
               ->  Index Scan using monitoringsystems_pkey on monitoringsystems  (cost=0.15..0.17 rows=1 width=96) (actual time=0.002..0.002 rows=
1 loops=5)
                     Index Cond: (events.systemid = id)
         ->  Index Scan using pk_serviceid on service  (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=0 loops=5)              
               Index Cond: (serviceid = events.serviceid)
 Planning time: 0.991 ms
 Execution time: 0.172 ms
(17 rows)

Also rendering the page in question:

time curl -u admin:admin "http://localhost:8980/opennms/includes/eventlist.jsp?node=1&throttle=5"
real    0m0.126s
user    0m0.004s
sys     0m0.003s

Everything is nice and quick - similar results with builds of both foundation-2017 and foundation-2018.

Fixed

Details
Assignee
Jesse White
Reporter
David Hustace
Labels
bugfixslacustomersupport
Due date
Dec 31, 2018
Components
Web UI - Node Details
Sprint
None
Fix versions
23.0.3
Meridian-2018.1.5
24.0.0
Affects versions
21.1.0
Priority
Major

PagerDuty

Created December 13, 2018 at 1:23 PM

Updated January 31, 2019 at 9:04 PM

Resolved January 31, 2019 at 9:04 PM

Node page very slow to load for nodes with more than 1000 events

Description

Acceptance / Success Criteria

OpenNMS Bamboo

Lucidchart Diagrams

Activity

Jesse White January 31, 2019 at 2:11 PM

Jesse White January 30, 2019 at 2:04 PM

David Hustace January 26, 2019 at 3:22 PM

Jesse White January 10, 2019 at 7:14 PM

Jesse White January 10, 2019 at 7:08 PM

Details
Assignee
Jesse White
Reporter
David Hustace
Labels
bugfixslacustomersupport
Due date
Dec 31, 2018
Components
Web UI - Node Details
Sprint
None
Fix versions
23.0.3
Meridian-2018.1.5
24.0.0
Affects versions
21.1.0
Priority
Major

Details

Assignee

Reporter

Labels

Due date

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Node page very slow to load for nodes with more than 1000 events

Description

Acceptance / Success Criteria

OpenNMS Bamboo

Lucidchart Diagrams

Activity

Jesse White January 31, 2019 at 2:11 PM

Jesse White January 30, 2019 at 2:04 PM

David Hustace January 26, 2019 at 3:22 PM

Jesse White January 10, 2019 at 7:14 PM

Jesse White January 10, 2019 at 7:08 PM

DetailsAssigneeJesse WhiteJesse WhiteReporterDavid HustaceDavid HustaceLabelsbugfixslacustomersupportDue dateDec 31, 2018ComponentsWeb UI - Node DetailsSprintNone+4Fix versions23.0.3Meridian-2018.1.524.0.0Affects versions21.1.0PriorityMajor

Details

Assignee

Reporter

Labels

Due date

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Jesse White
Reporter
David Hustace
Labels
bugfixslacustomersupport
Due date
Dec 31, 2018
Components
Web UI - Node Details
Sprint
None
Fix versions
23.0.3
Meridian-2018.1.5
24.0.0
Affects versions
21.1.0
Priority
Major

PagerDuty