Node page very slow to load for nodes with more than 1000 events

Description

The following "long running" query was found by a customer that is reporting that a few nodes have a very long load time on the node page.  Several customers are reporting this problem.

SELECT events.*, node.nodelabel, service.servicename, monitoringsystems.id AS systemId, monitoringsystems.label AS systemLabel, monitoringsystems.location AS location FROM node RIGHT OUTER JOIN events ON (events.nodeid = node.nodeid) LEFT OUTER JOIN monitoringsystems ON (events.systemid = monitoringsystems.id) LEFT OUTER JOIN service ON (service.serviceid = events.serviceid) WHERE EVENTACKUSER IS NULL AND EVENTS.NODEID = <node id> AND EVENTDISPLAY='Y' ORDER BY EVENTID DESC LIMIT 5 OFFSET 0;

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Jesse White January 31, 2019 at 2:11 PM

Jesse White January 30, 2019 at 2:04 PM

Using the database in question:

opennms=> EXPLAIN ANALYZE SELECT events.*, node.nodelabel, service.servicename, monitoringsystems.id AS systemId, monitoringsystems.label AS systemLabel, monitoringsystems.location AS location FROM node RIGHT OUTER JOIN events ON (events.nodeid = node.nodeid) LEFT OUTER JOIN monitoringsystems ON (events.systemid = monitoringsystems.id) LEFT OUTER JOIN service ON (service.serviceid = events.serviceid) WHERE EVENTACKUSER IS NULL AND EVENTS.NODEID = 363 AND EVENTDISPLAY='Y' ORDER BY EVENTID DESC LIMIT 5 OFFSET 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.71..18315.80 rows=5 width=5701) (actual time=4207.683..6440.949 rows=5 loops=1) -> Nested Loop Left Join (cost=0.71..33879263.33 rows=9249 width=5701) (actual time=4207.682..6440.943 rows=5 loops=1) Join Filter: (service.serviceid = events.serviceid) Rows Removed by Join Filter: 55 -> Nested Loop Left Join (cost=0.71..33877736.11 rows=9249 width=5185) (actual time=4206.996..6440.247 rows=5 loops=1) Join Filter: (events.systemid = monitoringsystems.id) -> Nested Loop Left Join (cost=0.71..33877388.24 rows=9249 width=5089) (actual time=4206.982..6440.225 rows=5 loops=1) Join Filter: (events.nodeid = node.nodeid) -> Index Scan Backward using pk_eventid on events (cost=0.43..33877241.21 rows=9249 width=5071) (actual time=4204.640..6437.870 rows=5 loops=1) Filter: ((eventackuser IS NULL) AND (nodeid = 363) AND (eventdisplay = 'Y'::bpchar)) Rows Removed by Filter: 1829130 -> Materialize (cost=0.28..8.30 rows=1 width=22) (actual time=0.468..0.468 rows=1 loops=5) -> Index Scan using node_id_type_idx on node (cost=0.28..8.29 rows=1 width=22) (actual time=2.315..2.315 rows=1 loops=1) Index Cond: (nodeid = 363) -> Materialize (cost=0.00..1.04 rows=3 width=96) (actual time=0.002..0.002 rows=1 loops=5) -> Seq Scan on monitoringsystems (cost=0.00..1.03 rows=3 width=96) (actual time=0.008..0.008 rows=1 loops=1) -> Materialize (cost=0.00..1.17 rows=11 width=520) (actual time=0.136..0.137 rows=11 loops=5) -> Seq Scan on service (cost=0.00..1.11 rows=11 width=520) (actual time=0.678..0.679 rows=11 loops=1) Planning time: 0.439 ms Execution time: 6441.001 ms (20 rows)

The query take 6+ seconds.

Adding eventtime to the ORDER BY clause reduces this query down to <1ms:

opennms=> EXPLAIN ANALYZE SELECT events.*, opennms-> node.nodelabel, opennms-> service.servicename, opennms-> monitoringsystems.id AS systemId, opennms-> monitoringsystems.label AS systemLabel, opennms-> monitoringsystems.location AS location opennms-> FROM node opennms-> RIGHT OUTER JOIN events opennms-> ON (events.nodeid = node.nodeid) opennms-> LEFT OUTER JOIN monitoringsystems opennms-> ON (events.systemid = monitoringsystems.id) opennms-> LEFT OUTER JOIN service opennms-> ON (service.serviceid = events.serviceid) opennms-> WHERE EVENTACKUSER IS NULL opennms-> AND EVENTS.NODEID = 363 opennms-> AND EVENTDISPLAY='Y' opennms-> ORDER BY EVENTID, EVENTTIME opennms-> DESC opennms-> LIMIT 5 opennms-> OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=36040.97..36040.98 rows=5 width=5701) (actual time=0.399..0.401 rows=5 loops=1) -> Sort (cost=36040.97..36064.09 rows=9249 width=5701) (actual time=0.399..0.399 rows=5 loops=1) Sort Key: events.eventid, events.eventtime DESC Sort Method: top-N heapsort Memory: 30kB -> Hash Left Join (cost=177.64..35887.35 rows=9249 width=5701) (actual time=0.046..0.337 rows=88 loops=1) Hash Cond: (events.serviceid = service.serviceid) -> Nested Loop Left Join (cost=176.39..35861.82 rows=9249 width=5185) (actual time=0.037..0.295 rows=88 loops=1) Join Filter: (events.systemid = monitoringsystems.id) -> Nested Loop Left Join (cost=176.39..35513.95 rows=9249 width=5089) (actual time=0.033..0.262 rows=88 loops=1) Join Filter: (events.nodeid = node.nodeid) -> Bitmap Heap Scan on events (cost=176.11..35366.92 rows=9249 width=5071) (actual time=0.025..0.211 rows=88 loops=1) Recheck Cond: (nodeid = 363) Filter: ((eventackuser IS NULL) AND (eventdisplay = 'Y'::bpchar)) Heap Blocks: exact=71 -> Bitmap Index Scan on events_nodeid_idx (cost=0.00..173.80 rows=9249 width=0) (actual time=0.015..0.015 rows=88 loops=1) Index Cond: (nodeid = 363) -> Materialize (cost=0.28..8.30 rows=1 width=22) (actual time=0.000..0.000 rows=1 loops=88) -> Index Scan using node_id_type_idx on node (cost=0.28..8.29 rows=1 width=22) (actual time=0.006..0.006 rows=1 loops=1) Index Cond: (nodeid = 363) -> Materialize (cost=0.00..1.04 rows=3 width=96) (actual time=0.000..0.000 rows=1 loops=88) -> Seq Scan on monitoringsystems (cost=0.00..1.03 rows=3 width=96) (actual time=0.002..0.002 rows=1 loops=1) -> Hash (cost=1.11..1.11 rows=11 width=520) (actual time=0.006..0.006 rows=11 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on service (cost=0.00..1.11 rows=11 width=520) (actual time=0.002..0.003 rows=11 loops=1) Planning time: 0.367 ms Execution time: 0.454 ms (26 rows)

David Hustace January 26, 2019 at 3:22 PM

We have a DB dump from the customer and a short list of nodes that are repeatedly effected after cleaning out all events.

 

Jesse White January 10, 2019 at 7:14 PM

Can we grab the same details as above for the deployments & nodes where this was observed?

Jesse White January 10, 2019 at 7:08 PM

I tried persisting 2 million+ events for a node:

opennms=> select count(*) from events where nodeid = 1; count --------- 2380663 (1 row)

And then running the query:

opennms=> EXPLAIN ANALYZE SELECT events.*, node.nodelabel, service.servicename, monitoringsystems.id AS systemId, monitoringsystems.label AS systemLabel, monitoringsystems.location AS location FROM node RIGHT OUTER JOIN events ON (events.nodeid = node.nodeid) LEFT OUTER JOIN monitoringsystems ON (events.systemid = monitoringsystems.id) LEFT OUTER JOIN service ON (service.serviceid = events.serviceid) WHERE EVENTACKUSER IS NULL AND EVENTS.NODEID = 1 AND EVENTDISPLAY='Y' ORDER BY EVENTID DESC LIMIT 5 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- Limit (cost=0.86..2.96 rows=5 width=7138) (actual time=0.055..0.076 rows=5 loops=1) -> Nested Loop Left Join (cost=0.86..1001685.12 rows=2381725 width=7138) (actual time=0.054..0.074 rows=5 loops=1) -> Nested Loop Left Join (cost=0.71..614606.79 rows=2381725 width=6622) (actual time=0.049..0.068 rows=5 loops=1) -> Nested Loop Left Join (cost=0.57..221675.17 rows=2381725 width=6526) (actual time=0.038..0.051 rows=5 loops=1) Join Filter: (events.nodeid = node.nodeid) -> Index Scan Backward using pk_eventid on events (cost=0.43..185941.13 rows=2381725 width=6010) (actual time=0.013..0.025 rows=5 loops=1) Filter: ((eventackuser IS NULL) AND (nodeid = 1) AND (eventdisplay = 'Y'::bpchar)) Rows Removed by Filter: 8 -> Materialize (cost=0.14..8.16 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=5) -> Index Scan using node_id_type_idx on node (cost=0.14..8.15 rows=1 width=520) (actual time=0.020..0.020 rows=1 loop s=1) Index Cond: (nodeid = 1) -> Index Scan using monitoringsystems_pkey on monitoringsystems (cost=0.15..0.17 rows=1 width=96) (actual time=0.002..0.002 rows= 1 loops=5) Index Cond: (events.systemid = id) -> Index Scan using pk_serviceid on service (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=0 loops=5) Index Cond: (serviceid = events.serviceid) Planning time: 0.991 ms Execution time: 0.172 ms (17 rows)

Also rendering the page in question:

time curl -u admin:admin "http://localhost:8980/opennms/includes/eventlist.jsp?node=1&throttle=5" real 0m0.126s user 0m0.004s sys 0m0.003s

Everything is nice and quick - similar results with builds of both foundation-2017 and foundation-2018.

Fixed

Details

Assignee

Reporter

Due date

Sprint

Affects versions

Priority

PagerDuty

Created December 13, 2018 at 1:23 PM
Updated January 31, 2019 at 9:04 PM
Resolved January 31, 2019 at 9:04 PM

Flag notifications