Node page very slow to load for nodes with more than 1000 events
Description
Acceptance / Success Criteria
None
Lucidchart Diagrams
Activity
Jesse White January 30, 2019 at 2:04 PM
Using the database in question:
opennms=> EXPLAIN ANALYZE SELECT events.*,
node.nodelabel,
service.servicename,
monitoringsystems.id AS systemId,
monitoringsystems.label AS systemLabel,
monitoringsystems.location AS location
FROM node
RIGHT OUTER JOIN events
ON (events.nodeid = node.nodeid)
LEFT OUTER JOIN monitoringsystems
ON (events.systemid = monitoringsystems.id)
LEFT OUTER JOIN service
ON (service.serviceid = events.serviceid)
WHERE EVENTACKUSER IS NULL
AND EVENTS.NODEID = 363
AND EVENTDISPLAY='Y'
ORDER BY EVENTID
DESC
LIMIT 5
OFFSET 0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.71..18315.80 rows=5 width=5701) (actual time=4207.683..6440.949 rows=5 loops=1)
-> Nested Loop Left Join (cost=0.71..33879263.33 rows=9249 width=5701) (actual time=4207.682..6440.943 rows=5 loops=1)
Join Filter: (service.serviceid = events.serviceid)
Rows Removed by Join Filter: 55
-> Nested Loop Left Join (cost=0.71..33877736.11 rows=9249 width=5185) (actual time=4206.996..6440.247 rows=5 loops=1)
Join Filter: (events.systemid = monitoringsystems.id)
-> Nested Loop Left Join (cost=0.71..33877388.24 rows=9249 width=5089) (actual time=4206.982..6440.225 rows=5 loops=1)
Join Filter: (events.nodeid = node.nodeid)
-> Index Scan Backward using pk_eventid on events (cost=0.43..33877241.21 rows=9249 width=5071) (actual time=4204.640..6437.870 rows=5 loops=1)
Filter: ((eventackuser IS NULL) AND (nodeid = 363) AND (eventdisplay = 'Y'::bpchar))
Rows Removed by Filter: 1829130
-> Materialize (cost=0.28..8.30 rows=1 width=22) (actual time=0.468..0.468 rows=1 loops=5)
-> Index Scan using node_id_type_idx on node (cost=0.28..8.29 rows=1 width=22) (actual time=2.315..2.315 rows=1 loops=1)
Index Cond: (nodeid = 363)
-> Materialize (cost=0.00..1.04 rows=3 width=96) (actual time=0.002..0.002 rows=1 loops=5)
-> Seq Scan on monitoringsystems (cost=0.00..1.03 rows=3 width=96) (actual time=0.008..0.008 rows=1 loops=1)
-> Materialize (cost=0.00..1.17 rows=11 width=520) (actual time=0.136..0.137 rows=11 loops=5)
-> Seq Scan on service (cost=0.00..1.11 rows=11 width=520) (actual time=0.678..0.679 rows=11 loops=1)
Planning time: 0.439 ms
Execution time: 6441.001 ms
(20 rows)
The query take 6+ seconds.
Adding eventtime to the ORDER BY clause reduces this query down to <1ms:
opennms=> EXPLAIN ANALYZE SELECT events.*,
opennms-> node.nodelabel,
opennms-> service.servicename,
opennms-> monitoringsystems.id AS systemId,
opennms-> monitoringsystems.label AS systemLabel,
opennms-> monitoringsystems.location AS location
opennms-> FROM node
opennms-> RIGHT OUTER JOIN events
opennms-> ON (events.nodeid = node.nodeid)
opennms-> LEFT OUTER JOIN monitoringsystems
opennms-> ON (events.systemid = monitoringsystems.id)
opennms-> LEFT OUTER JOIN service
opennms-> ON (service.serviceid = events.serviceid)
opennms-> WHERE EVENTACKUSER IS NULL
opennms-> AND EVENTS.NODEID = 363
opennms-> AND EVENTDISPLAY='Y'
opennms-> ORDER BY EVENTID, EVENTTIME
opennms-> DESC
opennms-> LIMIT 5
opennms-> OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=36040.97..36040.98 rows=5 width=5701) (actual time=0.399..0.401 rows=5 loops=1)
-> Sort (cost=36040.97..36064.09 rows=9249 width=5701) (actual time=0.399..0.399 rows=5 loops=1)
Sort Key: events.eventid, events.eventtime DESC
Sort Method: top-N heapsort Memory: 30kB
-> Hash Left Join (cost=177.64..35887.35 rows=9249 width=5701) (actual time=0.046..0.337 rows=88 loops=1)
Hash Cond: (events.serviceid = service.serviceid)
-> Nested Loop Left Join (cost=176.39..35861.82 rows=9249 width=5185) (actual time=0.037..0.295 rows=88 loops=1)
Join Filter: (events.systemid = monitoringsystems.id)
-> Nested Loop Left Join (cost=176.39..35513.95 rows=9249 width=5089) (actual time=0.033..0.262 rows=88 loops=1)
Join Filter: (events.nodeid = node.nodeid)
-> Bitmap Heap Scan on events (cost=176.11..35366.92 rows=9249 width=5071) (actual time=0.025..0.211 rows=88 loops=1)
Recheck Cond: (nodeid = 363)
Filter: ((eventackuser IS NULL) AND (eventdisplay = 'Y'::bpchar))
Heap Blocks: exact=71
-> Bitmap Index Scan on events_nodeid_idx (cost=0.00..173.80 rows=9249 width=0) (actual time=0.015..0.015 rows=88 loops=1)
Index Cond: (nodeid = 363)
-> Materialize (cost=0.28..8.30 rows=1 width=22) (actual time=0.000..0.000 rows=1 loops=88)
-> Index Scan using node_id_type_idx on node (cost=0.28..8.29 rows=1 width=22) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: (nodeid = 363)
-> Materialize (cost=0.00..1.04 rows=3 width=96) (actual time=0.000..0.000 rows=1 loops=88)
-> Seq Scan on monitoringsystems (cost=0.00..1.03 rows=3 width=96) (actual time=0.002..0.002 rows=1 loops=1)
-> Hash (cost=1.11..1.11 rows=11 width=520) (actual time=0.006..0.006 rows=11 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on service (cost=0.00..1.11 rows=11 width=520) (actual time=0.002..0.003 rows=11 loops=1)
Planning time: 0.367 ms
Execution time: 0.454 ms
(26 rows)
David Hustace January 26, 2019 at 3:22 PM
We have a DB dump from the customer and a short list of nodes that are repeatedly effected after cleaning out all events.
Jesse White January 10, 2019 at 7:14 PM
Can we grab the same details as above for the deployments & nodes where this was observed?
Jesse White January 10, 2019 at 7:08 PM
I tried persisting 2 million+ events for a node:
opennms=> select count(*) from events where nodeid = 1;
count
---------
2380663
(1 row)
And then running the query:
opennms=> EXPLAIN ANALYZE SELECT events.*,
node.nodelabel,
service.servicename,
monitoringsystems.id AS systemId,
monitoringsystems.label AS systemLabel,
monitoringsystems.location AS location
FROM node
RIGHT OUTER JOIN events
ON (events.nodeid = node.nodeid)
LEFT OUTER JOIN monitoringsystems
ON (events.systemid = monitoringsystems.id)
LEFT OUTER JOIN service
ON (service.serviceid = events.serviceid)
WHERE EVENTACKUSER IS NULL
AND EVENTS.NODEID = 1
AND EVENTDISPLAY='Y'
ORDER BY EVENTID
DESC
LIMIT 5
OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
----------------
Limit (cost=0.86..2.96 rows=5 width=7138) (actual time=0.055..0.076 rows=5 loops=1)
-> Nested Loop Left Join (cost=0.86..1001685.12 rows=2381725 width=7138) (actual time=0.054..0.074 rows=5 loops=1)
-> Nested Loop Left Join (cost=0.71..614606.79 rows=2381725 width=6622) (actual time=0.049..0.068 rows=5 loops=1)
-> Nested Loop Left Join (cost=0.57..221675.17 rows=2381725 width=6526) (actual time=0.038..0.051 rows=5 loops=1)
Join Filter: (events.nodeid = node.nodeid)
-> Index Scan Backward using pk_eventid on events (cost=0.43..185941.13 rows=2381725 width=6010) (actual time=0.013..0.025
rows=5 loops=1)
Filter: ((eventackuser IS NULL) AND (nodeid = 1) AND (eventdisplay = 'Y'::bpchar))
Rows Removed by Filter: 8
-> Materialize (cost=0.14..8.16 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=5)
-> Index Scan using node_id_type_idx on node (cost=0.14..8.15 rows=1 width=520) (actual time=0.020..0.020 rows=1 loop
s=1)
Index Cond: (nodeid = 1)
-> Index Scan using monitoringsystems_pkey on monitoringsystems (cost=0.15..0.17 rows=1 width=96) (actual time=0.002..0.002 rows=
1 loops=5)
Index Cond: (events.systemid = id)
-> Index Scan using pk_serviceid on service (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=0 loops=5)
Index Cond: (serviceid = events.serviceid)
Planning time: 0.991 ms
Execution time: 0.172 ms
(17 rows)
Also rendering the page in question:
time curl -u admin:admin "http://localhost:8980/opennms/includes/eventlist.jsp?node=1&throttle=5"
real 0m0.126s
user 0m0.004s
sys 0m0.003s
Everything is nice and quick - similar results with builds of both foundation-2017
and foundation-2018
.
Fixed
Details
Assignee
Jesse WhiteJesse WhiteReporter
David HustaceDavid HustaceDue date
Dec 31, 2018Components
Sprint
NoneFix versions
Affects versions
Priority
Major
Details
Details
Assignee
Jesse White
Jesse WhiteReporter
David Hustace
David HustaceDue date
Dec 31, 2018
Components
Sprint
None
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

PagerDuty Incident
Created December 13, 2018 at 1:23 PM
Updated January 31, 2019 at 9:04 PM
Resolved January 31, 2019 at 9:04 PM
The following "long running" query was found by a customer that is reporting that a few nodes have a very long load time on the node page. Several customers are reporting this problem.
SELECT events.*, node.nodelabel, service.servicename, monitoringsystems.id AS systemId, monitoringsystems.label AS systemLabel, monitoringsystems.location AS location FROM node RIGHT OUTER JOIN events ON (events.nodeid = node.nodeid) LEFT OUTER JOIN monitoringsystems ON (events.systemid = monitoringsystems.id) LEFT OUTER JOIN service ON (service.serviceid = events.serviceid) WHERE EVENTACKUSER IS NULL AND EVENTS.NODEID = <node id> AND EVENTDISPLAY='Y' ORDER BY EVENTID DESC LIMIT 5 OFFSET 0;