Status: Resolved (View Workflow)
Affects Version/s: 21.1.0, Meridian-2016.1.12, Meridian-2017.1.7
Component/s: Data Output - Newts
Security Level: Default (Default Security Scheme)
Sprint:Horizon - April 18th 2018, Horizon - April 25th 2018, Horizon - May 2nd 2018, Horizon - May 9th 2018
I've been working on running our metrics:stress tool against Cassandra clusters in order to understand how well the cluster behave, and I found a problem associated with something called "Newts Index Inserts".
This feature can be controlled through an undocumented setting called: org.opennms.newts.disable.indexing. By default, the indexing is enabled.
This feature is required in order to be able to enumerate the resources and metrics available, which is also translated on populating primarily the newts.resource_metrics table on Cassandra.
This indexing process happens every time OpenNMS is started, and it takes a considerable amount of resources, specially from Cassandra, making the cluster temprarily unavailable (as it is extremely busy). The side effect in OpenNMS is that the ring buffer (regardless the configured size) goes to it maximum, and stays there for a while, specially under heavy load (100K samples per second or higher). When the ring buffer is full, OpenNMS is discarding samples, which is translated into holes on the graphs.
On my tests, using 4 `m4.10xlarge` EC2 instances running 3 Cassandra instances on each of them (as that is the use case I'm studing at the moment), for a total of 15 Cassandra nodes, the ring buffer is full for 15min.
Once this indexing work is done, the ring buffer goes to 0 and the Cassandra cluster starts working smothly and it is able to handle 100K even with 2 physical nodes down (which means 6 of the 12 cassandra instances down). From this point, if I disable indexing, I can restart OpenNMS without worring about performance.
Now, I made another test, which is start over with a fresh cluster and the indexing disabled. I can see that the samples table is being updated, no issues at OpenNMS or Cassandra, and the ring buffer is barely used (checked through JMX directly). Unfortunately, because the resource_metrics table is not updated, OpenNMS cannot enumare the resources and I cannot graph any performance metric. This is why the indexing has to be performed at least once.
The only way I found to reduce the indexing time is by brute force, which means, having a more powerful Cassandra cluster, which I think is not the best solution. If I build the cluster using m5.12xlarge, the index part finishes quicly using 50 percent of the ring buffer during indexing (and the size of the ring buffer is 2^22 = 4194304).
The idea would be understand where the heavy load is created on either Newts or the Persistence strategy in OpenNMS, to avoid overwhelm the cluster and be able to only indexing when it is necessary (not every time OpenNMS starts), and not at the same time, making sure that the cluster performance is not affected by spreading out those inserts, as the actual metrics are being generated during the inserts (which is why the ring buffer grows quickly).
Finally, I found that 2^22 is the maximum amount for the ring buffer I found that doesn't have a bad impact on OpenNMS performance. Greater numbers like 2^23 (as it has to be a power of 2), has a bad impact on the CPU usage of OpenNMS, and lead to long Full GCs very quickly (even knowing that the ring buffer is designed to avoid memory issues).