Any SNMP error-status > 5 treated as unrecognized, aborts AggregateTracker

Description

The AggregateTracker code is unaware of SNMP error-status values greater than 5 (the original set). Higher values are treated as unknown and invariably fatal to the tracker's operation. RFC3416 defines values 0-18. We need to put in the work to understand which of these values are actually show-stoppers, which can be ignored and which may require some kind of adaptive behavior as in the case of TOO_BIG_ERR.

public boolean processErrors(int errorStatus, int errorIndex) { if (errorStatus == TOO_BIG_ERR) { int maxVarsPerPdu = m_pduBuilder.getMaxVarsPerPdu(); if (maxVarsPerPdu <= 1) { throw new IllegalArgumentException("Unable to handle tooBigError when maxVarsPerPdu = "+maxVarsPerPdu); } m_pduBuilder.setMaxVarsPerPdu(maxVarsPerPdu/2); reportTooBigErr("Reducing maxVarsPerPdu for this request to "+m_pduBuilder.getMaxVarsPerPdu()); return true; } else if (errorStatus == GEN_ERR) { return processChildError(errorStatus, errorIndex); } else if (errorStatus == NO_SUCH_NAME_ERR) { return processChildError(errorStatus, errorIndex); } else if (errorStatus != NO_ERR){ throw new IllegalArgumentException("Unrecognized errorStatus "+errorStatus); } else { // Continue on.. no need to retry return false; } }

Environment

Reported by Chaz Hopkins in https://mynms.opennms.com/Ticket/Display.html?id=4627 while scanning a: Cisco Nexus Routers OID .1.3.6.1.4.1.9.12.3.1.3.1038 Cisco NX-OS(tm) n5000, Software (n5000-uk9), Version 7.0(6)N1(1), RELEASE SOFTWARE Copyright (c) 2002-2012 by Cisco Systems, Inc. Device Manager Version 6.0(2)N1(1), Compiled 4/7/2015 4:00:00 Received an errorStatus of 13.

Acceptance / Success Criteria

None

Attachments

1
  • 05 Dec 2016, 01:18 PM

Lucidchart Diagrams

Activity

Show:

Benjamin Reed December 21, 2016 at 11:47 AM

This got merged yesterday.

Benjamin Reed December 7, 2016 at 1:18 PM

Jeff Gehlbach December 6, 2016 at 3:35 PM

Your reading of RFC 3416 tracks with my own.

noAccess(6) indicates a MIB view problem, which may go away for a future collection but is unlikely to be transient within the lifetime of a single tracker.

The RFC doesn't really say much about authorizationError(16) so I'm not sure what to do there. Could we make the behavior in case of this (and maybe other?) statuses user-configurable? That would enable us to adapt to a broader range of lame agents.

Benjamin Reed December 6, 2016 at 3:07 PM
Edited

Changes I'm making:

  • Removed the LOG.* calls I put in the AggregateTracker which could stomp on thread-tracking.

  • Changed badValue(3) and readOnly(4) to continue to be non-fatal (in case we can continue to retrieve other values), but to return false when determining retries.

  • If I am reading RFC 3416 correctly, wrongType(7), wrongLength(8), wrongEncoding(9), wrongValue(10), noCreation(11), inconsistentValue(12), resourceUnavailable(13), commitFailed(14), undoFailed(15), notWritable(17), and inconsistentName(18) are all related only to things that would happen from a SET* call, so they, too, should be non-fatal but non-retried.

Also, currently noAccess(6) and authorizationError(16) are set to fatal: false, retry: true. Should we throw an exception if auth fails? Or retry in the hope it's a transient error? Or just do non-fatal, non-retried, like the others?

Benjamin Reed December 5, 2016 at 2:53 PM

Yeah, a bunch of them were related to setting. I erred on the side of caution but we can certainly make all the set-based ones just reject cleanly...

Fixed

Details

Assignee

Reporter

Labels

Sprint

Affects versions

Priority

PagerDuty

Created August 11, 2016 at 12:03 PM
Updated January 11, 2017 at 9:24 AM
Resolved December 21, 2016 at 11:47 AM