Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-5328

JNA ICMP Pinger loses packet due to incorrect tId

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10.1
    • Fix Version/s: 19.0.0
    • Security Level: Default (Default Security Scheme)
    • Labels:
    • Environment:
      Debian Squeeze 64bit
      64GB ram (48GB used as ram disk for RRDs)
      JAVA Heap 5GB
      Logs on there own mirrored spindles
      DB on separate Machine.

      Description

      Firstly a bit of background, We've recently started to test 1.10.1 (520 nodes, 1300+ Interfaces). We currently use 1.8.16 on our live system without any issues (2000+ nodes, 9000+ Interfaces).

      When using the JNA ICMP Pinger we've noticed that we were seeing quite a lot of packet loss on devices that were fine on the live system, after a quick tcpdump we could see all the packets were being sent and recieved well within the timeout.

      This seems to be caused by the tId on the reply packet being read incorrectly (but only on a few packets). Below are some logs/packet captures which will hopefully help:

      Firstly the poller sends the ping request:

      2012-04-25 15:01:34,600 DEBUG [PollerScheduler-30 Pool-fiber15] JnaPingRequest: 1335362494600: Sending Ping Request: [ID=JnaPingRequestId[addr = /10.195.249.218, ident = 16965, seqNum = 1, tId = 14190047],Retries=2,Timeout=3000,Expiration=-1,Callback=org.opennms.netmgt.icmp.LogPrefixPreservingPingResponseCallback@1b8cc6cb]

      and here is the TCPDump capture of ICMP Echo Request being sent with the correct cookie, id, sequence number and tId (00 00 00 00 00 d8 85 df)

      108629 509.949764 10.100.3.147 10.195.249.218 ICMP 98 Echo (ping) request id=0x4245, seq=1/256, ttl=64

      0000 00 00 0c 07 ac 03 e4 11 5b 0e 6f 66 08 00 45 00 ........[.of..E.
      0010 00 54 00 00 40 00 40 01 28 15 0a 64 03 93 0a c3 .T..@.@.(..d....
      0020 f9 da 08 00 6c 19 42 45 00 01 4f 70 65 6e 4e 4d ....l.BE..OpenNM
      0030 53 21 00 01 83 71 da 5a 9c 4b 00 00 00 00 00 d8 S!...q.Z.K......
      0040 85 df 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 .......... !"#$%
      0050 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 &'()*+,-./012345
      0060 36 37 67

      We then receive the reply from the device, again note the same cookie, id, sequence number and tId (00 00 00 00 00 d8 85 df):

      108633 509.952207 10.195.249.218 10.100.3.147 ICMP 98 Echo (ping) reply id=0x4245, seq=1/256, ttl=60

      0000 e4 11 5b 0e 6f 66 00 23 34 e1 20 00 08 00 45 00 ..[.of.#4. ...E.
      0010 00 54 84 df 40 00 3c 01 a7 35 0a c3 f9 da 0a 64 .T..@.<..5.....d
      0020 03 93 00 00 74 19 42 45 00 01 4f 70 65 6e 4e 4d ....t.BE..OpenNM
      0030 53 21 00 01 83 71 da 5a 9c 4b 00 00 00 00 00 d8 S!...q.Z.K......
      0040 85 df 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 .......... !"#$%
      0050 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 &'()*+,-./012345
      0060 36 37 67

      OpenNMS then receives the packet above, but decodes it incorrectly (notice tId is now incremented by one to 14190048):

      2012-04-25 15:01:34,635 DEBUG [JNA-ICMP-16965-Reply-Processor] RequestTracker: Found a reply to process: org.opennms.netmgt.icmp.jna.JnaPingReply@3c2dc273[Address = /10.195.249.218, JnaPingRequestId = JnaPingRequestId[addr = /10.195.249.218, ident = 16965, seqNum = 1, tId = 14190048]]

      OpenNMS then tries to find the original request for this reply in the Request Map but can't find it as the tId no longer matches, (although the request before does have the tId that was decoded)...

      2012-04-25 15:01:34,635 DEBUG [JNA-ICMP-16965-Reply-Processor] RequestTracker: Looking for request with Id: JnaPingRequestId[addr = /10.195.249.218, ident = 16965, seqNum = 1, tId = 14190048] in map

      { <snip a lot of unrelated pending requests> JnaPingRequestId[addr = /10.195.251.6, ident = 16965, seqNum = 1, tId = 14190048]=[ID= JnaPingRequestId[addr = /10.195.251.6, ident = 16965, seqNum = 1, tId = 14190048],Retries=2,Timeout=3000,Expiration=1335362497601,Callback=org.opennms.netmgt.icmp.LogPrefixPreservingPingResponseCallback@3fd303a3], JnaPingRequestId[addr = /10.195.249.218, ident = 16965, seqNum = 1, tId = 14190047]=[ID= JnaPingRequestId[addr = /10.195.249.218, ident = 16965, seqNum = 1, tId = 14190047],Retries=2,Timeout=3000,Expiration=1335362497600,Callback=org.opennms.netmgt.icmp.LogPrefixPreservingPingResponseCallback@1b8cc6cb], <snip a lot more unrelated pending requests> }

      2012-04-25 15:01:34,635 DEBUG [JNA-ICMP-16965-Reply-Processor] RequestTracker: No request found for reply org.opennms.netmgt.icmp.jna.JnaPingReply@3c2dc273[Address = /10.195.249.218, JnaPingRequestId = JnaPingRequestId[addr = /10.195.249.218, ident = 16965, seqNum = 1, tId = 14190048]]

      I've looked through the code and can't see where or how the tId could be getting incremented when being read back from the packet. If we switch the poller over the the JNI instead of JNA then the issue no longer occurs.

      Cheers

      Mark

        Attachments

          Activity

            People

            • Assignee:
              ranger Benjamin Reed
              Reporter:
              mjaz M. Jazmik
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: