Performance Troubleshooting Checklist
  • 05 Mar 2024
  • Dark
    Light
  • PDF

Performance Troubleshooting Checklist

  • Dark
    Light
  • PDF

Article summary

As you troubleshoot performance problems, use the following checklists to help identify and resolve issues. Keep in mind that every environment and problem is unique, and these troubleshooting checklists represent a baseline to help guide you.

⚠ If you are unable to resolve any problem areas or have questions, please contact the Itential Product Support Team.

Common Performance Issues

Some performance issues you may encounter include, but are not limited to:

  • IAP UI loads slowly.
  • Jobs are running slower than usual.
  • Apps and their functions are unresponsive or working slowly.
  • Pronghorn consumes a lot of resources.

Troubleshooting Steps

Use the checklist below when you encounter any performance problems. These steps are in no particular order.

Step Suggested Action
For IAP in HA, review all the corrective measures outlined below in each applicable IAP server instance. Check the journalctl, application, Mongo, and Redis logs to eliminate any known errors that could be triggering an issue.
Inspect the available applications and adapters in the IAP profile and ensure all the required apps and adapters are "up" and loaded.
Validate system resources such as Memory and CPU on each application, adapter, Redis and MongoDB instance. Ensure the system resources CPU, Memory, and Disk space are adequately available for the IAP instance.
Check if any recent changes to apps, adapters, network, or infrastructure could have triggered the issue. Ensure the connectivity and network latency is validated between IAP and its dependencies like MongoDB or Redis.
Determine the number of jobs that were triggered for the entire day to compare if this is normal with previous days, especially when an issue is noticed with jobs running slower than usual.
Collect the job documents of the jobs that are running slow from Mongo and export a copy for Itential to investigate the statistics on each job. See the Recovery Steps section for more detail.
If the issue is noticed on custom apps, verify the custom app APIs are outside of IAP by using Postman or Curl, and see how they are responding.
Locate any COLLSCAN in the Mongo logs that indicate indexes are missing and try adding indexes on collections or queries, as required.
Inspect any network latency related issues using mtr. See the My Traceroute Tool section for more detail.
Inspect the data payload within the tasks to determine if the processed data is too big.
Check the size of the Jobs and Tasks Collections and archive these collections, if necessary.
Enable the debug logs for investigation when required.
Check the size of Redis DB and clear the keys if they have not been refreshed in a long time.
Check the tcpdump between two endpoints to identify any network-related issues. Refer to Wireshark for capturing a tcpdump.

My Traceroute (MTR) Tool

To inspect network latency and packet loss related data you can use mtr.

mtr is a network diagnostic tool that combines ping and traceroute into one program. 
mtr output provides data on two ways: packet loss and latency.

To install mtr tool in a Linux box, execute the below command

$ sudo yum -y install mtr

Sample mtr command to check packet loss and latency to a destination host

[iap202115 ~]$ mtr --report -T -c 20 192.168.33.25
Start: Thu Sep 29 18:43:03 2022
HOST: iap202115                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.33.25              0.0%    20  2072. 1880. 235.2 2072. 446.7
[ap202115 ~]$ 

In this report packet loss is 0%. Average time to reach the destination is 1880ms, best is 235.2ms and worst is 2072 ms.

To run mtr command and check the result for a period to time, you can use below and check the each hop.

$ mtr  www.google.com



Figure 1: MTR Tool
01_Performance_MTR

Recovery Steps

For instances where the cause of a performance issue is unknown, the following steps can be utilized when trying to implement a recovery:

  1. Pause the Task Worker on all IAP instance to ensure the inflight tasks are completed and then restart all instances of IAP.

  2. If the issue persists, stop IAP and restart all the dependencies (i.e., MongoDB and Redis).

  3. Restart Redis by clearing keys.

⚠ Redis keys that are cleared (deleted) and restarted requires a full outage. Step 3 should be done during a maintenance window.


Clear Redis Keys

Stop Redis: systemctl stop Redis Cleanup Files: Remove the dump.rdb and appendonly.aof files Redis Flush All Keys: redis-cli flush all



Start Redis

Start Redis: systemctl start redis

Recommendations

Use the steps listed below to avoid performance issues.

Step Suggested Action
Make sure sufficient resources (RAM, CPU and Disk space) are available on the instances of IAP servers.
Increase the number of IAP instances when the number of jobs has significantly increased.
Add more logging to the custom apps to identify the locations where it consumes more time.
Use DB Queries to identify potential problems if there are any jobs that are running on a loop.
Instead of the forEach task, the childJob loop feature can be used as an alternative.
The parallel loop type of the childJob loop feature is a powerful feature; however, it is highly resource intensive. Please evaluate what the childjob is doing as compared to the hardware on which IAP is deployed.

Performance Database Commands

Use the following queries to help identify where your system might be under-performing.

Job Counts

To fetch a count for the jobs that have ran more than 500 times (can also use Job Metrics from Ops Manager on IAP 2021.2 and above).

var now = new Date();
var NumberOfDaysAgo = 1;
var ReportDuration = 1;
var epochday = 24 * 60 * 60 * 1000;
var reportStartDate = new Date(now - NumberOfDaysAgo * epochday) - 1;
var reportEndDate = new Date(reportStartDate + ReportDuration * epochday) - 1;
db.jobs.aggregate(
    [
    {"$match":{
        //"status":{$in:["running","error"]},
        "metrics.start_time": {$gte: reportStartDate, $lte: reportEndDate}
        }
     },
     {"$group":{
         //_id: {name: "$name"}
         "_id": "$name",
         "count": {$sum:1}
         }
      },
      {"$match":{
          "count": {$gte: 500}
          }
        },
      {"$sort":{
          "count": -1
          }} 
    ]);


Jobs & Tasks Collection Count

To fetch the jobs and tasks collection count for the full day plus every hour.

db.getCollection('jobs').find(
   {$and:[{"metrics.start_time":
      {$gte:LOWERBOUNDS}},
         {"metrics.start_time":{$lte:UPPERBOUNDS}}]}).count()

⚠ Replace UPPERBOUNDS and LOWERBOUNDSin the command with the epoch time (using milliseconds), between two date timestamps.


MongoDB Metrics

Get a copy of the MongoDB data to investigate the statistics on particular job performance. To do this, please locate a few different workflows that took longer than usual to run, and then copy the JobID of those workflows, and use the following MongoDB command to export the metrics data for Itential to use (you may have to modify the MongoDB command if you are using a MongoDB password, etc.) and replace “JOB-ID-HERE” with the specific JobID.

mongoexport --db=pronghorn --collection=tasks --fields="_id,metrics,name" --query='{"job._id": "JOB-ID-HERE"}' --out=JOB-ID-HERE.json


Job Velocity

Validate if jobs are running and completed to understand the velocity of the jobs.

db.getCollection('jobs').find({"status": "complete"}).count()

db.getCollection('jobs').find({"status": "running"}).count()


Best Practices for Building Workflows & Prebuilts

View the following Itential Academy courses to learn best practices for building workflows in IAP and using Pre-builts to build automations:


Was this article helpful?

Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.