Increase workflow efficiency with large datasets

some_rando · April 6, 2022, 6:42am

Am wondering how I can improve my workflow efficiency when dealing with large datasets. I have a couple workflows with steps that may return hundreds or thousands of results, and they take a long time to complete. Will share two examples.

Example 1:
This workflow retrieves the CISA KEV list of CVE’s from their website. For each CVE, it gets the vulnerability ID from Rapid7 IVM, then for each vuln ID found in IVM, gets a list of vulnerable systems. Lastly, it adds a custom tag (if it doesn’t already exist) called CISA KEV. This workflow has been running for 3 hours and has iterated over 400 items. 200 more to go. Would love to make this much faster.

Example 2:
This workflow accepts a string input, such as apache-log4j-obsolete-version. It then gets all affected assets in InsightVM and returns as a list of resource id’s. It then loops over each resource id to get it’s hostname from InsightVM. It took about 1hr 45m to loop over 280 resource id’s.

Is there a better way to build these type of workflows? Maybe I avoid the built-in loop step? Is there a way to do some parallel processing or something like this? Also, is there a time limit on how long jobs can run?

Thanks in advance!

Michael-Cochran-Rapid7 · April 6, 2022, 12:36pm

You could possibly use the JQ plugin to run a query and extract the data you need from the array, which would speed things up.

some_rando · April 6, 2022, 1:02pm

Hi, great idea. I will try that and see how much it speeds those two workflows up!

Eric-Wilson · April 8, 2022, 1:27pm

One thing you may try is using the SQL query action in InsightVM to query against your entire dataset as opposed to one offs in a loop. For instance, the query below will find all of the assets vulnerable to a set of CVE’s and return the asset ID, hostname, and vulnerability id in CSV format. Convert the CSV to a json object and you are good to go.

select favf.asset_id, favf.vulnerability_id, da.host_name as hostname, dvf.reference as “cve id”
FROM fact_asset_vulnerability_finding favf
JOIN dim_vulnerability_reference dvf ON dvf.vulnerability_id = favf.vulnerability_id
JOIN dim_asset da ON da.asset_id = favf.asset_id
WHERE dvf.source = ‘CVE’ AND
dvf.reference IN (‘CVE-2018-1234’,‘CVE-2018-5678’)

Screen Shot 2022-04-01 at 7.47.31 AM (1)

Eric-Wilson · April 8, 2022, 2:07pm

If your first workflow is in reference to the Known Exploited Vulnerabilities from CISA what I have done for that is run a workflow to stores all of the CVE’s in a global artifact on the first pass. Subsequent workflows use JQ to pull CVE’s from CISA and the global artifact to compare differences and only do lookups on the new values. That will save you from having to loop through all 700 vulns that are out there right now. You would then delete the old values from the global artifact and replace them with the new list.

some_rando · April 8, 2022, 6:13pm

Thank you, I will test with this method as well!

some_rando · April 8, 2022, 6:13pm

Double thank you @Eric-Wilson! I will update this post probably by end of weekend on how this all goes and compares!

some_rando · April 8, 2022, 6:16pm

@Eric-Wilson and yes you are spot on, it is grabbing that Known Exploited Vulnerabilities from CISA!

some_rando · April 8, 2022, 7:09pm

@Eric-Wilson This works nice. It took like literally less than 2 minutes for the workflow to complete, when before this one was taking around 2 hours. Thank you, and thanks for the sample SQL query as well.

some_rando · April 8, 2022, 7:10pm

@Michael-Cochran-Rapid7 This looks to be the route. Will keep you posted!