Deleting Duplicate Assets in Nexpose

alan_sarz · June 1, 2020, 6:20pm

Does anyone have a decent way of filtering for and deleting duplicate assets by IP? Right now, I’m exporting my full list of assets, opening in Excel, and using conditional formatting to find the duplicate IP entries. That’s an OK way of doing it, but i haven’t found a great way to take that list and mass delete only one of the two (or three) duplicates within Nexpose. I could enter those IPs into the query builder, but how would i go about making that list deletable?

Thanks for any help.

zyoutz · June 1, 2020, 8:10pm

@alan_sarz A couple quick questions so I can get you going in the right direction. First, how are you identifying which assets to be deleted out of the duplicates by IP? Second, are you trying to do this with an InsightConnect workflow?

We do have the ability to delete assets by ID with the following API endpoint: https://help.rapid7.com/insightvm/en-us/api/index.html#operation/deleteAsset. Making sure we have the list of IDs programmatically (or from your Excel file if you do manual filtering) would be a good place to start.

alan_sarz · June 2, 2020, 1:56pm

I am searching for an IP, and then deleting all entries except for the one with the most current scan date.

I am not using insight connect. Is that something only for InsightVM? We are using onsite Nexpose, not any cloud-based solution.

peter_david · June 2, 2020, 3:00pm

I would make a dynamic asset group, and use it to limit scans to not most recent (greater than last seven days or whatever), then simply review and delete. However, if the rule is that it MUST be a duplicate as well… that’ll be harder… For that type of stuff we use the InsightVM API and have it run a script overnight. Specifically, we do this with short-lived instances that use the InsightVM so they don’t keep cluttering our asset counts, but really can use for any type of filtering you can do in code.

tony_hamil · June 3, 2020, 3:50am

I use the API to do the same thing for our systems with hostnames. I pull all of the systems that have a hostname, find all of the duplicates, and then delete the duplicate that was missing a MAC address, wrong OS, etc… There may be 1 or 2 that I have to do manually but easier than doing 20+ individually through Excel and the GUI. I use Powershell for the code but it will work with any language. Hope that helps

brandon_mcclure · June 4, 2020, 3:57am

I ran across an article a while ago on a good way to do this in Nexpose.
I think this might be it https://blog.rapid7.com/2017/07/11/cleaning-house-maintaining-an-accurate-and-relevant-vulnerability-management-program/
The big takeaway is making sure you have asset linking on. It also talks about removing stale assets and “Ghost” assets (Asset risk score is 0 & OS is empty & Asset name is empty) by creating asset groups to tag these. This won’t get all the duplicates, but it will get quite a few making the rest of the cleanup easier because I think you’ll find a lot of the duplicates will fall under the Ghost category.

alan_sarz · June 4, 2020, 5:47pm

yes thank you. I did read that before and I’ve added ghost asset criteria to a deletion asset group. Without moving everything to InsightVM (which i’m not allowed to do), I’m afraid it’ll be a largely manual process for me.

Here’s the easiest way I’ve found to do this:

*export entire asset group to csv
*Open excel and highlight, filter, and sort duplicates with the IP column
*create a user made tag and import with file. Might have to copy and paste the excel list into notepad and upload the txt file
*create an asset group and make the only criteria assets with the tag you created earlier

This way, you have a list of duplicates and you can select and delete whichever ones you want.

alan_sarz · June 4, 2020, 5:48pm

Can you use the API with on-premises Nexpose? I dont have licenses for InsightVM.

brandon_mcclure · June 4, 2020, 9:39pm

yes, documentation is at https://NexposeServer.domain.com:3780/api/3/html
It supports basic auth. I’ve done Tag management, ad-hoc scans, and software inventory with it and it works well.
The InsightVM Plug-in in InsightConnects works against this. They recently added the ability to limit a scan to one host in a site and as soon as my test box gets moved to its new building I’ll be doing a PR request to add the options to define the other scan parameters.

zyoutz · June 4, 2020, 9:54pm

The InsightVM v3 API is compatible with InsightVM and Nexpose. Public documentation for it can be found here: InsightVM API (v3). And as Brandon mentioned it is also available from the Nexpose host itself.

zyoutz · June 4, 2020, 9:57pm

Out of this process, it would be possible to automate most of the steps with the API out side of the filtering by IP. If you add any “human” decisions as to which asset by IP you keep versus get rid of then yes this would need to be reviewed; however, if you simply take the latest scan ID (or some other criteria) this might also be possible with the use of the API.

bill_endraske · August 4, 2020, 2:39pm

I have had the same issue with duplicate assets. Here are some steps I’ve taken to cleanup.

I use the reports and create a SQL Export report with the SQL query. The query accounts for duplicate records where the same asset is listed multiple times with or without the FQDN. It strips the ‘corp.contoso.com’ from hostnames containing it.

SELECT da.host_name, lower(da.host_name), replace(lower(da.host_name),'.corp.contoso.com','') AS HostName, da.asset_id AS "Asset ID", da.ip_address AS "IP Address", da.mac_address AS "MAC Address", dau.unique_id AS "R7 Agent ID", da.sites AS "Sites", dos.description AS "Operating System", fad.first_discovered AS "First Discovered", fa.scan_finished AS "Last Scan Date", da.last_assessed_for_vulnerabilities AS "Last Assessed", fa.critical_vulnerabilities AS "Critical Vulnerabilities", fa.severe_vulnerabilities AS "Severe Vulnerabilities", fa.moderate_vulnerabilities AS "Moderate Vulnerabilities", fa.vulnerabilities AS "Total Vulnerabilities", fa.malware_kits AS "Malware Kits", fa.exploits AS "Exploits", to_char(round(fa.riskscore::numeric,0),'999G999G999') AS "Risk Score"
FROM dim_asset da 
JOIN fact_asset fa USING (asset_id)
JOIN fact_asset_discovery fad USING (asset_id)
JOIN dim_asset_unique_id dau USING (asset_id)
JOIN dim_operating_system dos USING (operating_system_id)
Where dau.source = 'R7 Agent'
And replace(lower(da.host_name),'.corp.contoso.com','') in (select replace(lower(da.host_name),'.corp.contoso.com','') from dim_asset da group by replace(lower(da.host_name),'.corp.contoso.com','')HAVING count(*) > 1)
ORDER BY replace(lower(da.host_name),'.corp.contoso.com','') ASC, da.asset_id ASC

After running the report, I download the CSV and examine the list. The report is sorted by hostname and has the oldest AssetID listed first. I use the AssetIDs in the report to run through the RestAPI and delete.

Add a list of AssetIDs to a text file called ‘dups.txt’ and execute the python script below with your username, password, and host settings.

from __future__ import print_function
import time
import rapid7vmconsole
from rapid7vmconsole.rest import ApiException
from pprint import pprint

import base64
import logging
import sys

config = rapid7vmconsole.Configuration(name='Rapid7')
config.username = 'nxadmin'
config.password = '**********'
config.host = 'https://insightvm.contoso.com'
config.verify_ssl = False
config.assert_hostname = False
config.proxy = None
config.ssl_ca_cert = None
config.connection_pool_maxsize = None
config.cert_file = None
config.key_file = None
config.safe_chars_for_path_param = ''

# Logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
logger.addHandler(ch)
config.debug = False


auth = "%s:%s" % (config.username, config.password)
auth = base64.b64encode(auth.encode('ascii')).decode()
client = rapid7vmconsole.ApiClient(configuration=config)
client.default_headers['Authorization'] = "Basic %s" % auth

# create an instance of the API class
api_instance = rapid7vmconsole.AssetApi(client)

with open("dups.txt","r") as a_file:
    for asset_id in a_file:
        try:
            # Asset
            api_response = api_instance.delete_asset(asset_id)
            pprint(api_response)
        except ApiException as e:
            print("Exception when calling AssetApi->delete_asset: %s\n" % e)

I hope this is helpful to those who need it.

cpeel · December 1, 2023, 4:08pm

I have this issue currently and I think many customers will due to a recent update in the console that changes asset correlations. Here is my game plan using the IVM API and python:

IVM Call to get all asset names
Perform an operation to standardize the names (lower() and remove anything after “.”)
Find duplicates from those
Delete the one with the lower Asset ID (as these usually occur when we update agents or refresh an asset. The lower asset id is the older asset because they are sequential)

nkershaw · December 4, 2023, 9:01am

I wouldn’t do anything drastic yet with relation to update 6.6.229, there is clearly an issue with duplicate assets since this update.

R7 support:
“We are aware of an issue with asset count increase and potential duplication of assets in your environment introduced with the 6.6.229 product release of InsightVM and Nexpose. Our teams are investigating solutions to address asset duplication.”

jjatracy · January 17, 2024, 4:09pm

I did the same thing about a year ago.
here are the steps I went through

Call IVM to gather all assets
I added a function to ignore specific assets if needed.
Strip away any FQDN additions to get just the host name of devices
Check all host assets for duplicate entries
Confirm if device has duplicate entries
Check the last scan date of the asset based on either agent import or scan_date
Deletes all entries that are not the most recent scan.

the code is kinda long and a bit hacky, but its a one click solution and will handle up to x number of duplicate entries per asset.

import datetime
import operator
import requests
import json
import logging
import time


#************************************************************************************************
#Global Variables
#************************************************************************************************

hostname_list = []
Sname_list = []
seen = set()
base_api_url = "https://<URL>/api/3/";
date = datetime.datetime.now().strftime('%Y-%m-%d %H-%M-%S')
logging.basicConfig(filename=str(date) + " All Assets.log", level=logging.INFO)
username = ''
password = ''
base_api_search_url = f"https://<url>/api/3/assets/";
headers = {'Content-type': 'application/json', 'Accept': 'application/json'}
size = 500
sorted = "DESC"
subtext = "runner"
ip_hostname = "ip-"
delimiter = "."
base_api_search_url = f"https://<url>/api/3/assets/search";
base_api_delete_url = f"https://<url>/api/3/assets/";
page = 0


#************************************************************************************************************************************************************************************************
#Fetch all assets from nexpose console. - This will take a few hours because of the way that the data is returned, it will only return 500 devices at a time over 30+ pages
#************************************************************************************************************************************************************************************************

def get_devices():
	page_count = 0

	print("Getting the page count")
	request_response = requests.get(
		base_api_search_url + f"?page={page_count}&size={size}&sort={sorted}",
		headers=headers,
		auth=(username, password),
		verify=False
	).content.decode()

	res = json.loads(request_response)

	total_pages = res['page']['totalPages']
	print("total page count is: " + str(total_pages) + "\n")
	logging.info("total page count is: " + str(total_pages) + "\n")
	logging.info("\n fetching the list of all assets in the nexpose console. \n This may take up to 2 hours or more. \n ")
	print("\n fetching the list of all assets in the nexpose console. \n This may take up to 2 hours or more. \n ")
	while page_count <= total_pages:
		logging.info("Getting Page " + str(page_count))
		print("Getting Page " + str(page_count))
		request_response = requests.get(
			base_api_search_url + f"?page={page_count}&size={size}&sort={sorted}",
			headers=headers,
			auth=(username, password),
			verify=False
		).content.decode()

		res = json.loads(request_response)
		
		
	#****************************************************************************************************
	# this ignores specific assets if needed
	#****************************************************************************************************
	
	
		for i in res['resources']:
			try:
				if subtext in i["hostName"]:
					logging.info("Skipping Runner Machine: " + str(i["hostName"]))
				elif ip_hostname in i["hostName"]:
					logging.info("Skipping IP for hostname machine: " + str(i["hostName"]))
				else:
					hostname = i['hostName']
					hostname_list.append(hostname)
					logging.info(hostname + ": this is a valid device")
			except Exception as e:
				ip_address = i["ip"]
				logging.info("The ip " + str(ip_address) + " did not have a hostname associated to it")
				print("The ip " + str(ip_address) + " did not have a hostname associated to it")
		page_count = page_count + 1

	#************************************************************************************************************************************************************************************************************************************************************************************************
	# This strips everything from the host name starting at the first instance of "." this makes it easier to compare the hostnames for duplicates as some will have a FQDN added to them.
	#************************************************************************************************************************************************************************************************************************************************************************************************
	
	for i in hostname_list:
		name_prep = i
		stripped_name = name_prep.split(delimiter,1)[0]
		Sname_list.append(stripped_name)

	#************************************************************************************************************************************************************************************************
	# This checks the stripped down list of host names for duplicates and adds those duplicate devices to another list to be passed on to the duplicate device function.
	#************************************************************************************************************************************************************************************************

	print("checking for dupes, single line method")
	logging.info(("checking for dupes, single line method"))
	assetname_data = [x for x in Sname_list if x in seen or seen.add(x)]
	print(assetname_data)
	logging.info(assetname_data)
	logging.info("there are " + str(len(assetname_data)) + " devices that have duplicate entries.")
	print("there are " + str(len(assetname_data)) + " devices that have duplicate entries.")
	return assetname_data

def find_duplicates(assetname_data):

	for i in assetname_data:
		body = {
			"match": "all",
			"filters": [
				{
					"field": "host-name",
					"operator": "is",
					"value": i
				}
			]
		}
		request_response = requests.post(
			base_api_search_url + f"?page={page}&size={size}",
			data=json.dumps(body),
			headers=headers,
			auth=(username, password),
			verify=False
		).content.decode()

		res = json.loads(request_response)

		if len(res['resources']) == 1:
			logging.info("this device has no duplicate " + i)

		else:
			print("this device has duplicates: " + i)
			Assets = 0
			TotalCount = len(res['resources'])
			AssetList = []
			for h in res['resources']:
				AssetID = h['id']
				AssetList.append(AssetID)

			Asset_list = []

			while Assets < TotalCount:
				print("getting scan history for: " + i)
				logging.info("getting scan history for: " + i)
				for j in reversed(res['resources'][Assets]['history']):
					if j['type'] == "SCAN" or j['type'] == "AGENT-IMPORT":
						actual_scan_date1 = j['date']
						asset1_data = {"ID": AssetList[Assets], "Scan_Date": actual_scan_date1}
						Asset_list.append(asset1_data)
						Assets = Assets + 1
						break

			Asset_list.sort(key=operator.itemgetter('Scan_Date'), reverse=True)

			for k in Asset_list[1:]:
				logging.info("Deleting: " + str(k))
				print("Deleting: " + str(k))
				try:
					request_response = requests.delete(
						base_api_delete_url + str(k['ID']),
						data=json.dumps(body),
						headers=headers,
						auth=(username, password),
						verify=False
					).content.decode()
				except Exception as e:
					logging.info("Deletion Error on Device: " + str(i) + " \n" + e)


if __name__ == "__main__":
	#print("Waiting 1 hour to allow rapid 7 to update the console." "\n")
	#time.sleep(3600)
	assetname_data = get_devices()
	find_duplicates(assetname_data)

dschaare · January 22, 2024, 2:21pm

We’ve opened a case with support as we are also badly affected by the changes in correlation introduced with release 6.6.229. No solution yet and some hundred duplicates every day …

jjatracy · January 22, 2024, 11:26pm

I found a error in the code and cant edit my post anymore,

in the global variables section there are two variables named base_api_search_url. The second instance of this should be renamed to something of your choosing and then mirrored in the “find _duplicates” function.

hostname_list = []
Sname_list = []
seen = set()
base_api_url = "https://<URL>/api/3/";
date = datetime.datetime.now().strftime('%Y-%m-%d %H-%M-%S')
logging.basicConfig(filename=str(date) + " All Assets.log", level=logging.INFO)
username = ''
password = ''
base_api_search_url = f"https://<url>/api/3/assets/";
headers = {'Content-type': 'application/json', 'Accept': 'application/json'}
size = 500
sorted = "DESC"
subtext = "runner"
ip_hostname = "ip-"
delimiter = "."
#**************************HERE************************
<New Name> = f"https://<url>/api/3/assets/search";
#******************************************************
base_api_delete_url = f"https://<url>/api/3/assets/";
page = 0

def find_duplicates(assetname_data):

	for i in assetname_data:
		body = {
			"match": "all",
			"filters": [
				{
					"field": "host-name",
					"operator": "is",
					"value": i
				}
			]
		}
		request_response = requests.post(
#*********************************HERE***********************
			<New_Name> + f"?page={page}&size={size}",
#*************************************************************
			data=json.dumps(body),
			headers=headers,
			auth=(username, password),
			verify=False
		).content.decode()

nkershaw1 · January 31, 2024, 3:19pm

Did you get anywhere with support, we also see duplicates on a daily basis, after initially clearing those out that were down to the 6.6.229 release, it appears to still be happening!?

dschaare · January 31, 2024, 3:47pm

Unfortunately no solution yet. Duplicates every Day, that we have to clean out manually. I have escalated this to our CSA.

dschaare · February 1, 2024, 6:21am

In our case, IP Address, MAC, Hostname and Windows-UUID are persistent, only the Agent-ID isn’t. The change of the Agent-ID is caused by a nightly re-deployment (Citrix PVS and partly an older, image based procedure). It seems that the Agent-ID was given a much higher weighting with release 6.6.229.

I suggested to tune this in /opt/rapid7/nexpose/nsc/conf/correlation/correlation-default.properties for our environment, but support stated that this will cause more issues. Since InsightVM is currently unusable for us, I can’t imagine what could get worse by tuning the properties…