InsightIDR Log Search: export only selected columns to CSV

dreadpir8robots · June 24, 2022, 3:16pm

On a fairly regular basis, we work with Process Start Events data extracted from InsightIDR as a CSV. It would be very useful to extract only those columns which are relevant. Is there any way to export only selected columns from an InsightIDR Log Search?

In the InsightIDR Log Search UI, I can use the Manage Columns dropdown to see only the columns I’m interested in for a particular search:

However, when I export Process Start Events log search results to CSV, the export always includes the full set of 500+ columns, even if only a handful of columns are relevant to the context I’m working in.

The full CSV is large and difficult to work with. There are ~400 columns describing environment variables, and the full original message ingested by InsightIDR is also included as JSON within the CSV. It’s great that this information is available if we need it, but it would be nice to choose whether or not to include it in any particular export.

As a bonus, a file with selected columns would be much smaller to store/transfer, which would be good for both Rapid7 and customers.

dreadpir8robots · June 24, 2022, 3:30pm

I did wonder if the API would be one way of achieving this, but while the API is a powerful extra tool for launching log searches, it seems to be a pipeline to the same export functionality as the UI search. I don’t see anything in the Query Individual Logs Using LEQL API documentation which would allow me to select specific columns for an export job.

some_rando · June 24, 2022, 7:40pm

I would be interested in this as well.

dreadpir8robots · June 24, 2022, 7:49pm

I’ve written a rough-and-ready Python script this afternoon which uses Pandas DataFrame to prune columns from a CSV file. It’s getting me where I need to be, but it’s a workaround, not a solution - I still have to download a massive CSV file from InsightIDR.

Here’s the core of the script:

import pandas as pd
columnsInclude = ["process.name", "r7_context.asset.rrn", "hostname",
"process.exe_file.product_name", "process.exe_file.hashes.md5",
"process.exe_file.hashes.sha1", "process.exe_file.hashes.sha256"]
pdi = pd.read_csv(inputfile, encoding="UTF-8", dtype=object)
for column in pdi:
    cleanCol = re.sub('^json\.', '', column)
    if cleanCol in columnsInclude:
        print(f"Including column {cleanCol}")
    else:
        prunedCols.append(column)
        pdi.drop(column, axis=1, inplace=True)
print(f"Pruned {len(prunedCols)} columns")
pdi.to_csv(outputfile, encoding="utf-8")

Note that:

I pretend that the json. prefix on [nearly] every column doesn’t exist.
columnsInclude is a list of of columns I am interested in, e.g. process.name,r7_context.asset.rrn,hostname,process.exe_file.product_name,process.exe_file.hashes.md5,process.exe_file.hashes.sha1,process.exe_file.hashes.sha256
I don’t make any attempt to deal with datatypes.
The script processes a Process Start Events extract with ~250k lines in about 7 minutes in WSL on my Dell Latitude 7420, so it’s not super-quick, but it’s fast enough.

some_rando · June 25, 2022, 3:01am

Wow, nice. Thank you for sharing

mike · July 8, 2022, 8:04pm

I would also like the ability to only export the selected columns. This would be a good feature request.

some_rando · July 10, 2022, 3:26am

Question for you @dreadpir8robots : Where are you creating that includeColumns list and also where are you referencing it in the for loop? Is columnsInclude supposed to be includeColumns? Thanks in advance!

sean_obrien · July 11, 2022, 7:49am

Hi @dreadpir8robots

Sean from Rapid7 here. We are listening! Appreciate the feedback as well as detailed workaround below. It is an idea we are discussing right now, so great timing.

It would be great to understand the task this formatted output helps with, and any subsequent use of the csv file so we can tackle this optimally.

For example, how you (or your team) use the csv file once its downloaded (and pruned to show these curated columns). Would you mind providing more details from that perspective?

thanks
Sean

dreadpir8robots · July 11, 2022, 1:46pm

Oops @some_rando, it should be columnsInclude throughout; I accidentally transposed the two words when I was writing my bulletpoint list with the notes, sorry about that.

I’d share the entire script but it the rest of it is just argument handling, some --help info, and there’s stuff specific to my employer in there.

I’ve updated the snippet above to hopefully make it easier to use.

dreadpir8robots · July 11, 2022, 2:19pm

Thanks @sean_obrien. That sounds promising and I hope that this is something which would be useful to other people/customers.

You asked:

It would be great to understand the task this formatted output helps with, and any subsequent use of the csv file so we can tackle this optimally.

Fair question, and I know I said this in my original post:

On a fairly regular basis, we work with Process Start Events data extracted from InsightIDR as a CSV. It would be very useful to extract only those columns which are relevant.

I should have explained more clearly, but when I say we work with extracts from InsightIDR, it’s not one specific task which we’re repeating over and over. I can give some examples though; these are usually kicked off either by a request from management for arbitrary information at scale, or by our need to understand exposure in a big hurry.

The first example I’m going to give is Log4j. Since this was such a severe and easily exploitable vulnerability, we needed to figure out which endpoints were vulnerable asap, so that we could give senior management the info they’d need to make emergency decisions; do we switch things off? Might be better to have some services offline and safe than online and compromised. Especially since Log4j dropped at the end of the week, we couldn’t wait for InsightVM detections to come through. By looking at Process Start Event data, we could at least say “here’s a list of endpoints which are almost certainly vulnerable”.

Thinking about it, these requests often (but not always) boil down to something like: “which exact version of {software} has been run on {subset of endpoints} during {time period}?” We can pull Process Start Events, and use process.exe_file.hashes_* fields to determine exactly which version of software is running, even if InsightIDR or other solutions don’t always record the exact version.

What I describe above might not be the originally envisaged use of the export functionality. Having said that, I think it’s a valid thing to want to do, and the platform is capable (with a bit of nudging).

I know adding features takes a lot of work/testing/documentation/communication, so I’m keeping my expectations realistic here. I do see advantages to allowing extracts with only custom columns though, and not just for customers: if I can cut down the size of a generated export file from several GB to ~50MB by excluding all those columns I don’t need to see, that might reduce S3 data transfer/storage costs on the R7 side…?

One more thought: if this functionality does get added, it would be a bonus if the user could save lists of desired columns for reuse in the future, since otherwise it’d be necessary to select columns each time.

some_rando · July 13, 2022, 4:00am

Hi @dreadpir8robots, nice - thank you!!!

dreadpir8robots · August 16, 2022, 2:40pm

If you do end up using the script above, especially on WSL, keep an eye on memory usage. With enough free memory, even big CSVs will take just a few minutes, but if the OS decides to start swapping heavily to/from disk the same operation could take hours. Ask me how I know!

SDavis · August 16, 2022, 8:16pm

@dreadpir8robots I just wanted to say I really like your username, it’s pretty inconceivable that I didn’t notice it until now.

dreadpir8robots · August 17, 2022, 1:17pm

Thank you @SDavis

As a 90s kid and a Canadian, I love your avatar!

SDavis · August 17, 2022, 4:31pm

haha much appreciated, painstaking 5 minutes in paint

msulaiman · September 3, 2022, 11:47am

Hi everyone. The Rapid7 team must includes this feature. It is a great and useful option.