Plugin Development : Dealing with large inputs/outputs?

Hey Folks,
I’m working on a decorator for InsightVM reports. The InsightVM plugin action GetReport returns a base64 encoded string containing the report data… this report is 100MB+ … Is there supported shared storage across containers somewhere? My workflow hangs because loading the same report as a: base64 string, utf-8 string, StringsIO fileobject is very ineffecient. I’d like to download the report CSV and insert it into a PSQL database.

(I’m open to suggestions on a better way :slight_smile:

My other thought was to tinker with the dockerfile directly and mount a tmp directory for the workflow. The last step of the workflow would be to empty the tmp directory.

There is a plugin called “storage” which could probably be used for this.
But it is a legacy plugin and it’s probably a better idea to store it in a DB.
For that you could use the SQL plugin:

I would not use storage for this. It’s not designed to take that kind of data. It was a solution for getting inputs and outputs in loops before we did a lot of the loop improvements. It might work, but it’s risky. If the plugin restarts for any reason (orchestrator reboot for example) you lose all the data since it’s stored in cache.

Conceptually, thinking through this, could you pull the report, immediately save it somewhere with FTP / SCP, then use other workflows to process it? Maybe try to split up the processing in to several other workflows? You can use the Rest plugin to fire off other workflows after it’s saved to process it.

That’s just my first instinct when dealing with big data. How can I break it down into manageable chunks.

1 Like

I think the problem of “I have 100MB of data that I need to read in order to batch” still exists though, right?

I’m writing my own plugin, so it’s easy for me include boto3, download the file from S3 to /tmp, then read chunks of the file… but for folks who cant, this will end up being a blocker. Is there a documented limit of “we don’t recommend data over this size”?

(Also, I’m 100% onboard with acknowledging the fact that I probably shouldn’t treat ICON like an abstraction of AWS ECS… but the more feature’s ya’ll release, the closer it gets :stuck_out_tongue: )

Edit: One thing to note; InsightVM plugin provides report as base64 data… InsightCloud orchestrator running the base64 decode plugin WILL choke on 100MB :slight_smile:

ICON like an abstraction of AWS ECS

That’s fine, I think we’re easier to use. :slight_smile:

To your other question, no there’s not a straight answer to “How much data is too much data.” I’ve successfully run 1GB files through I think, but it takes forever, and I think I increased the cooling cost for our data center by a fraction of a percent.

But, no, I wouldn’t recommend doing that at all. It’s just not stable, and prone to all sorts of weird problems.

The reason I can’t give you a straight answer because it’s a combination of both just the size of the data and what you’re doing with it. I can imagine for loops on 100MB of data that would kill a WF. However, I can think of other operations on bigger files that would be fine. It’s also a function of how many resources you’re throwing at your orchestrator. If it’s on a sixteen-core box with 128GB of ram, you can do some awesome things with it. If you’re running the bare minimum or under, it’ll still do neat things, just very slowly and it may run out of resources.

That’s a long-winded way to say no, we don’t have hard limits on data processing.

So with all that said, is there any way you could “pre-process” the file before it gets into ICON? You could break it up into smaller queries or anything and then process those in batches?

.

|

Ohhh… now that’s an idea :slight_smile: The report is too big, so create multiple reports that ARE of a size we can handle…

@matt_domko_deprecated Out of curiosity, what type of data are you trying to get out of InsightVM and push to Postgres? I assume given the size its likely asset/vuln/finding data?

I ask because I wanted to make sure you knew about the Data Warehouse functionality of VM which is an ideal way to get high read optimized data into Postgres and because we are actively working on a plugin for the InsightVM cloud API which will have the ability to pull some of the data you find in the console. If there is overlap, that might be a future good option.

1 Like

On the nose; it’s vuln data. Because InsightVM doesn’t expose the instance-id, we’re doing a bunch of “fun” work trying to come up with a decent reporting method. I’ve got plans to dig into the DW, just haven’t had time yet (but it might be better than fighting with this report)

Yeah, while reports are VERY flexible if you want this mass data I think investing the time into DW is your best bet (as long as you have a dedicated Postgres db/schema). If postgres is available, setting up the DW connection in the console is extremely fast, the export will be faster, the export won’t hit your console as hard as running a report, AND you’ll get tons of data. You can then either use the latest data when running queries or you have snapshots (dated tables) where you can compare over time. A bit of a better fit and less management in my eyes.

Plus the schema is documented out here with all the data you’ll end up getting. Keep in mind this doesn’t have EVERY piece of scan information (just point in time snapshots) so that amount of data is significantly less than what the console needs.

https://help.rapid7.com/nexpose/en-us/warehouse/warehouse-schema.html

If you have PG already up and an account, its a quick setup and you can see if the data fits your needs. Hope this helps!

2 Likes

DW changed my life :slight_smile: Thanks! Now I just have ICON run a SQL query instead!

2 Likes

Glad that worked out! Data Warehouse is VERY powerful if you need to get to lots of VM data quickly. Lots of cool stuff you can do with it!