Group by fields in CSV

christopher_hickernell · October 28, 2021, 5:38pm

Given a CSV where Field1, Field2, and Field3 contain the same value across multiple rows, is there a way to group that data so that it is not repeated in the output of the other fields?

Input
Field1,Field2,Field3,Field4,Field5
john,doe,phone,home,555-1212
john,doe,phone,work,555-1313
john,doe,phone,mobile,555-5555

Output
John Doe - Phone
Home: 555-1212
Work: 555-1313
Mobile: 555-5555

mike_rinehart · October 29, 2021, 2:55pm

Hi @christopher_hickernell - I cannot think of any built-in steps or plugins that can accomplish that for you. However, I do think our Python plugin would be up to the task if you are familiar with Python.

elijah_martin-merrill · November 1, 2021, 8:04pm

You can do this with the jq plugin.

here’s a simplified example:
input json

{"json":[
  {"field1": "john", "field2": "home", "field3": "5551210"},
  {"field1": "john", "field2": "work", "field3": "5551211"},
  {"field1": "john", "field2": "mobile", "field3": "5551212"},
  {"field1": "jane", "field2": "home", "field3": "5551213"},
  {"field1": "jane", "field2": "mobile", "field3": "5551214"},
  {"field1": "jane", "field2": "work", "field3": "5551215"}
]}

jq:
{"numbers":[.json|group_by(.field1)|.[]|map({(.field2): .field3, name: .field1})|add]}

results:

{
  "numbers": [
    {
      "home": "5551213",
      "name": "jane",
      "mobile": "5551214",
      "work": "5551215"
    },
    {
      "home": "5551210",
      "name": "john",
      "work": "5551211",
      "mobile": "5551212"
    }
  ]
}

In ICON, you’d feed the output (the entire top-level object, not the array) form the csv plugin’s to json action into the jq step. What you’ll get is a string which you then run through the typeconverter string to object action to have usable JSON in your workflow.

JQ is part templating language, part filtering language, and part query language. The above jq defines the output structure {"numbers":[ .......... ]} and then fills-in the … bit with an array using the query.
The query starts with changing context from the top level to the .json array, then it groups the entries into separate arrays using field1. With our sample input, it creates two arrays. We then iterate over these arrays (.[]) and do some mapping. The add at the end tells it to add the mapped fields to the object instead of creating new objects.

christopher_hickernell · November 3, 2021, 2:36pm

This is great! The keys (home, mobile, work) are not always going to be known. So, they cannot be referenced directly in the output. Displaying the output would be easier if the phone numbers were in a nested object. For Example:

{
  "name": "jane",
  "numbers": {
      "home": "5551213",
      "mobile": "5551214",
      "work": "5551215"
    }
}
{
  "name": "john",
  "numbers":  {
       "home": "5551210", 
       "work": "5551211",
       "mobile": "5551212"
      }
}

Then we can iterate over the nested object to produce the output, regardless of what the keys are.

Is that possible?

elijah_martin-merrill · November 3, 2021, 7:12pm

[.json|group_by(.field1)|.[]|map({(.field2): .field3, name: .field1})|add]|{"people":[.[]|{"name":.name, "numbers":[(del(.name))]}]}

gives

{
  "people": [
    {
      "name": "jane",
      "numbers": [
        {
          "home": "5551213",
          "mobile": "5551214",
          "work": "5551215"
        }
      ]
    },
    {
      "name": "john",
      "numbers": [
        {
          "home": "5551210",
          "work": "5551211",
          "mobile": "5551212"
        }
      ]
    }
  ]
}

There’s probably another way to do it that doesn’t double-iterate, but this works

christopher_hickernell · November 4, 2021, 12:25pm

This did the trick! Thank you!