How to use JSON flattener on arrays without creating duplicate rows

I imported data from a GET API call. There are arrays in the json response. Using the json flattener is a double edge sword because while it flattens the array, it also creates duplicate rows.

Is there a way to flatten json arrays and avoid duplicate rows?

Hey John! I certainly can help you with this. It’s important for context for everyone to have same starting point for this question.

Incoming wall of text!

When you pull in data via an API Import, most likely you are pulling in a list of things. Let’s say a list of products.

That initial pull of data is almost guaranteed to be held in an array. Arrays in JSON (the type of notation used by most APIs that one would use in Parabola) are wrapped in [square braces] and simply say, I am a list of things.

The API Import, sometimes on its own, and sometimes with a little prodding, will unpack that array into a friendly table. Many times, that is all automatic and no one has to know it happened. You see a table of products and you are satisfied!

But other times, like in this case, each Product may have nested arrays within in it. What if every product contains a list of all of its variants? That’s a common way that product data is structured. I have a product called hat and it has a few variants, such as: red hat, white hat, blue hat. That data is held in another array. Each product has a column called Variant, where there is another array in each of the cells - you’d see something like [{"name":"red hat"},{"name":"white hat"},{"name":"blue hat"}].

We want to get that data out of that JSON array and into something more friendly. So we look to the JSON Flattener for this! You pick that column to flatten, and then, once you update, you see a new option, to Expand List into New Rows. When you use that option, all of a sudden, you have duplicate rows.

Specifically, you will see that the hat row now has 3 rows - 1 for each variant. What that JSON Flattener did was change your frame of reference. Instead of 1 row representing 1 product, your data now has 1 row representing 1 product+variant. red hat, white hat, and blue hat are each on their own row.

Why is this done? It’s because arrays can have any number of items in them. They are lists, and a list can have any number of things in it. Let’s say that instead of making new rows, the expansion of an array created more columns. Each had variant lived in its own column. So this hat row remains as 1 row, with 3 additional columns. Maybe they are called Name 1, Name 2, and Name 3`. That totally works.

But remember we don’t just have the hat row - we have a full list of products that we are applying this to. So the next row is called sticker and it has 30 product variants! This new process now generates 30 more columns. To keep the table “square” (more like a rectangle), every row now has those columns. So we have Name 1 through Name 30, even for products like the hat, which only has data fro 3 of them.

Okay that is still kind of working, its just a little messy. So you then use a column filter, and say, I just want to see the column called Name 10 because it has a piece of data that I want. The next time this flow runs, and your data has been updated, you may only have products with 2 variants each. Now that column filter cannot find Name 10 and it throws an error and stops your flow. That’s not good. Additionally, the order of items in an array is not guaranteed by APIs, so Name 10 might be the purple sticker one day, and the blue sticker another day. That kind of inconsistency is not great.

So in order to avoid the mess from above, we move array items into new rows! It means your column names are always guaranteed, and the table still represents the same overall structure.

That was a lot, I know.

So what do we do about this? To take an example from your actual data, John, you have a list of products to start, and each product has a list of measurements. Most items have 1 set of measurements in that list, so those rows stay as a single row. But some products are actually multiple items, each with their own measurements. Let’s say you have a table that comes with a chair. Its 1 product, but there is a set of measurements for the table and for the chair. So the JSON Flattener puts that on two rows.

The question then is not how to stop it from making new rows, but more of what do you want to do with those rows - how is your destination system set up to store this information?

In other words, can you describe the ideal shape of your data at the end of all of this?

2 Likes

I have a similar question. I am calling a Search API to return search results. Is there a way to filter the json before flattening so I don’t bring in all the rows. Thanks

There is not a way to filter rows inside the JSON specifically before it is expanded! You will need to expand it in order to remove parts of it.

If you wan to remove the entire row containing the entire JSON blob if it contains a certain row, you can use a row filter and target that JSON column.

1 Like

I also have the same problem and mostly I’ve only 150-200 max row on the first API call, then on the next API call that should be the same but there is the problem (as like @brian explained) with the same that second API call response not easily fit on JSON flattering.

So I’ve used like @brian suggest but that’s used my extra credit. which is not the right way. I still need to optimize it.