How do I handle rows with partially duplicate data

When duplicate data is found in one or more rows, but not necessarily all rows (i.e., Zip code), is there a way to consolidate the rows within the file, or extract them to a different file vs. deleting them?

Hey Ken! There should be a way to do this, but it depends on the outcome you want. Do you want to take a single file, which has many rows, some of which share Zip Codes, and do one of the following?

  1. Split that file into two tables - one with rows that are unique overall, and one with rows that have duplicates in the original file?
  2. Split the file into one table that represents every Zip Code from the original file, but with only 1 row per Zip, and then another file of any row that was excluded while making the first table
  3. Something else?

Brian, example #1 seems closest to what we’re looking for. Would we end up with two tables in a file, or two separate files?

Let’s say two rows have the same zip code and a third row has a different zip code. We want the two rows with matching Zip Codes to be merged into a single row. I think the problem here may be handling the other non-matching data from those rows. Not sure it can be done, but the preference would be to keep non-matching data from only one of the rows rather than deleting all of the non-matching data.

In the end, the rows with duplicate data need to end up with data in every column. For example, if the Zip Code is the same on two rows and the addresses are different, we want to end up with the Zip Code being a unique value in the column, with only one of the addresses populating the row, and data in each column of that row if data was originally present.

Would it be best to split the data into two sets, and then run the duplicated data through a separate flow?

Ah interesting. With anything in Parabola, you would end up with two separate files. The only way to get two tables into a single file would be to export them to Google Sheets, where they can live in separate tabs on the same sheet (file).

Right, handling non-matching data is the tough part. Let me make sure I have this correct in my head. Given two rows with the same Zip Code, If there is a blank in the City column in one row, but in the other row there is a value, you want to make sure the value is preserved. And if both rows have a value, you don’t care which one is preserved. That way, you have the “fullest” row possible for each Zip.

That above assumption would mean that data from different rows would be intermingled into a single row, which can be inaccurate sometimes.

But then you metion:

If you want to always keep the data from the same row that the Zip was “chosen” from, then you can just use the Remove Duplicate Rows step, targeting the Zip column, to create the unique table of Zips, along with the other data in each row that was originally present. Then I can show you how to find the other table.

If you provide a screenshot of a few rows, and what you want the tables to look like from those, I can work backwards easily for you.

Got it. Here’s a couple of screenshots of (a) the original format and (b) the desired format.

Notice that B displays ALL of the unique Zip Codes, but no duplicates, AND retains data from one of the rows that shared a duplicate Zip Code. It doesn’t matter which of the (sharing) rows the data comes from, just can’t be blank.

In the final output, each Zip Code must be unique.

I hope that makes sense.

A
image

B
image

However, this would be the preferrable outcome.

Output #1
image
|
Output #2
image

Same results, just in two different files/tables.

The data from duplicates is in it’s own file/table.

Alright here we go!

Your first option is to just use the Remove Duplicates step as seen here:

Which gives you this

If you want the second option, you just need to build it out a bit to count the number of duplicates, and then it can filter based on that number into two tables. Like this:

And here are the remaining shots of those other steps:

Now the limitation here is that the Remove Duplicates step is governing which row is actually chosen for each Zip code by selecting the first one it finds for each Zip, going from top to bottom. So if that row happens to be less complete than the others, you will have a less than complete row representing that Zip.

You can overcome this in a few ways. If you expect all Zip codes to have at least 1 row that contains data in every column, then you can use a Remove Rows step to only keep rows that are not blank in each of the columns.

1 Like

Awesome, thx! I’ll try to build out the 2nd option. Very much appreciated!

1 Like

Excellent. Ping me here if you get stuck.