Having issues joining two XLS's together

Hey all -

I have two excel sheets where one contains a list of properties and each column shows data specific to each property. So in total, there are 1,000 properties.

I have another excel that contains any person connected to that property and each column would show their contact information and mailing address. So in total there are over 3,700 contacts and match to the 1,000 properties in the other sheet. I don’t have access to the data source but I believe the data is from exporting “any contact info” from a list of 1,000 properties.

So I’m trying to Join the two tables together to match up the property addresses of both spreadsheets. The end result should be 1,000 properties and columns added that show the contact information for each property. So if there are 4 owners for one property, then the end result would show one property and columns for all the owner contact information.

But I’m having trouble creating my workflow. I’ve built formulas into the sheets to concatonate the property information into a full address. Of the 1,000 properties, there are 16 properties that do not a have a perfect full address match, which I believe is because there is not any contact info for these properties. I’m not sure if this is creating an error, but the output of my workflow results in over 61,000 rows of data… most of them are the same property repeated.

Can anyone assist? I’ll link the flow below.

https://parabola.io/app/flow/66439/builder?redirected=true

Hey Austin,

The Join is creating a ton of rows because of duplicates in both of your tables. Take a look at your output from the Join right now, and look at the second row, which has the same address repeated down quite a few times - the 133Rd one.

Copy that cell in the first column and then switch to the input 1 and input 2 tabs in the Join step, and use the search bar at the top to search for that address. I see 2 instances of it in input 1 and 10 in input 2.That makes for 20 results for just 1 address. If you look at those rows in the input tables, some of the columns are unique between them, though.

So either you need to find a way to remove duplicate addresses in one or both of the tables, or you need to Join on something more unique than the address, since it is duplicated so often.

I see that Brian, thanks for your comment.

So if we use an simpler example where one address had two property owners (Bob and Jane), would Join end up creating a duplicate property address? So as a result, it would create two duplicate addresses and add columns of both Bob (for one property record) and Jane (for the duplicated record)?

Is there anyone Flow I should be using? Since the goal would be to add the contact information as new columns if there are any matching property addresses between both sheets, so in the end there would be one property record with new columns for Bob and Jane’s contact info?

Unfortunately I don’t have direct access to the software that pulled this data, so it might be tough to figure out another way to include columns that would be more unique, but I’m wondering if I’d run into the same issue. There may be a property record unique to each property, but I would run into the same problem where 5 owners could be involved with one property, which sounds like there would be 5 duplicate property records created each with the owner information as new columns?

Correct - if you have a table of addresses, and a table of owners (with addresses listed on them), and two people own the same address, then the join will result in 2 rows to express that relationship.

If your goal is to always have 1 row per property, and have all owners expressed on that single row, then that can be done as well, though it also comes with issues.

To generalize, what we are encountering is a one-to-many relationship (assuming no owner has more than 1 property). That is to say, many items from X table can be associated with 1 item from Y table.

There are 3 ways to express this in Parabola (and in any spreadsheet looking thing):

  1. A table of owners, each with the property information columns for the owners. You will see that each rows represents an owner, and addresses may be duplicated across owners.
  2. A table of addresses, each with all associated owners showing up in subsequent columns. So address 123, which has both Bob and Jane as the owners, will have an Owner 1 column and an Owner 2 column. The problem is that if address 456 has 10 owners, now the table needs Owner 1 through Owner 10 columns, So your column count and names will not be stable, which is not good.
  3. A table of addresses, with all associated owners summarized into columns. This would look something like address 123 showing an owner count of 2, and perhaps a column called Owner names with the value Bob, Jane. So we are putting a list of names into a single cell.

Is there one of those that sounds like it will work for you?

I think the best solution would be #2 in your explanation. I would be importing the data into a cold calling system so most importantly I would need each owner’s phone number & email to be in separate columns. But having a summary column for names, phones, and email would be great because I can create a custom data field called “Ownership Summary” to show a summary such as Bob, Jane, Mike, Tommy instead of creating four separate data fields. Would be great if I could create this summary field to be setup such as Owner 1: Bob, Owner 1 Phone: 123-456-7890, Owner 1 Email: bob@gmail.com, Owner 2: Jane...

Okay, that will work, but I will caution you that most of the time folks use a structure like that, it causes issues.

Let’s say today you create that flow, and the available data shows property 1 with 3 owners, so there are 3 sets of owner columns in the final output data. Some of the properties only have 1 owner, so they will be blank in the other owner columns.

Will it be okay for the cold calling system to see blanks for certain properties in columns like Owner 2 Phone?

Let’s say tomorrow the data shows that no property has has more than 2 owners.

Will the system be okay that the Owner 3... columns no longer exist in that export?

That would be okay. If there are are empty values for different columns, such as Phone 2, Phone 3, etc. it’s not a problem on the import. If only 2/5 phone numbers import into the calling software, then that’s no issue. I’ll copy a data list below which I’ve imported, and you can see how each property record has multiple phone numbers or emails, if available.

Alright - then you will want to use a Merge values step after your Join step.

Set it to merge every field that you want to create new columns for - things like the phone number, name, etc. Set the delimiter to something unique like ||

Then set the grouping column to the address column that you only want to see 1 instance of.

Now, send that data to a Split column step, and split all of the merged columns using the delimiter you set.

This will expand those columns into new columns.

From here, you should directly export your data, as any other transform that targets one of these generated columns is likely to break in the future. Most significantly, you can’t really rename the columns after this step.

Hopefully I followed along correctly. I’m getting an error on the Column Split (see below).


Thanks for the screenshots. It looks like there’s one address for which you have a lot of contact information for that’s causing that error with the Split column step. (Your data output has exceeded 5000 columns for the step)

I think I might have a workaround to suggest, but before I do, can you confirm what your ideal output would be. You mentioned you want Owner 1: Bob, Owner 1 Phone: 123-456-7890, Owner 1 Email: bob@gmail.com, Owner 2: Jane... – do you want all of that in a single cell per address or are separate columns for each contact okay?

Hi @sachi. Here’s what my calling software looks like (screenshot). I’ll need each column to have individual data so it can import into the data fields. So as you can see, there were 3 columns with phone numbers. Other data fields include bedroom count, bathroom count, square footage, etc (as you can see on the right). On the left, I imported a custom field which is a line separated concatonate formula of a bunch of different columns. I did that so I can have a quick overview I can see all the information as I’m talking to someone.

The data provider i used to pull this information had the data organized where each phone number was a separate data field. When I import, I have 9 different phone fields (Phone 1 through Phone 9) which will upload the numbers into.

When cold calling, the system will cycle through each of the numbers automatically.

Here is a Google link to the data which I’ve successfully been able to upload. The new data I’m working with is another data provider I don’t have direct access, so I’m not sure I can do a direct export in a similar way to the Google link.

Alright I think I see the problem - it has to do with the duplicate rows that we discussed above.

If you paste this into your flow:

parabola:cb:bdb819c7461a41fda408517175bbd24f

And then pull a new arrow off of your join, and send it into those new steps, you will see that one address has around 57k entries! Some others have high amounts as well.

I poked around, and I think I found a cleaner way to do this that removes the duplicates.

Paste this into a blank area of your flow:

parabola:cb:fad3f0face5b46349eefb9d030b3ad94

That is a revised version that removes the duplicates before the join and does some more merging.

I was hesitant to remove duplicates from the Parks import, but some of the addresses were causing issues due to how many times they were duplicated.

Let me know if that all works!

1 Like

Thank you! I think this is a great solution!

3 Likes