How to build a blacklist feature?

Sebastian_Schaeffer · October 19, 2020, 1:31pm

Hi, I am trying to build a blacklist feature for a list of URLs:

I have a list of URLs in a CSV
I also have a list of “killwords” in a blacklist. (i.e. like “forum”, “comment” etc.) in a Google Sheet

Now I want to delete all rows in the URL list that contain any of the words from the blacklist (a few hundred words).

The expected result is that I receive a list of URLs where all URLs have been deleted where a killword existed.

Can anyone give me a hint how to build this?

Thank you!

Sebastian

daniel · October 20, 2020, 3:06am

Hey @Sebastian_Schaeffer,

Try importing your URLs using the Use CSV file step and your killwords using the Pull from Google Sheets step.

You can remove URLs that contain a blacklisted word by using the Find Overlap step. You’ll want to keep rows from the CSV file that “match” your imported killwords.

To be considered a match, the Find Overlap step will search your URL column using the values in the Killwords column. These sidebar settings should do the trick:

The resulting URLs shouldn’t have killwords. Depending on your dataset, you may have to adjust the approximation percentage, but I found 20% worked well.

Let me know how this works out for you!

Sebastian_Schaeffer · October 20, 2020, 1:52pm

thanks @daniel. This kinda works but it’s very fuzzy and I am not really sure about the quality of result. I tested it with a file with roughly 10k rows and just changing the percentage from 20% to 19% produces massively different results without me actually knowing whats going on. ultimately what i need to know is if the exact string exists in an URL.

My guess is that a 20% match could also mean that I get a positive with “domain.com/blog” even though my killword is “/blog/”, or am I mistaken?

daniel · October 20, 2020, 5:36pm

Hey @Sebastian_Schaeffer,

The results can definitely be fuzzy, so thanks for trying that out. @anon36387498 came up with a great solution that should be accurate. It’s not too different from the first flow.

Import your data tables and add an Insert Column step after each import. Let’s create a new column called Join and add a text value of 1. This will help us merge our tables.

Add a Combine Tables step to combine our tables using the Join column. This will likely cause duplicate rows, but we’ll take care of that.

Next, add an If/Else step and we’ll create a column called Killword? to verify if a URL contains a killword.

We can remove duplicate rows in the URL column by using the remove duplicates step and merge the Killword? column.

Lastly, we can filter rows that do not contain Yes and you should be left with the correct URLs!

Topic		Replies	Views
Delete rows in a Google Sheet Ask a question	2	457	July 15, 2020
Unable to find 'List Contains' function - is it deleted? Ask a question Building-Flows	3	472	October 29, 2020
Find and replace wildcard? Ask a question Building-Flows	4	623	December 20, 2023
Airtable daily update Ask a question Building-Flows	5	712	November 16, 2020
NewBie question - selectively delete email addresses with a specific domain Ask a question Building-Flows	1	238	February 13, 2024

How to build a blacklist feature?

Related topics