How to build a blacklist feature?

Hi, I am trying to build a blacklist feature for a list of URLs:

  • I have a list of URLs in a CSV
  • I also have a list of “killwords” in a blacklist. (i.e. like “forum”, “comment” etc.) in a Google Sheet

Now I want to delete all rows in the URL list that contain any of the words from the blacklist (a few hundred words).

The expected result is that I receive a list of URLs where all URLs have been deleted where a killword existed.

Can anyone give me a hint how to build this?

Thank you!

Sebastian

Hey @Sebastian_Schaeffer,

Try importing your URLs using the Use CSV file step and your killwords using the Pull from Google Sheets step.

You can remove URLs that contain a blacklisted word by using the Find Overlap step. You’ll want to keep rows from the CSV file that “match” your imported killwords.

To be considered a match, the Find Overlap step will search your URL column using the values in the Killwords column. These sidebar settings should do the trick:

The resulting URLs shouldn’t have killwords. Depending on your dataset, you may have to adjust the approximation percentage, but I found 20% worked well.

Let me know how this works out for you!

2 Likes

thanks @daniel. This kinda works but it’s very fuzzy and I am not really sure about the quality of result. I tested it with a file with roughly 10k rows and just changing the percentage from 20% to 19% produces massively different results without me actually knowing whats going on. ultimately what i need to know is if the exact string exists in an URL.

My guess is that a 20% match could also mean that I get a positive with “domain.com/blog” even though my killword is “/blog/”, or am I mistaken?

Hey @Sebastian_Schaeffer,

The results can definitely be fuzzy, so thanks for trying that out. @Adeline came up with a great solution that should be accurate. It’s not too different from the first flow.

Import your data tables and add an Insert Column step after each import. Let’s create a new column called Join and add a text value of 1. This will help us merge our tables.

Add a Combine Tables step to combine our tables using the Join column. This will likely cause duplicate rows, but we’ll take care of that.

Next, add an If/Else step and we’ll create a column called Killword? to verify if a URL contains a killword.

We can remove duplicate rows in the URL column by using the remove duplicates step and merge the Killword? column.

Lastly, we can filter rows that do not contain Yes and you should be left with the correct URLs!

1 Like