Regex: leave match

jim · March 25, 2021, 10:19pm

Been experimenting with the regex step and although I can match and remove elements of a string, a url in this case, I’d like to match and keep.

Is there any easy way of doing this?

I’ll admit regex is my weak point.

I’m validating urls and trying to remove all sub domains and just leave the root domain and tld. I’ve not been able to find or compose any regex that will match potentially multiple sub domains on all tlds.

I’ve only been able to get regex working that will match all domains. Hence my thinking…

Thanks!

brian · March 26, 2021, 1:00am

Hey Jim!

Regex can certainly be tricky.

If you have an expression that can match the thing you want, then that will work.

Wrap all of the parts of your expression that identify the thing you want to keep in parenthesis.

So if you had an expression like this: .*-[0-9]+-.* which matches some numbers surrounded by other stuff, and you just wanted to keep the numbers that you identified, you would wrap the important part in parenthesis like this: .*-([0-9]+)-.*

And then in the Replace field in Parabola, put $1.

Explanation
The parenthesis are called capture groups. They capture the stuff being matched. And then in the Replace field, you can reference any capture group by using the $ character. In this case, we have only 1 capture group, so we want capture group #1, which is expressed as $1. That means a second group in your expression could also be captured and referenced as $2, and so on.

This can be tricky, so let me know if that doesn’t fully solve it for you!

jim · March 26, 2021, 9:28am

Thanks @Brian, I though the answer was probably with groups and had experimented with $1 , $2 etc but couldn’t quite get it to work…

I’m using the expression: \w+(?:.\w\w)?.\w+$

It has a non capturing group and I’m wondering if when I wrap it in () and then try and ref it with $1 this is the reason it doesn’t return the domain, only the subdomain

My domains are formatted:

sub1.sub2.this.com
sub.foo.com

etc. I know this regex won’t work for some TLDs but its enough for my data.

I’m creating a new col if this changes things.

any other insight appreciated

brian · March 26, 2021, 4:37pm

Hey Jim,

Parsing these parts of URLs is surprisingly difficult. Domains can have any number of subdomains, and the TLDs can have multiple extensions as well.

In your examples, you only show single level TLD’s - is that true for your data? Or would you need to deal with a URL like http://bbc.co.uk/ which has co.uk as it’s TLD?

I think the key here will be to figure out your tolerance for edge cases.

jim · March 26, 2021, 4:48pm

yes unless I build a large regex with all tlds, then as you say there will be edge cases…

I do need to deal with tlds that use second level domains like .uk, so it gets more complicated. I found regex that can handle most cases.

In the example: \w+(?:.\w\w)?.\w+$, this will leave the subdomain in a column, I thought I’d use a search and replace to search for the value of the col remove it from the full url and then I’d have the domain, but you can’t use {fields} in search and replace unless I’m missing something?

brian · March 26, 2021, 5:26pm

Ah I see.

Try (^.*)\.*(\w+)\.(\w+)\.(\w+)$ instead and Replace with $3.$4

Let me know if that works!

jim · March 26, 2021, 5:34pm

Perfect, thanks you for you help on this @brian

Topic		Replies	Views
Regex: Extracting data from string Ask a question	2	391	June 30, 2021
Find & Replace - Trim URL to Root Domain Ask a question	1	1020	April 24, 2020
Add Parenthesis to comma seperate urls Ask a question	7	716	April 24, 2020
Strip html and create plain text Ask a question	3	740	June 16, 2020
Negative Regex to extract rows with/without email address? Ask a question	1	388	September 24, 2020

Regex: leave match

Related topics