Regex: leave match

Been experimenting with the regex step and although I can match and remove elements of a string, a url in this case, I’d like to match and keep.

Is there any easy way of doing this?

I’ll admit regex is my weak point.

I’m validating urls and trying to remove all sub domains and just leave the root domain and tld. I’ve not been able to find or compose any regex that will match potentially multiple sub domains on all tlds.

I’ve only been able to get regex working that will match all domains. Hence my thinking…

Thanks!

Hey Jim!

Regex can certainly be tricky.

If you have an expression that can match the thing you want, then that will work.

Wrap all of the parts of your expression that identify the thing you want to keep in parenthesis.

So if you had an expression like this: .*-[0-9]+-.* which matches some numbers surrounded by other stuff, and you just wanted to keep the numbers that you identified, you would wrap the important part in parenthesis like this: .*-([0-9]+)-.*

And then in the Replace field in Parabola, put $1.

Explanation
The parenthesis are called capture groups. They capture the stuff being matched. And then in the Replace field, you can reference any capture group by using the $ character. In this case, we have only 1 capture group, so we want capture group #1, which is expressed as $1. That means a second group in your expression could also be captured and referenced as $2, and so on.

This can be tricky, so let me know if that doesn’t fully solve it for you!

Thanks @Brian, I though the answer was probably with groups and had experimented with $1 , $2 etc but couldn’t quite get it to work…

I’m using the expression: \w+(?:.\w\w)?.\w+$

It has a non capturing group and I’m wondering if when I wrap it in () and then try and ref it with $1 this is the reason it doesn’t return the domain, only the subdomain

My domains are formatted:

sub1.sub2.this.com
sub.foo.com

etc. I know this regex won’t work for some TLDs but its enough for my data.

I’m creating a new col if this changes things.

any other insight appreciated

Hey Jim,

Parsing these parts of URLs is surprisingly difficult. Domains can have any number of subdomains, and the TLDs can have multiple extensions as well.

In your examples, you only show single level TLD’s - is that true for your data? Or would you need to deal with a URL like http://bbc.co.uk/ which has co.uk as it’s TLD?

I think the key here will be to figure out your tolerance for edge cases.

yes unless I build a large regex with all tlds, then as you say there will be edge cases…

I do need to deal with tlds that use second level domains like .uk, so it gets more complicated. I found regex that can handle most cases.

In the example: \w+(?:.\w\w)?.\w+$, this will leave the subdomain in a column, I thought I’d use a search and replace to search for the value of the col remove it from the full url and then I’d have the domain, but you can’t use {fields} in search and replace unless I’m missing something?

Ah I see.

Try (^.*)\.*(\w+)\.(\w+)\.(\w+)$ instead and Replace with $3.$4

Let me know if that works!

Perfect, thanks you for you help on this @brian