Strip html and create plain text

New parabola user here. Is there an easy way to strip html from a column and turn it into plain text?

Many thanks in advance…

Hey Brian,

Your best bet is to use a Regular Expression in the RegEx step.

I looked at the community posted expressions on https://regexr.com/ and this one seems best:

(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)

Try that out in the expression box. Set it to replace with nothing.

If you have non-tag things you need to remove, like non-breaking spaces &nbsp; then you will need to take care of those separately.

If you want to extract the text content of a full webpage, using @brian’s regex is pretty safe.

If your columns just contain very simple html snippets, and you know there aren’t <script> and <style> tags, you might be able to use a simpler expression:

<.*?>

4 Likes

Brian & Zach - thanks so much! Both approaches seem to work.

3 Likes