Help with web scrap data

Hey everyone.

I’ve been web-scraping the product data from multiple different sites into google sheets.

I’m having an 3 issues

Issue 1 the product is called different names on different sites.

Here’s a example
Lets say I have three different google sheets each representing a different site and I’m combining them into one.

Site A Amd Ryzen 16 core 5950x
Site B Ryzen 16 core 5950x
Site C Amd 5950x

Issue 2 I don’t know if this matters as much but my web scraping is also picking different products with the same issue

For example

Site A amd Ryzen 6 core 5600x
Site B Ryzen 6 core 5600x

Issue 3 I would like to merge all similar columns into one row.
So one amd Ryzen 5950x and with a column of 3 url links to site a site b and site c

I’m sorry this is long I’ve spent 7 hours to no avail trying to figure this out.

A similar data structure but not

Hey @Mack,

Welcome to the community :wave:

There are a handful of ways to go about this, but it really depends on the size and variance of your dataset. The first step I might try to use would be the “Insert if/else column” step.

In this example, we’re creating a column called New Product Name. If the value in the Product Name column contains 5990x we can set the value to our new column.

Once your product names are consistent, we can remove the original column using a “Select columns” step. Next, we can merge the products into a single row with the links in separate columns.

To do this, use an “Insert row numbers step” along with a “Pivot columns” step. Set the Pivot column to Row Number and set the Pivot value to your Link column.

Here’s an outline of what your flow might look like:

1 Like

This was extremely helpful!

One more question is there a way to set If Product Name Contains X number of same characters or same values? So lets say for instance I wouldn’t have to enter 5950x? I could automate it a little more

Hey @mack,

Will each product have the same amount of links or will that amount change? If you wanted to always show 3 links per product, that’s a relatively quick setup in Parabola.

After pulling in your data, you can use the “Insert row numbers” step to repeat based on the number of links you want to insert. In this example, we’re setting the value to 3.

We can then merge the Link column based on the unique values in the Row Number column using a comma , delimiter.

Lastly, we can split the columns in the merged Link column based on the comma , delimiter.

Let me know if that option could work. If not, we may be able to merge your data based on a model number like 5950x or 5600x.

With the web scraping, as long as the product is listed on the site it will create data.

Site A product this there grabs link
Site B product not there No data no link.
Site C product listed Link

So realisticly it could be like 5950x [Site A URL] [Site B No Data No URL] [Site C URL]
Also not every product has numbers so can I do the same Insert Number rows numbers for Text?
That would solve the last issue I’m having.

thanks Daniel this is extremely helpful!

Hey @mack,

Glad to hear it. Would you mind elaborating on inserting row numbers for text? If you want to add a few additional rows and insert text, you can try using the “Insert rows” step.

Alternatively, the “Insert text column” step will create a full column and you can add a custom value in that cell.

A fuzzy search.

Not all my data sets have 5950x or 5600x

When I’m was web scraping Motherboards
Is there a way to fuzzy search by keywords for example

Site A Asus ROG Crosshair VIII Hero
Site B ASUS ROG X570 Crosshair VIII Hero
Site C Asus ROG X570 Crosshair Hero

all the same Motherboard but different names on different sites.
Is there a way to fuzzy search so I don’t have to manually enter each insert if/else column for each product I will have thousands of results.

Hey @mack

Check out the “Find Overlap” step.

You may have to change your data structure around a bit, but this will enable fuzzy search given 2 inputs. You could do this in 2 steps, first finding overlap between Sites A and B, and then Site AB (already combined) and C.