I have written a couple of different Rallypoint articles in the past about Web Scraping with low to code, if you aren’t full bodied on the topic feel free to head over to the Data Pipeline and check out the Intro to Data Engineering course, there is a section on Web Scraping that goes into a deep dive of the topic. For now though, if you don’t know what it is, it’s simply the process of extracting data from a Web site via the use of a scraping tool. There are numerous ways to achieve this ranging from built in browser plugins to python scripts, each with their own level of skill and tech knowledge required. I have recently come across an even easier way to achieve this for small to medium data sets, using functionality that is already built into Power BI. This process is very effective for pulling data that spans over multiple webpages, but I would warn you that the more data you are trying to pull the more resources Power BI will utilize. If you are attempting to pull a large dataset I would recommend using one of the other methods.
Well, let's get stuck into it. If you have used Power BI before you would be familiar with the ‘Get Data’ button, we can import data from a webpage by selecting ‘Web’.
For this example, I am going to import NRL player statistics for the 2021 season, you could also use this method for scraping data for listings on an e-commerce website or reviews.
Here we can see that the data I am trying to scrape resides over 25 pages, it is also important that we inspect the URL.
For this method to work we require some form of page numbering to be present in the URL, luckily here we can see that there is a pageNumber at the end of the URL.
Now in Power BI enter in this URL into the prompt after we have selected to import data from a Web source. Select the table that contains the data required and go to ‘Transform Data’.
In Power Query we will go to Advanced Editor and add a variable to convert the query to a function.
The code will look like this:
All we need to add is the variable before the let function and then replace the page number with the variable. We can call the variable whatever we like.
We then replace the page number with the variable.
Ensuring that we encase the variable with “ and &.
We can then click done to save the changes. This then converts our query into a function.
Now we know that our data lives over 25 different pages, so we will need to create a list of 1 – 25 to use as variables for the function. Simply create a new blank query and then use the following formula to generate the list.
Ensure that you convert the list to a table so that we can use it in the next step.
Some quick housekeeping, we will rename the Function, the Query and also the column name. Now to use the list we will need to change the Data Type of the list to text as per the function that was initially created and then we invoke the formula.
What is happening when we Invoke Custom Function is Power BI is using the PageNumber list that we created to fill the variable in the website URL, so Power BI will query the 25 different URLS and import the data from each page.
Once we expand the column we can see all player stats for the 2021 season, it is simply enough to remove the Page_Number column from the table and clean the data as we see fit.
This is a handy trick that can be used for importing web-based data over multiple pages if you plan on importing the extracted data into Power BI anyway.
If you enjoyed this article or want to know more about Web Scraping please take a look at the Introduction to Data Engineering course in the Data pipeline, we explore other ways of scraping data and how to import data from many other data sources.