but you want to keep your skills sharp, then you need to get your hands on some data and get to work. The question is though, where are you getting your Data? All the exercises and projects within the Pathway have the Data already organized and provided for you, it is a nice luxury to have but in the real world it is up to you, the Data Analyst, to find the Data required to make that exciting, thought-provoking dashboard.
I am still quite a junior in the DA world, and I am ashamed to admit that my Python skills are almost non-existent. I am trying to learn, but it is just taking a little longer than anticipated. So, when I had the idea to find some data relating to the Property Market in West Melbourne, I found myself stuck. I searched far and wide for Data Sets that would fit my needs, alas to no avail.
That is when I stumbled across a solution to my problem, Web Scraping. Web Scraping is the process of reading the underlying HTML (Hypertext Markup Language) code of a website, extracting the parts that we need, then attempting to format the extracted data in such a way for us Data Analysts to go about our work. Now, this may be off-putting if you do not have any experience with HTML, but the solution I chose to use makes the process as easy as clicking a few buttons. A word of warning though, Web Scraping is not considered illegal in Australia, however, it can be in breach of the website's terms and conditions of use. As a general rule of thumb, if you are not using the data for commercial gain and only personal use, you have nothing to worry about.
The solution I chose was the Web Scraper – Free Web Scraping plugin for Chrome. This is a little-to-no code solution that allows us to set up our scrape by selecting which HTML elements we want the information from.
The example I will be using is the West Melbourne Property Data, scraping from a real estate website. Some websites are quite protective of their data, evident by restrictions they put in place to stop people from extracting a complete dataset. The main issue I faced was that I could not view beyond 50 pages of property data and each page only contained 25 properties. To work around this, I set up multiple scrapes of more narrowed searches. For example, instead of searching for multiple suburbs, I would search for each individual suburb and do the scrape, then change the “sort” function and re-scrape to try and get a wider set of data.
Head over to the Chrome Web Store and install the plugin here. After installing, navigate to the page you want to scrape and hit F12 on windows or to open the Developer Tools and click on Web Scraper.
For the first step, let’s create the Sitemap by giving it a title and copying in the URL of the website that we want to scrape. Since we want to scrape from multiple pages, we will find the page number and replace it with square brackets and a range e.g. [1-50]. Since I know that Real Estate.com won’t allow us to search beyond 50 pages, I changed the URL to search for pages 1 to 50.
Once the Sitemap is good to go, we then need to set up our selector. The selector is the element of the HTML that we are going to have the Web Scraper take from each webpage. To do this, click the “Add new selector” button. From here, give the selector a descriptive name and select the “Type” as “Text”, this will pull the actual text that the HTML represents on the webpage. The next stage is to select the HTML that we want to scrape. You would usually make a selector for each piece of information you want to scrape and tie this back to a Parent Selector, however, in this instance due to the way that Real Estate.com has formatted their HTML, it is easiest to scrape the whole ‘residential-card’ and then separate in the data cleaning process. Upon selecting the HTML, also remove the address identifier to tell the web scraper to scrape all instances of the div.residential-card rather than just the one that we have physically selected. Select “Multiple” and hit save. You can also hit the Data Preview button to make sure that you have indeed selected the correct data as intended.
Now it is as simple as starting the scrape. The plugin will then open a mini-window and start querying the 50 pages we have selected.
Upon completion, simply hit refresh in the original Web Scraper window (not the pop-up window) to display the data and then export as a CSV ready to clean. This sort of data is not normalized or uniform, so you will have to be creative with your cleaning. For this data, I found that not all listings had complete data. For example, some properties did not list a land size or have garages. I challenge you to try this out for yourself and see what sort of data you can scrape! Have fun!
Thanks for reading and I hope this has a useful application in your Data Journey. Feel free to try it out yourself and reach out if you have any issues!