Power Bi – Scraping Data from Webpages Over Multiple Pages

If you remember...

I have written a couple of different Rallypoint articles in the past about Web Scraping with low to code, if you aren’t full bodied on the topic feel free to head over to the Data Pipeline and check out the Intro to Data Engineering course, there is a section on Web Scraping that goes into a deep dive of the topic. For now though, if you don’t know what it is, it’s simply the process of extracting data from a Web site via the use of a scraping tool. There are numerous ways to achieve this ranging from built in browser plugins to python scripts, each with their own level of skill and tech knowledge required. I have recently come across an even easier way to achieve this for small to medium data sets, using functionality that is already built into Power BI. This process is very effective for pulling data that spans over multiple webpages, but I would warn you that the more data you are trying to pull the more resources Power BI will utilize. If you are attempting to pull a large dataset I would recommend using one of the other methods. 
 
Well, let's get stuck into it. If you have used Power BI before you would be familiar with the ‘Get Data’ button, we can import data from a webpage by selecting ‘Web’. 

For this example, I am going to import NRL player statistics for the 2021 season, you could also use this method for scraping data for listings on an e-commerce website or reviews. 

Here we can see that the data I am trying to scrape resides over 25 pages, it is also important that we inspect the URL. 

For this method to work we require some form of page numbering to be present in the URL, luckily here we can see that there is a pageNumber at the end of the URL. 

Now in Power BI enter in this URL into the prompt after we have selected to import data from a Web source. Select the table that contains the data required and go to ‘Transform Data’. 

In Power Query we will go to Advanced Editor and add a variable to convert the query to a function. 

The code will look like this: 

All we need to add is the variable before the let function and then replace the page number with the variable. We can call the variable whatever we like.  

We then replace the page number with the variable. 

Ensuring that we encase the variable with “ and &. 

We can then click done to save the changes. This then converts our query into a function. 

Now we know that our data lives over 25 different pages, so we will need to create a list of 1 – 25 to use as variables for the function. Simply create a new blank query and then use the following formula to generate the list. 

Ensure that you convert the list to a table so that we can use it in the next step. 

Some quick housekeeping, we will rename the Function, the Query and also the column name. Now to use the list we will need to change the Data Type of the list to text as per the function that was initially created and then we invoke the formula. 

What is happening when we Invoke Custom Function is Power BI is using the PageNumber list that we created to fill the variable in the website URL, so Power BI will query the 25 different URLS and import the data from each page. 

 Once we expand the column we can see all player stats for the 2021 season, it is simply enough to remove the Page_Number column from the table and clean the data as we see fit. 
 
This is a handy trick that can be used for importing web-based data over multiple pages if you plan on importing the extracted data into Power BI anyway. 
 
If you enjoyed this article or want to know more about Web Scraping please take a look at the Introduction to Data Engineering course in the Data pipeline, we explore other ways of scraping data and how to import data from many other data sources.  

Architecture Reviews

Synopsis:

Architecture Reviews analyses frameworks of enterprise and solution architecture, which is vital for technology and operational design.  

Architecture reviews are conducted to understand the designed or eventuated structure of a particular ‘thing’ so that a further design or alternation of a design may be conceived. Architecture is vital, it is the rational understanding of the designed structure. While commonly associated with a building in the construction industry, architecture can apply to anything. In the business and technology world, architecture is important to providing direction and guidance to the various nodes of activity. Enterprise architecture is the description of such elements as business strategy, plan and key operations. Solution architecture refers to the operational and technology symbiosis to solve a particular problem, and technology architecture refers to how that technology is structured and designed. On their face, these definitions are simple, however, conducting a review of an organisation’s architecture can be complex and ambiguous.  

Enterprises, like military organisations, deal with complex and ambiguous problems which tend to have high consequences. The complexity eventuates from many competing priorities and numerous nodes of activities. The ambiguity is caused by an evolving environment of threat actors, economic and social effects, and information uncertainty. While this complexity and ambiguity make an architecture review difficult, it is also the reason why understanding the organisation's architecture is so vital; for, in the face of chaos, we must be orderly. The point of a review is to understand the enterprise's strategy and goals, assess the operations, technology and solutions aimed at realising that strategy and goal and make a judgement upon points of friction, neglect, and redundancy. This assessment is to then be used in the designing of further architecture for maximising the achievement of long-term efficiency and excellence.  

The International Standards Organisation (‘ISO’), asserts that an architecture review should encompass five ‘views’ (ISO 42010:2011); views are synonymous with perspectives or categories. These views are enterprise, information, computation, engineering and technology. Enterprise is the viewpoint of operations, processes, principles, and guidelines on how the enterprise achieves its goals. Information is the viewpoint of where data is stored, how it is gained, and how it is used. Computation is the viewpoint of where computer power is used in the organisation such as artificial intelligence (‘AI’) and robotic process automation. Engineering regards the view of how the entire information technology (‘IT’) system is networked, where data flows and how the administration of the system is managed. The technology viewpoint is very technical, focusing on how the code is structured, what APIs are used and so forth.  

Similar to the ISO standard, Lean-Weng (et al, 2014), who conceived of an Enterprise Architecture Framework (‘EAF’) for the Singaporean Military, uses four components (instead of the ISO views); business, information, solution, and technical. Ceteris paribus while the enterprise view equals the business component, the ‘solution’ component considers the applications used by business operations, processes and administration. Lean-Weng also proposes that architecture ought to be viewed by two levels of concepts. The first level is the enterprise architecture, which draws the components together by common principles, values and strategy. The second layer is a systematic assessment, which interoperates the components by breaking them down into small components titled ‘view’. Lean-Weng’s conception of a view can be understood by the following table.  

Lean-Weng’s concept of dual layers provides a logical differentiation between the intent of ‘enterprise architecture’ and ‘solution architecture’, where enterprise architecture is the assessment of components alignment to business strategies and goals organisation-wide, and solution architecture is the assessment and application of the components and their views towards solving a problem of the business strategy. A solution architecture review would focus upon one node of the business’s activity, applying all components and views to assess its efficiency and excellence achievement.  

Although the ISO is a globally recognised authority on standardisation, pragmatically, there is no one framework to fit every enterprise. The architect should apply the views/components necessary to achieve the enterprise's goals, although some mixture of Lean-Weng and ISO would serve as a good starting point. 

References  

Yeoh , Lean-Weng and Syn, Hon-Beng 'An Enterprise Architecture Framework for Developing Command and Control Systems' (4 Nov 2014) Choon-Vu Lam Defence Science and Technology Agency - URL https://onlinelibrary-wiley-com.simsrad.net.ocs.mq.edu.au/doi/abs/10.1002/j.2334-5837.2007.tb02887.x 

International Standards Organisation 'Systems and Software Engineering - Architecture Description' (1 Dec 2011) ISO/IEC/IEEE42010:2011(E) - URL https://nanopdf.com/download/iso-iec-ieee-420102011e-systems-and-software-engineering_pdf 

Samir Roshan 'IT Architecture Review: The Basics, the Approach, the Outcome' (9 Nov 2012) Thinking Loud on Cloud - URL https://thinkingloudoncloud.com/2012/11/architecture-review-basics-approach-outcome/  

'Enterprise Architect vs. Solution Architect vs. Technical Architect' (2021) LeonIX - URL https://www.leanix.net/en/wiki/ea/enterprise-architect-vs-solution-architect-vs-technical-architect-whats-the-difference 

Music Speaks to Your Soul, Mathematics Can Tell You Why

The Beauty of Mathematics

Though it can seem dry at first, mathematics can embody a set of abstractions and conceptualisations that can only be described as beautiful.

With the data pathway gearing up to produce more lessons on mathematics to inform our other courses, I thought I would take the opportunity to share some of the more beautiful elements of mathematics that reveal relationships underpinning our fundamental experience of the Universe.

If you are after a business use (outside of operating a business in a universe built on these fundamental concepts), then buried within this, is the understanding that informs the way that the Power BI analytics automatic predictive tool works.

It may take a few articles but I thought a good place to start would be to share some of the mathematics of music, sound waves and how it is processed by the human brain. Biology and anatomy is my least strong area of knowledge, so apologies for any incorrect use of terminology there.

Music That Sounds Bad

I was raised in a fairly musical household. My father was a guitar player his whole life and my mother, a classical pianist. My grandfather was a singer in the philharmonic choir (Seen here performing stairway to heaven on Andrew Denton in 1989). As an angsty teen, I quickly rejected the music of establishment in favour of Death Metal, but being quite forgetful, I never remembered to bring my guitar pick to guitar lessons, so ended up learning classical guitar fingerpicking, instead of the chunky, brutal riffs that filled my iPod.

I’ve been playing for around 15 years now, and in that time I’ve managed to cobble together a relatively accurate understanding of music theory. While doing so, I found that mathematics was the most helpful tool in understanding why some notes sound good together and others sound bad ("bad" ="brutal" if you’re into Death Metal, "dissonant" if you’re into Stravinski).

So why do some notes sound harmonious together, whilst others sound dissonant and harsh?

As with any system, it's best understood by breaking it down into its constituents.

Music is just a combination of sound, and sound is just vibrations in air.

So let’s do some physics!

The first thing we need to do is explain what sound actually is.

Sound

A sound wave is a moving front of increasing and decreasing density, that transfers energy, without the transfer of matter. We usually think of sound as the vibrations in the air, but all matter can carry physical vibrations, that is sound energy.

The animation below shows a slow-moving wave, but sound in air moves at roughly 343 meters per second, each particle moving back and forth hundreds of times every second.

(Above gif taken from http://resource.isvr.soton.ac.uk/spcg/tutorial/tutorial/Tutorial_files/Web-basics-sound.htm)

The animation above shows what air molecules do as a sound wave passes through them. In the animation, watch the red dot carefully. It is oscillating back and forth, but it always returns to its rest location. This is actually what all of the molecules shown are doing. The wave moves, but the molecules oscillate/vibrate in place.

When a sound wave enters your ear, it causes this same motion that we see in the red dot, in tiny hairs within your inner ear. These hairs are connected to nerve endings that run information to the brain. This allows your brain to measure how fast the air is vibrating.

The length of these hairs determines the frequencies that they are able to vibrate at.

Imagine waving different sticks as fast as you can. You can wave a shorter stick much faster than a longer one. This is describing how the physical size affects the “Resonance Frequency” (more on this later).

Faster vibrations sound higher in pitch (treble) and slower vibrations sound lower in pitch (bass).

The number of oscillations back and forth each second is called the frequency of the wave.

The graph below the wave diagram shows this longitudinal compression wave, represented as a sine wave. This is the shape of the graph you get if you plot y=sin(x).

The sine wave can be a graph of molecule displacement, or of density, both will be a sine wave.

Here is what a pure, true sine wave sounds like: https://youtu.be/qNf9nzvnd1k

The sound will start at a low frequency, below the register of human hearing (don't crank the volume). As the frequency increases, you will start to hear the sound. It will then become too high in frequency for the human ear. (Your dog might hear it though)

If you have any hearing damage then as the sine wave runs through the frequencies that correspond to the damaged hairs in your ear, you may notice some tinnitus.

This shows you how the pitch of a note changes with frequency.

Let's pause and realise that we’ve just done something pretty cool.

We have linked a mathematical measure (frequency of a wave) to a biological model, which we’ve then linked to how our brain processes one of its most basic inputs, sound.

If you remember the early days of mobile phones you might remember all of the early monophonic ringtones.

Mono, Greek for one/single

Phonic, Greek for sound

These were songs played using single sine waves of different frequencies, one after the other.

This was definitely cool, but we didn’t really have music yet. No one would sit and listen to Motorola ringtones in their spare time. We need more than just monophonic to get that deep connection that humans have with music.

Wave Addition

This next section will involve a little math exercise.

I’d like you to go to GeoGebra: https://www.geogebra.org/calculator which is a fantastic resource for understanding functions in mathematics. Upon opening this link, many of you will sigh and realise how much frustration you would have been saved in high school if you had had the tools kids have today.

Step 1 Click on the f(x) tab to access the calculator's inbuilt functions.

Step 2 click ‘sin’

Step 3 click back to the numbers tab

Step 4 click x and hit enter

This process will add a sine wave to the graph.

If you then click into the input, you can add another function. This time repeat the process, but click into the function and change it to sin(2x).

We’ve now got 2 waves. By multiplying x by 2, inside our sine function, we’ve doubled the frequency.

In the time it takes our first wave to move up, down and back again, the second wave has done it twice.

Try inputting different values before x to see how it changes your wave.

Now we’re going to add 2 waves together.

We have already input these 2 waves.

Sin(x)

Sin(2x)

Lets now look at their sum:

Sin(x)+Sin(2x)

This is exactly how waves add together in nature. As waves move through each other, their amplitudes add together, creating a “superposition” of the 2 waves.

Ripples across the surface of water add their amplitudes, creating an “interference pattern”.

Try adding many waves to see what weird looking waves you see.

Think about the frequency of the resulting, repeating pattern you generate.

Fourier Theory

In this video, we see the wave pattern that we get from the sound of the presenter’s voice. You may notice that at 43 seconds we see something that looks a little bit like the graph of f(x)=sin(x)+sin(2x)+sin(3x)+sin(4x).

Followed by the presenter making a much higher pitched sound at 43 seconds, which looks more like a pure sine wave.

f(x)=sin(x)

Sounds like a whistle, a flute, a soprano singer, a male falsetto or a xylophone, usually consist of a small number of waves. The more complex or “warmer” tones can be made by stacking many waves on top of each other.

The cool part comes from a field of mathematics developed by Joseph Fourier in the 19th century. Fourier Theorem roughly states that any continuous function, can be written as the weighted sum of a (sometimes infinite) series of sine waves.

Any function or shape.

The sounds made by a violin, the static from a TV, the shape of the stock market value over time, an exponential growth, a logarithm, letters written out by hand on a sheet of paper. All of these can be represented as the summation of a series of waves.

For example:

This is the result of successive additions of terms in a series that looks like this:

y=sin(x)+sin(3x)+sin(5x)+sin(7x)+sin(9x)…

The more terms we add, the closer our series becomes to the square wave, approaching the true shape at infinity.

shows the difference in sound between the sine wave and the square wave.

All of the soul and natural timbre of every instrument, voice, or animal call, is just a matter of adding enough sine waves together.

If you want to see the slow build from pure sine waves to organic sounds, here is a video of Wendy Carlos, the pioneer of the synthesizer, creating a xylophone by stacking pure frequencies.  Jump to 2:40 for the demonstration.

This concept put into a rigorous mathematical statement:

The complex exponentials form a basis for L2 ([0,1])

Where L2([0,1]) is the set of functions f(t) on [0,1] for which:

[Citation]

Put into a less rigorous pseudomathematical statement:

"Everything is waves maaaaaaaannn"

The reason this works so well for our brains is that when that sound wave enters our ears, the structural build of our ears separates the wave into its Fourier series frequency components.

The wave sin(x)+sin(3x) will wiggle the hair in your ear that corresponds to sin(x) AND the one that corresponds to sin(3x).

Your brain takes the Fourier transform of the wave and represents it as a sum of frequencies.

Is the wave actually made of the many separate waves or is this only an approximation of its true nature?

The two are mathematically and physically equivalent. I leave the rest up to the philosophers.

Tune in next time and I'll talk about Resonance and the Harmonic Series.

Data Entry Forms In Power BI

Introduction

Microsoft Power BI is a powerful data analysis and reporting tool. However, it is not generally used for data entry. Sometimes users would like to be able to capture data from a Power BI report. This paper provides a method of entering data entry through Power BI using an SQL stored procedure with parameters.

Data Entry

Microsoft has many tools that may be used for entering data for data analysis and display in Power BI, including but not limited to:

  1. Microsoft Forms
  2. SharePoint Forms
  3. Excel
  4. Access
  5. SQL Server Management Studio
  6. SQL Server Reporting Services.

Power BI may also be used to enter and store data using:

  1. parameters
  2. a stored procedure
  3. a SQL Server table.

To illustrate this technique, a simple example of an employee records table TblEmployees was created in SQL Server, with the following columns:

  1. EmployeeID As NVARCHAR(MAX)
  2. FirstName As NVARCHAR(MAX)
  3. Surname As NVARCHAR(MAX).

(Example data was added to tblEmployees for demonstration)

Parameters

The parameter values entered by the user are passed to the stored procedure for entering into the table.

A new parameter is created by selecting Home – Transform data – Transform Data then Home –Manage Parameters – New Parameter; Figure 1. In the example, three parameters were created:

  1. @EmployeeID
  2. @FirstName
  3. @Surname.

The parameters may be edited through the Query Editor, accessed by Home – Transform data – Transform data then Manage parameters - Manage parameters.

Figure 1. New Parameter Creation Page

Stored Procedure

To use the parameters to enter user names into TblEmploye and show the results in the report card, a stored procedure spCreateEmployee was created. The stored procedure uses the INSERT INTO function, with the values being the parameters[1]. Note the data types of the table columns and parameters need to match. Further, an issue with inserting data from a PowerBI direct query is that the code runs multiple loops. Thus an IF EXISTS function was used to ensure that only one record is created, with the added befit that records may be updated for a given EmployeeID.

The SQL code for spCreateEmployee is:

USE [WYWM]

GO

ALTER PROC [dbo].[spCreateEmployee]
	(
	@EmployeeID AS nvarchar(max)
	,@FirstName AS nvarchar(max)
	,@Surname AS nvarchar(max)
	)

AS

BEGIN

IF EXISTS (SELECT * FROM [dbo].[TblEmployees] WHERE [EmployeeID] = @EmployeeID)
BEGIN
	UPDATE [dbo].[TblEmployees]
	SET 	[FirstName] = @FirstName
,[Surname] = @Surname
	WHERE [EmployeeID] = @EmployeeID;
END

ELSE
BEGIN
INSERT INTO [dbo].[TblEmployees]
(
		[EmployeeID]
		,[FirstName]
,[Surname]
		)

VALUES
(
		@EmployeeID
		,@FirstName
,@Surname
)
END
END

A tip for using Power BI with parameters is to check View – Parameters – Always allow; Figure 2.

Figure 2. Always Allow Parameters

Linking The Parameters To The Stored Procedure

To link the Power BI parameters to the stored procedure spCreateEmployee[2], tblEmployees was linked as a direct query data source in Power BI as tblEmployeeCreate.

The stored procedure spCreateEmployee is then called from tblEmployeeCreate adding the following query by selecting the Query Editor at Home – Transform Data – Transform Data, select the table query and select Advanced Editor.

let    
Source = Sql.Database("STEVEN-DESKTOP", "WYWM", 
[
    Query="
    EXECUTE spCreateEmployee
    @FirstName = '"& FirstName &"', 
    @Surname = '"& Surname &"', 
    @EmployeeID = '"& EmployeeID &"' 
    "]
    )
in
Source

A function fnCreateEmployee is created by selecting tblEmployeeCreate, right-clicking and selecting Create Function.

Entering Data

To enter a new employee, the parameter values are entered and the stored procedure executed.

The parameter values may be entered either:

  1. Selecting Home – Transform Data – Edit Parameters, entering the Parameter values and selecting Okay; Figure 3.
  2. In the Query Editor, selecting fnCreateEmployee, entering the Parameter values and selecting Invoke; Figure 4.
Figure 3. Edit Parameters
Figure 4. CreateEmployee Function

Native Mode Check

Once the parameter values are submitted, a Native Database Query prompt may be shown; Figure 5.

This request for permission can be turned off by selecting File – Options - Security. The first option in the dialogue is regarding Native Database Queries, by default the Require User Approval checkbox is ticked; Figure 6. Uncheck this box, noting that this changes the setting for all queries.

Figure 5. Native Database Queries Prompt
Figure 6. Native Database Queries Setting

Results Table

Finally, to show the content of tblEmployees:

  1. TblEmployees has linked again as a direct query data source in Power BI as tblEmployees
  2. a Power BI table results card was created; Figure 7.
Figure 7. Results Card

Conclusion

In conclusion, data entry may be conducted directly through Power BI by using an INSERT INTO SQL stored procedure with parameters.

[1] https://www.mssqltips.com/sqlservertutorial/2518/sql-server-insert-command-with-variables/

[2] https://www.c-sharpcorner.com/article/execute-sql-server-stored-procedure-with-user-parameter-in-power-bi/

Demystifying The Study of Maths For Data Science

"You need to learn more maths"

If you’re anything like me, you have been told this statement over and over and over and over again. ‘But what the heck does it mean?’, ‘where do I begin?’ You may ask. I agree you with, maths is a massive and daunting domain to tackle.

I, like many of the readers on this blog, have recently embarked on a technological adventure of a life time. I started mine two years ago, I came from a very non-tech background and most certainly did not deal with maths past your classic times tables. So, you and I are probably in the same boat, but if you’re not, keep reading, you might find this perspective interesting.

That’s it, context is set; in this article I am going to discuss with you about how you could go about learning more maths, and how it is actually applicable! Yea I know right, crazy? Making something applicable, wish school did more of that.

Anyway, back to it.

In the data science world, we can safely delineate three main areas of maths you should learn,

  1. Statistics
  2. Linear algebra
  3. Calculous

In my opinion, in that order, but I am happy to be shown differently.

This is not to discount other areas, such as geometry and wider algebraic concepts, but these will give you a very large step up in designing algorithms, building analytical reports, and proving your hypotheses.

I think the most important first step is to answer, when would either of these concepts be used?

Statistics is broken into two primary components; we have descriptive and inferential. Stats as a whole is fantastic for summarising data into single, digestible notations which can tell you whether variables correlate, to what extent you can trust the determination, and so on. The descriptive component will describe the data as you have it, for example standard deviation, mean, medium and mode, etc. Inferential stats allow you understand more about why something is happening; this can be differentiated into frequentist and Bayesian statistics. Frequentist is the assessment that an event will probably by the historic frequency that the event had occurred. Bayesian, on the other hand, begins with a subjective prediction of the probability the event has occurred and is then tested; there is a lot more to Bayesian statistics, but the intent of the calculation is to give a more realistic and worldly assessment of probability than what frequentist has to offer.

Linear algebra on the other hand is the study of the straight lines of the data, and is best described as focusing on the operations of variables and numbers. This paradigm allows us to regress from the data and understand it as vector. A vector is simply understanding ‘something’ in the way it conforms to rules of arithmetic. By understanding the data as a vector, we can create algorithms to model the data, this is useful for prediction and optimisation.

Calculous, conversely, is the study of curves of the data, and is best described as the focus on the operations of functions and their derivatives. While linear algebra aims to create a vector of representation, calculous aims to identify the rate of change in the specific variable. This rate of change is very important in prediction and optimization as it tells us how a variable will be affected as the measure is changed. Calculous is said to have two stages; firstly differentiation, also coined as the cutting stage. This is to break the curve down into smaller pieces to assess the rate of change of an increasingly straighter line.  The second stage is integration, or rebuilding stage, which is the reformation of the smaller lines so to recreate the curve. This is important for data science as we can model a variables behaviour as reacting to very specific moments.

Now, the second most important question is, how do we go about learning to use these?

A traditionalist will tell you to go by text books, learn the theory and do the exercises. The realist will tell you not to worry about theoretical maths and go straight to applied, meaning learn and use as necessary. And then we have those who say you should learn maths then learn python and what not, and those others who will say no you should learn programming then go to the maths.

I disagree with all the above. My answer? Do it all at once 😉

No, you think I am joking but I am not, hear me out.

In the modern world we are extremely lucky to have a lot of technology that has already done the real technical, theoretical heavy work for us. The issue is, if we do not know what to apply, how could we apply it? Remember my last post when I spoke about labelling a carrot a tomato, and if you want to learn about carrots but you keep searching for the term tomato, you will end up learning an awful amount about tomato’s and nothing about carrots? Yea well same applies here.

So, this is what I recommend, which I will be helping you out with. Learn the topics of the above areas well enough so you can say “yes, if I want to solve this issue, I am going to need differential calculus”. Now, you do not need to necessarily understand how calculus solves the fictional problem, you only need to know the terms and roughly why they relate.

Now, if you couple this by learning programming like Python, you can start to look at learning libraries like NumPy which enables you to call on a range of mathematical formulas as functions, put in the different variables, and the computer will do the rest. However, I will reiterate, if you do not know the concepts of maths to be applied, you will not know which function to call upon in Python.

So, in my opinion, get the basics of Python sorted first, do not worry, it is really simple. Then start to read up on the above maths areas while you learn and practice in python. After a couple of weeks you will be great guns and you will be ready to move across or down the T (if you do not know my reference, you really should read my last article).

Further to, Microsoft Azure has a fantastic service called the Azure Machine Learning Studio. Similar to python, this studio has a range of functions to select from, and without understanding the maths or programming, you can be running your own algorithms in just a few days. But again, you need to understand what maths concept you ought to be engaging.

Now, over the next few weeks I will continue posting to demystify maths so to help you understand better what specific maths concept you should be using for a specific problem.

Stay tuned. 

Written by

Coopper Kingsley

Associate Instructor

Hit me up at coopper.kingsley@withyouwithme.com

References

Stefania Cristina 'Differential and Integral Calculus - Differentiate with Respect to Anything' (21 July, 2021) Machine Learning Mastery - URL https://machinelearningmastery.com/differential-and-integral-calculus-differentiate-with-respect-to-anything/

'Algebra vs Calculus' (23 October 2020) Curmath - URL https://www.cuemath.com/learn/mathematics/algebra-vs-calculus/

‘Analytics Vidhya 'How to Learn Mathematics For Machine Learning? What Concepts do You Need to Master in Data Science?' (June 2021) URL - https://www.analyticsvidhya.com/blog/2021/06/how-to-learn-mathematics-for-machine-learning-what-concepts-do-you-need-to-master-in-data-science/

Justin Zeltzer 'Introduction to Bayesian Statistics' (1 Oct 2018) Z Statistics - URL http://www.zstatistics.com/videos#/statistical-inference

Data Science, what is it?

Data Science, what is it?

You’ve probably heard this term thrown around a lot in the last 10 years, and if you’re a nerd, you have probably even thought ‘wow, that is sexy’… no? And you would be correct! Being a self-proclaimed nerd, I think this term and its related meanings are fantastic. It is the present and future right? Over the next 50 years, we will be driving autonomous cars, wars will be fought with robots and heck, your psychologist might even end up being some algorithm too.

But what is it? I thought I knew what data science was until I was given some professionally transformative experiences over the last few months. I was invited to attend a ‘Data Science for Defence Capability’ course run by the Australian Defence Force Academy in November 2021. On this two-week course, I was to learn what data science is enough so I could become conversational in the matter.

I was super excited, as a studious individual, I researched my teachers for the course to learn their backgrounds and what I would learn from each of them individually. At looking over their professional biographies I thought ‘yup, these are certainly data scientists’. They fitted my presupposed definition to the T, one person-built image recognition algorithms, another built rationale for swarm robots, the other was a programming and algorithm expert, and so on. Would you disagree with my presumption?

So anyway, I was sitting there on my very first day and as a typical course would start, the teaching staff asked, ‘data science, what is it?’. All fourteen students, being engaged and excited, gave their opinion as a matter of fact; ‘machine learning’, ‘AI’, ‘neural networks’, we shouted! Seems like reasonable explanations, right? So, you would then be just as surprised as me when the staff said,

‘Well, we don’t think that we are data scientists.’

What?

We were stunned.

The image recognition specialist, now explaining, said ‘well, you are correct, ML, AI and NN is data science, but not one of us build that technology completely.’ For instance, this specialist explained that their role and speciality was in calculus and building the maths model for neural networks, or using pre-created algorithms, however, that is one slice of the cake in making the end NN tech. This specialist works with the programming expert, they work with a data engineer, work with data analysts and so on.

The class was shown this graphic;

Photo produced by ADFA Data Science for Data Capability Course - Nov 2021

The image recognition specialist actually makes up a minute section of the ‘models and algorithms’ component. Without their skill set, the process would be useless, correct! But without all the ‘stuff’ before and after, their skill set would likewise be useless.

So, you tell me, what or who is a data scientist, are they all data scientists who contribute to this process? Does this question even matter? Why do we need to label? Well, I’ll tell you why this is a very important issue to resolve!

This is important because labelling is how we google, and in this modern-day, googling is how we learn. So, if you saw a carrot and labelled it as a tomato, and then went to google to learn about carrots but instead googled tomato’s, you would learn an awful amount about tomato’s, not carrots. Therefore, in your quest for professional development, you must be very careful about what you are intending to google to learn.

It is all well and good to say you wish to be a data scientist, but what does that mean? Do you want to work with image recognition, robotic behaviour, or economic and business optimisation? Do you want to work with theoretical maths, do you want to be programming or more in the data prep or delivery stages?

Now, before I leave you with this article to mull over, I will offer two more tunes of advice.

Think of the T! The T method is breadth and depth of knowledge. If you go across the T, so broad, you are learning a lot of skills, you might learn engineering, narration and maths, but if you go down the T, so deep, you will be specialising in, let’s say, inferential statistical analysis. There is no correct answer, just do you and what you find interesting, but I would suggest going broad before you dive, never know what you might find out.

Lastly, think out about the analytics continuum, and how far up the maturity scale you wish to go, the further up you go means the more studying you will have to undertake, but if you try to do the entirety of the continuum you will be less capable then someone you focused on a single bit (yes, T-model).

Photo produced by ADFA Data Science for Data Capability Course - Nov 2021

In summary, learn about data science and what you actually want to do, because one thing is very different from the other; apply the T model in your progression, and understand the forms of analysis.

Written by

Coopper Kingsley

Associate Instructor

Hit me up at coopper.kingsley@withyouwithme.com

References

Natassha Selvaraj 'A complete data science guide in 2021' (21 July 2021) Towards Data Science URL - https://towardsdatascience.com/a-complete-data-science-roadmap-in-2021-77a15d6be1d9

Harsha Das 'RoadMap to learn Data Science in 2021' (8 July 2021) LinkedIn Articles - URL: https://www.linkedin.com/pulse/roadmap-learn-data-science-2021-harsha-das/?trk=read_related_article-card_title

PowerBi – Making a Navigation Pane

Sometimes we want uniformity in our dashboards...

We want all our pages to have a common theme or useability that the customer or end-user can rely on. For me, I like my Dashboards to have a central Navigation Pane that allows the users to not only move through the different Dashboards but also provides a common slicing selection that can be used for most of my dashboards. Of course, there may be times where we need to alter this common pane to accommodate for different visualizations or slicing options, but for the most part the same structure can be used to great effect. 

To get started, create slicers for each filter selection you would like the end-user to be able to filter by. It’s important to note as well that if something I want to Slice against exists in both the Fact Table and the Dimension Table I will take it from the Dimension Table. The general rule that we want to follow is to almost always slice Facts against Dimensions. 

The slicers will make the basis of the Filter Header. 

Next, we need to construct the navigation pane that will house these slicers. I like to choose a basic setup of Navigation Arrows, one for going back one page and the other for going forward a page, a custom “Home” button and a “Q&A” button to allow the end user to ask questions directly of the data. These buttons are then placed on top of a text box to include the title of the page. If you lose your slicers behind the textbox, don’t worry we’ll cover how to edit Layer Order shortly. You may choose to go a different route, what I have found is there are a million different ways to accomplish the same task in Power Bi. If you choose not to have Navigation arrows that is ok, do what you feel will provide the required level of functionality for your dashboard. 

To create the custom “Home” button, simply navigate to Add a button and select a Blank button. This is quite a useful option as you can create a button with a custom image that is linked to an action within the dashboard. 

To link this newly created button to an action, click on the button and go to the Format button pane, the second last option is Action. 
 

 When the user clicks this button they will be taken to the PowerBi page titled “Home Page”. 

Now a bit of housekeeping

Adding some labels to the navigation arrows and changing the background to match that of the text box. 

Next, we need to change the slicer types to best fit in the navigation pane. In doing so I will also alter the year selector to only show the last 4 years as this data will be most relevant. 

Dropdowns for most of the selections will make for a clean easy to use solution, with buttons for the Years to add some variety. 

Now the issue we face is that the dropdowns have disappeared behind the yellow text box as seen below.  

This can be remedied by changing the Layer Order. This is found under the ‘View’ tab and ‘Selection’. 

Under this Selection menu, we can change the layering order of the elements we have within the Dashboard. We want to move the text box containing the title to the back and move the dropdowns to the front. This is done by clicking and dragging the elements in the side menu into the desired order. 
 

All that is left to do is to group all the navigation pane elements together to allow us to easily duplicate this to be able to copy and use this on our other dashboard pages. To do this in the Selection side menu select all elements of the navigation pane, group them and name the group to distinguish it from any other groups we may create in the future. 

We now have a complete Navigation Pane that we can utilize throughout the dashboard as we see fit. The Selection and Grouping functions are useful tools that can be mastered to create intricate, multi-layered dashboards and can save you a mountain of time when wanting to re-use large parts of a dashboard across the Power BI file or files. 
 
I hope you have learnt something new or improved your skills in Power Bi by reading this article. As always feel free to reach out on Discord if you have any questions or want to chat about Data. 

Power Bi – Back to the Future – The Time Dimension Table

In one of my previous RallyPoint Articles, Star Schema’s and Fact vs Dimension Tables, I introduced you to the Star Schema and the importance of Dimension Tables. In case you missed it or need a refresher, a Dimension Table is a static reference table used within a Star Schema Data Model to take the load off the main Fact Table (containing specific event information).  

In order to fully maximize the use of this type of Data Model

we need multiple reference tables (Dimension Tables) spread around the Fact Table, and one of the most required references is Time. Almost every Fact Table will contain some type of “time” reference containing time-specific data on when specific event/s occurred. While Power Bi does have built-in Date Splicing Functionality, we are limited to only slicing by Year > Quarter > Month > Day. It can also become difficult to compare rows between Fact Tables if there is no common Date Dimension to slice by. 
 
This issue can be solved quickly with an imported Power Query script specifically designed to give you your desired date range with the flexibility to drilldown.  

First download the script below:

After that, choose the PowerBi field you want the Time Dimension Table added to and add a Blank Query: 

Next, in the newly opened Blank Query, navigate to the View tab and click Advanced Editor. 

Following that, open the script in a text editor (such as notepad) and copy and paste the contents into the Advanced Editor. 

The only parts that require manual input are the FromYear and ToYear. Change these dates to suit your needs and click done.  


What you are left with is an extensive Time Dimension Table that acts as a centralized time splicer and can be utilized to provide in-depth analysis of your data! This may not seem important for your simple 2 table dataset but once you start to add complexity and additional data from other sources the Time Dimension Table pays for itself! 

 
As always if you have any issues with this, or anything else data related, feel free to reach out! 

christopher-jaames.dennis@withyouwithme.com 

Data Analytics – Web Scraping (No Programming Required)

If you are like me and have completed the WYWM Data Analytics Pathway...

but you want to keep your skills sharp, then you need to get your hands on some data and get to work. The question is though, where are you getting your Data? All the exercises and projects within the Pathway have the Data already organized and provided for you, it is a nice luxury to have but in the real world it is up to you, the Data Analyst, to find the Data required to make that exciting, thought-provoking dashboard. 
 
I am still quite a junior in the DA world, and I am ashamed to admit that my Python skills are almost non-existent. I am trying to learn, but it is just taking a little longer than anticipated. So, when I had the idea to find some data relating to the Property Market in West Melbourne, I found myself stuck. I searched far and wide for Data Sets that would fit my needs, alas to no avail. 
 
That is when I stumbled across a solution to my problem, Web Scraping. Web Scraping is the process of reading the underlying HTML (Hypertext Markup Language) code of a website, extracting the parts that we need, then attempting to format the extracted data in such a way for us Data Analysts to go about our work. Now, this may be off-putting if you do not have any experience with HTML, but the solution I chose to use makes the process as easy as clicking a few buttons. A word of warning though, Web Scraping is not considered illegal in Australia, however, it can be in breach of the website's terms and conditions of use. As a general rule of thumb, if you are not using the data for commercial gain and only personal use, you have nothing to worry about. 
 
The solution I chose was the Web Scraper – Free Web Scraping plugin for Chrome. This is a little-to-no code solution that allows us to set up our scrape by selecting which HTML elements we want the information from. 
 
The example I will be using is the West Melbourne Property Data, scraping from a real estate website. Some websites are quite protective of their data, evident by restrictions they put in place to stop people from extracting a complete dataset. The main issue I faced was that I could not view beyond 50 pages of property data and each page only contained 25 properties. To work around this, I set up multiple scrapes of more narrowed searches. For example, instead of searching for multiple suburbs, I would search for each individual suburb and do the scrape, then change the “sort” function and re-scrape to try and get a wider set of data. 

To get started

Head over to the Chrome Web Store and install the plugin here. After installing, navigate to the page you want to scrape and hit F12 on windows or to open the Developer Tools and click on Web Scraper. 

For the first step, let’s create the Sitemap by giving it a title and copying in the URL of the website that we want to scrape. Since we want to scrape from multiple pages, we will find the page number and replace it with square brackets and a range e.g. [1-50]. Since I know that Real Estate.com won’t allow us to search beyond 50 pages, I changed the URL to search for pages 1 to 50. 

Once the Sitemap is good to go, we then need to set up our selector. The selector is the element of the HTML that we are going to have the Web Scraper take from each webpage. To do this, click the “Add new selector” button. From here, give the selector a descriptive name and select the “Type” as “Text”, this will pull the actual text that the HTML represents on the webpage. The next stage is to select the HTML that we want to scrape. You would usually make a selector for each piece of information you want to scrape and tie this back to a Parent Selector, however, in this instance due to the way that Real Estate.com has formatted their HTML, it is easiest to scrape the whole ‘residential-card’ and then separate in the data cleaning process. Upon selecting the HTML, also remove the address identifier to tell the web scraper to scrape all instances of the div.residential-card rather than just the one that we have physically selected. Select “Multiple” and hit save. You can also hit the Data Preview button to make sure that you have indeed selected the correct data as intended. 

Now it is as simple as starting the scrape. The plugin will then open a mini-window and start querying the 50 pages we have selected.  

Upon completion, simply hit refresh in the original Web Scraper window (not the pop-up window) to display the data and then export as a CSV ready to clean. This sort of data is not normalized or uniform, so you will have to be creative with your cleaning. For this data, I found that not all listings had complete data. For example, some properties did not list a land size or have garages. I challenge you to try this out for yourself and see what sort of data you can scrape! Have fun! 
 
Thanks for reading and I hope this has a useful application in your Data Journey. Feel free to try it out yourself and reach out if you have any issues! 

Dimensional data models – Fact Tables vs Dimension Tables

There comes a time where your data outgrows its model

This will become apparent quite quickly if your idea of a data model is just one super wide table. The concept of a data model is something that does take some getting used to, I still sometimes manage to get confused by it. A data model is a way of organizing your data so that, as your data set continues to grow over time, little-to-no manual actions need to be taken by the Data Analyst to incorporate it into the pre-existing file. There are a couple of different types of models that can be used, however the most common one is the Star Schema Model. 

The Star Schema is a desired model as it promotes usability, performance and scalability and allows for simpler DAX. Now you might be like me when I first heard the term ‘Star Schema’ and think to yourself “Sure those are some nice words Chris but why should I care? My data model works just fine the way it is right now” and sure you are ok for now, but what happens to your model once you add another 20k rows in a years' time? Will your model (or lack thereof) be able to handle it? 
 
The simple answer is no, even if it does hold up it will cause significant frustration throughout the process.  

As you can see in the above example when the data is not modelled the table is too wide, making it quite difficult to understand. 
 
The schema is modelled off, you guessed it, a star, with tables categorized as Fact Tables in the middle of the model and Dimension Tables on the points on the outer side.  

Fact Table is a table that contains specific event data, this can be transactional data such as sales data or appointment data. Each row in this table refers to an individual event, there may be hundreds or thousands of these entries in any given fact table. The information contained in the Fact Table should be specifically tied to the event, an example of this would be a Property Sales Table that contains relevant data tied to the sale of properties. Only information directly relating to the sale of the property is included. 

 

Dimension Table is a table containing information that is related to a business event, it is usually static in nature and does not change. These tables sit around the Fact Table and are used as a reference table. An example of this would be a Suburb Table. This negates the need to include all the information relating to the suburb in the Fact Table, making it longer, harder to read and slower for Power Bi to process the data. 

Dimension tables will include single unique references per row that refer to a column from within the Fact Table. It is also important to note that Fact and Dimension Tables almost always connect to the opposite table type, it would be extremely rare to see 2 Fact Tables or 2 Dimension Tables linked to each other. 

The model is created this way to allow the Fact Tables to continue to grow without this growth inhibiting how we use the data. The Fact Table will reference the Dimension Table when it requires further information that is not held within its own table, this increases the performance and usability of the data model. 

These two types of tables are linked together by what is known as a Surrogate Key

Surrogate Key is a unique identifier that is common between the two table types that is used to create the linkage between them. Fact Tables will contain multiple entries of the same Surrogate Key, whereas the Dimension Table would contain only one reference to the same key. The Fact Table would ‘look up’ the Dimension Table for the extra information weld within relating to that Key.  

The tables can then have different relationship types, dependent on the type of table and the direction of the relationship. For the above example instead of having all the Suburb information in the Property Sales Data Table, we use a separate Dimension Table called Post Code to obtain all the relevant suburb information like Postcode, Country, State etc. These 2 tables are linked by the Suburb Surrogate Key with the relationship being Many to One.  

There are 4 different relationships that the different tables can have with each other, these are: 

  1. Many to One - Many instances of the Surrogate Key on the first table and only one instance on the second table 
  1. One to One - One instance of the Surrogate Key on the first table and only one instance on the second table 
  1. One to Many - The same as Many to One but the other way around 
  1. Many to Many - Many instances of the Surrogate Key on the first table and many instances on the second table (it is recommended not to utilize this unless dictated by the complexity of the data model) 

You can also choose whether you want a relationship to transfer information in a single direction or both directions using the Cross Filter Direction setting. For example, if cross-filter directions are set to both, information can travel from table A to table B as well as from table B to table A.  However, if cross filter direction is set to single, information can only travel one way (from table A to table B OR from table B to table A, but not both)  

I hope this sheds some light on the distinct types of tables and their use within the Star Schema Model. It is something that seems foreign at first but once you put it into practice it will change the way you model your data forever. 
 
Next time I will show you how to develop your new Star Schema skills even further with the power of the Time Dimension Table. Feel free to reach out on Discord for any of your Data related needs! 
 
christopher-jaames.dennis@withyouwithme.com