Fischer’s Exact Test – with T-SQL and R

This post is a long overdue second part to the post on Chi Square Test that I did a few months ago.  This post addresses relationships between two categorical variables, but in cases where data is sparse, and the numbers (in any cell) are less than 5. The Chi Square test is to be used when numbers are higher than 5, but what if you have a problem with smaller numbers, and you want to find the connection between variables involved, or if there is a connection involved? To study this i picked a simple example from our SQL Saturday data. I have a very simple matrix like below. This tells me count of speakers, by gender, and separated as new (new to our event, not new entirely), and repeat, those who have attended our event.

Speakers New Repeat Row Total
Male 2 11 13
Female 1 2 3
Column Total 3 13 16

Step 1 – Setup Hypothesis: What is the question am trying to answer?  – if I were to choose 3 new speakers at random, say – what is the probability that a minimum of 1 of them will be a woman? Another more simplified way of stating the same problem is – Is there a correlation between gender and number of new speakers? From a statistical perspective, the assumption is a ‘no’ to begin with. (also called Null Hypothesis). If we disprove this statement, we prove the opposite – that there is a relationship. If not, there isn’t. So putting it down:
H0, or Null hypothesis : There is no correlation between gender and new speaker count that is statistically significant.
H1: The alternate hypothesis: There is a correlation between gender and new speaker count that is statistically significant.
What do both of these statements mean mathematically, or in other words , what would be the basis on which we make this decision? We can look at that in Step 3.

Step 2: Set up the appropriate test statistic: We choose to use Fischer’s test because of the sparse number of values we have, and also because our variables of choice are categorical.

Step 3: How do i decide? : The decision rule in two sample tests of hypothesis depends on three factors :
1 Whether the test is upper, lower or two tailed (meaning the comparison is greater, lesser or both sides of gender and speaker count)
2 The level of significance or degree of accuracy needed,
3 The form of test statistic.
Our test here is to just find out if gender and speaker count are related so it is a two tailed test. The level of significance we can use is the most commonly used 95% which is also the default in R for Fischer’s Test. The form of the test statistic is P value. So our decision rule would be that gender and speaker category are related if P value is less than 0.05.

Step 4: Calculation

Now, time to do the math...first, with R:
Input =(" Speaker New Repeat
 Male 2 11
 Female 1 2
 ")
 
 TestData = as.matrix(read.table(textConnection(Input),
 header=TRUE, row.names=1))
fisher.test(TestData,alternative="two.sided")

fischers

R is telling us that the p value is 0.4893. way above 0.05. And hence per our decision rule the two elements are not correlated based on the sparse data we have.

Now let us try the same thing with T-SQL. The calculation for Fischer’s test is rather elaborate when done manually – which is where you can appreciate how elegant and easy it is to use built-in functions with R. To do it otherwise, you need to not only code the calculation, but also come up with different possibilities of the same matrix. That is those that total up the same row and column wise. Then calculate the probabilities on each of them and sum those probabilities that are less than the ‘base probability’, or the one we derive from the base matrix. In this case we have 4 possible matrices as below, and each of the their probabilities (calculated with T-SQL) and as shown

Fischers1

T-SQL to calculate probabilities: All probability related math needs calculation of factorials. For this purpose I used the method described by Jeff Moden here.

DECLARE @newmen int , @newwomen int
, @repeatmen int , @repeatwomen int

DECLARE @pvalue numeric(18, 4)
DECLARE @numerator1 float,@numerator2 float,@numerator3 float,@numerator4 float,@numerator5 float 
DECLARE @denominator1 float,@denominator2 float,@denominator3 float,@denominator4 float,@denominator5 float

SELECT @newmen = 2, @newwomen = 1, @repeatmen = 11, @repeatwomen = 2


SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen)
--select @newmen+@newwomen
SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen)
--select @repeatmen+@repeatwomen

SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen)
--select @newmen+@repeatwomen

SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen)
--select @newwomen+@repeatmen

--select @numerator1, @numerator2, @numerator3, @numerator4
SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen
SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen
SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen
SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen
SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen)
SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5)
--select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5
SELECT 'Matrix 1 - Pcutoff' as Matrix, @pvalue as PValue


SELECT @newmen = 1, @newwomen = 2, @repeatmen = 12, @repeatwomen = 1

SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen)
--select @newmen+@newwomen
SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen)
--select @repeatmen+@repeatwomen

SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen)
--select @newmen+@repeatwomen

SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen)
--select @newwomen+@repeatmen

--select @numerator1, @numerator2, @numerator3, @numerator4
SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen
SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen
SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen
SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen
SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen)
SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5)
--select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5
SELECT 'Matrix 2' as Matrix, @pvalue as PValue


SELECT @newmen = 3, @newwomen = 0, @repeatmen = 10, @repeatwomen = 3


SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen)
--select @newmen+@newwomen
SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen)
--select @repeatmen+@repeatwomen

SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen)
--select @newmen+@repeatwomen

SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen)
--select @newwomen+@repeatmen

--select @numerator1, @numerator2, @numerator3, @numerator4
SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen
SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen
SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen
SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen
SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen)
SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5)
--select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5
SELECT 'Matrix 3' as Matrix, @pvalue as PValue

SELECT @newmen = 0, @newwomen = 3, @repeatmen = 13, @repeatwomen = 0


SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen)
--select @newmen+@newwomen
SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen)
--select @repeatmen+@repeatwomen

SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen)
--select @newmen+@repeatwomen

SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen)
--select @newwomen+@repeatmen

--select @numerator1, @numerator2, @numerator3, @numerator4
SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen
SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen
SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen
SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen
SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen)
SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5)
--select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5
SELECT 'Matrix 4' as Matrix, @pvalue as PValue

The response we get is as below.

fischers4

If we sum the 3 values that are less than base value 0.4179 – we get 0.4179 + 0.0696 + 0.0018 = 0.4893, which is exactly what we got from the R function.

Step 5: Conclusion: Since 0.4893 is greater than our desired value of 0.05, our decision rule did not pass. Or in other words, we accept the null hypothesis in Step 1, that there is no significant correlation between these two variables.

So, we can logically conclude, that based on the data we are given, we do not have enough evidence that gender of speakers and their count is actually related or significant. Thanks for reading!!

 

SQL Saturday Louisville precon – interview Andy Leonard

This will be year #9 of sql saturdays in Louisville. Every year (starting with 3rd or 4th), it has been a tradition to do ‘precons’ on Fridays. For those who don’t know – Precons are day long paid trainings by an expert in the subject, held on the friday before the event. We have heard repeatedly from many attendees that precons are their best bet at attending any solid training – they are local, priced low and given by the best of the teachers there is. Quite a number of attendees come to precons on their own dime to learn more.  In short it is impossible to overstate the value add of a precon to community.

We try to have 2 or 3 different precons on different subject areas. This year, we have a precon on Query Tuning by Grant Fritchey, SSIS/BIML by Andy Leonard and Building modern  data architecture by Joshua Luedemann.  Karla Landrum, the queen of sql saturdays – has been doing a blog interview with each presenter for the precons she has at SQL saturday, Pensacola. I thought this is a great way to publicize the event as well as the presenter and am doing it for our events too. This post is an interview with Andy Leonard.

If you are a serious BI/SSIS professional – it is impossible to not know Andy.  He is a highly respected person in the BI community – a technical guru and a former MVP who voluntarily gave up his title to enable more new people to get it. I am really glad to have him do a precon for us this year. Below is a short interview with him.

1. Give us a brief background/history of your work with sql server.
I started working with SQL Server in the late 1990’s. I was building electrical control systems and human-machine interfaces (HMIs) for manufacturing back then. To stress-test an HMI, I set the data-acquisition frequency to one second for every tag (data-collection point) and left for a 3-day weekend. When I returned I tried to open a 3GB Access file and… nothing happened. The stress test succeeded in letting me I’d need a bigger database engine. I found SQL Server 6.5 and it passed that same test. Problem solved.

2.What are the top 3 compelling reasons you’d give to someone for attending your precon – ‘intelligent-data-integration-ssis-design-patterns-and-biml’?
Automation is happening all around us. The days of being paid to copy, paste, and edit repetitive SSIS packages are fading fast; as are the days of manually building and deploying configurations scripts between SSIS Catalogs. Attendees will learn:
1) How to use Business Intelligence Markup Language (Biml) to automate the development of repetitive SSIS solutions;
2) How to leverage SSIS Design Patterns to improve SSIS package performance; and
3) How to combine SSIS Design Patterns and Biml to manage enterprise data integration. I’ll even throw in some demos of the free tools and utilities available at DILM Suite – no extra charge!

3. We are going through a lot of changes in database world. There are lot of skills required to sustain ourselves as data/BI professionals. Why do you think SSIS/BIML are among them, why is it so important to add to our skills as BI professionals?

All of the changes in data/BI are driven by economies of scale. Automation is driving down the costs of data integration and management. The net result is data integration developers can now deliver more in less time, with improved quality. There is benefit in spotting a trend and getting in early (ask me how I know!). Data integration automation with Biml is new-enough that one may still get in “early.”
4. What are a few reasons you’d give to someone for attending sql saturdays?
SQL Saturdays are awesome for so many reasons! My #1 reason for supporting SQL Saturdays is the networking opportunities they represent. I love the local-ness of the event. I love that SQL Saturdays offer so many in our community the opportunity to present their ideas and solutions to an audience for the very first time. And I love that SQL Saturdays introduce so many to our community. As I tell my lovely bride, Christy, attending a SQL Saturday is just like hanging out with family, except I don’t have to explain acronyms!

5. What do you like to do for fun/relaxation?

I like to read for relaxation. I read science fiction series – some old, others new. I really enjoyed The Expanse novels and I think the television series is doing a good job capturing the story line. I read business books (and listen to business audio books) because I’m interested in becoming a better businessperson and leader. I read (and listen to) books about theology and Christianity because I want to be a better husband, father, grandfather, and person.

I hope you enjoyed reading and hope it helps you sign up for his precon!! Thank you.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SQL Saturdays – down memory lane

A casual twitter-conversation with Karla Landrum and some other peeps led me down memory lane on older events. Our SQL Saturday at Louisville will be 9 years old this year. We were event #23, in 2009. SQL Saturdays started two years before, in 2007.

Our first event was held at a training center – 2 tracks, 6 speakers, 29 sessions submitted, 3 sponsors and about 60 attendees. We outgrew that location the very next year. Our present event has 6 tracks, close to 300 attendees, 109 sessions submitted so far.

The 22 events before us were as below:
1 Orlando, FL
2 Tampa, FL
3 Jacksonville, FL
4 Orlando, FL
5 Olympia, WA
6 Cleveland, OH (did not actually happen).
7 Birmingham, AL
8 Orlando, FL
9 Greenville, SC
10 Tampa, FL
11 Tacoma, WA
12 Portland, OR
13 Alpharetta, GA
14 Pensacola, FL
15 Jacksonville, FL
16 South Florida
17 Baton Rouge, LA
18 Orlando, FL
19 East Iowa, IA
20 Jacksonville, FL
21 Orlando, FL
22 Pensacola, FL

A lot of the Florida events are past their 10 year anniversary. Many others will be having one this year or next year. This means 10+years of free training to many, networking opportunities, small businesses that have profited by providing services and vendors who have <hopefully> found more customers. If you attend any of these events make sure to thank the organizers – an event is a LOT of work to organize and doing it for 10+ years is no mean achievement – it takes considerable motivation and hard work. Some of my personal choices of memories around 9 years of running this include –

1 I did not have breakfast/coffee delivered once. This is probably the biggest thing I remember that went wrong during my decade of running the event. The food vendor had an employee who was new to town and made his delivery somewhere else (pre GPS days). I still recall that frantic morning with upset speakers and repeat calls to the food vendor.

2 One of the free locations we hosted our event in once threatened to cancel on us on the Friday before. The reason given was that there was ‘an inch of snow’ on the ground and they did not want to risk anybody’s safety. I was on my way to speaker dinner, and had to turn around to talk to them and convince them otherwise. One inch of snow is a big deal for some people. My team and the only volunteer we have left from those days – Deana, has stories on planting signs on the road on a frozen morning. Needless to say, we never had an event in winter ever again.

3 We had 8 tracks at one event. There was a new speaker who was doing her first talk and had nobody show up at her class. She was in tears. We never overdo how many tracks we have after that.

4 Among my other favorite (smaller) memories of the decade include –
1 A lady DBA who was also a new mom attended the WIT session we had with Kevin Kline and Karen Lopez. She was close to quitting her job and decided to stick on after she heard them.
2 One of my events happened to fall on my birthday. Some of the attendees got a big cake and I had a ‘happy birthday’ sung to me by hundreds of people.
3 Wendy Pastrick, one of the PASS board members appreciated our event as among the best organized smaller events.
4 Tim Ford convinced me to attend SQL Cruise during my of events. I’ve attended a cruise every year since then.
5 Hearing attendees talk about ‘do you remember 5 years ago…we came here..’ – never tire of that, ever.

Thank you to all the organizers of the events above for your dedication and hard work..and hope to keep this going as long as we can!! If you are an organizer of any of the above events – do write more on your favorite memories!!

Thanks for reading.