To understand better am going to use the same example that I briefly referenced in my earlier post.

In this case the risk of a patient treated with Drug A developing asthma is 56/100 = 0.56. Risk of patient treated with drug B developing asthma is 32/100 = 0.32. So the relative risk is 0.56/0.32 which is 1.75. Absolute Risk, is another term which is the difference in probabilities of the two cases(0.56-0.32). There are some posts that argue that absolute risk should be used while comparing two medications and relative risk for one medication versus none at all but this is not a hard rule and there are many variations.

This wikipedia post has a great summary of relative risk – make sure to read the link they have on absolute risk also.

Now, applying relative risk to the problem we were trying to solve in the earlier post – we have two groups of data as below.

The relative risk in the first case is (32/40)/(24/60) = 2. In the second group it is (24/60)/(8/40) = 2. So logically when we combine(add) the two groups we should still get a relative risk of 2. But we get 1.75, as we saw with the first set of data above. The reason for that skew is because of the age factor, also called the confounding variable. We used the cochran-mantel test to mitigate the effect of the age factor to calculate x2 and pi value for the same data. We can use the same test to calculate relative risk by obscuring the age factor – the formula for doing this is as below (with due thanks to the text book on Introductory Applied Biostatistics.

Using the formula on the data in T-SQL (you can find the dataset to use here) –

declare @Batch2DrugAYes numeric(18,2), @Batch2DrugANo numeric(18,2), @Batch2DrugBYes numeric(18, 2), @Batch2DrugBNo numeric(18, 2) declare @riskratio numeric(18, 2), @riskrationumerator numeric(18, 2), @riskratiodenom numeric(18,2 ) declare @totalcount numeric(18, 2) SELECT @totalcount = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 SELECT @Batch1DrugAYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'A' AND response = 'Y' SELECT @Batch1DrugANo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'A' AND response = 'N' SELECT @Batch1DrugBYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'B' AND response = 'Y' SELECT @Batch1DrugBNo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'B' AND response = 'N' SELECT @Batch2DrugAYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'A' AND response = 'Y' SELECT @Batch2DrugANo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'A' AND response = 'N' SELECT @Batch2DrugBYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'B' AND response = 'Y' SELECT @Batch2DrugBNo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'B' AND response = 'N' SELECT @riskrationumerator = (@Batch1DrugAYes*(@Batch1DrugBYes+@Batch1DrugBNo))/@totalcount SELECT @riskrationumerator = @riskrationumerator + (@Batch2DrugAYes*(@Batch2DrugBYes+@Batch2DrugBNo))/@totalcount SELECT @riskratiodenom = (@Batch1DrugBYes*(@Batch1DrugAYes+@Batch1DrugANo))/@totalcount SELECT @riskratiodenom = @riskratiodenom + (@Batch2DrugBYes*(@Batch2DrugAYes+@Batch2DrugANo))/@totalcount --SELECT @riskratiodenom --SELECT @riskrationumerator,@riskratiodenom SELECT 'Adjust Risk Ratio: ', @riskrationumerator/@riskratiodenom

We can write code in R to achieve above result – there is no in built function to do this as far as I can see. But when we can write simpler code in T-SQL I was not sure if it worth the trouble to do it for this particular case. We have at least one scenario we can do something easily in T-SQL that R does not seem to have built-in. I certainly enjoyed that feeling!! Thanks for reading.

]]>

Let us consider a cohort study as an example – we have two medications A and B to treat asthma. We test them on a randomly selected batch of 200 people. Half of them receive drug A and half of them receive drug B. Some of them in either half develop asthma and some have it under control. The data set I have used can be found here. The summarized results are as below.

To understand this data better, let us look at a very important statistical term – Relative Risk .It is the ratio of two probabilities.That is, the Risk of patient developing asthma with Medication A/Risk of patient developing asthma with medication B. In this case the risk of a patient treated with Drug A developing asthma is 56/100 = 0.56. Risk of patient treated with drug B developing asthma is 32/100 = 0.32. So the relative risk is 0.56/0.32 which is 1.75.

Let us assume a theory/hypothesis then, that there is no significant difference in developing asthma from taking drug A versus taking drug B. Or in other words, that comparative relative risk from the two medications is the same, or that their ratio is 1. We can test this hypothesis using Chi Square test in R. (If you want the long winded T-SQL way of doing the Chi Square test refer to my blog post here. The goal of this post is to go further than that so am not repeating this using T-SQL, just using R for this step).

mymatrix3 <- matrix(c(56,44,32,68),nrow=2,byrow=TRUE) colnames(mymatrix3) <- c("Yes","No") rownames(mymatrix3) <- c("A", "B") chisq.test(mymatrix3)

Since the p value is significantly less than 0.05, we can conclude with 95% certainty that the null hypothesis is false and the two medications have different effects, not the same.

Now, let us take it one step further. Inspection of the data reveals that people selected randomly for the test fall broadly into two age groups, below 65 and over or equal to 65. Let us call these two age groups 1 and 2. If we separate the data into these two groups it looks like this.

Running the chi square test on both of these datasets, we get results like this :

mymatrix1 <- matrix(c(32,8,24,36),nrow=2,byrow=TRUE) colnames(mymatrix1) <- c("Yes","No") rownames(mymatrix1) <- c("A","B") chisq.test(mymatrix1) mymatrix2 <- matrix(c(24,36,8,32),nrow=2,byrow=TRUE) colnames(mymatrix2) <- c("Yes","No") rownames(mymatrix2) <- c("A", "B") myarray <- array(c(mymatrix1,mymatrix2),dim=c(2,2,2)) chisq.test(mymatrix2)

In the second dataset for people of age group < 65, we can see that the p value is greater than 0.05 thus proving the null hypothesis right. In other words, when the data is split into two groups based on age, the original assumption does not hold true. Age, in this case, becomes the confounding variable or the variable that changes the conclusions we draw from the dataset. The chi square test shows results that take into account the age variable. These results are not wrong but do not tell us if the two datasets are related for -specifically- for the the two variables we are looking for – drug used and nature of response.

To test for the independence of two variables and mute the effect of the confounding variable with repeated measurements, the Cochran-Mantel-Haenszel test can be used. If you have a matrix as below , the formula for x-squared/pi value for this test is

The above image is used with thanks from the book Introductory Applied Biostatistics.

**Using T-SQL: **I used the same formula as the textbook, only added the correction of 0.5 to the numerator, since R uses the correction automatically and we want to compare results to R.(Disclaimer: To be aware that I have intentionally done this step-by-step for the sake of clarity, and not tried to optimize T-SQL by doing it shortest way for this. It is my humble opinion that these calculations are best done using R – T-SQL is a backup method and a good means to understand what goes behind the formula, nothing more.)

declare @Batch1DrugAYes numeric(18,2), @Batch1DrugANo numeric(18,2), @Batch1DrugBYes numeric(18, 2), @Batch1DrugBNo numeric(18, 2) declare @Batch2DrugAYes numeric(18,2), @Batch2DrugANo numeric(18,2), @Batch2DrugBYes numeric(18, 2), @Batch2DrugBNo numeric(18, 2) declare @xsquared numeric(18, 2), @xsquarednumerator numeric(18, 2), @xsquareddenom numeric(18,2 ) declare @totalcount numeric(18, 2) SELECT @totalcount = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 SELECT @Batch1DrugAYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'A' AND response = 'Y' SELECT @Batch1DrugANo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'A' AND response = 'N' SELECT @Batch1DrugBYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'B' AND response = 'Y' SELECT @Batch1DrugBNo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 1 AND drug = 'B' AND response = 'N' SELECT @Batch2DrugAYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'A' AND response = 'Y' SELECT @Batch2DrugANo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'A' AND response = 'N' SELECT @Batch2DrugBYes = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'B' AND response = 'Y' SELECT @Batch2DrugBNo = count(*) FROM [dbo].[DrugResponse] WHERE batch = 2 AND drug = 'B' AND response = 'N' SELECT @xsquarednumerator = ((@Batch1DrugAYes*@Batch1DrugBNo) - (@Batch1DrugANo*@Batch1DrugBYes))/100 SELECT @xsquarednumerator = @xsquarednumerator + ((@Batch2DrugAYes*@Batch2DrugBNo) - (@Batch2DrugANo*@Batch2DrugBYes))/100 SELECT @xsquarednumerator = SQUARE(@xsquarednumerator-0.5) SELECT @xsquareddenom = ((@Batch1DrugAYes+@Batch1DrugANo)*(@Batch1DrugBYes+@Batch1DrugBNo)*(@Batch1DrugAYes+@Batch1DrugBYes)*(@Batch1DrugANo+@Batch1DrugBNo))/(SQUARE(@TOTALCOUNT)*(@totalcount-1)) SELECT @xsquareddenom = @xsquareddenom + ((@Batch2DrugAYes+@Batch2DrugANo)*(@Batch2DrugBYes+@Batch2DrugBNo)*(@Batch2DrugAYes+@Batch2DrugBYes)*(@Batch2DrugANo+@Batch2DrugBNo))/(SQUARE(@TOTALCOUNT)*(@totalcount-1)) --SELECT @xsquareddenom --SELECT @xsquarednumerator,@xsquareddenom SELECT 'Chi Squared: ', @xsquarednumerator/@xsquareddenom

We get a chi square value of 17.17. With T-SQL it is hard to take it further than that, so we have to stick this value into a calculator to get the corresponding p-value. The P-Value is 3.5E-05. The result is significant at p < 0.05. What this means in lay man terms is that the two datasets have differences that are statistically significant in nature and that the null hypothesis that says they are the same statistically is false.

Trying to do the same thing in R is very easy –

mymatrix1 <- matrix(c(32,8,24,36),nrow=2,byrow=TRUE) colnames(mymatrix1) <- c("Yes","No") rownames(mymatrix1) <- c("A","B") mymatrix2 <- matrix(c(24,36,8,32),nrow=2,byrow=TRUE) colnames(mymatrix2) <- c("Yes","No") rownames(mymatrix2) <- c("A", "B") myarray <- array(c(mymatrix1,mymatrix2),dim=c(2,2,2)) mantelhaen.test(myarray)

Results are as below and almost identical to what we found with T-SQL. Hence the conclusion drawn is valid, that the two datasets are different regardless of age.

In the next post I will cover the calculation of relative risk with this method. Thank you for reading.

]]>

USE [yourdb] GO /****** Object: Table [dbo].[DrugResponse] Script Date: 6/12/2017 6:45:46 AM ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIERl ON GO SET ANSI_PADDING ON GO CREATE TABLE [dbo].[DrugResponse]( [seqno] [int] IDENTITY(1,1) NOT NULL, [Batch] [smallint] NOT NULL, [Drug] [char](1) NOT NULL, [Response] [char](1) NOT NULL, CONSTRAINT [PK_DrugResponse] PRIMARY KEY CLUSTERED ( [seqno] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] GO SET ANSI_PADDING OFF GO SET IDENTITY_INSERT [dbo].[DrugResponse] ON GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (1, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (2, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (3, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (4, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (5, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (6, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (7, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (8, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (9, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (10, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (11, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (12, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (13, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (14, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (15, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (16, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (17, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (18, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (19, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (20, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (21, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (22, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (23, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (24, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (25, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (26, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (27, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (28, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (29, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (30, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (31, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (32, 1, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (33, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (34, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (35, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (36, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (37, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (38, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (39, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (40, 1, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (41, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (42, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (43, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (44, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (45, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (46, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (47, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (48, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (49, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (50, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (51, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (52, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (53, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (54, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (55, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (56, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (57, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (58, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (59, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (60, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (61, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (62, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (63, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (64, 1, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (65, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (66, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (67, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (68, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (69, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (70, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (71, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (72, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (73, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (74, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (75, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (76, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (77, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (78, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (79, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (80, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (81, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (82, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (83, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (84, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (85, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (86, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (87, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (88, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (89, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (90, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (91, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (92, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (93, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (94, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (95, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (96, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (97, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (98, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (99, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (100, 1, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (101, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (102, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (103, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (104, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (105, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (106, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (107, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (108, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (109, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (110, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (111, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (112, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (113, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (114, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (115, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (116, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (117, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (118, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (119, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (120, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (121, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (122, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (123, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (124, 2, N'A', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (125, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (126, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (127, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (128, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (129, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (130, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (131, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (132, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (133, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (134, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (135, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (136, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (137, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (138, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (139, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (140, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (141, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (142, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (143, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (144, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (145, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (146, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (147, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (148, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (149, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (150, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (151, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (152, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (153, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (154, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (155, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (156, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (157, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (158, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (159, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (160, 2, N'A', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (161, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (162, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (163, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (164, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (165, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (166, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (167, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (168, 2, N'B', N'Y') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (169, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (170, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (171, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (172, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (173, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (174, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (175, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (176, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (177, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (178, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (179, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (180, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (181, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (182, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (183, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (184, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (185, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (186, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (187, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (188, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (189, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (190, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (191, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (192, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (193, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (194, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (195, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (196, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (197, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (198, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (199, 2, N'B', N'N') GO INSERT [dbo].[DrugResponse] ([seqno], [Batch], [Drug], [Response]) VALUES (200, 2, N'B', N'N') GO SET IDENTITY_INSERT [dbo].[DrugResponse] OFF GO

]]>

Speakers | New | Repeat | Row Total |

Male | 2 | 11 | 13 |

Female | 1 | 2 | 3 |

Column Total | 3 | 13 | 16 |

**Step 1 – Setup Hypothesis:** What is the question am trying to answer? – if I were to choose 3 new speakers at random, say – what is the probability that a minimum of 1 of them will be a woman? Another more simplified way of stating the same problem is – Is there a correlation between gender and number of new speakers? From a statistical perspective, the assumption is a ‘no’ to begin with. (also called Null Hypothesis). If we disprove this statement, we prove the opposite – that there is a relationship. If not, there isn’t. So putting it down:

**H0, or Null hypothesis :** There is no correlation between gender and new speaker count that is statistically significant.

**H1: The alternate hypothesis:** There is a correlation between gender and new speaker count that is statistically significant.

What do both of these statements mean mathematically, or in other words , what would be the basis on which we make this decision? We can look at that in Step 3.

**Step 2: Set up the appropriate test statistic:** We choose to use Fischer’s test because of the sparse number of values we have, and also because our variables of choice are categorical.

**Step 3:** **How do i decide? :** The decision rule in two sample tests of hypothesis depends on three factors :

1 Whether the test is upper, lower or two tailed (meaning the comparison is greater, lesser or both sides of gender and speaker count)

2 The level of significance or degree of accuracy needed,

3 The form of test statistic.

Our test here is to just find out if gender and speaker count are related so it is a two tailed test. The level of significance we can use is the most commonly used 95% which is also the default in R for Fischer’s Test. The form of the test statistic is P value. So our decision rule would be that gender and speaker category are related if P value is less than 0.05.

**Step 4: Calculation**

Now, time to do the math...first, with R: Input =(" Speaker New Repeat Male 2 11 Female 1 2 ") TestData = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) fisher.test(TestData,alternative="two.sided")

R is telling us that the p value is 0.4893. way above 0.05. And hence per our decision rule the two elements are not correlated based on the sparse data we have.

Now let us try the same thing with T-SQL. The calculation for Fischer’s test is rather elaborate when done manually – which is where you can appreciate how elegant and easy it is to use built-in functions with R. To do it otherwise, you need to not only code the calculation, but also come up with different possibilities of the same matrix. That is those that total up the same row and column wise. Then calculate the probabilities on each of them and sum those probabilities that are less than the ‘base probability’, or the one we derive from the base matrix. In this case we have 4 possible matrices as below, and each of the their probabilities (calculated with T-SQL) and as shown

**T-SQL to calculate probabilities: **All probability related math needs calculation of factorials. For this purpose I used the method described by Jeff Moden here.

DECLARE @newmen int , @newwomen int , @repeatmen int , @repeatwomen int DECLARE @pvalue numeric(18, 4) DECLARE @numerator1 float,@numerator2 float,@numerator3 float,@numerator4 float,@numerator5 float DECLARE @denominator1 float,@denominator2 float,@denominator3 float,@denominator4 float,@denominator5 float SELECT @newmen = 2, @newwomen = 1, @repeatmen = 11, @repeatwomen = 2 SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen) --select @newmen+@newwomen SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen) --select @repeatmen+@repeatwomen SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen) --select @newmen+@repeatwomen SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen) --select @newwomen+@repeatmen --select @numerator1, @numerator2, @numerator3, @numerator4 SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen) SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5) --select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5 SELECT 'Matrix 1 - Pcutoff' as Matrix, @pvalue as PValue SELECT @newmen = 1, @newwomen = 2, @repeatmen = 12, @repeatwomen = 1 SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen) --select @newmen+@newwomen SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen) --select @repeatmen+@repeatwomen SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen) --select @newmen+@repeatwomen SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen) --select @newwomen+@repeatmen --select @numerator1, @numerator2, @numerator3, @numerator4 SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen) SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5) --select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5 SELECT 'Matrix 2' as Matrix, @pvalue as PValue SELECT @newmen = 3, @newwomen = 0, @repeatmen = 10, @repeatwomen = 3 SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen) --select @newmen+@newwomen SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen) --select @repeatmen+@repeatwomen SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen) --select @newmen+@repeatwomen SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen) --select @newwomen+@repeatmen --select @numerator1, @numerator2, @numerator3, @numerator4 SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen) SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5) --select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5 SELECT 'Matrix 3' as Matrix, @pvalue as PValue SELECT @newmen = 0, @newwomen = 3, @repeatmen = 13, @repeatwomen = 0 SELECT @numerator1 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen) --select @newmen+@newwomen SELECT @numerator2 = [n!] FROM [dbo].[Factorial] WHERE N = (@repeatmen+@repeatwomen) --select @repeatmen+@repeatwomen SELECT @numerator3 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@repeatmen) --select @newmen+@repeatwomen SELECT @numerator4 = [n!] FROM [dbo].[Factorial] WHERE N = (@newwomen+@repeatwomen) --select @newwomen+@repeatmen --select @numerator1, @numerator2, @numerator3, @numerator4 SELECT @denominator1 = [n!] FROM [dbo].[Factorial] WHERE N = @newmen SELECT @denominator2 = [n!] FROM [dbo].[Factorial] WHERE N = @newwomen SELECT @denominator3 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatmen SELECT @denominator4 = [n!] FROM [dbo].[Factorial] WHERE N = @repeatwomen SELECT @denominator5 = [n!] FROM [dbo].[Factorial] WHERE N = (@newmen+@newwomen+@repeatmen+@repeatwomen) SELECT @pvalue = (@numerator1*@numerator2*@numerator3*@numerator4)/(@denominator1*@denominator2*@denominator3*@denominator4*@denominator5) --select @denominator1, @denominator2, @denominator3, @denominator4, @denominator5 SELECT 'Matrix 4' as Matrix, @pvalue as PValue

The response we get is as below.

If we sum the 3 values that are less than base value 0.4179 – we get 0.4179 + 0.0696 + 0.0018 = 0.4893, which is exactly what we got from the R function.

**Step 5: Conclusion:** Since 0.4893 is greater than our desired value of 0.05, our decision rule did not pass. Or in other words, we accept the null hypothesis in Step 1, that there is no significant correlation between these two variables.

So, we can logically conclude, that based on the data we are given, we do not have enough evidence that gender of speakers and their count is actually related or significant. Thanks for reading!!

]]>

We try to have 2 or 3 different precons on different subject areas. This year, we have a precon on Query Tuning by Grant Fritchey, SSIS/BIML by Andy Leonard and Building modern data architecture by Joshua Luedemann. Karla Landrum, the queen of sql saturdays – has been doing a blog interview with each presenter for the precons she has at SQL saturday, Pensacola. I thought this is a great way to publicize the event as well as the presenter and am doing it for our events too. This post is an interview with Andy Leonard.

If you are a serious BI/SSIS professional – it is impossible to not know Andy. He is a highly respected person in the BI community – a technical guru and a former MVP who voluntarily gave up his title to enable more new people to get it. I am really glad to have him do a precon for us this year. Below is a short interview with him.

I started working with SQL Server in the late 1990’s. I was building electrical control systems and human-machine interfaces (HMIs) for manufacturing back then. To stress-test an HMI, I set the data-acquisition frequency to one second for every tag (data-collection point) and left for a 3-day weekend. When I returned I tried to open a 3GB Access file and… nothing happened. The stress test succeeded in letting me I’d need a bigger database engine. I found SQL Server 6.5 and it passed that same test. Problem solved.

Automation is happening all around us. The days of being paid to copy, paste, and edit repetitive SSIS packages are fading fast; as are the days of manually building and deploying configurations scripts between SSIS Catalogs. Attendees will learn:

1) How to use Business Intelligence Markup Language (Biml) to automate the development of repetitive SSIS solutions;

2) How to leverage SSIS Design Patterns to improve SSIS package performance; and

3) How to combine SSIS Design Patterns and Biml to manage enterprise data integration. I’ll even throw in some demos of the free tools and utilities available at DILM Suite – no extra charge!

3. We are going through a lot of changes in database world. There are lot of skills required to sustain ourselves as data/BI professionals. Why do you think SSIS/BIML are among them, why is it so important to add to our skills as BI professionals?

All of the changes in data/BI are driven by economies of scale. Automation is driving down the costs of data integration and management. The net result is data integration developers can now deliver more in less time, with improved quality. There is benefit in spotting a trend and getting in early (ask me how I know!). Data integration automation with Biml is new-enough that one may still get in “early.”

SQL Saturdays are awesome for so many reasons! My #1 reason for supporting SQL Saturdays is the networking opportunities they represent. I love the local-ness of the event. I love that SQL Saturdays offer so many in our community the opportunity to present their ideas and solutions to an audience for the very first time. And I love that SQL Saturdays introduce so many to our community. As I tell my lovely bride, Christy, attending a SQL Saturday is just like hanging out with family, except I don’t have to explain acronyms!

5. What do you like to do for fun/relaxation?

I like to read for relaxation. I read science fiction series – some old, others new. I really enjoyed The Expanse novels and I think the television series is doing a good job capturing the story line. I read business books (and listen to business audio books) because I’m interested in becoming a better businessperson and leader. I read (and listen to) books about theology and Christianity because I want to be a better husband, father, grandfather, and person.

I hope you enjoyed reading and hope it helps you sign up for his precon!! Thank you.

]]>

Our first event was held at a training center – 2 tracks, 6 speakers, 29 sessions submitted, 3 sponsors and about 60 attendees. We outgrew that location the very next year. Our present event has 6 tracks, close to 300 attendees, 109 sessions submitted so far.

The 22 events before us were as below:

1 Orlando, FL

2 Tampa, FL

3 Jacksonville, FL

4 Orlando, FL

5 Olympia, WA

6 Cleveland, OH (did not actually happen).

7 Birmingham, AL

8 Orlando, FL

9 Greenville, SC

10 Tampa, FL

11 Tacoma, WA

12 Portland, OR

13 Alpharetta, GA

14 Pensacola, FL

15 Jacksonville, FL

16 South Florida

17 Baton Rouge, LA

18 Orlando, FL

19 East Iowa, IA

20 Jacksonville, FL

21 Orlando, FL

22 Pensacola, FL

A lot of the Florida events are past their 10 year anniversary. Many others will be having one this year or next year. This means 10+years of free training to many, networking opportunities, small businesses that have profited by providing services and vendors who have <hopefully> found more customers. If you attend any of these events make sure to thank the organizers – an event is a LOT of work to organize and doing it for 10+ years is no mean achievement – it takes considerable motivation and hard work. Some of my personal choices of memories around 9 years of running this include –

1 I did not have breakfast/coffee delivered once. This is probably the biggest thing I remember that went wrong during my decade of running the event. The food vendor had an employee who was new to town and made his delivery somewhere else (pre GPS days). I still recall that frantic morning with upset speakers and repeat calls to the food vendor.

2 One of the free locations we hosted our event in once threatened to cancel on us on the Friday before. The reason given was that there was ‘an inch of snow’ on the ground and they did not want to risk anybody’s safety. I was on my way to speaker dinner, and had to turn around to talk to them and convince them otherwise. One inch of snow is a big deal for some people. My team and the only volunteer we have left from those days – Deana, has stories on planting signs on the road on a frozen morning. Needless to say, we never had an event in winter ever again.

3 We had 8 tracks at one event. There was a new speaker who was doing her first talk and had nobody show up at her class. She was in tears. We never overdo how many tracks we have after that.

4 Among my other favorite (smaller) memories of the decade include –

1 A lady DBA who was also a new mom attended the WIT session we had with Kevin Kline and Karen Lopez. She was close to quitting her job and decided to stick on after she heard them.

2 One of my events happened to fall on my birthday. Some of the attendees got a big cake and I had a ‘happy birthday’ sung to me by hundreds of people.

3 Wendy Pastrick, one of the PASS board members appreciated our event as among the best organized smaller events.

4 Tim Ford convinced me to attend SQL Cruise during my of events. I’ve attended a cruise every year since then.

5 Hearing attendees talk about ‘do you remember 5 years ago…we came here..’ – never tire of that, ever.

Thank you to all the organizers of the events above for your dedication and hard work..and hope to keep this going as long as we can!! If you are an organizer of any of the above events – do write more on your favorite memories!!

Thanks for reading.

]]>

*Given a room of 23 random people, what are chances that two or more of them have the same birthday? *

This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.

What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this. Doing the math then – first with T-SQL –

DECLARE @x INTEGER, @NUMBEROFPAIRS INTEGER, @probability_notapair numeric(18, 4), @probability_pair numeric(18, 4) DECLARE @daysinyear numeric(18,4), @daysinyearminus1 numeric(18, 4) SELECT @x = 23 SELECT @numberofpairs = (@x*(@x-1)/2) SELECT @daysinyear = 365 SELECT @daysinyearminus1 = @daysinyear - 1 SELECT @probability_notapair = (@daysinyearminus1/@daysinyear) SELECT 'Probability of a pair having birthdays' ,1-power(@probability_notapair, @NUMBEROFPAIRS)

In R this is very easily calculated using the line

prod(1-(0:22)/365)

To be aware that prod is just a function that multiplies what it is supplied, it is not a special statistical function of any kind. In this case since the math is really easy, that is all we need to calculate the result.

As we can see it is pretty close to what we got with T-SQL.

We can play around with R a little bit and get a nice graph illustrating the same thing.

positiveprob <- numeric(23) #creatingvectortoholdvalues #loop and fill values in vector for (n in 1:23) { negativeprob <- 1 - (0:(n - 1))/365 positiveprob[n] <- 1 - prod(negativeprob) } #draw graph to show probability plot(positiveprob, main = 'Graph of birthday probabilites for 23 people', xlab = 'Number of people in room', ylab = 'Probability of same birthdays')

As we can see the probability of two or more people sharing a birthday in a room of about 23 is near 50%. Pretty amazing.

There is a ton of very detailed posts on this rather famous problem, this is just a basic intro for those of you learning stats and R.

1 https://www.scientificamerican.com/article/bring-science-home-probability-birthday-paradox/

2 http://blog.revolutionanalytics.com/2012/06/simulating-the-birthday-problem-with-data-derived-probabilities.html

Thanks for reading!!

]]>

**Step 1:** How do I know my problem fits this particular statistical solution?

To determine whether *n* is large enough to use what statisticians call the *normal approximation to the binomial,* both of the following conditions must hold:

In this case 100*0.3 = 30, which is way greater than 10. The second condition 100*0.7 is also 70 and way greater than 10. So we are good to proceed.

2 Statistically stated, we need to find P(x<=35)

3 The formula to use now is where ‘meu’ in the numerator is the mean and sigma, the denominator is the standard deviation. Let us use some tried and trusted t-sql to arrive at this value.

4 We use 35 as x, but we add 0.5 as suggested ‘corrective value’ to it. Running T-SQL to get z value as below:

DECLARE @X numeric(10,4), @p numeric(10, 4), @n numeric (10, 4), @Z NUMERIC(10, 4) DECLARE @MEAN NUMERIC(10, 4), @VARIANCE NUMERIC(10, 4), @SD NUMERIC(10, 4) SELECT @X = 35.5 SELECT @P = 0.3 SELECT @N = 100 SELECT @MEAN = @N*@p SELECT @VARIANCE = @N*@p*(1-@p) SELECT @SD = SQRT(@VARIANCE) SELECT @Z = (@X-@MEAN)/@SD SELECT @Z as 'Z value'

5 To calculate probability from Z value, we can use Z value tables. There are also plenty of online calculators available – I used this one. I get probability to be 0.884969.

6 The same calculation can be achieved with great ease in R by just saying

pbinom(35,size=100,prob=0.3)

The result I get is very close to above.

You can get the same result by calling the R function from within TSQL as below

EXEC sp_execute_external_script @language = N'R' ,@script = N'x<-pbinom(35,size=100,prob=0.3); print(x);' ,@input_data_1 = N'SELECT 1;'

In other words, there is a 88.39 percent chance that 35 out of 100 smokers end up with lung disease. Thank you for reading!

]]>

1 The trials are independent

2 The number of trials, n is fixed.

3 Each trial outcome can be a success or a failure

4 The probability of success in each case is the same.

Applying these rules –

1 The 7 smoking friends are not related or from the same group. (This is important as one friend’s habits can influence another and that does not make for an independent trial).

2 They smoke approximately at the same rate.

3 Either they get a lung disease or they don’t. We are not considering other issues they may have because of smoking.

4 Since all these conditions are met, the probability of each of them getting a lung disease is more or less the same.

The binomial formula given that

x = total number of “successes” (pass or fail, heads or tails etc.)

P = probability of a success on an individual trial

n = number of trials

q= 1 – p – is as below:

For those not math savvy the ! stands for factorial of a number. In above example n equals 7. x, The number of ‘successes’ (morbid, i know, to define a lung condition as a success but just an example) we are looking for is 2. p is given to be 0. and q is 1 – 0.3 which is 0.7. Now, given the rules of probability – we need to add probability of 0 or none having a lung condition, 1 person having a lung condition and 2 having a lung condition – to see what is the probability of a maximum of 2 having a lung condition. Let us look at doing this with T-SQL first, then with R and then calling the R script from within T-SQL.

**1 Using T-SQL:**

There are a lot of different ways to write the simple code of calculating factorial. I found this one to be most handy and reused it. I created the user defined function as ‘factorial’ and used the same code below to calculate probabilities of 0.1 or 2 people getting a lung illness. If we add the 3 together we get the total probability of the maximum of 2 people getting a lung illness – which is about 0.65 or 65 %.

DECLARE @n decimal(10,2), @x decimal(10, 2), @p decimal(10, 2), @q decimal(10, 2) DECLARE @p0 decimal(10, 2), @p1 decimal(10, 2), @p2 decimal(10, 2), @n1 decimal(10, 2), @n2 decimal(10, 2), @n3 decimal(10, 2) SELECT @n = 7, @x = 0, @p = 0.3,@q=0.7 SELECT @x = 0 SELECT @n1 = dbo.factorial(@n) SELECT @n2 = dbo.factorial(@n-@x) SELECT @n3 = 1 SELECT @p1 = ( @n1/(@n2 * @n3))*power(@p, @x)*power(@q,@n-@x) select @p1 as 'Probability of 0 people getting lung illness' SELECT @x = 1 SELECT @p1 = ( @n/@x)*power(@p, @x)*power(@q,@n-@x) select @p1 as 'Probability of 1 person getting lung illness' SELECT @x = 2 SELECT @n1 = dbo.factorial(@n) SELECT @n2 = dbo.factorial(@n-@x) SELECT @n3 = dbo.factorial(@x) SELECT @p2 = ( @n1/(@n2 * @n3))*power(@p, @x)*power(@q,@n-@x) select @p2 as 'Probability of 2 people getting lung illness'

Results are as below:

**2 Using R:**

The R function for this is seriously simple, one line call as below.

dbinom(0:2, size=7, prob=0.3)

My results, almost exactly the same as what we got with T-SQL.

**3 Calling R from T-SQL:**

Instead of writing all that code i can simply call this function from with TSQL –

EXEC sp_execute_external_script @language = N'R' ,@script = N'x<-dbinom(0:2, size=7, prob=0.3); print(x);' ,@input_data_1 = N'SELECT 1;'

Results as below:

It is a LOT of fun to get our numbers to tie in more than one way. Thanks for reading.

]]>

The sampling distribution is the distribution of means collected from random samples taken from a population. So, for example, if i have a population of life expectancies around the globe. I draw five different samples from it. For each sample set I calculate the mean. The collection of those means would make up my sample distribution. Generally, the mean of the sample distribution will equal the mean of the population, and the standard deviation of the sample distribution will equal the standard deviation of the population.

The **central limit theorem** states that the sampling distribution of the mean of any independent,random variable will be normal or nearly normal, if the sample size is large enough. How large is “large enough”? The answer depends on two factors.

- Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.
- The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required. (from stattrek.com).

The main use of the sampling distribution is to verify the accuracy of many statistics and population they were based upon.

Let me try demonstrating this with an example in TSQL. I am going to use [Production].[WorkOrder] table from Adventureworks2016. To begin with, am going to test if this data is actually a normal distribution in of itself. I use the Empirical rule test I have described here for this. Running the code for the test, I get values that tell me that this data is very skewed and hence not a normal distribution.

DECLARE @sdev numeric(18,2), @mean numeric(18, 2), @sigma1 numeric(18, 2), @sigma2 numeric(18, 2), @sigma3 numeric(18, 2) DECLARE @totalcount numeric(18, 2) SELECT @sdev = SQRT(var(orderqty)) FROM [Production].[WorkOrder] SELECT @mean = sum(orderqty)/count(*) FROM [Production].[WorkOrder] SELECT @totalcount = count(*) FROM [Production].[WorkOrder] where orderqty > 0 SELECT @sigma1 = (count(*)/@totalcount)*100 FROM [Production].[WorkOrder] WHERE orderqty >= @mean-@sdev and orderqty<= @mean+@sdev SELECT @sigma2 = (count(*)/@totalcount)*100 FROM [Production].[WorkOrder] WHERE orderqty >= @mean-(2*@sdev) and orderqty<= @mean+(2*@sdev) SELECT @sigma3 = (count(*)/@totalcount)*100 FROM [Production].[WorkOrder] WHERE orderqty >= @mean-(3*@sdev) and orderqty<= @mean+(3*@sdev) SELECT @sigma1 AS 'Percentage in one SD FROM mean', @sigma2 AS 'Percentage in two SD FROM mean', @sigma3 as 'Percentage in 3 SD FROM mean

In order for the data to be a normal distribution – the following conditions have to be met –

68% of data falls within the first standard deviation from the mean.

95% fall within two standard deviations.

99.7% fall within three standard deviations.

The results we get from above query suggest to us that the raw data does not quite align with these rules and hence is not a normal distribution.

Now, let us create a sampling distribution from this. To do this we need to pull a few random samples of the data. I used the query suggested here to pull random samples from tables. I pull 30 samples in all and put them into tables.

SELECT * INTO [Production].[WorkOrderSample20] FROM [Production].[WorkOrder] WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 20

I run this query 30 times and change the name of the table the results go into, so am now left with 30 tables with random samples of data from main table.

Now, I have to calculate the mean of each sample, pool it all together and then re run the test for normal distribution to see what we get. I do all of that below.

DECLARE @samplingdist TABLE (samplemean INT) INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample1] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample2] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample3] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample4] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample5] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample6] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample7] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample8] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample9] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample10] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample11] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample12] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample13] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample14] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample15] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample16] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample17] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample18] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample19] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample20] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample21] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample22] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample23] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample24] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample25] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample26] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample27] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample28] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample29] INSERT INTO @samplingdist (samplemean) select sum(orderqty)/count(*) from [Production].[WorkOrderSample30] DECLARE @sdev numeric(18,2), @mean numeric(18, 2), @sigma1 numeric(18, 2), @sigma2 numeric(18, 2), @sigma3 numeric(18, 2) DECLARE @totalcount numeric(18, 2) SELECT @sdev = SQRT(var(samplemean)) FROM @samplingdist SELECT @mean = sum(samplemean)/count(*) FROM @samplingdist SELECT @totalcount = count(*) FROM @samplingdist SELECT @sigma1 = (count(*)/@totalcount)*100 FROM @samplingdist WHERE samplemean >= @mean-@sdev and samplemean<= @mean+@sdev SELECT @sigma2 = (count(*)/@totalcount)*100 FROM @samplingdist WHERE samplemean >= @mean-(2*@sdev) and samplemean<= @mean+(2*@sdev) SELECT @sigma3 = (count(*)/@totalcount)*100 FROM @samplingdist WHERE samplemean >= @mean-(3*@sdev) and samplemean<= @mean+(3*@sdev) SELECT @sigma1 AS 'Percentage in one SD FROM mean', @sigma2 AS 'Percentage in two SD FROM mean', @sigma3 as 'Percentage in 3 SD FROM mean'The results I get are as below.

The results seem to be close to what is needed for a normal distribution now.

(68% of data should fall within the first standard deviation from the mean.

95% should fall within two standard deviations.

99.7% should fall within three standard deviations.)

It is almost magical how easily the rule fits. To get this to work I had to work on many different sampling sizes – to remember the rule says that it needs considerable number of samples to reflect a normal distribution. In the next post I will look into some examples of using R for demonstrating the same theorem. Thank you for reading.

]]>