Statistics with R and SQL

Normal approximation to binomial distribution using T-SQL and R

In the previous post I demonstrated the use of binomial formula to calculate probabilities of events occurring in certain situations. In this post am going to explore the same situation with a bigger sample set. Let us assume, for example, that instead of 7 smokers we had 100 smokers. We want to know what are the chances that a maximum of 35 of them have a serious lung disease. Going by the Binomial formula illustrated in the previous post that would mean a whole lot of calculations – calculating the probability of every situation less than 35 and then stick the values into the formula to derive an answer. That is a very laborious way to derive a result, even if we are using an automated process to do it.  An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers.  The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Step 1: How do I know my problem fits this particular statistical solution?
To determine whether n is large enough to use what statisticians call the normal approximation to the binomial, both of the following conditions must hold:

image0.png

In this case 100*0.3 = 30, which is way greater than 10. The second condition 100*0.7 is also 70 and way greater than 10. So we are good to proceed.

2 Statistically stated, we need to find P(x<=35)

3 The formula to use now is zee where ‘meu’ in the numerator is the mean and sigma, the denominator is the standard deviation. Let us use some tried and trusted t-sql to arrive at this value.

4 We use 35 as x, but we add 0.5 as suggested ‘corrective value’ to it. Running T-SQL to get z value as below:

DECLARE @X numeric(10,4), @p numeric(10, 4), @n numeric
(10, 4), @Z NUMERIC(10, 4)
DECLARE @MEAN NUMERIC(10, 4), @VARIANCE NUMERIC(10, 4), @SD NUMERIC(10, 4)
SELECT @X = 35.5
SELECT @P = 0.3
SELECT @N = 100
SELECT @MEAN = @N*@p
SELECT @VARIANCE = @N*@p*(1-@p)
SELECT @SD = SQRT(@VARIANCE)
SELECT @Z = (@X-@MEAN)/@SD
SELECT @Z as 'Z value'

zvalue

5 To calculate probability from Z value, we can use Z value tables. There are also plenty of online calculators available – I used this one.  I get probability to be 0.884969.

zvalue

6 The same calculation can be achieved with great ease in R by just saying

pbinom(35,size=100,prob=0.3)

The result I get is very close to above.

zvalue

You can get the same result by calling the R function from within TSQL as below

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'x<-pbinom(35,size=100,prob=0.3);
 print(x);'
 ,@input_data_1 = N'SELECT 1;'

zee

In other words, there is a 88.39 percent chance that 35 out of 100 smokers end up with lung disease. Thank you for reading!

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s