Statistics with R and SQL

# Understanding ANOVA

ANOVA – or analysis of variance, is a term given to a set of statistical models that are used to analyze differences among groups and if the differences are statistically significant to arrive at any conclusion. The models were developed by statistician and evolutionary biologist Ronald Fischer. To give a very simplistic definition – ANOVA… Continue reading Understanding ANOVA

Statistics with R and SQL

# Box-and-whisker plot and data patterns with R and T-SQL

R is particularly good with drawing graphs with data. Some graphs are familiar to most DBAs as it has been things we have seen and used over time – bar charts, pie diagram and so on. Some are not. Understanding exploratory graphics is vitally important to the R programmer/data science newbie. This week I wanted… Continue reading Box-and-whisker plot and data patterns with R and T-SQL

Statistics with R and SQL

# Confidence Intervals for a proportion – using R

What is the difference between reading numbers as they are presented, and interpreting them in a mature, deeper way? One way perhaps to look at the latter is what statisticians call ‘confidence interval’. Suppose I look at a sampling of 100 americans who are asked if they approve of the job the supreme court is… Continue reading Confidence Intervals for a proportion – using R

Statistics with R and SQL

# Understanding Relative Risk – with T-SQL

In this post we will explore a common statistical term – Relative Risk, otherwise called Risk Factor. Relative Risk is a term that is important to understand when you are doing comparative studies of two groups that are different in some specific way. The most common usage of this is in drug testing – with… Continue reading Understanding Relative Risk – with T-SQL

Statistics with R and SQL

# Fischer’s Exact Test – with T-SQL and R

This post is a long overdue second part to the post on Chi Square Test that I did a few months ago.  This post addresses relationships between two categorical variables, but in cases where data is sparse, and the numbers (in any cell) are less than 5. The Chi Square test is to be used when… Continue reading Fischer’s Exact Test – with T-SQL and R

Statistics with R and SQL

# The Birthday Problem – with T-SQL and R

When I was on SQLCruise recently – Buck Woody (b|t) made a interesting statement – that in a room of 23 people, there is over a 50% chance that  two or more have the same birthdays. And sure enough, we did end up having more than two people with same birthday. I was tempted to… Continue reading The Birthday Problem – with T-SQL and R

Statistics with R and SQL

# Normal approximation to binomial distribution using T-SQL and R

In the previous post I demonstrated the use of binomial formula to calculate probabilities of events occurring in certain situations. In this post am going to explore the same situation with a bigger sample set. Let us assume, for example, that instead of 7 smokers we had 100 smokers. We want to know what are… Continue reading Normal approximation to binomial distribution using T-SQL and R

Statistics with R and SQL

# The Binomial Formula with T-SQL and R

In a previous post I explained the basics of probability. In this post I will use some of those principles to see how to solve certain problems. I will pick a very simple problem that I found in a statistics textbook. Suppose I have 7 friends who are smokers. The probability that a random smoker will develop… Continue reading The Binomial Formula with T-SQL and R

# Sampling Distribution and Central Limit Theorem

In this post am going to explain (in highly simplified terms) two very important statistical concepts – the sampling distribution and central limit  theorem. The sampling distribution is the distribution of means collected from random samples taken from a population. So, for example, if i have a population of life expectancies around the globe. I draw… Continue reading Sampling Distribution and Central Limit Theorem