I was a student of statistics in school and college. I didn’t get to use much of anything I learned at school, and this was considered one of them. But apparently with the boom in interest in data mining, knowledge of statistics has become more important than ever before. Along with learning basics of statistics comes numerous tools available to mine data, and get appropriate metrics for business. For many people including me – we’d like to get involved with meaningful data mining but don’t know where to start. I personally decided to go back into revising some concepts I knew back in the days off the top of my head. I am going to blog every week on one concept that I am re-learning and applying to some data. I have great interest in data related to healthcare, so I used a data set available for public download from World Health Organization to help me. I also used T-SQL to run basic statistical queries. T-SQL is far from the best tool there is for statistical analysis, but it is certainly useful for getting a start on the subject of understanding and summarizing data.
You are given a dataset. It could a table, or multiple tables. What are the appropriate questions to ask before you get around to analyzing/mining this dataset? Below are a few that are recommended that I found useful to begin with:
- What is the variable being measured? (There may be multiple, but for simplicity’s sake, let us make it 1).
In my case that is life expectancy of people around the globe,summarized at country level.
- What is the population used? (The population is the potential collection of people that are ideally part of the sample used.)
People living in the countries around the globe.
- What is the sample used? (The sample is the subset of the population used for this specific purpose)
People living in the countries targeted for this study.
The reason for asking these questions is to ensure that the dataset is representative of what we intend to present – if we say we are presenting statistics on global life expectancy there better be numbers related to that from many different countries around the world. There need not be every country on the list, but several countries are necessary to make the sample worthy of study. Sampling and sample size is in itself a very vast and deep subject. But this is just about understanding its basics and why we ask certain questions.
The last question that would determine the nature of our queries is about type of the variable being measured. Data used for statistical analysis can fall into one of 4 categories ,also called statistical scales – Nominal, Ordinal, Interval and Ratio. Like it is with rules of normalization – there are 4 rules to determine nature of measurement –
- Identity – each value is unique
- Magnitude – values have a relationship to each other
- Equal Intervals – difference between two unique values is the same
- Minimum value of 0 – the minimal value of the variable is 0.
Applying these rules to the above types of data –
Nominal data is what falls into specific mutually exclusive categories (the ones commonly under radio buttons) – male/female, ethnicity, nationality and so on. It satisfies rule 1.
Ordinal data is basically nominal data but has buckets that are gradeable or have some sort of order to them – such as ‘how good was this presentation – excellent to horrible’, or ‘how do you feel’, or even for example, work related titles – like data entry clerk, programmer, dba, senior dba and so on. It satisfies rules 1 and 2.
Interval data is numbers, and like ordinal – very specific numbers but you can calculate the difference between them. Interval data satisfies rules 1,2 and 3 but not 4 – because it does not stop at 0. An example would be temperature.
Ratio data is numbers like interval data but the minimum stops at 0. This enables data to be ratio friendly and hence is the best kind for statistical analysis.Examples would be weight, speed of a vehicle and so on.
In my specific data set – the variable am analyzing is life expectancies/age, which is ratio data by some and ordinal by others. This link addresses why. (There are a lot of such ambiguities and in the end what we call it is secondary to the value we derive out it). Now that i know my data, i need to go about running some basic statistics on it. More on that in the next post.