December 2016 – Curious..about data

2016 has undoubtedly been a landmark year in my life. To me it marked my first conscious entry into mid age. It was the first year that I really pondered some of the questions that people need to think of as they get older in life – with clarity that I had not enjoyed before. I think that that clarity only comes with age, and at the right time, no matter how hard we try to make it happen earlier. Some of these questions , for me, included –

1 How much longer do I want to work in IT?
2 Am I doing the kind of work that energizes me and makes me feel like I am contributing something real to the world in some way?
3 What do I want out of where I live? (Or in other words, am I happy with connections I have and social life am having?)

I have been pondering these questions for a couple of years now, but it came to a climax towards end of 2015. I was at a new job – a DBA position again. I was making great $$$, the benefits were very good and the place was just a couple of miles from where I lived. But, the job had certain issues that led me to pondering these questions deeper. It got to be a mental struggle that made it very hard for me to go in to work every day with a positive mindset. I looked at my savings, and also talked to some of the many connections I had made with the SQL community. It became clear to me that I needed a sabbatical to ponder some of this – with some part time consulting work to keep my bills up and stay in touch with technology. So, I decided to leave the position to do just that – take a sabbatical with a part time job and ponder what I want to do next.

Although it sounds like a romantic/cool thing to do now – it really did not feel that way. It felt like relief, and the extra time was a true blessing – but there were fears that went with it – fears that am doing something very radical, that I will run out of money, or fall sick, on and on. But in truth, none of that happened. I spent a good 4 months doing consulting and learning some great new things, catching up on my reading, talking walks ,meditating, and pondering the questions I had set myself to answer. I was led to understanding that a switch to BI and Analytics would be a better option for me, after two decades of production DBA work. I also figured that working in healthcare related analytics would give me the kind of satisfaction I craved – that my work was making a difference, in some small way, to the bigger world and was not just about putting out fires on servers.

By the end of March I found myself a BI position with a healthcare analytics firm, not too far from home. It also involved significant amount of DBA work, which I was glad for, as someone switching lanes. At the end of December am glad to say that I am loving what I do and planning to keep at it as much as I can. I am also blogging and writing articles on analytics, in addition to pursuing an associate degree. So, it all lined up like it was meant to. The year was hard in so many ways – there were some health challenges towards the end of it, and finding time to put into learning is still very hard. But I believe am on the right path and will be guided towards my eventual goal of retiring happy and doing the kind of work I want to be doing.

My goals for 2017 are as below :
1 Make time for what matters – not let work run my life. By that – eat well, exercise, meditate, take time to blog and learn outside work. The time management is easier said than done but setting the goal is the first step.
2 Understand that time is limited, and retain this understanding on a proactive basis. This is the big difference in thinking from youth to mid age. I believe I have 10-12 years of full time work left. In that time I want to be doing what I enjoy and not give in to fears and insecurities.
3 Take time for connections that matter – for friends, family and people who need me. To me that includes connecting with my family of origin (atleast one trip to India every two years), connecting with #sqlfamily (PASS summit, as many sql saturdays as possible), staying active with local community – organizing sql saturdays with my partners in crime – John Morehouse,Chris Yates and several loyal volunteers, speaking at local user group and so on.

My suggestions to anyone else in the same place as I am – or getting there –

1 Life is short. If you are stuck in a seriously unhappy job or doing work that does not seem to mean anything – reconsider. Honor your heart’s calling, and take time to find it.
2 If you are over 45 – make a bucket list, and check off items. Make solid plans to get atleast one or two items off the list every year.
3 Learn proactively – very few people I know got ahead by just learning on the job. You need a fantastically good job for that, and granted, there are a few, but not many of us are that lucky. How does one find time? Yes, that is a hard question ,but that should never be left unanswered. Two simple things that I am doing are
1 Listening to pod casts or watch pluralsight videos when I exercise,
2 Blog on one thing I learned every week.

I want to increase this as I go, but am making strides even with this much.

Wish you all health, peace and happiness in 2017!! Thank you for reading.

So far I’ve worked on simple analytical techniques using one or two variables in a dataset. This article is a sort of a summary – about various techniques we can use for such datasets, depending on the type of variable in question. The techniques include – how to get summary statistics out as relevant, and how to plot the appropriate descriptive graph. There are fundamentally two types of variables –

1 Continues – variables which are not restricted in their value – such as height, weight, temperature etc.
2 Categorical – variables which can have specific values or a specific domain to which values belong.

The types of relationships we want to look at are

1 Categorical-Continues
2 Categorical-Categorical
3 Continues-Continues

1 Categorical and Continues Variable: For this purpose we will use one of R’s built in datasets, called Orange. This dataset has information on five types of orange trees, coded 1 to 5, with their circumference and age. The categorical variable here is the type of orange tree, for continues variable we can consider either circumference or age.

1 List the dataset. We can do this by simply typing Orange.
cc1

2 To get  the summary statistics on each type of tree - 
by(Orange$circumference,Orange$Tree, summary)

3 Now we want to understand what this data is made of. What are the min and max values , which is the biggest tree and which is the smallest, what is the overlap in values and so on. The easiest way to do this is to get the box-and-whisker plot of this data

boxplot(circumference~Tree, data=Orange, main="Circumference of Orange Trees",  xlab="Tree", ylab="Circumference", col=c("red","blue", "yellow","orange","darkgreen"))

cc3

From the plot we can see that type 3 trees have the smallest circumference while type 4 have the largest, with type 2 close to type 4. We can also see that type 1 trees have the thinnest dispersion of circumference while type 4 has the highest, closely followed by type 2. We can also see that there are no significant outliers in this data.

2 Relationship between two categorical variables:

For this purpose I can use the built in dataset in R called ‘HairEyeColor’. This dataset has gender wise hair and eye color for several people. Gender, Hair color, Eye color are categorical variables. For simplicity’s sake I am only going with gender and eye color.

Let us say we want the following information from this dataset –

1 % of men and women across eye color (and the unpivot – % of men and women per eye color)

2 % of men and women across hair color (and the unpivot – % of men and women per eye color)

3 Count of men and women in the mix (total)

4 Count of men and women for each eye/hair color

The code that accomplishes all of this is as below:

# Flatten the data into gender/eye color

gendereyemix<-xtabs(Freq~Sex+Eye,data.frame(HairEyeColor))

# % of men and women across eye color

prop.table(gendereyemix, 1)

# % of men and women for each specific eye color

prop.table(gendereyemix, 2)

# Number of men and women in the mix

margin.table(gendereyemix, 1)

# Number of men and women per eye color

margin.table(gendereyemix, 2)

Results are as below:

cc4

We can accomplish the same using 3 variables in the mix for both prop.table and margin.table functions.

We can also run many tests of independence with them.

The appropriate chart for comparing categorical variable may be bar chart.

barplot(gendereyemix, main="Gender-Eye Color Distribution",          xlab="Eye Color",col=c("darkblue","red"),         legend = rownames(gendereyemix))

cc5

3 Relationship between 2 continues variables:

For this let us consider the airquality dataframe, and the two continues variables it has, windspeed and temperature. To begin with we can get a summary of windspeed for each temperature by

by(airquality$Wind, airquality$Temp, summary)

cc6

Although this gives a decent idea of airspeeds at each temperature, we have no idea if there is any specific correlation between the two. To understand if there is a correlation –

t.test(airquality$Wind, airquality$Temp, data=airquality)

cc7

The p value is really small indicating that there may be a statistically significant correlation between airquality and temperature. We may want to draw a graph to understand what this correlation is – I tried a scatter plot

plot(airquality$Wind, airquality$Temp, main="Scatterplot of Wind versus Temperature", 
+ xlab="Wind ", ylab="Temperature", pch=19)

cc8

Now we add a line of best fit to the graph

lines(lowess(airquality$Wind,airquality$Temp), col="blue")

cc9

From this we can see that there is a correlation between temperature falling with increasing wind speed.

These are just a few examples of how to correlate variables. In many cases there may be a lot more than two variables in the mix and there are many strategies involved to study the correlation. But understand what types of variables they are, and what are best graphs to use for the correlation helps further our data analysis skills. Thanks for reading!