Getting Insights from Survey Results using Data Science in Python

6/22/2016

I'm neither a statistician nor a professional data scientist. But when a friend and fellow Chattanooga-based developer, Jared Menard and I started talking about the salary survey he put together, I was intrigued by the insights that could be gotten from the data. The survey asked questions like:

How much money do you make a year?
How many years of experience do you have?
How many books have you read in the last year?
Do you have a degree? Is it related to your career?
Name some things that you use at work.

Of the roughly 500 developers targeted 67 people responded to the survey, putting us short of the 80 respondents needed for a 10% margin of error per survey monkey. Also, the survey was not designed by a professional researcher / survey designer. So while I readily acknowledge that there may be some potential flaws in the findings, I was more interested in the process of using data science and machine learning techniques to get deeper insights from the data than what could be provided by the basic statistical tools of mean, mode, standard deviation, scatter plots and column charts.

Two of the questions I was hoping to answer were;
1. What kinds/groups of developers answered the survey and what are the general characteristics of those groupings?
2. Of all the features we have, what are the important features in this dataset when it comes to determining salary and given those features can we predict an expected salary?

Data Preparation

To get started, I had to do what every data scientist is forced to spend 80% of their time doing which is data cleaning and preparation. Due to the small size of the dataset, it was easy enough to pull up the CSV file in a text editor or spreadsheet program and visually examine the data to see what needed fixing. For larger data sets, you would have to rely on tools to find missing values and outliers, generate histograms, etc. For example, the dataframe class from the Pandas library contains some methods like isnull() and fillna() that can be used to manage missing values.

When it comes to missing values, there are a number of strategies for handling them. You could drop the record with the missing value entirely, you could set the missing value to 0 / 'NA' / -1, or you could impute the missing value. Some imputation methods are; calculating the mean value, carrying forward the last observation, and using a predicted value from a regression analysis. For this survey, one record was dropped because the respondent failed to complete the survey, a couple of missing values in other records were set to 0 or NA, and a couple of other missing values were imputed based on the previous answers given by the respondent.

Another data preparation task that was done was taking a continuous variable (number of employees at your company) and turning it into a categorical variable (micro, small, medium, large, enterprise). This task could easily be accomplished with a few lines of Python code.

def companySizeCat(numEmployees):
    if numEmployees < 7: # micro
        return 1
    elif numEmployees < 250: # small
        return 2
    elif numEmployees < 500: # medium
        return 3
    elif numEmployees < 1000: # large
        return 4
    else: # enterprise
        return 5

survey_df['COMP_SIZE'] = survey_df['COMP_SIZE'].apply(lambda x: companySizeCat(x))

Finally, in the survey there was a free text field with the question; "Name some things that you use at work". In order to be able to accurately do some form of text mining with that field, I opted to perform a number of transformations: I lowercased all the words in the field, standardized/stemmed certain terms (e.g. 'osx' and 'os x' both get standardized to 'osx'), and deduced broader tools from specific ones.

Examining the Data

Once you get your data into a Pandas data frame, examining your data is quite trivial. Especially if you are working in a jupyter notebook environment. Pandas gives you lots of methods such as describe, head, tail, hist, info, median, mean, mode, std, nlargest, nsmallest that you can use to gain an understanding of what your data looks like.

For example calling the describe() method will show you various summary statistics for every column in your dataset.

survey_df.describe()

	YEARS_EXP	NUM_HOURS_WORK	DAYS_OFF	SALARY	COMP_SIZE	HEALTH_INS	COMP_MEALS	CONF_BUDGET	401K	TRAINING_BUDGET	BOOKS_READ	DEGREE	CAREER_DEGREE	WORK_HIST_COUNT
count	67.000000	67.000000	67.000000	67.000000	67.000000	67	67.000000	67.000000	67	67	67.000000	67	67	67.000000
mean	9.305970	43.044776	21.059701	82020.149254	198.925373	0.880597	1.343284	5302.238806	0.6268657	0.3731343	8.828358	0.7761194	0.6119403	5.328358
std	7.335032	6.883077	16.737619	28621.548435	329.753595	0.3267094	2.421821	27144.660774	0.4872875	0.4872875	10.460804	0.419989	0.4909861	4.247595
min	0.500000	20.000000	0.000000	20000.000000	1.000000	False	0.000000	0.000000	False	False	0.000000	False	False	1.000000
25%	4.000000	40.000000	14.000000	65000.000000	10.000000	1	0.000000	0.000000	0	0	2.500000	1	0	3.000000
50%	7.000000	40.000000	20.000000	77000.000000	30.000000	1	0.000000	0.000000	1	0	5.000000	1	1	4.000000
75%	12.000000	47.000000	23.500000	95000.000000	250.000000	1	2.000000	475.000000	1	1	10.000000	1	1	6.000000
max	30.000000	65.000000	100.000000	185000.000000	1001.000000	True	10.000000	200000.000000	True	True	50.000000	True	True	27.000000

Answering the question; "Amongst the survey respondents, what is the most common company size, years of experience, number of jobs held, number of books read and salary?" becomes as easy as:

sel_columns_df = survey_df[['COMP_SIZE','YEARS_EXP', 'WORK_HIST_COUNT', 'BOOKS_READ', 'SALARY']]
sel_columns_df.mode()

	COMP_SIZE	YEARS_EXP	WORK_HIST_COUNT	BOOKS_READ	SALARY
0	2	10	3	3	75000
1	NaN	NaN	4	NaN	NaN

Note: For the work history count there is a tie; 3 & 4 are both the most recorded number of jobs held.

Another question that Python's data science tools (scikit-learn and Pandas) enable me to answer is the question: What are the most common tools used mentioned by the respondents?

To answer this, first I need to split up the raw entered text into units representing each item. This is called tokenization. Then I need to create a matrix of the count of the occurrence of each token for every entry. This matrix can then be used in other algorithms but I'm simply summing the counts and then displaying the top counts.

# Define a function to tokenize the raw text entered by the users 
def tokenizr(text):
    tokens = text.split(";")
    return [ token.strip() for token in tokens if token.isspace() == False ]

tools = survey_df['STACK']
tools.fillna('', inplace=True)

# Vectorize the text using sklearn provided CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(tokenizer=tokenizr)
tools_matrix = tf_vectorizer.fit_transform(tools)

tf_vectorizer.get_feature_names()

>>>

[u'.net',
 u'airflow',
 u'android',
 u'anger',
 u'angular',
 u'ansible',
 u'apache',
 u'apex',
 u'asp .net',
 u'asp mvc',
 u'atom',
 u'aws',
 u'baby bottles & burping cloths.',
 u'bash',
 u'bootstrap',
 u'brainpower',
 u'c++',
 u'centos',

...

# Turn the matrix into a dataframe
tools_matrix_df = DataFrame(data=tools_matrix.toarray(), columns=tf_vectorizer.get_feature_names())

# Get the most mentioned stack tools by summing the values of the tools_matrix_df
X = tools_matrix_df.sum()

X.sort(ascending=False)
X.head(10)

>>>

javascript    39
linux         23
node          22
ruby          22
rails         19
postgres      19
nginx         16
php           15
mysql         14
angular       12

The results show that many of the survey respondents are frontend developers, or at the very least full stack developers, as javascript, ruby on rails, nginx, php and angular dominate the list.

Clustering the Respondent Data

The first question that I had that requires the use of machine learning algorithms is: What kinds/groups of developers answered the survey and what where the general characteristics of those groupings?

Being able to cluster respondents into groups may be useful for understanding and expecting similar behavior from the members of a certain group. It allows us to make assumptions about certain populations that are founded in data. For instance it may be that the data shows that infrastructure engineers tend to make certain salaries, work 40 hours a week, work for companies of a certain size and have X number of prior jobs. Which could be in contrast to front end engineers who maybe have a different salary range, work slightly more hours, have X number of years of experience and have read Y number of books on average.

To get started with clustering I used the K-Means implementation provided by SciKit Learn. I also performed a version of the clustering using SciKit Learn's Agglomerative clustering implementation but since the results I got were similar to the K-Means solution, I chose to focus on iterating my K-Means solution. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. My first iteration of the solution was simply fed the dataframe and asked to come up with 5 clusters

# Using a basic KMeans algorithm
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, max_iter=500)

# Attempt 1: Just give it all the data in its raw, naked glory and see what it comes up with
kmeans.fit(survey_df)

# Taking a look at what records were assigned to what cluster, I notice a disturbing trend: Salary is outweighing all other features
survey_df_c1 = survey_df.copy()
survey_df_c1.insert(0, 'CLUSTER1_LABEL', kmeans.labels_)
survey_df_c1.head(10)

	CLUSTER1_LABEL	YEARS_EXP	NUM_HOURS_WORK	DAYS_OFF	SALARY	COMP_SIZE	HEALTH_INS	COMP_MEALS	CONF_BUDGET	401K	TRAINING_BUDGET	BOOKS_READ	DEGREE	CAREER_DEGREE	WORK_HIST_COUNT		.net	angular	apache	atom	classic asp	csharp	css	debian	divshot	flask	fsharp	gulp	html	java	javascript	jekyll	lemp	linux	magento	mssql	mysql	nginx	node	os x	perl	php	postgres	postgresql	powershell	puppet	python	rails	ruby	scss	sql	surreal.cms	ubuntu	vb	vba	vim	windows	windows batch files	wordpress	wp
0	0	5	35	20	70000	1	True	4	0	False	False	0	True	True	8	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0
1	2	10	40	0	120000	1	False	4	3000	False	True	3	False	False	6	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	1	1	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	11	45	15	75000	2	True	0	0	False	False	2	True	True	6	0	1	0	0	0	0	0	0	1	0	0	0	0	0	1	1	0	0	1	0	1	1	0	1	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3	0	6	40	16	77000	2	True	0	0	False	False	10	True	True	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	6	50	14	70000	2	True	0	0	False	False	3	False	False	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	1	0	1	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0
5	3	2	50	14	45000	1	False	0	0	False	True	4	True	False	1	0	0	0	0	1	0	0	1	0	0	0	0	1	1	0	1	0	0	1	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	1	0	0	1	0
6	0	16	50	15	93550	3	True	0	350	True	True	22	True	True	9	0	1	0	0	0	1	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	1	0	0	1	1	0	0	1	0	0
7	3	1	40	60	50000	1	False	0	0	True	False	24	True	False	5	0	0	0	0	0	0	0	1	0	1	0	0	0	1	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1
8	2	18	50	30	130000	5	True	0	0	True	False	10	False	False	3	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	2	30	20	20	120000	2	True	0	5000	True	True	10	False	False	20	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

# Here's a hack in place of visualization:

# To prove that out a bit more I group by the cluster label and run some stats
cl1_groups = survey_df_c1.groupby(['CLUSTER1_LABEL'])
cl1_groups.agg([np.mean, np.min, np.max, np.std])

	YEARS_EXP				NUM_HOURS_WORK				DAYS_OFF				SALARY				COMP_SIZE				HEALTH_INS				COMP_MEALS				CONF_BUDGET				401K				TRAINING_BUDGET				BOOKS_READ				DEGREE				CAREER_DEGREE				WORK_HIST_COUNT								.net				airflow				android				anger				angular				ansible				apache				apex				asp .net				asp mvc				...	ssh				stack overflow				sublime text				surreal.cms				swift				teamcity				teradata				terminals				torquebox				ubuntu				vagrant				vb				vba				vero				vim				visual studio				visual studio 2012				vmware				windows				windows batch files				windows server				wordpress				wp				xcode				xslt
	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	...	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std
CLUSTER1_LABEL
0	10.333333	1.0	30	7.030546	43.444444	30	60	6.591962	21.638889	0	100	20.077212	82586	68000	100000	9938.867506	2.333333	1	5	0.925820	0.916667	False	True	0.280306	0.861111	0	8	1.709683	423.611111	0	5000	1008.711067	0.638889	False	True	0.487136	0.305556	False	True	0.467177	9.833333	0	50	12.190160	0.833333	False	True	0.377964	0.694444	False	True	0.467177	6.027778	1	27	4.462454	0.027778	0	1	0.166667	0.222222	0	1	0.421637	0.027778	0	1	0.166667	0.055556	0	1	0.232311	0.027778	0	1	0.166667	0.250000	0	1	0.439155	0.055556	0	1	0.232311	0.055556	0	1	0.232311	0.027778	0	1	0.166667	0.000000	0	0	0.000000	0.000000	0	0	0.000000	...	0.027778	0	1	0.166667	0.055556	0	1	0.232311	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.027778	0	1	0.166667	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.027778	0	1	0.166667	0.000000	0	0	0.000000	0.166667	0	1	0.377964	0.055556	0	1	0.232311	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.000000	0	0	0.000000	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.027778	0	1	0.166667	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.027778	0	1	0.166667	0.000000	0	0	0.000000
1	9.000000	9.0	9	NaN	40.000000	40	40	NaN	25.000000	25	25	NaN	90000	90000	90000	NaN	2.000000	2	2	NaN	1.000000	True	True	NaN	1.000000	1	1	NaN	200000.000000	200000	200000	NaN	0.000000	False	False	NaN	0.000000	False	False	NaN	8.000000	8	8	NaN	1.000000	True	True	NaN	1.000000	True	True	NaN	4.000000	4	4	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	...	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN
2	12.833333	4.0	30	6.939129	39.583333	20	50	7.216878	22.666667	0	45	10.620164	126666	105000	185000	24058.010699	3.083333	1	5	1.443376	0.916667	False	True	0.288675	2.083333	0	8	2.778434	3125.000000	0	20000	5657.356434	0.750000	False	True	0.452267	0.583333	False	True	0.514929	7.583333	1	20	5.869154	0.500000	False	True	0.522233	0.416667	False	True	0.514929	7.166667	3	20	4.969605	0.083333	0	1	0.288675	0.083333	0	1	0.288675	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.166667	0	1	0.389249	0.000000	0	0	0.000000	0.083333	0	1	0.288675	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	...	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.083333	0	1	0.288675	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.083333	0	1	0.288675	0.083333	0	1	0.288675	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000
3	5.147059	0.5	28	6.804789	44.705882	40	65	7.174301	18.235294	0	60	13.534813	50426	20000	65000	12621.832742	2.470588	1	5	1.419403	0.764706	False	True	0.437237	1.411765	0	10	2.807553	147.058824	0	1500	424.437344	0.588235	False	True	0.507300	0.352941	False	True	0.492592	7.970588	2	40	9.811458	0.823529	False	True	0.392953	0.588235	False	True	0.507300	2.882353	1	6	1.615640	0.235294	0	1	0.437237	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.058824	0	1	0.242536	...	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.176471	0	1	0.392953	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.176471	0	1	0.392953	0.058824	0	1	0.242536	0.000000	0	0	0.000000	0.058824	0	1	0.242536
4	1.000000	1.0	1	NaN	45.000000	45	45	NaN	25.000000	25	25	NaN	55000	55000	55000	NaN	2.000000	2	2	NaN	1.000000	True	True	NaN	9.000000	9	9	NaN	100000.000000	100000	100000	NaN	0.000000	False	False	NaN	1.000000	True	True	NaN	3.000000	3	3	NaN	1.000000	True	True	NaN	0.000000	False	False	NaN	1.000000	1	1	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	...	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN	0.000000	0	0	NaN

This naive solution failed in my opinion. Because a naive K-Means implementation is sensitive to scale, the salary feature vastly outweighed the other features in determining the clusters. So my result was clusters that were dictated by salary ranges. Respondents with salaries in ranges from 20K - 65K were in cluster 3, 68K - 100K in cluster 0, 105K - 185K in cluster 2, etc.

In my 2nd iteration of the K-Means solution, I normalized (scaled) the data so that all the fields now contained values between 0 and 1.

def normalize(df, columns):
    result = df.copy()
    for feature_name in columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

survey_df_normalized = normalize(survey_df, ['YEARS_EXP', 'NUM_HOURS_WORK', 'DAYS_OFF', 'SALARY', 'CONF_BUDGET', 'BOOKS_READ', 'WORK_HIST_COUNT'])
survey_df_normalized.drop(['STACK', 'COMP_MEALS'], axis=1, inplace=True)

kmeans.fit(survey_df_normalized)

survey_df_c2 = survey_df.copy()
survey_df_c2.insert(0, 'CLUSTER2_LABEL', kmeans.labels_)

cl2_groups = survey_df_c2.groupby(['CLUSTER2_LABEL'])
cl2_groups.agg([np.mean, np.min, np.max, np.std])

	YEARS_EXP				NUM_HOURS_WORK				DAYS_OFF				SALARY				COMP_SIZE				HEALTH_INS				COMP_MEALS				CONF_BUDGET				401K				TRAINING_BUDGET				BOOKS_READ				DEGREE				CAREER_DEGREE				WORK_HIST_COUNT								.net				airflow				android				anger				angular				ansible				apache				apex				asp .net				asp mvc				...	ssh				stack overflow				sublime text				surreal.cms				swift				teamcity				teradata				terminals				torquebox				ubuntu				vagrant				vb				vba				vero				vim				visual studio				visual studio 2012				vmware				windows				windows batch files				windows server				wordpress				wp				xcode				xslt
	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	...	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std	mean	amin	amax	std
CLUSTER2_LABEL
0	9.909091	1.0	25	6.582148	44.636364	35	60	5.980488	22.227273	10	100	18.094808	82709	40000	115000	20678.253094	2.181818	2	3	0.394771	1.000000	True	True	0.000000	1.363636	0	8	2.300009	870.454545	0	5000	1524.299839	1.000000	True	True	0.000000	0.454545	False	True	0.509647	11.318182	0	50	13.608073	1.000000	True	True	0.000000	0.909091	False	True	0.294245	6.454545	1	27	5.483545	0.136364	0	1	0.351250	0.227273	0	1	0.428932	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.318182	0	1	0.476731	0.045455	0	1	0.213201	0.045455	0	1	0.213201	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.000000	0	0	0.000000	...	0.045455	0	1	0.213201	0.090909	0	1	0.294245	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.045455	0	1	0.213201	0.318182	0	1	0.476731	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.045455	0	1	0.213201	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000
1	13.500000	2.0	30	8.759610	43.357143	35	50	5.123207	30.928571	15	100	21.345020	94696	52000	185000	36481.314834	4.571429	4	5	0.513553	1.000000	True	True	0.000000	2.571429	0	10	3.274947	1757.142857	0	20000	5303.368175	1.000000	True	True	0.000000	0.714286	False	True	0.468807	6.714286	0	15	4.664573	0.857143	False	True	0.363137	0.714286	False	True	0.468807	4.642857	1	14	3.177012	0.142857	0	1	0.363137	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	...	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261
2	8.642857	1.0	30	8.005836	41.785714	20	65	9.728456	16.071429	5	25	6.207493	80500	30000	165000	33575.975831	2.000000	2	2	0.000000	0.928571	False	True	0.267261	0.857143	0	9	2.413333	7607.142857	0	100000	26627.182878	0.357143	False	True	0.497245	0.214286	False	True	0.425815	6.714286	1	35	8.774338	0.214286	False	True	0.425815	0.000000	False	False	0.000000	5.428571	1	20	4.685728	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.142857	0	1	0.363137	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.071429	0	1	0.267261	...	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.071429	0	1	0.267261	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000
3	6.500000	1.0	18	6.715653	43.333333	30	60	10.327956	14.000000	0	60	23.323808	71833	20000	120000	38937.984873	1.166667	1	2	0.408248	0.000000	False	False	0.000000	0.833333	0	4	1.602082	833.333333	0	3000	1329.159885	0.166667	False	True	0.408248	0.333333	False	True	0.516398	17.833333	3	40	13.775582	0.666667	False	True	0.516398	0.000000	False	False	0.000000	5.166667	1	10	2.926887	0.000000	0	0	0.000000	0.166667	0	1	0.408248	0.000000	0	0	0.000000	0.166667	0	1	0.408248	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.166667	0	1	0.408248	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	...	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.166667	0	1	0.408248	0.166667	0	1	0.408248	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.333333	0	1	0.516398	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.166667	0	1	0.408248	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.166667	0	1	0.408248	0.166667	0	1	0.408248	0.166667	0	1	0.408248	0.000000	0	0	0.000000
4	5.136364	0.5	11	2.933351	40.909091	35	50	3.753786	16.363636	10	25	4.319091	72000	40000	90000	12369.316877	1.818182	1	2	0.404520	0.909091	False	True	0.301511	0.636364	0	4	1.286291	18181.818182	0	200000	60302.268913	0.000000	False	False	0.000000	0.000000	False	False	0.000000	4.318182	0	10	3.593555	1.000000	True	True	0.000000	1.000000	True	True	0.000000	3.909091	1	8	2.071451	0.090909	0	1	0.301511	0.181818	0	1	0.404520	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.181818	0	1	0.404520	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	...	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.090909	0	1	0.301511	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.090909	0	1	0.301511	0.000000	0	0	0.000000	0.090909	0	1	0.301511	0.090909	0	1	0.301511	0.000000	0	0	0.000000	0.000000	0	0	0.000000	0.000000	0	0	0.000000

Thanks to the feature scaling, my 2nd iteration produced good clusters. Or did it? While scanning the tables revealed no noticeable skew from over-weighted features, its very difficult for humans to understand and visualize clusters in 174 dimensions. So how do you validate clustering results without visualizing them? The answer to that question is beyond the scope of this article, but a succinct and easy-to-understand document that explores this topic further can be found here.

In order to side step the difficulties of visualizing cluster data in high dimensions, for my 3rd iteration I chose 3 features to cluster on. This allowed me to produce a 3D scatter plot of the respondents and verify that the clustering was successful.

survey_df_selected = survey_df_normalized[['SALARY', 'DAYS_OFF', 'WORK_HIST_COUNT']]

kmeans = KMeans(n_clusters=5, max_iter=500)
kmeans.fit(survey_df_selected)

Principal Component Analysis is a dimension reduction method that could prove to be useful in reducing the N dimension vector space to the 2 or 3 components that are the most significant (show the most variance). While I didn't implement PCA for this project, I was able to make small steps towards marginal success in clustering the survey respondents.

Predicting Salary

The second question I had that requires the use of machine learning algorithms is: What are the important features in this dataset when it comes to determining salary and given those features can we predict an expected salary?

This distills down to being a classification problem and there are many algorithms to solve these kinds of problems; logistic regression, naive bayes classifier, support vector machines, neural networks, etc.

I chose to use a decision tree algorithm to solve the problem because decision trees have built in feature selection, are easy to understand and interpret, work on both categorical and continuous variables and require little data preparation. Therefore it was a perfect match for my use case because it automatically answers the question; "What are the most important features in this dataset when it comes to determining salary". And because, being a survey with an expected margin of error, it makes sense for me to optimize for usability and demo-ability over accuracy.

The decision tree algorithm works by recursively splitting a set of data observations into subsets based on the attribute that most effectively splits the data set into groups (subsets) of observations with similar outcomes. This results in a tree structure in which each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label or regression value.

from sklearn.tree import DecisionTreeClassifier, export_graphviz

y = survey_df["SALARY"]
X = survey_df.drop(["SALARY", "STACK"], axis=1, inplace=False)
tree = DecisionTreeClassifier(min_samples_split=4)
tree.fit(X, y)

(click to see larger figure)

For my dataset, the decision tree algorithm found the test "does your company provide more than 8.5 meals per month" to provide the most effective 1st split. If the answer is no then the left branch is taken and the next test becomes "do you have more or less than 8.5 years of experience?". If the answer to that test is "less than 8.5 years" then the left branch is taken, otherwise the right branch is taken. The tests answer the question of "What are the most important features in this dataset when it comes to determining salary". The leaf (terminal) nodes show a prediction based on the observations from the dataset. So for instance; if your company provides less than 8.5 meals per month, you have less than 8.5 years of experience, you've read less than 1.5 books, and your company does not provide a 401K you are likely to earn $70,000. This prediction is based on the fact that 3 observations in the dataset with those same attributes earn that salary.

A couple of things to note; like many other machine learning algorithms, decision trees are vulnerable to overfitting. Overfitting means that the model created is too closely fit to a limited set of observations and hence it describes the random error instead of the underlying relationship. There are many ways to combat the overfitting problem that are better covered in a machine learning course. One interesting approach is to use random forests in place of decision trees. For the decision tree I present in this post, I exclude the "STACK" feature (i.e. survey respondent's skills) from the model. This was done for simplicity sake. I have a version of the model that takes "skill" into account but it has significantly more tests and although it appears to be more accurate, it is also more susceptible to overfitting. The final thing to note is that confidence in a decision tree's classification prediction can be quantified by outputting class probabilities. For regression trees, the confidence measure is the error. This blog post has more information.

After you train your decision tree model in Python using SciKit-Learn, predicting outcomes is as easy as calling the predict method on the model object. If you would like to create a rules engine from the decision tree or you simply want a textual representation of the tree checkout the last section of this blog post or this stackoverflow question.

Conclusion

Here's what I learned in the process:

To ask specific and deeper questions from your data, custom coding is usually your best bet. Most tools generalize to meet a broader set of needs. Your custom Python code is the best way to meet your needs and handle the specific data that you have.
The Pandas library makes it easy to perform your standard statistical analysis on data tables with methods like std, var, mean, quantile, etc. (See full list)
Clustering high dimensional data is easy enough to do with the SciKit-Learn clustering implementations but validating those clusters is a harder problem that requires more research and publication
Decision trees are a very robust tool for both feature importance and prediction but care must be taken to ensure you don't overfit your model.

There's a lot more insights that could be deciphered from the survey data given adequate amounts of data and the right machine learning tools and techniques. If anyone is interested in this project, checkout the github repository: github.com/tim-ojo/chadev_survey.

Tim

Code Author

Getting Insights from Survey Results using Data Science in Python

Data Preparation

Examining the Data

Clustering the Respondent Data

Predicting Salary

Conclusion

Tim

0 comments

About me

Popular Posts

On the web @

Labels

Blog Archive

Getting Insights from Survey Results using Data Science in Python

Data Preparation

Examining the Data

Clustering the Respondent Data

Predicting Salary

Conclusion

Share This Post

Tim

You Might Also Like

0 comments

About me

Popular Posts

On the web @

Labels

Blog Archive