Getting Insights from Survey Results using Data Science in Python

6/22/2016


I'm neither a statistician nor a professional data scientist. But when a friend and fellow Chattanooga-based developer, Jared Menard and I started talking about the salary survey he put together, I was intrigued by the insights that could be gotten from the data. The survey asked questions like:

  • How much money do you make a year?
  • How many years of experience do you have?
  • How many books have you read in the last year?
  • Do you have a degree? Is it related to your career?
  • Name some things that you use at work.
Of the roughly 500 developers targeted 67 people responded to the survey, putting us short of the 80 respondents needed for a 10% margin of error per survey monkey. Also, the survey was not designed by a professional researcher / survey designer. So while I readily acknowledge that there may be some potential flaws in the findings, I was more interested in the process of using data science and machine learning techniques to get deeper insights from the data than what could be provided by the basic statistical tools of mean, mode, standard deviation, scatter plots and column charts.



Two of the questions I was hoping to answer were;
1. What kinds/groups of developers answered the survey and what are the general characteristics of those groupings?
2. Of all the features we have, what are the important features in this dataset when it comes to determining salary and given those features can we predict an expected salary?

Data Preparation

To get started, I had to do what every data scientist is forced to spend 80% of their time doing which is data cleaning and preparation. Due to the small size of the dataset, it was easy enough to pull up the CSV file in a text editor or spreadsheet program and visually examine the data to see what needed fixing. For larger data sets, you would have to rely on tools to find missing values and outliers, generate histograms, etc. For example, the dataframe class from the Pandas library contains some methods like isnull() and fillna() that can be used to manage missing values.

When it comes to missing values, there are a number of strategies for handling them. You could drop the record with the missing value entirely, you could set the missing value to 0 / 'NA' / -1, or you could impute the missing value. Some imputation methods are; calculating the mean value, carrying forward the last observation, and using a predicted value from a regression analysis. For this survey, one record was dropped because the respondent failed to complete the survey, a couple of missing values in other records were set to 0 or NA, and a couple of other missing values were imputed based on the previous answers given by the respondent.

Another data preparation task that was done was taking a continuous variable (number of employees at your company) and turning it into a categorical variable (micro, small, medium, large, enterprise). This task could easily be accomplished with a few lines of Python code.

def companySizeCat(numEmployees):
    if numEmployees < 7: # micro
        return 1
    elif numEmployees < 250: # small
        return 2
    elif numEmployees < 500: # medium
        return 3
    elif numEmployees < 1000: # large
        return 4
    else: # enterprise
        return 5 
 
survey_df['COMP_SIZE'] = survey_df['COMP_SIZE'].apply(lambda x: companySizeCat(x)) 

Finally, in the survey there was a free text field with the question; "Name some things that you use at work". In order to be able to accurately do some form of text mining with that field, I opted to perform a number of transformations: I lowercased all the words in the field, standardized/stemmed certain terms (e.g. 'osx' and 'os x' both get standardized to 'osx'), and deduced broader tools from specific ones.

Examining the Data

Once you get your data into a Pandas data frame, examining your data is quite trivial. Especially if you are working in a jupyter notebook environment. Pandas gives you lots of methods such as describe, head, tail, hist, info, median, mean, mode, std, nlargest, nsmallest that you can use to gain an understanding of what your data looks like.

For example calling the describe() method will show you various summary statistics for every column in your dataset.

survey_df.describe()
 

YEARS_EXP NUM_HOURS_WORK DAYS_OFF SALARY COMP_SIZE HEALTH_INS COMP_MEALS CONF_BUDGET 401K TRAINING_BUDGET BOOKS_READ DEGREE CAREER_DEGREE WORK_HIST_COUNT
count 67.000000 67.000000 67.000000 67.000000 67.000000 67 67.000000 67.000000 67 67 67.000000 67 67 67.000000
mean 9.305970 43.044776 21.059701 82020.149254 198.925373 0.880597 1.343284 5302.238806 0.6268657 0.3731343 8.828358 0.7761194 0.6119403 5.328358
std 7.335032 6.883077 16.737619 28621.548435 329.753595 0.3267094 2.421821 27144.660774 0.4872875 0.4872875 10.460804 0.419989 0.4909861 4.247595
min 0.500000 20.000000 0.000000 20000.000000 1.000000 False 0.000000 0.000000 False False 0.000000 False False 1.000000
25% 4.000000 40.000000 14.000000 65000.000000 10.000000 1 0.000000 0.000000 0 0 2.500000 1 0 3.000000
50% 7.000000 40.000000 20.000000 77000.000000 30.000000 1 0.000000 0.000000 1 0 5.000000 1 1 4.000000
75% 12.000000 47.000000 23.500000 95000.000000 250.000000 1 2.000000 475.000000 1 1 10.000000 1 1 6.000000
max 30.000000 65.000000 100.000000 185000.000000 1001.000000 True 10.000000 200000.000000 True True 50.000000 True True 27.000000

Answering the question; "Amongst the survey respondents, what is the most common company size, years of experience, number of jobs held, number of books read and salary?" becomes as easy as:

sel_columns_df = survey_df[['COMP_SIZE','YEARS_EXP', 'WORK_HIST_COUNT', 'BOOKS_READ', 'SALARY']]
sel_columns_df.mode()
 

COMP_SIZE YEARS_EXP WORK_HIST_COUNT BOOKS_READ SALARY
0 2 10 3 3 75000
1 NaN NaN 4 NaN NaN

Note:  For the work history count there is a tie; 3 & 4 are both the most recorded number of jobs held.

Another question that Python's data science tools (scikit-learn and Pandas) enable me to answer is the question: What are the most common tools used mentioned by the respondents?

To answer this, first I need to split up the raw entered text into units representing each item. This is called tokenization. Then I need to create a matrix of the count of the occurrence of each token for every entry. This matrix can then be used in other algorithms but I'm simply summing the counts and then displaying the top counts.

# Define a function to tokenize the raw text entered by the users 
def tokenizr(text):
    tokens = text.split(";")
    return [ token.strip() for token in tokens if token.isspace() == False ]

tools = survey_df['STACK']
tools.fillna('', inplace=True)
  
# Vectorize the text using sklearn provided CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(tokenizer=tokenizr)
tools_matrix = tf_vectorizer.fit_transform(tools)

tf_vectorizer.get_feature_names()
>>>
[u'.net',
 u'airflow',
 u'android',
 u'anger',
 u'angular',
 u'ansible',
 u'apache',
 u'apex',
 u'asp .net',
 u'asp mvc',
 u'atom',
 u'aws',
 u'baby bottles & burping cloths.',
 u'bash',
 u'bootstrap',
 u'brainpower',
 u'c++',
 u'centos', 
... 
]
 
# Turn the matrix into a dataframe
tools_matrix_df = DataFrame(data=tools_matrix.toarray(), columns=tf_vectorizer.get_feature_names())
 
# Get the most mentioned stack tools by summing the values of the tools_matrix_df
X = tools_matrix_df.sum()

X.sort(ascending=False)
X.head(10)
>>>
javascript    39
linux         23
node          22
ruby          22
rails         19
postgres      19
nginx         16
php           15
mysql         14
angular       12 
 
The results show that many of the survey respondents are frontend developers, or at the very least full stack developers, as javascript, ruby on rails, nginx, php and angular dominate the list.

Clustering the Respondent Data

The first question that I had that requires the use of machine learning algorithms is: What kinds/groups of developers answered the survey and what where the general characteristics of those groupings?

Being able to cluster respondents into groups may be useful for understanding and expecting similar behavior from the members of a certain group. It allows us to make assumptions about certain populations that are founded in data. For instance it may be that the data shows that infrastructure engineers tend to make certain salaries, work 40 hours a week, work for companies of a certain size and have X number of prior jobs. Which could be in contrast to front end engineers who maybe have a different salary range, work slightly more hours, have X number of years of experience and have read Y number of books on average.

To get started with clustering I used the K-Means implementation provided by SciKit Learn. I also performed a version of the clustering using SciKit Learn's Agglomerative clustering implementation but since the results I got were similar to the K-Means solution, I chose to focus on iterating my K-Means solution. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. My first iteration of the solution was simply fed the dataframe and asked to come up with 5 clusters

# Using a basic KMeans algorithm
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, max_iter=500)

# Attempt 1: Just give it all the data in its raw, naked glory and see what it comes up with
kmeans.fit(survey_df)

# Taking a look at what records were assigned to what cluster, I notice a disturbing trend: Salary is outweighing all other features
survey_df_c1 = survey_df.copy()
survey_df_c1.insert(0, 'CLUSTER1_LABEL', kmeans.labels_)
survey_df_c1.head(10)
 

CLUSTER1_LABEL YEARS_EXP NUM_HOURS_WORK DAYS_OFF SALARY COMP_SIZE HEALTH_INS COMP_MEALS CONF_BUDGET 401K TRAINING_BUDGET BOOKS_READ DEGREE CAREER_DEGREE WORK_HIST_COUNT
.net airflow android anger angular ansible apache apex asp .net asp mvc atom aws baby bottles & burping cloths. bash bootstrap brainpower c++ centos chairs classic asp clojure cognos computers consul csharp css css3 db2 for z/os db2 luw debian digitalocean divshot django docker drupal ec2 eclipse ide elasticache elasticsearch elixir ember es6 excel express express.js flask flux force.com platform fsharp giphy git github go google google docs google drive google hangouts gotomeeting graphite gulp hand tools hate heroku hipchat hl7 html html5 iis ims internet java javascript jekyll jenkins jira jquery jruby keras kibana knockout knockout.js kubernetes lasers lemp line scan cameras linux magento mercurial mesos microsoft mongodb mssql mysql netezza nginx node obj c oracle os x outlook perl phone php plc/mlc postgres postgresql power tools powershell puppet python rabbitmq rackspace rails rails.js rds react redis redshift redux rhel ruby s3 sails salesforce sass scikit-learn scss segment.io selenium server 2012 sinatra skype slack sql sql server ssh stack overflow sublime text surreal.cms swift teamcity teradata terminals torquebox ubuntu vagrant vb vba vero vim visual studio visual studio 2012 vmware windows windows batch files windows server wordpress wp xcode xslt
0 0 5 35 20 70000 1 True 4 0 False False 0 True True 8 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 10 40 0 120000 1 False 4 3000 False True 3 False False 6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 11 45 15 75000 2 True 0 0 False False 2 True True 6 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3 0 6 40 16 77000 2 True 0 0 False False 10 True True 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 6 50 14 70000 2 True 0 0 False False 3 False False 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 3 2 50 14 45000 1 False 0 0 False True 4 True False 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
6 0 16 50 15 93550 3 True 0 350 True True 22 True True 9 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
7 3 1 40 60 50000 1 False 0 0 True False 24 True False 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
8 2 18 50 30 130000 5 True 0 0 True False 10 False False 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 2 30 20 20 120000 2 True 0 5000 True True 10 False False 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
 
# Here's a hack in place of visualization: 
# To prove that out a bit more I group by the cluster label and run some stats
cl1_groups = survey_df_c1.groupby(['CLUSTER1_LABEL'])
cl1_groups.agg([np.mean, np.min, np.max, np.std])
 

YEARS_EXP NUM_HOURS_WORK DAYS_OFF SALARY COMP_SIZE HEALTH_INS COMP_MEALS CONF_BUDGET 401K TRAINING_BUDGET BOOKS_READ DEGREE CAREER_DEGREE WORK_HIST_COUNT
.net airflow android anger angular ansible apache apex asp .net asp mvc ... ssh stack overflow sublime text surreal.cms swift teamcity teradata terminals torquebox ubuntu vagrant vb vba vero vim visual studio visual studio 2012 vmware windows windows batch files windows server wordpress wp xcode xslt

mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std ... mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std
CLUSTER1_LABEL








































































































































































































0 10.333333 1.0 30 7.030546 43.444444 30 60 6.591962 21.638889 0 100 20.077212 82586 68000 100000 9938.867506 2.333333 1 5 0.925820 0.916667 False True 0.280306 0.861111 0 8 1.709683 423.611111 0 5000 1008.711067 0.638889 False True 0.487136 0.305556 False True 0.467177 9.833333 0 50 12.190160 0.833333 False True 0.377964 0.694444 False True 0.467177 6.027778 1 27 4.462454 0.027778 0 1 0.166667 0.222222 0 1 0.421637 0.027778 0 1 0.166667 0.055556 0 1 0.232311 0.027778 0 1 0.166667 0.250000 0 1 0.439155 0.055556 0 1 0.232311 0.055556 0 1 0.232311 0.027778 0 1 0.166667 0.000000 0 0 0.000000 0.000000 0 0 0.000000 ... 0.027778 0 1 0.166667 0.055556 0 1 0.232311 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.027778 0 1 0.166667 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.027778 0 1 0.166667 0.000000 0 0 0.000000 0.166667 0 1 0.377964 0.055556 0 1 0.232311 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.000000 0 0 0.000000 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.027778 0 1 0.166667 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.027778 0 1 0.166667 0.000000 0 0 0.000000
1 9.000000 9.0 9 NaN 40.000000 40 40 NaN 25.000000 25 25 NaN 90000 90000 90000 NaN 2.000000 2 2 NaN 1.000000 True True NaN 1.000000 1 1 NaN 200000.000000 200000 200000 NaN 0.000000 False False NaN 0.000000 False False NaN 8.000000 8 8 NaN 1.000000 True True NaN 1.000000 True True NaN 4.000000 4 4 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN ... 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN
2 12.833333 4.0 30 6.939129 39.583333 20 50 7.216878 22.666667 0 45 10.620164 126666 105000 185000 24058.010699 3.083333 1 5 1.443376 0.916667 False True 0.288675 2.083333 0 8 2.778434 3125.000000 0 20000 5657.356434 0.750000 False True 0.452267 0.583333 False True 0.514929 7.583333 1 20 5.869154 0.500000 False True 0.522233 0.416667 False True 0.514929 7.166667 3 20 4.969605 0.083333 0 1 0.288675 0.083333 0 1 0.288675 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.166667 0 1 0.389249 0.000000 0 0 0.000000 0.083333 0 1 0.288675 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 ... 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.083333 0 1 0.288675 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.083333 0 1 0.288675 0.083333 0 1 0.288675 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000
3 5.147059 0.5 28 6.804789 44.705882 40 65 7.174301 18.235294 0 60 13.534813 50426 20000 65000 12621.832742 2.470588 1 5 1.419403 0.764706 False True 0.437237 1.411765 0 10 2.807553 147.058824 0 1500 424.437344 0.588235 False True 0.507300 0.352941 False True 0.492592 7.970588 2 40 9.811458 0.823529 False True 0.392953 0.588235 False True 0.507300 2.882353 1 6 1.615640 0.235294 0 1 0.437237 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.058824 0 1 0.242536 ... 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.176471 0 1 0.392953 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.176471 0 1 0.392953 0.058824 0 1 0.242536 0.000000 0 0 0.000000 0.058824 0 1 0.242536
4 1.000000 1.0 1 NaN 45.000000 45 45 NaN 25.000000 25 25 NaN 55000 55000 55000 NaN 2.000000 2 2 NaN 1.000000 True True NaN 9.000000 9 9 NaN 100000.000000 100000 100000 NaN 0.000000 False False NaN 1.000000 True True NaN 3.000000 3 3 NaN 1.000000 True True NaN 0.000000 False False NaN 1.000000 1 1 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN ... 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN 0.000000 0 0 NaN
 

This naive solution failed in my opinion. Because a naive K-Means implementation is sensitive to scale, the salary feature vastly outweighed the other features in determining the clusters. So my result was clusters that were dictated by salary ranges. Respondents with salaries in ranges from 20K - 65K were in cluster 3, 68K - 100K in cluster 0, 105K - 185K in cluster 2, etc.

In my 2nd iteration of the K-Means solution, I normalized (scaled) the data so that all the fields now contained values between 0 and 1.

def normalize(df, columns):
    result = df.copy()
    for feature_name in columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result
 
survey_df_normalized = normalize(survey_df, ['YEARS_EXP', 'NUM_HOURS_WORK', 'DAYS_OFF', 'SALARY', 'CONF_BUDGET', 'BOOKS_READ', 'WORK_HIST_COUNT'])
survey_df_normalized.drop(['STACK', 'COMP_MEALS'], axis=1, inplace=True) 

kmeans.fit(survey_df_normalized)

survey_df_c2 = survey_df.copy()
survey_df_c2.insert(0, 'CLUSTER2_LABEL', kmeans.labels_)
cl2_groups = survey_df_c2.groupby(['CLUSTER2_LABEL'])
cl2_groups.agg([np.mean, np.min, np.max, np.std]) 
 

YEARS_EXP NUM_HOURS_WORK DAYS_OFF SALARY COMP_SIZE HEALTH_INS COMP_MEALS CONF_BUDGET 401K TRAINING_BUDGET BOOKS_READ DEGREE CAREER_DEGREE WORK_HIST_COUNT
.net airflow android anger angular ansible apache apex asp .net asp mvc ... ssh stack overflow sublime text surreal.cms swift teamcity teradata terminals torquebox ubuntu vagrant vb vba vero vim visual studio visual studio 2012 vmware windows windows batch files windows server wordpress wp xcode xslt

mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std ... mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std mean amin amax std
CLUSTER2_LABEL








































































































































































































0 9.909091 1.0 25 6.582148 44.636364 35 60 5.980488 22.227273 10 100 18.094808 82709 40000 115000 20678.253094 2.181818 2 3 0.394771 1.000000 True True 0.000000 1.363636 0 8 2.300009 870.454545 0 5000 1524.299839 1.000000 True True 0.000000 0.454545 False True 0.509647 11.318182 0 50 13.608073 1.000000 True True 0.000000 0.909091 False True 0.294245 6.454545 1 27 5.483545 0.136364 0 1 0.351250 0.227273 0 1 0.428932 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.318182 0 1 0.476731 0.045455 0 1 0.213201 0.045455 0 1 0.213201 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.000000 0 0 0.000000 ... 0.045455 0 1 0.213201 0.090909 0 1 0.294245 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.045455 0 1 0.213201 0.318182 0 1 0.476731 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.045455 0 1 0.213201 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000
1 13.500000 2.0 30 8.759610 43.357143 35 50 5.123207 30.928571 15 100 21.345020 94696 52000 185000 36481.314834 4.571429 4 5 0.513553 1.000000 True True 0.000000 2.571429 0 10 3.274947 1757.142857 0 20000 5303.368175 1.000000 True True 0.000000 0.714286 False True 0.468807 6.714286 0 15 4.664573 0.857143 False True 0.363137 0.714286 False True 0.468807 4.642857 1 14 3.177012 0.142857 0 1 0.363137 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 ... 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261
2 8.642857 1.0 30 8.005836 41.785714 20 65 9.728456 16.071429 5 25 6.207493 80500 30000 165000 33575.975831 2.000000 2 2 0.000000 0.928571 False True 0.267261 0.857143 0 9 2.413333 7607.142857 0 100000 26627.182878 0.357143 False True 0.497245 0.214286 False True 0.425815 6.714286 1 35 8.774338 0.214286 False True 0.425815 0.000000 False False 0.000000 5.428571 1 20 4.685728 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.142857 0 1 0.363137 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.071429 0 1 0.267261 ... 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.071429 0 1 0.267261 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000
3 6.500000 1.0 18 6.715653 43.333333 30 60 10.327956 14.000000 0 60 23.323808 71833 20000 120000 38937.984873 1.166667 1 2 0.408248 0.000000 False False 0.000000 0.833333 0 4 1.602082 833.333333 0 3000 1329.159885 0.166667 False True 0.408248 0.333333 False True 0.516398 17.833333 3 40 13.775582 0.666667 False True 0.516398 0.000000 False False 0.000000 5.166667 1 10 2.926887 0.000000 0 0 0.000000 0.166667 0 1 0.408248 0.000000 0 0 0.000000 0.166667 0 1 0.408248 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.166667 0 1 0.408248 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 ... 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.166667 0 1 0.408248 0.166667 0 1 0.408248 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.333333 0 1 0.516398 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.166667 0 1 0.408248 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.166667 0 1 0.408248 0.166667 0 1 0.408248 0.166667 0 1 0.408248 0.000000 0 0 0.000000
4 5.136364 0.5 11 2.933351 40.909091 35 50 3.753786 16.363636 10 25 4.319091 72000 40000 90000 12369.316877 1.818182 1 2 0.404520 0.909091 False True 0.301511 0.636364 0 4 1.286291 18181.818182 0 200000 60302.268913 0.000000 False False 0.000000 0.000000 False False 0.000000 4.318182 0 10 3.593555 1.000000 True True 0.000000 1.000000 True True 0.000000 3.909091 1 8 2.071451 0.090909 0 1 0.301511 0.181818 0 1 0.404520 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.181818 0 1 0.404520 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 ... 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.090909 0 1 0.301511 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.090909 0 1 0.301511 0.000000 0 0 0.000000 0.090909 0 1 0.301511 0.090909 0 1 0.301511 0.000000 0 0 0.000000 0.000000 0 0 0.000000 0.000000 0 0 0.000000

Thanks to the feature scaling, my 2nd iteration produced good clusters. Or did it? While scanning the tables revealed no noticeable skew from over-weighted features, its very difficult for humans to understand and visualize clusters in 174 dimensions. So how do you validate clustering results without visualizing them? The answer to that question is beyond the scope of this article, but a succinct and easy-to-understand document that explores this topic further can be found here.

In order to side step the difficulties of visualizing cluster data in high dimensions, for my 3rd iteration I chose 3 features to cluster on. This allowed me to produce a 3D scatter plot of the respondents and verify that the clustering was successful.

survey_df_selected = survey_df_normalized[['SALARY', 'DAYS_OFF', 'WORK_HIST_COUNT']]

kmeans = KMeans(n_clusters=5, max_iter=500)
kmeans.fit(survey_df_selected)



Principal Component Analysis is a dimension reduction method that could prove to be useful in reducing the N dimension vector space to the 2 or 3 components that are the most significant (show the most variance). While I didn't implement PCA for this project, I was able to make small steps towards marginal success in clustering the survey respondents.

Predicting Salary

The second question I had that requires the use of machine learning algorithms is: What are the important features in this dataset when it comes to determining salary and given those features can we predict an expected salary?

This distills down to being a classification problem and there are many algorithms to solve these kinds of problems; logistic regression, naive bayes classifier, support vector machines, neural networks, etc.

I chose to use a decision tree algorithm to solve the problem because decision trees have built in feature selection, are easy to understand and interpret, work on both categorical and continuous variables and require little data preparation. Therefore it was a perfect match for my use case because it automatically answers the question; "What are the most important features in this dataset when it comes to determining salary". And because, being a survey with an expected margin of error, it makes sense for me to optimize for usability and demo-ability over accuracy.

The decision tree algorithm works by recursively splitting a set of data observations into subsets based on the attribute that most effectively splits the data set into groups (subsets) of observations with similar outcomes. This results in a tree structure in which each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label or regression value.

from sklearn.tree import DecisionTreeClassifier, export_graphviz

y = survey_df["SALARY"]
X = survey_df.drop(["SALARY", "STACK"], axis=1, inplace=False)
tree = DecisionTreeClassifier(min_samples_split=4)
tree.fit(X, y)

(click to see larger figure)

For my dataset, the decision tree algorithm found the test "does your company provide more than 8.5 meals per month" to provide the most effective 1st split. If the answer is no then the left branch is taken and the next test becomes "do you have more or less than 8.5 years of experience?". If the answer to that test is "less than 8.5 years" then the left branch is taken, otherwise the right branch is taken. The tests answer the question of "What are the most important features in this dataset when it comes to determining salary". The leaf (terminal) nodes show a prediction based on the observations from the dataset. So for instance; if your company provides less than 8.5 meals per month, you have less than 8.5 years of experience, you've read less than 1.5 books, and your company does not provide a 401K you are likely to earn $70,000. This prediction is based on the fact that 3 observations in the dataset with those same attributes earn that salary.

A couple of things to note; like many other machine learning algorithms, decision trees are vulnerable to overfitting. Overfitting means that the model created is too closely fit to a limited set of observations and hence it describes the random error instead of the underlying relationship. There are many ways to combat the overfitting problem that are better covered in a machine learning course. One interesting approach is to use random forests in place of decision trees. For the decision tree I present in this post, I exclude the "STACK" feature (i.e. survey respondent's skills) from the model. This was done for simplicity sake. I have a version of the model that takes "skill" into account but it has significantly more tests and although it appears to be more accurate, it is also more susceptible to overfitting. The final thing to note is that confidence in a decision tree's classification prediction can be quantified by outputting class probabilities. For regression trees, the confidence measure is the error. This blog post has more information.

After you train your decision tree model in Python using SciKit-Learn, predicting outcomes is as easy as calling the predict method on the model object. If you would like to create a rules engine from the decision tree or you simply want a textual representation of the tree checkout the last section of this blog post or this stackoverflow question.

Conclusion

Here's what I learned in the process:
  • To ask specific and deeper questions from your data, custom coding is usually your best bet. Most tools generalize to meet a broader set of needs. Your custom Python code is the best way to meet your needs and handle the specific data that you have.
  • The Pandas library makes it easy to perform your standard statistical analysis on data tables with methods like std, var, mean, quantile, etc. (See full list
  • Clustering high dimensional data is easy enough to do with the SciKit-Learn clustering implementations but validating those clusters is a harder problem that requires more research and publication
  • Decision trees are a very robust tool for both feature importance and prediction but care must be taken to ensure you don't overfit your model.
There's a lot more insights that could be deciphered from the survey data given adequate amounts of data and the right machine learning tools and techniques. If anyone is interested in this project, checkout the github repository: github.com/tim-ojo/chadev_survey.




You Might Also Like

0 comments