Random imputation by group using pandas library

view notebook on github

Kaggle's introductory competition - Titanic: Machine learning from disaster competition - provides datasets of passengers who were on board of the RMS Titanic. One of the features age is missing a good chunk of data - 177 values out of 891 rows in the training data. Below is an attempt to fill those gaps using random imputation based on logical groups.

In [1]:
import numpy as np
import pandas as pd
import re

train = pd.read_csv('../data/raw/train.csv')
train.columns = [x.lower() for x in train.columns]
# identify rows with no age
train['missingage'] =\
    [pd.isnull(x) for x in train['age']]
False    714
True     177
Name: missingage, dtype: int64

Extract title

First, we extract people's titles from their names following the feature engineering steps from Titanic best working classifier by Sina.

  1. Individual titles are extacted using a regular expression
  2. Similar titles (e.g., Miss, Ms, and Mlle) are grouped and other rare occurrence titles are grouped separately
In [2]:
def getTitle(name):
    title = re.search('([A-Za-z]+)\.', name)
    # check if title exists
    if title:
        return title.group(1)
    return ''

train['title'] = train['name'].apply(getTitle)
pd.crosstab(train['sex'], train['title'])
title Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
female 0 0 1 0 1 0 1 0 0 182 2 1 0 125 1 0 0
male 1 2 0 1 6 1 0 2 40 0 0 0 517 0 0 6 1
In [3]:
train['title'] = train['title'].replace(
        # group non-common titles as 'other'
        r'^(?:(?!Mlle|Miss|Ms|Mme|Mr|Mrs|Master).)*$': 'Other',
        # replace Mlle and Miss with Ms
        r'^(Mlle|Miss)': 'Ms',
        # replace Mme with Mrs
        'Mme': 'Mrs'
    }, regex=True)
pd.crosstab(train['sex'], train['title'])
title Master Mr Mrs Ms Other
female 0 0 126 185 3
male 40 517 0 0 20

Examine relationship

We will use the extracted feature title to impute missing age values in groups. Before we perform the imputation, we examine their relationship with the following scatterplots over boxplots.

In [4]:
# load necessary plotting libraries and 
# function from src.visualization
import os
import sys
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
from visualization import visualize as viz
from bokeh.palettes import Set2
from bokeh.io import output_notebook, output_file
# create the plot
viz.plotAgeByTitleSex(train, Set2[3])
Continue reading “Random imputation by group using pandas library”

Setting up a collaborative Python project environment

Recently, I have convinced two of my friends to form a study group on data science. We are planning to follow courses from different online educational platforms such as DataCamp and edX while completing projects using data from Kaggle. Before we start our first project with Kaggle’s introductory competition, we walked through the following steps to set up common, collaborative project environments.

Create a virtual environment

Conda is an open source package and environment manager included in Anaconda distribution. The distribution includes majority of common scientific Python packages used for data science. We will use it to create a virtual environment and manage packages separately per project. Miniconda is a lighter installer of the Anaconda distribution.

Having installed the Conda manager, we can set up individual Python environments for each project. Each environment can be configured with different sets of packages and even different Python versions. To create a new environment, we use the following command line in Terminal on macOS and Linux or Anaconda Prompt on Windows.

$ conda create -n titanic python=3.6

Here, titanic is the environment name and the python=3.6 specifies the version for the environment. To activate the environment on macOS or Linux, use

$ source activate titanic


$ activate titanic

on Windows.
Continue reading “Setting up a collaborative Python project environment”

Plotting linked plots using bokeh library

view notebook on github

In this post, I am going to create interlinked, interactive scatter plots using the Bokeh library. Below is the description of the library from the homepage.

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

I quite like its clean look and more than anything the interactive visualizaiton capabilities. It also allows using javascript based web browser interactions without learning javascript. I have been picking on what it can do from its documentations and tutorials available on Bokeh NBViewer Gallery.

Load libaries

First, I am going to load the libaries I am going to use and run output_notebook function from the bokeh library. The function configures Bokeh plot objects to be displayed on the notebook.

In [1]:
import pandas as pd
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.models import CategoricalColorMapper
from bokeh.models import Plot, Range1d, HoverTool
from bokeh.layouts import gridplot
from bokeh.palettes import Set2
BokehJS 0.12.7 successfully loaded.

Load data

To enable interlinking between plots, a common ColumnDataSource needs to be used as the data source between plots. You can create one from a pandas DataFrame or a dictionary. I am going to use the diabetes dataset originally from here to demonstrate this. Below is a brief description of the dataset from the original source.

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

I am going to plot each of the 9 numeric features against the response variable on individual scatter plots. I will In the code block below, the dataset is loaded as a pandas DataFrame and a ColumnDataSource is defined using the DataFrame.

In [2]:
df = pd.read_table('../data/diabetes_tab.txt')
# assuming 1 is female and 2 is male
df['Gender'] = ['FEMALE' if x == 1 else 'MALE' 
                for x in df.SEX.values]
df.rename(columns={'AGE': 'Age'}, inplace=True)
one_source = ColumnDataSource(df)
Age SEX BMI BP S1 S2 S3 S4 S5 S6 Y Gender
0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151 MALE
1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75 FEMALE
2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141 MALE
3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206 FEMALE
4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135 FEMALE

Create an interactive scatter plot

Next, I am going to create a single scatter plot with age and the response variable. I am going to add a few interaction effects including a hover effect showing the x, y values of each point.

  • Box select: Highlight data points selected in a rectangular box by dragging the mouse
  • Lasso select: Highlight data points selected in a lasso shape by dragging the mouse
  • Tap: Highlight selected data points by clicking the mouse
  • Wheel zoom: Zoom in and out of the plot using the mouse wheel zoom
  • Reset: Reset the plot to its default state
In [3]:
# define a color map for SEX variable
cmap = CategoricalColorMapper(
    factors=('FEMALE', 'MALE'),
# define a function to enable reuse
def plot_diabetes(x, width=480, height=320, 
                  legend=None, legend_location=None, 
    hover = HoverTool(
        tooltips=[('Index', '$index'), 
                  (x, '$x'), 
                  ('Progression', '$y'), 
                  ('Gender', '@Gender')
    tools = [hover, 'box_select', 'tap', 
             'wheel_zoom', 'reset', 'help']
    plt = figure(width=width, height=height,
                 title=x +' vs. diabetes progression',
    plt.circle(x, 'Y', alpha=0.8, source=one_source,
               fill_color={'field': 'Gender', 'transform': cmap},
               line_color={'field': 'Gender', 'transform': cmap},
               # highlight when selected
               selection_fill_color={'field': 'Gender', 'transform': cmap},
               selection_line_color={'field': 'Gender', 'transform': cmap},
               # mute when not selected
               nonselection_fill_color={'field': 'Gender', 'transform': cmap},
    plt.xaxis.axis_label = x
    plt.xaxis.axis_label_text_font_style = 'normal'
    plt.yaxis.axis_label = 'Diabetes progression'
    plt.yaxis.axis_label_text_font_style = 'normal'
        plt.legend.location = legend_location
        plt.legend.orientation = legend_orientation
        plt.legend.background_fill_alpha = 0.7

p1 = plot_diabetes('Age', legend='Gender', legend_location='top_left', 
Continue reading “Plotting linked plots using bokeh library”

Hello world!

Welcome to my personal blog. I have recently recreated my domain during which my previous blog was wiped out and it is pretty empty now. I will make sure to fill it up soon!