Setting up a collaborative Python project environment

Recently, I have convinced two of my friends to form a study group on data science. We are planning to follow courses from different online educational platforms such as DataCamp and edX while completing projects using data from Kaggle. Before we start our first project with Kaggle’s introductory competition, we walked through the following steps to set up common, collaborative project environments.

Create a virtual environment

Conda is an open source package and environment manager included in Anaconda distribution. The distribution includes majority of common scientific Python packages used for data science. We will use it to create a virtual environment and manage packages separately per project. Miniconda is a lighter installer of the Anaconda distribution.

Having installed the Conda manager, we can set up individual Python environments for each project. Each environment can be configured with different sets of packages and even different Python versions. To create a new environment, we use the following command line in Terminal on macOS and Linux or Anaconda Prompt on Windows.

$ conda create -n titanic python=3.6

Here, titanic is the environment name and the python=3.6 specifies the version for the environment. To activate the environment on macOS or Linux, use

$ source activate titanic


$ activate titanic

on Windows.

When activated, the environment’s name appears on the command-line interface. To deactivate, use $ source dactivate on macOS and Linux or $ deactivate on Windows.

Install packages

To install packages in a virtual environment, you can use conda install <package-name> after activating the environment. Initially, we install the following scientific packages.

(titanic) $ conda install scikit-learn
(titanic) $ conda install pandas

Conda also installs their dependencies such as numpy and scipy.

(titanic) $ conda install scikit-learn
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /Users/xxxxx/anaconda3/envs/titanic:

The following NEW packages will be INSTALLED:

    mkl:          2017.0.3-0        
    numpy:        1.13.1-py36_0     
    scikit-learn: 0.19.0-np113py36_0
    scipy:        0.19.1-np113py36_0

Proceed ([y]/n)? y

mkl-2017.0.3-0 100% |########################################################| Time: 0:00:54   2.13 MB/s

Set up project structure

To enable collaboration among the study group members, we will work off a common GitHub repository per project. A consistent project structure is necessary to keep project artifacts clean and organized with multiple members working on the same project. There is a Python package just for that.

Cookiecutter is a command-line utility that creates projects from templates and there is a template for data science projects. While we may modify structure as needed, the template gives us a starting point.

Since the utility will be used outside any individual project, we install the package on the root Anaconda environment.

$ conda install cookiecutter

After installation, you can create a project structure with the following command in the directory where you want to place the project directory.

$ cookiecutter

The utility asks a few questions on the basic information about the project and creates a directory structure based on the template from

Upload on GitHub

The git installer is available from here.

We first initialize the project directory as a local git repository using the following command in the directory.

$ git init

Then add and commit all files and sub-directories within the project directory. Note a commit message is required for each commit when using git.

$ git add .
$ git commit -m 'Initial commit'

Now connect the local repository to a remote GitHub repository.

$ git remote add origin<user>/<repo-name>

The local and remote repositories are connected but the locally committed files aren’t available until they are pushed to the remote repository.

$ git push -u origin master


We now have

  • a common project repository where we can all pull from and push to
  • a standard project structure so that we can merge our individual contributions while keeping the structure organized and clean
  • a Python environment isolated for the project so we can also share our environment specifications per project

Plotting linked plots using bokeh library

view notebook on github

In this post, I am going to create interlinked, interactive scatter plots using the Bokeh library. Below is the description of the library from the homepage.

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

I quite like its clean look and more than anything the interactive visualizaiton capabilities. It also allows using javascript based web browser interactions without learning javascript. I have been picking on what it can do from its documentations and tutorials available on Bokeh NBViewer Gallery.

Load libaries

First, I am going to load the libaries I am going to use and run output_notebook function from the bokeh library. The function configures Bokeh plot objects to be displayed on the notebook.

In [1]:
import pandas as pd
from import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.models import CategoricalColorMapper
from bokeh.models import Plot, Range1d, HoverTool
from bokeh.layouts import gridplot
from bokeh.palettes import Set2
BokehJS 0.12.7 successfully loaded.

Load data

To enable interlinking between plots, a common ColumnDataSource needs to be used as the data source between plots. You can create one from a pandas DataFrame or a dictionary. I am going to use the diabetes dataset originally from here to demonstrate this. Below is a brief description of the dataset from the original source.

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

I am going to plot each of the 9 numeric features against the response variable on individual scatter plots. I will In the code block below, the dataset is loaded as a pandas DataFrame and a ColumnDataSource is defined using the DataFrame.

In [2]:
df = pd.read_table('../data/diabetes_tab.txt')
# assuming 1 is female and 2 is male
df['Gender'] = ['FEMALE' if x == 1 else 'MALE' 
                for x in df.SEX.values]
df.rename(columns={'AGE': 'Age'}, inplace=True)
one_source = ColumnDataSource(df)
Age SEX BMI BP S1 S2 S3 S4 S5 S6 Y Gender
0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151 MALE
1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75 FEMALE
2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141 MALE
3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206 FEMALE
4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135 FEMALE

Create an interactive scatter plot

Next, I am going to create a single scatter plot with age and the response variable. I am going to add a few interaction effects including a hover effect showing the x, y values of each point.

  • Box select: Highlight data points selected in a rectangular box by dragging the mouse
  • Lasso select: Highlight data points selected in a lasso shape by dragging the mouse
  • Tap: Highlight selected data points by clicking the mouse
  • Wheel zoom: Zoom in and out of the plot using the mouse wheel zoom
  • Reset: Reset the plot to its default state
In [3]:
# define a color map for SEX variable
cmap = CategoricalColorMapper(
    factors=('FEMALE', 'MALE'),
# define a function to enable reuse
def plot_diabetes(x, width=480, height=320, 
                  legend=None, legend_location=None, 
    hover = HoverTool(
        tooltips=[('Index', '$index'), 
                  (x, '$x'), 
                  ('Progression', '$y'), 
                  ('Gender', '@Gender')
    tools = [hover, 'box_select', 'tap', 
             'wheel_zoom', 'reset', 'help']
    plt = figure(width=width, height=height,
                 title=x +' vs. diabetes progression',
                 tools=tools), 'Y', alpha=0.8, source=one_source,
               fill_color={'field': 'Gender', 'transform': cmap},
               line_color={'field': 'Gender', 'transform': cmap},
               # highlight when selected
               selection_fill_color={'field': 'Gender', 'transform': cmap},
               selection_line_color={'field': 'Gender', 'transform': cmap},
               # mute when not selected
               nonselection_fill_color={'field': 'Gender', 'transform': cmap},
    plt.xaxis.axis_label = x
    plt.xaxis.axis_label_text_font_style = 'normal'
    plt.yaxis.axis_label = 'Diabetes progression'
    plt.yaxis.axis_label_text_font_style = 'normal'
        plt.legend.location = legend_location
        plt.legend.orientation = legend_orientation
        plt.legend.background_fill_alpha = 0.7

p1 = plot_diabetes('Age', legend='Gender', legend_location='top_left', 

Bokeh Plot

You can now see an interactive scatter plot. A toolbar is placed beside the plot where you can switch on and off different tools we included. In particular, in this plot you can see the values for each data point when you hover over them. You can set the list of values you want to show by configuring tooltips with a list of (label, value) pairs in the HoverTool object.

You can refer to different variables in the source dataset by prefixing @. Fields starting with $ will are used for "special fields" such as the coordinates and the color apparently the color values are pulled from the data source, not the figure's `fill_color`.

Create multiple linked plots

Now, I am going to create multiple plots and place them in a single grid using bokeh library's gridplot. The plots are linked by a single data source. Selecting data points in one plot will highlight the same data points in all.

In [4]:
plots = [plot_diabetes(x, width=240, height=180) 
         for x in df.columns 
         if x not in ['SEX', 'Gender', 'Y']]

# create an empty plot with only the title
gtitle = figure(width=240, height=80, title="Linked scatter plots"), 0, fill_color=None, line_color=None)
gtitle.border_fill_color = None
gtitle.grid.visible = False
gtitle.axis.visible = False
gtitle.outline_line_color = None

# create an empty plot with only the legend
glegend = figure(width=240, height=80, title=None),0, fill_color=Set2[3][0], line_color=Set2[3][0], legend='FEMALE'),0, fill_color=Set2[3][1], line_color=Set2[3][1], legend='MALE')
glegend.border_fill_color = None
glegend.grid.visible = False
glegend.axis.visible = False
glegend.outline_line_color = None
glegend.legend.border_line_color = None
glegend.legend.location = 'center'

show(gridplot([gtitle, None, glegend] + plots, ncols=3))

Bokeh Plot

You can now see nine different plots linked with a single data source. When you select any data points in one plot the same data points are highlighted across all while the rest are 'muted'.

This could be useful when inspecting data with multiple dimensions. For example, when I clicked on the person with the highest S1 measurement, I can she that he also had the highest measurements of S2 and S4. Besides, it is just fun playing with these plots. I am looking forward to going through more of the library examples and tutorials.

Hello world!

Welcome to my personal blog. I have recently recreated my domain during which my previous blog was wiped out and it is pretty empty now. I will make sure to fill it up soon!