Recently, I have convinced two of my friends to form a study group on data science. We are planning to follow courses from different online educational platforms such as DataCamp and edX while completing projects using data from Kaggle. Before we start our first project with Kaggle’s introductory competition, we walked through the following steps to set up common, collaborative project environments.
Create a virtual environment
Conda is an open source package and environment manager included in Anaconda distribution. The distribution includes majority of common scientific Python packages used for data science. We will use it to create a virtual environment and manage packages separately per project. Miniconda is a lighter installer of the Anaconda distribution.
Having installed the Conda manager, we can set up individual Python environments for each project. Each environment can be configured with different sets of packages and even different Python versions. To create a new environment, we use the following command line in Terminal on macOS and Linux or Anaconda Prompt on Windows.
$ conda create -n titanic python=3.6
titanic is the environment name and the
python=3.6 specifies the version for the environment. To activate the environment on macOS or Linux, use
$ source activate titanic
$ activate titanic
When activated, the environment’s name appears on the command-line interface. To deactivate, use
$ source dactivate on macOS and Linux or
$ deactivate on Windows.
To install packages in a virtual environment, you can use
conda install <package-name> after activating the environment. Initially, we install the following scientific packages.
(titanic) $ conda install scikit-learn (titanic) $ conda install pandas
Conda also installs their dependencies such as
(titanic) $ conda install scikit-learn Fetching package metadata ........... Solving package specifications: . Package plan for installation in environment /Users/xxxxx/anaconda3/envs/titanic: The following NEW packages will be INSTALLED: mkl: 2017.0.3-0 numpy: 1.13.1-py36_0 scikit-learn: 0.19.0-np113py36_0 scipy: 0.19.1-np113py36_0 Proceed ([y]/n)? y mkl-2017.0.3-0 100% |########################################################| Time: 0:00:54 2.13 MB/s
Set up project structure
To enable collaboration among the study group members, we will work off a common GitHub repository per project. A consistent project structure is necessary to keep project artifacts clean and organized with multiple members working on the same project. There is a Python package just for that.
Cookiecutter is a command-line utility that creates projects from templates and there is a template for data science projects. While we may modify structure as needed, the template gives us a starting point.
Since the utility will be used outside any individual project, we install the package on the root Anaconda environment.
$ conda install cookiecutter
After installation, you can create a project structure with the following command in the directory where you want to place the project directory.
$ cookiecutter https://github.com/drivendata/cookiecutter-data-science
The utility asks a few questions on the basic information about the project and creates a directory structure based on the template from
Upload on GitHub
gitinstaller is available from here.
We first initialize the project directory as a local
git repository using the following command in the directory.
$ git init
Then add and commit all files and sub-directories within the project directory. Note a commit message is required for each commit when using
$ git add . $ git commit -m 'Initial commit'
Now connect the local repository to a remote GitHub repository.
$ git remote add origin email@example.com:<user>/<repo-name>
The local and remote repositories are connected but the locally committed files aren’t available until they are pushed to the remote repository.
$ git push -u origin master
We now have
- a common project repository where we can all
- a standard project structure so that we can merge our individual contributions while keeping the structure organized and clean
- a Python environment isolated for the project so we can also share our environment specifications per project