Warakorn - Fotolia

Get started Bring yourself up to speed with our introductory content.

Get to know Python tools and how to use them

Python tools can be incredibly useful in application development. Learn how to set up a Python environment and build your application.

Python is used for a wide range of applications, from basic Web applications to advanced scientific programming. The Python community has created a range of tools to minimize the time it takes to prototype data analysis tools. However, due to the breadth of community contributions, finding and installing a comprehensive set of Python tools and packages is time-consuming and prone to incompatibilities. Here are some tips to reduce the time needed to set up your environment and build your application.

The first step is to find the right tools. The Anaconda Python distribution is recommended for data analysis application development; it includes most of the essential Python packages you will need for such projects. NumPy and SciPy are the foundation of a number of valuable data analysis packages, such as SciKit-Learn for machine learning. The Pandas Data Analysis Library is also included. Pandas is a useful tool for querying and analyzing large volumes of structured data, such as time series data. Many of the data structures will be familiar to R programmers.

The Anaconda Python distribution also includes iPython, a browser-based interactive shell for developing Python code. One of the advantages of iPython is the ability to create notebooks that allow for a mix of code, documentation and visualizations. You can share notebooks for others to run, or publish results using NBViewer.

If you are working with Python in AWS, you probably have your data stored there as well. The Python tool for the AWS API, known as boto, includes support for working with Amazon S3, SimpleDB, and DynamoDB. These APIs help streamline the process of getting stored data to an application.

Now that you have a set of Python tools to work with, the next step is to learn how to use them. You'll probably spend a fair amount of time working with Pandas and SciKit-Learn, although it will depend on your specific needs. SciKit-Learn has machine learning packages for classification, regression, clustering, preprocessing and evaluating machine learning models. If you are new to machine learning, the documentation includes tutorials and sample data to assist you. Additionally, SciKit-Learn offers a broad set of classes and methods for working at a fairly high level.

Pandas lets developers and analysts work with tabular or relational data. The data frame structure is a basic building block for storing tabular data. It has features for dealing with common tasks, such as importing, merging, aligning and filtering data. It also includes useful methods for dealing with missing data.

If you need to work with statistical tests and measures within a Python environment, you should look into the Statsmodel package. It includes support for linear regression, discrete choice models, non-parametric estimators and code for other statistical functions. There is also support for sharing files with NumPy and Pandas.

Once your Python development environment is set up, make sure to save your configuration as an Amazon Machine Image, or AMI, so you can reuse it. As your data sets grow you may want to take advantage of AWS EC2 resources and run applications on a cluster. Luckily, iPython is designed to support parallel processing.

Next Steps

Here are the three BI dashboard best practices you need to know

Want to work with Python? Here's a tutorial on the PyCharm IDE

Dig Deeper on AWS tools for development

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Have you faced difficulties when installing Python tools? What were they, and how did you overcome them?
Having recently had to work with several of the Python tools mentioned in this article (over 10 years of Python programming expertise), I can tell you the learning curve is a steep one for several reasons. 

First, overall these tools simplify the 'mechanics' of using the underlying Python libraries for scientific and numerical data computation and data visualization. Installation is easy, provided you follow the recommended installation parameters. Once installed, the sample and tutorial examples illustrate the basic concepts quite well. 

However, numerical data analysis is a mind-set that does not come easily or quickly, and unless one has a fair amount of prior experience in statistics and probability reasoning setting up an actual analysis is not as easy. To use an unfortunate, but apt analogy: with great power come great responsibility. Knowing which combination of tools to use on which kinds of data, and then teasing out valid conclusions is not easily achieved. 
If you are tasked to use these tools, and you are not a 'numbers geek' then keeping a copy of a text like the Idiot's Guide to Probability and Statistics handy is essential.  
These tools make running sophisticated number crunching tasks a cake walk, but they can't help you fake your way through a practical and useful data-vis analysis.
I've installed these Python tools on OSX (Mac) with no problems for Python 2.7.