Warakorn - Fotolia
Python is used for a wide range of applications, from basic Web applications to advanced scientific programming. The Python community has created a range of tools to minimize the time it takes to prototype data analysis tools. However, due to the breadth of community contributions, finding and installing a comprehensive set of Python tools and packages is time-consuming and prone to incompatibilities. Here are some tips to reduce the time needed to set up your environment and build your application.
The first step is to find the right tools. The Anaconda Python distribution is recommended for data analysis application development; it includes most of the essential Python packages you will need for such projects. NumPy and SciPy are the foundation of a number of valuable data analysis packages, such as SciKit-Learn for machine learning. The Pandas Data Analysis Library is also included. Pandas is a useful tool for querying and analyzing large volumes of structured data, such as time series data. Many of the data structures will be familiar to R programmers.
The Anaconda Python distribution also includes iPython, a browser-based interactive shell for developing Python code. One of the advantages of iPython is the ability to create notebooks that allow for a mix of code, documentation and visualizations. You can share notebooks for others to run, or publish results using NBViewer.
If you are working with Python in AWS, you probably have your data stored there as well. The Python tool for the AWS API, known as boto, includes support for working with Amazon S3, SimpleDB, and DynamoDB. These APIs help streamline the process of getting stored data to an application.
Now that you have a set of Python tools to work with, the next step is to learn how to use them. You'll probably spend a fair amount of time working with Pandas and SciKit-Learn, although it will depend on your specific needs. SciKit-Learn has machine learning packages for classification, regression, clustering, preprocessing and evaluating machine learning models. If you are new to machine learning, the documentation includes tutorials and sample data to assist you. Additionally, SciKit-Learn offers a broad set of classes and methods for working at a fairly high level.
Pandas lets developers and analysts work with tabular or relational data. The data frame structure is a basic building block for storing tabular data. It has features for dealing with common tasks, such as importing, merging, aligning and filtering data. It also includes useful methods for dealing with missing data.
If you need to work with statistical tests and measures within a Python environment, you should look into the Statsmodel package. It includes support for linear regression, discrete choice models, non-parametric estimators and code for other statistical functions. There is also support for sharing files with NumPy and Pandas.
Once your Python development environment is set up, make sure to save your configuration as an Amazon Machine Image, or AMI, so you can reuse it. As your data sets grow you may want to take advantage of AWS EC2 resources and run applications on a cluster. Luckily, iPython is designed to support parallel processing.
Here are the three BI dashboard best practices you need to know