AWS big data developers are testing the limits of the cloud provider's systems, from EC2 instances and Elastic MapReduce to new programming languages such as R.
One such company, Comlinkdata, a Boston-based firm founded in 2010, is a real-time provider of telecom market movement data. If customers switch cell phone providers, Comlinkdata sees when the switch happens, knows how long they were with their previous carrier, knows the approximate location based on phone number and then starts to enrich that data with other sources like demographics, advertising spend and coverage data. The company analyzes about 50,000 such switches per day in the U.S.
SearchAWS.com sat down with Tanya Cashorali, director of analytics for the firm, to uncover what's on the back end of Comlinkdata's number-crunching applications and connection to AWS big data.
What does your AWS deployment look like?
Tanya Cashorali: Our main data warehouse, where we store data that's used for multiple products, is on Amazon Redshift. We use S3 as a means of copying data to and from Redshift, as well as input and output data for [Elastic MapReduce] jobs. We started with all our data in SQL Server. We get all this switching data nightly, but the way the data's structured, those 50,000 switches become about five-and-a-half to 6 million rows in a database per day. Processing that was taking about 12 hours on SQL Server, and we had to speed that up. So, we exchanged the processing to be run in Apache Pig, which is a SQL-like language that runs on top of Hadoop. That runs nightly now on EMR, so we extract from SQL Server for input data, we push that input data to S3, [and] S3 then serves as the input location for EMR jobs.
The EMR job runs for about an hour, so we've gotten it down from 12 hours to one hour. The output from that is also stored on S3. And then we copy that into Amazon Redshift. So then our multiple analytical products are querying Amazon Redshift in real time, where queries over 4 billion rows return [results] in about five seconds or less. It started out at about five minutes, but we made a bunch of optimizations to our data structures to get that time down.
What kind of changes did you make?
Cashorali: For one, we de-normalized the table. Previously, we were doing massive joins over various tables, and joining a three-and-a-half-billion-row table to a 40,000-row table is really expensive [in terms of resources]. That's why we changed the structure -- it took changing the Pig script, but we de-normalized so it's just one massive table scan. We set up our distribution key such that it filters through the data in the most optimal way possible for all of our various queries, and now there's no more need to copy the data over the various nodes to perform these really heavy joins. We also changed our instance types to the SSD-based nodes, C3.8xlarges, rather than the [spinning disk]-based nodes, so that made accessing the data a lot faster.
Anything on your wish list? Anything you'd like to see AWS add or change about the services you use?
Cashorali: Raising storage limits on RDS would be nice, and Amazon Redshift multi-[availability zone] deployments. I don't know much about what kind of support they offer for running R more seamlessly. I know it's possible just doing your own custom work with loading up an EC2 instance and installing R and all that, but potentially offering not just EMR in the cloud but ways to more seamlessly run more complex statistical modeling in R on the cloud.
There are people who are starting to write packages to do it, but it's still kind of a clunky process. It would be nice to say, "I want to run a million different models on EMR, and I have all the R code, I just want to be able to send the code to a bunch of machines on EMR and have it parallelized for me the same way it does already with languages like Pig and Hive." There are clearly customers using R for really massive data analysis problems. It would only help Amazon to integrate that whole process more seamlessly within their products.
[Editor’s note: Currently, AWS offers a third-party app for hosting R within its cloud, from Revolution Analytics in the Amazon Marketplace.]
What exactly is R?
Cashorali: It's an open-source statistical programming environment. It was developed by Ph.D. statisticians, which some people say is the best thing about R and also the worst thing about R. It's not like a typical functional programming language, but I would say it’s best for really complex data analysis, data visualization, it can be used for data processing, but there are other tools that are better suited to that like Python or just BASH … there are a lot of libraries that really smart PhDs have developed that are free to use, you can look at the source code, and all the heavy lifting has been done for you. You just plug your data in and it outputs what you need.
What's the biggest challenge with using AWS?
Cashorali: There are a lot of IT administrative tasks that it requires. You really should have someone owning the whole configuration. There are things, like virtual private cloud, that still need to be set up on our side -- managing security groups, who has access to what databases, to what EC2 instances, all the different permissions, policies for EC2 instances and databases, as well as structuring your data on S3. It's a lot of administrative work … maintaining it is close to a full-time job.