SAN FRANCISCO - Matchmaking can be a data intensive process, particularly if you are trying to figure out which relationships are really going to work in the long run.
At the JavaOne show, Steve Kuo, and Joshua Tuberville, software architects at eHarmony, discussed how the company has been testing out different variations of its matchmaking models using Hadoop and Elastic MapReduce on top of Amazon Web Services (AWS).
eHarmony claims this attention to detail is what has enabled it to attract 20 million active worldwide users and led to an average of 236 marriages a day in the US.
The production application that actually determines matches uses a relational database, which works out quite well for matching users. But in order to score the success of different models, they needed to develop a system that could scale out across multiple machines better than a relational database.
They realized that Hadoop, an open source way for distributing a workload across multiple machines, particularly when vast amounts of data are involved, could work. Others have demonstrated Hadoop successfully scaling across 2,000 servers, and Kuo believes that it will be able to scale to support eHarmony's needs.
For example, the Hadoop Distributed File System provides reliability of data through replication, and Hadoop can automatically re-execute a process if a particular server goes down. The only weak spot is the master server, which could bring down an entire process if it were to crash. eHarmony's Tuberville noted that this has not happened to them yet.
In addition to the basic AWS service for storage (S3) and computation (EC2), eHarmony has started using Elastic MapReduce, a hosted Hadoop framework running on AWS. Although eHarmony had been implementing Hadoop on its own previously, this new service has saved it considerable programming and deployment time, Kuo said. The Elastic MapReduce cut the configuration and startup code for allocating the clusters and shutting down EC2 instances. It reduced about 150 lines of code in multiple scripts down to about 10 lines of code in a Ruby wrapper, according to Kuo.
Making sense of MapReduce
MapReduce requires a shift in how programmers think about a problem, noted Tuberville. It breaks a problem into a series of computational maps and reductions, which make it easy to distribute across multiple servers. "When I first got involved, it was not intuitive how you could model your data across multiple processes," Tuberville said.
The typical demonstration of MapReduce is a word counting application in which the application maps the counting of the number of unique words in each paragraph across multiple servers and then reduces this by adding up all the instances of a particular word. "At first, it seemed like a weird way to do a word count, but I could not figure out another way to scale to petabytes, and that is what Hadoop is designed for," Tuberville said.
One of the weaknesses of MapReduce is that the mappers often produce more data than was in the input in the intermediate steps of the process. If a programmer is not careful, the amount of intermediate data can explode and lead to high application costs.
"When we started we did not know how to implement our application in MapReduce," Kuo said. "You have to think about how to decompose your problem into a series of Map and Reduce problems."
Hadoop programming is still pretty cutting edge and can be a challenge, even for experienced database developers. Kuo said they have been experimenting with letting other eHarmony analysts create applications using Pig, a higher-level analytical language, which is akin to SQL for database programming. Unfortunately Pig code runs at about a third the speed of native Hadoop scripts.
Kuo said EC2 and Hadoop are the easier part of the problem. The hardest part of coding is extracting, transforming and loading the data. As new data becomes available, they need to move information about millions of users with hundreds of attributes and hundreds of millions of matches into the engine. Even with compression, it ends up being tens of gigabytes of data.
Then they send the new data up to S3, start the cluster, do the MapReduce jobs, get the resulting data out of S3 and finally kill the cluster.
Kuo said, "One of the things that is nice is you can set up multiple accounts. It is cheap because you only pay for the resources you use. If you make tremendous changes to the code you just spin up a cloud of the same size as the production cloud, and you can find out if the new application is faster or slower."
In order to determine how they could get the most work done cost effectively, eHarmony analyzed a variety of scenarios from 24-low-end computers up to 49-high-end computers. When they looked at the total process execution time, they found that Hadoop only took about 50 percent. Much more of the process was spent uploading data and downloading the results. Increasing the machine size only reduced the 50 percent of the time spent on analytical processing with Hadoop.
eHarmony found it was spending about $1,200 per month to run the analytical application across 50 high-end servers. This compared quite favorably with the $5,000 their system administrators estimated it would have cost them to run the same machines in-house for power, cooling, network and administration.
After hearing the presentation , Dan Kokotov, lead engineer with 5AM Solutions in Rockville, MD, said that he was inspired to look at using Hadoop to do some gene sequence analysis programming they have been writing using traditional methods. He also thinks the cloud can open new markets for relatively small organizations like his to begin commercially offering high-performance computational services, without having to sink a lot of upfront capital into a server farm to get started.