Andrea Danti - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Comb through cloud databases with Amazon CloudSearch

With a little work and reformatting, Amazon CloudSearch lets developers easily query unstructured data to find useful nuggets.

Cloud databases are ideal for storing and managing structured data that fits neatly into relational tables. But much of the data enterprises process is unstructured or semi-structured. And some is comprised of free-form text that must be easily searchable. Relational databases handle product information related to cost, size and quantity. But add a few paragraphs of detailed description and the database won't work that well. Enterprises in this situation need a search engine.

Search engines are applications that allow users to query unstructured or semi-structured data in much the same way users query structured data in relational databases. Enterprises storing and managing large amounts of semi-structured content in AWS could use Amazon CloudSearch to retrieve data.

Some search engines work with semi-structured or unstructured data and can read multiple file types such as DOCX, PDF and TXT. Amazon CloudSearch works with JavaScript Object Notation (JSON) or XML documents. So if your content is in a different form, you'll need to pre-process the data into one of these formats.

CloudSearch organizes semi-structured data in a domain; analogous to relational database tables that contain rows of data, domains contain documents. The documents include field names and values. For example, if you have a domain for searching email messages, your documents would have fields such as sender, recipient, CC, subject and message.

The first step to using Amazon CloudSearch is to define fields in your documents. For each field, you can indicate if the data in the field should be searchable, if users should be able to sort on that field, and other processing options. CloudSearch also offers the ability to extract fields from sample data, which can save time otherwise spent manually specifying all fields and processing options.

Once a domain is defined, documents can be loaded into CloudSearch. As documents are loaded, they are processed according to domain configuration settings. This can include removing common words, known as stop words that do not help with search in cloud computing databases and would otherwise take up unnecessary space. The text of documents may also have words replaced by their root words in a process known as stemming. This helps to improve matching and reduces storage space because words such as "rain," "rained" and "raining" are all reduced to the root word "rain."

After documents are loaded and word indexes are built, the CloudSearch domain is ready for queries. Like relational databases, queries can be simple or complicated. Users could search for a simple term, such as "headphones" or something more targeted, such as: "the field description should contain 'headphones,' the price field should be 'less than $25' and the date first available should be 'within the last 12 months.'"

To perform this kind of Boolean search, developers must be familiar with CloudSearch query syntax. Developers create a search interface that allows end users to specify fields and values while hiding the complexity of CloudSearch query syntax.

Accessing and scaling Amazon CloudSearch

CloudSearch has three access points for managing domains, loading documents and querying a domain: administration console, command-line interface or programming language API.

Like other AWS Services, CloudSearch will scale up if your document indexing or query processing load becomes too high for the existing instance. CloudSearch works with small, large, extra-large and double extra-large search instances; prices range from $0.10 per hour to $1.10 per hour. When CloudSearch scales up, it launches a larger instance. If CloudSearch is already using the largest instance, it will partition documents and use multiple servers to index documents or respond to queries.

CloudSearch supports features found in specialized search engines, including multiple languages, advanced-term searching options, auto-complete queries and highlighting within results. The application also integrated with Identity Access Manager to protect content. It can also specify an IP address or addresses that are allowed to load documents into a domain.

About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Dig Deeper on AWS database management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.