Cloud architects and developers use Amazon CloudSearch to build and deploy search indexes on a large collection...
of data, such as webpages or document files. Through the Amazon console, you can create a search domain, upload your data to the domain, build search indexes and start submitting search requests.
At low hourly rates, you pay as you go for only the resources you consume. It's cheaper to own search applications with CloudSearch than to operate a large-scale search environment on your own.
To get started, sign in to the AWS Management Console to get to Amazon CloudSearch. Next, create a search domain with a name of your choice.
The next step is to determine the size of a sample file you want to upload, as well as the desired replication count, based on the amount of traffic you expect. If the count is 1, the traffic between the source location of the file(s) and CloudSearch will be low. If you want more traffic for a larger file size, increase your count.
The easiest way to upload a sample file is to use the Amazon CloudSearch console to get it from your local machine. If you don't have a sample file, you could use the sample files from Amazon S3 buckets. You are limited to 10 files.
Then, manually add index fields or configure them later. If you are unfamiliar with this database, the Amazon DynamoDB Developer Guide is an informative manual.
For the first option, you can upload any of the following file types:
- Comma Separated Value (.csv)
- Microsoft Excel or LibreOffice Calc (.xls, .xlsx)
- Microsoft PowerPoint or LibreOffice Impress (.ppt, .pptx)
- Microsoft Word or LibreOffice Writer (.doc, .docx)
If you upload a CSV file, be sure to properly identify document fields in the first row of this file to avoid processing it.
For the purposes of this demonstration, a LibreOffice document in .docx format will be used. CloudSearch will not accept the Open Document Format. The steps in finding general properties for a LibreOffice Writer and Microsoft Word are similar.
Here are a few suggestions to find the general properties for a LibreOffice document. First, click Properties... from the File tab. You will see information on Document type, Date created, Document size, Last printed, Revision number and other items.
After uploading your document, find the Suggested Index Field Configuration page.
The index fields are taken from the items of the general properties of the document. The index file type includes text, text-array, literal, int and date.
The following CloudSearch index fields are automatically searchable:
- content (searchable items found)
- content_type (application/MS Word)
- resource name (name of a document file)
You will see shaded, checked-off boxes for them. A shaded box means you cannot uncheck it.
CloudSearch automatically converts the LibreOffice Word property items (in parentheses) into search index fields.
- creation_date (Created)
- last_modified (Modified)
- last-printed (Last printed)
You can uncheck these items. Keep in mind that you might need these fields for facets to refine or filter search results.
Next, determine whether search results can include the contents of the field, whether search results can be sorted by the field, and whether matched phrases in a text or text-array field can be highlighted in search results.
CloudSearch will then ask you to review the search domain name, scaling options, index fields and access policies you chose. If you change your mind, you can edit them.
When the processing is done, your search domain is marked as active. It shows the number of index fields and the search endpoint that CloudSearch console uses to submit search requests.
Once you re-enter the CloudSearch console, you can submit your search requests.
A word of caution: The properties of a LibreOffice Word are not listed in the way CloudSearch wants them to be listed when you submit search requests. For example, from LibreOffice Word properties page:
Created: 12/20/2012, 09:43:08, Judith
CloudSearch automatically converts the property item Created into creation_date as a search index. It picks up the date and time values but not the "Judith" value. Since LibreOffice Word does not provide the author property item, CloudSearch does not recognize "Judith" as the author.
In my next article, I will explain how to fix this problem and discuss other topics on submitting search requests.
About the author: Judith M. Myerson is the former ADP security officer/manager at a naval facility where she led enterprise projects for its materiel management system. Currently a consultant and subject matter expert, she is the author of several books and articles on cloud use, compliance regulations, mobile security, software engineering, systems engineering and risk management. She received her master of science degree in engineering from the University of Pennsylvania and is certified in risk and information system control (CRISC).