Amazon CloudSearch is a helpful tool for building search indexes for documents in the cloud. The service is based...
on using the existing property items of the documents, and allows developers to add new property items in order to fine-tune search indexes.
I've chosen a LibreOffice Writer document to show how to use CloudSearch to build and fix problems with search indexes. For demonstration purposes, the document is small to keep the costs of index processing low.
It takes five steps to build search indexes: prepare a document, start Amazon CloudSearch, locate source of index fields, add index fields and run a test search. If the test is successful, you can use the same index to search other LibreOffice documents.
Step 1: Prepare a sample document
- From the File tab, go to Properties.
- Under the General Properties tab, do not uncheck the box for Apply the user data.
- Under Options from the Tools tab, add your name to LibreOffice User Data.
- Check the box for Use data for document properties.
- Click OK.
- From the Edit tab, turn on Record Changes.
- Make necessary edits.
- Save the document in LibreOffice Writer format (.odt) and then in Microsoft Word format (.doc or .docx). CloudSearch doesn't accept documents in LibreOffice Writer format (.odt).
Step 2: Start CloudSearch
- Sign in to AWS Management Console.
- Select CloudSearch.
- Choose an active domain in the dashboard.
- Click Upload Documents.
- Select File(s) on my local disk and click Browse for the sample file you want to upload.
- Click Continue.
After CloudSearch analyzes the sample document, the dashboard provides a list of index fields:
The italicized marks are not configured for the domain. You cannot proceed until you fix the problem.
Step 3: Locate the source of improperly configured index fields
- Select About LibreOffice from the Help tab to get the source for the application_name field.
- Go to Language Settings under Options from the Tools tab to get the source for the language field. English (USA) is the default option.
- Pick General Properties tab from Properties under the File tab to get the source for the fields in the following table.
|author||The author's name follows the date in the Created property item.|
|creator||The creator's name follows the date in the Created property item.|
|last_author||The author's name follows the date in the Modified property item.|
|last_modified||The field corresponds to the Modified property item excluding the author's name following the date.|
|total_time||The field corresponds to the Total editing time property item.|
You can decide whether to delete or keep these fields before adding new index fields to the domain configuration.
Step 4: Add new index fields
- Sign in to CloudSearch to open the dashboard.
- Select an active domain name.
- Select Indexing Options.
- Click Add Index Field one at a time for each field.
Note: CloudSearch makes each field automatically searchable by default.
- Click Submit.
- Click Run indexing to rebuild the index.
- Select OK to start indexing. Small indexes typically take several minutes to build and deploy, but it can take several hours to build and deploy a large index. The smaller an index is, the less the costs of rebuilding search indexes are.
Step 5: Submit search request
- Select Dashboard.
- Click Upload Documents from S3 buckets or your local file system.
- Click Run a Test Search.
- Enter a word in the Search box. For example, "vulnerability."
- Click Go to begin sorting index fields by a document's relevance score in descending order.
Note: The score is based on how often the search terms appear in the document compared to how common the term is across all documents in the domain. In this example, one document is used.
Here are the results for a document:
The document score is 0.5532488. The search term was found once.
The creator and author is Judith.
The last author is Judith.
The document was revised six times.
The language is USA English by default (en-US).
The creator and author is Judith.
The creation date is 2014 June 20 at 10:08:15.
The content is: Biometric vulnerability assessment changes.
The modified date is 12:57:38.
The content type is application/vnd.openxmlformats (Word docx).
The resource name is Biometric vulnerability assessment.docx.
In conclusion, run a test search with a small document to keep the processing costs low. If the test is successful, you can use the index to search a large collection of documents.