250 likes | 273 Views
Learn how to manage unstructured data by building a document database with document, page indexing and retrieval solutions using Elasticsearch and Amazon Web Services
E N D
Building an unstructured data management solutions with ElasticSearch and Amazon Web Services A document and page level retrieval solution powered by ElasticSearch proposed to handle a business requirement in Mobius
Topics Covered • The Business need we faced • Why ElasticSearch to meet our challenge? • Adopting the Parent-Child relationship in ElasticSearch • ElasticSearch Document Database Architecture • Technical Implementation of the solution • Plugin Creation • Index Creation • Indexing parent document • Indexing child document • Retrieving documents by query • Possible Search Types in ElasticSearch • How we adapted the phrase search
The Business need we faced • A UK based energy intelligence company required a document store database to hold analysis and research documents • The document could be in various file formats likePDF’s, Excel, text file etc.,. • Two kinds of retrieval were needed - • Page level Retrieval - To retrieve specific pages that matched the search content and tags. • Document Level Retrieval - To retrieve an entire document based on the searched content and tags.
WhyElasticSearch to meet our challenge? • Other document level tagging and retrieval solutions like Aleph and OverviewDocs did not have a clear feature for page level retrieval • Likeable Features of ElasticSearch include - • Open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. • Can power extremely fast and accurate full-text searches for data discovery applications. • Multiple configurations and variations available to tag and index documents in ElasticSearch like PDF’s, Excel etc., • Capable to handle up to Petabytes of data and scalable to a large extent.
Adopting the Parent-Child relationship in ElasticSearch • Indexing in the document level was a common feature while page level indexing was not available by default • A tailor-made solution for page level retrieval was to be built • We adopted the Parent-Child relationship in ElasticSearch to cater to our needs. How would this work? • In the Parent, Document meta information and Document Tags can be saved. • Child can refer to the Parent type and can also index Page tags, Page content and page level Page meta information.
The architecture comprises of four main parts - • Parser • AWS S3 Storage • ElasticSearch • Query Processor ElasticSearch Document Database Architecture Though ElasticSearch serves as the core search engine, to facilitate splitting, encoding and merging of pages during retrieval calls for a proper document database system
Overview of the ElasticSearch Document Database Architecture
Parser: • Parses the documents, splits them, encodes them to base64 • Pushes actual page without base64 encode to AWS S3 and encoded page to ElasticSearch along with AWS s3 location. 2. AWS S3 Storage: • The document and pages of the document are saved here for later retrieval by the user. • This is done so that when a user searches for a document, we initially hit the ElasticSearch, fetch the meta information about the document from there and then retrieve the corresponding document/page from AWS S3.
3.ElasticSearch: ElasticSearch serves as the core search engine for searching tags, documents and pages. 4. Query Processor: • The end user will query the document from here. • When a search query is given, the query processor would - • Hit the ElasticSearch and get the meta information • Retrieves the actual document/page from AWS3. This is done to attain maximum speed and performance. • The result will then be published to the end user.
The retrieval process done by ElasticSearch engine can be broadly broken down into the following 5 steps - • Plugin Creation • Index Creation • Indexing parent document • Indexing child document • Retrieving documents by query Technical Implementation of the solution
Plugin Creation - To create the database in ElasticSearch we have to convert the pages into base64 encoded content. We need to create a plugin to ingest base64 encoded PDF, word, etc.,. and index them to elasticsearch. URL: http://localhost:9200/_ingest/pipeline/parser Method: PUT Body: { "description" : "Extract attachment information", "processors" : [ { "attachment" : { "field" : "data" } } ] }
2. Index Creation - An index is to be created to index the document. Since there are no special search requirement, a default index with parent and child mapping was formed. URL: http://localhost:9200/Index_name Method: PUT Body: { "mappings": { "document": {}, "pages": { "_parent": { "type": "document" } } } }
3. Indexing parent document - When a new document is added, we have to index document level details in parent document using below API call. URL: http://localhost:9200/Index_name/document/parent_id Method: POST Body: { Key:value }
4. Indexing child document - Once the parent is created, the pages and the related information in the pages can be indexed using below API. URL: http://localhost:9200/Index_name/pages/child_id?parent=parent_id&pipeline=parser METHOD: POST Body: { "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf", "title" : "Quick", "data": "SElHSEFDQ1VSQUNZUE9TVEFMQUREUkVTU0VYVFJBQ1RJT05GUk9NV0VCUEFHRVNieVpoZXl1YW5ZdVN1Ym1pdHRlZGlucGFydGlhbGZ1bGxsbWVudG9mdGhlcmVxdWlyZW1lbnRzZm9ydGhlZGVncmVlb2ZNYXN0ZXJvZkNvbXB1dGVyU2NpZW5jZWF0RGFsaG91c2llVW5pdmVyc2l0eUhhbGlmYXgsTm92YVNjb3RpYU1hcmNoMjAwN2NDb3B5cmlnaHRieVpoZXl1YW5ZdSwyMDA3" *** Base 64 encoded pages. }
5. Retrieving documents by query - A document can be queried based on text, title, and tags and the below method can be used for all. URL: http://localhost:9200/Index_name/pages/_search METHOD: POST Body: { "query": { "match": { "attachment.content": { "query": "lorem" } } } }
Possible Search Types in ElasticSearch There are many search types in ElasticSearch by default. Below are a few of them -
How we adapted the phrase search • Our business requirement was to perform a phrase search for content matching and exact match for tag matching. • We used two types of phrase searches • Page Phrase Search • Document Phrase Search
Page Phrase Search URL: http://localhost:9200/document_db/pages/_search { "query": { "bool": { "must": [ { "match_phrase": { "attachment.content":{ "query":"1Q17" } } } ] } },
"_source": [ "_type", "_id", "Page_Number", "type", "File_Name" ], "highlight" : { "fields" : { "attachment.content" : {} } } } Note: In this page search we are only selecting the needed fields by selecting them in _source field. This is done in order to avoid retrieving the page and base64 encoded content which will increase the retrieved content size and at the same time increase the time latency.
Document Phrase Search URL: http://localhost:9200/document_db/document/_search { "query": { "bool": { "must": [{ "has_child": { "type": "pages", "query": { "match_phrase": { "attachment.content": "1-800-SEC-0330." } } } } ] } } }
Concluding Thoughts • The solution outlined here is used as our document store database for document/page retrieval. • It has a stunning response time that varies from few milliseconds to seconds. • Though the current scope of the solution is limited to PDF documents, we are planning to extend the same to other document types like spreadsheets and text files. • Do you have another or similar workaround for document retrieval? Share your ideas in the comment sectionor mail us at support@mobiusservices.com.
Thank You Do visit our blog on the topic here https://blog.mobiusdata.com/building-unstructured-data-management-solution-with-elasticsearch-and-aws/