0 likes | 20 Views
Unlock the power of ArangoDB, the most complete graph database. Explore its scalability for multiple use cases including fraud detection, supply chain, network analysis, traceability, recommendations, and more. Trusted by global enterprises. Explore the advantage today!<br><br>Website: https://arangodb.com/<br>Location: San Francisco, CA 94104-5401 United States<br><br>#upsert operation, #cypher unwind, #fast graphs, #machine database
E N D
A Primer on Query Performance Optimization in ArangoDB Query Analysis. Indexes. Multi-Model.
Welcome In this course you will learn how to achieve speed ups for your queries in ArangoDB. We are going to cover fundamental techniques, with a focus on different indexes and when to use which. We assume that you have written a few AQLs already. As example dataset to try out queries and see the differences in the execution plans and query performance, a modified version of the Amazon meta dataset featuring product information is provided by us. Feel free to experiment with this dataset following the course and enjoy! Special thanks to @kavin and @chintanmishra for their detailed feedback to the beta version of this course!
What you will learn ‣ Profiling of queries ‣ Analyzing queries manually ‣ Usage of different index types ‣ Computational complexity of queries ‣ Query caching ‣ Storage engine differences If you are completely new to ArangoDB, check out our other training material: ArangoDB Training Center – Getting Started, Graph Course, New Features in v3.2 and more 3
Table of Content ‣ Preparations for this course (p.5) ‣ Search value range / sorting (p.42) ‣ Combine hash and skiplist index? (p.45) ‣ Starting with the dataset (p.8) ‣ Importing with arangoimp (p.9) ‣ Dataset structure (p.13) ‣ When to use a graph data model? (p.49) ‣ Category relations (p.50) ‣ similar products graph (p.53) ‣ Queries blocking collections (p.19) ‣ Substring search in attribute (p.58) ‣ Find exact matches (p.23) ‣ Document key (p.24) ‣ other attributes (p.27) ‣ Other (p.65) ‣ Vertex-centric indexes (p.66) ‣ Query caching (p.67) ‣ Fast joins (p.33) ‣ Final Tasks (p.68) ‣ Hidden lookups (p.37) 4
Preparations for this course Requirements & Download and Install ArangoDB 5
Requirements To follow this course and perform the described exercises, you will need: ‣ 4GB of RAM ‣ some free disk space ‣ an internet connection ‣ a web browser which supports Canvas or WebGL ‣ max. two hours of time Note: Click these symbols to get more details on a topic This icon marks queries which will take a long time if run. We recommend to only execute these queries with indexes in place, in order to not interrupt your reading and learning flow. 6
Download and install ArangoDB ‣ Go to arangodb.com/download/ to find the latest Community or Enterprise Edition for your operating system. Follow the instructions on how to download and install it for your OS. Further details can be found here: docs.arangodb.com/latest/Manual/GettingStarted/Installing/ ‣ Once the server booted up, open http://localhost:8529 in your browser to access Aardvark, the ArangoDB WebUI ‣ Login as "root", leave the password field empty and Login. After that select "_system" as your database. 7
Starting with the dataset Importing with Arangoimp & Dataset structure 8
Download and Import the Dataset What we provide is a modified version of an Amazon dataset, comprised of product meta data. Please download the training dataset zip ArangoDB_PerformanceCourse_DemoData.zip (46MB) and unpack it to a folder of your choice. After unpacking you should see two newline-delimited JSON files named products.jsonl and categories.jsonl. JSON Lines website Let’s import the data with ArangoDB's import tool arangoimp. ‣ Run the following in your console (single line): arangoimp --file path to products.jsonl on your machine --collection products --create-collection true ‣ Repeat this for the other file: arangoimp --file path to categories.jsonl --collection categories --create-collection true Remember that we didn’t define a password for the user "root" so please just hit return when asked for a password during import! 9
Download and Import the Dataset You should see something like this in your console after putting in the import command: Please specify a password: Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.2.3, database: '_system', username: 'root' ---------------------------------------- database: _system collection: products create: yes source filename: /path/to/products.jsonl file type: json threads: 2 connect timeout: 5 request timeout: 1200 ---------------------------------------- Starting JSON import... 2017-09-21T05:00:24Z [10364] INFO processed 49838607 bytes (6%) of input file 2017-09-21T05:00:25Z [10364] INFO processed 99677214 bytes (12%) of input file 2017-09-21T05:00:25Z [10364] INFO processed 149483054 bytes (18%) of input file 2017-09-21T05:00:26Z [10364] INFO processed 199321661 bytes (24%) of input file 2017-09-21T05:00:27Z [10364] INFO processed 249127501 bytes (30%) of input file 2017-09-21T05:00:28Z [10364] INFO processed 298966108 bytes (36%) of input file 2017-09-21T05:00:29Z [10364] INFO processed 348771948 bytes (42%) of input file 2017-09-21T05:00:30Z [10364] INFO processed 398610555 bytes (48%) of input file 2017-09-21T05:00:31Z [10364] INFO processed 448416395 bytes (54%) of input file 2017-09-21T05:00:32Z [10364] INFO processed 498255002 bytes (60%) of input file 2017-09-21T05:00:33Z [10364] INFO processed 548093609 bytes (66%) of input file 2017-09-21T05:00:34Z [10364] INFO processed 597899449 bytes (72%) of input file 2017-09-21T05:00:35Z [10364] INFO processed 647738056 bytes (78%) of input file 2017-09-21T05:00:36Z [10364] INFO processed 697543896 bytes (84%) of input file 2017-09-21T05:00:38Z [10364] INFO processed 747382503 bytes (90%) of input file 2017-09-21T05:00:39Z [10364] INFO processed 797188343 bytes (96%) of input file created: 548552 warnings/errors: 0 updated/replaced: 0 ignored: 0 10
What did arangoimp do? ‣ Created two new document collections (products and categories) ‣ Created one document for each line in the source files ‣ Imported around 550,000 product and 14,500 category documents ‣ Please note: ‣ The product ID of each original Amazon record is used as document _key in the edited dataset we provide for this course ‣ Product reviews were removed, because they are not used in the course ‣ The categories documents were generated from the "categories" attribute in the products documents and are referenced by their _key in the "categoryKeys" attribute of each product, to demonstrate another data model. 11
In ArangoDB WebUI ‣ Go to ArangoDB WebUI (http://localhost:8529 in your browser) and choose "COLLECTIONS" in the menu ‣ Collection "products" and "categories" should be there now ‣ The document icon indicates that both are vertex collections, not edge collections ‣ Click on a collection to browse its documents 12
Starting with the dataset Importing with Arangoimp & Dataset structure 13
Data structure of products documents Example product as shown in the document editor of the web interface: Attribute Description _key a unique product identifier (as string) _id collection name + "/" + _key (computed property) ASIN Amazon Standard Identification Number title product title group generic product type (e.g. Book) salesrank numeric value, lower means more units sold similar array of similar products (ASINs as strings) categories array of product categories categoryKeys array of document keys (categories collection) 14
Hands on: First AQL Queries You may run below queries to get an impression of the data. We will later show and explain other queries in more detail. Copy below queries into the query editor and - click Execute or hit - Ctrl + Return, respectively - Cmd + Return (Mac keyboard) ‣ Return ten products FOR prod IN products LIMIT 10 RETURN prod ‣ Return the titles of the most popular DVDs FOR prod IN products FILTER prod.group == 'DVD' AND prod.salesrank > 0 SORT prod.salesrank RETURN prod.title 15
Hands on: First AQL Queries The query results appear below the query editor. Click on next to the total query runtime in the header of the result box to display detailed profiling information. Click the Explain button if you want to see the execution plan for the query. This does not run your query, but merely analyzes and optimizes the query statement by checking for opportunities to use indexes and applying other optimization rules that aid query execution speed. By the way: Query run times mentioned in this course are based on the RocksDB storage engine and a recent desktop computer. 16
Hands on: First AQL Queries The simple product query explained: We can see the query string, how it translates to the execution plan and that there are no indexes used and no optimization rules applied so far: 17
Hands on: First AQL Queries ‣ Return the ten most used categories FOR prod IN products FOR cat IN prod.categories OR [] COLLECT category = cat WITH COUNT INTO count SORT count DESC LIMIT 10 RETURN { category, count } COLLECT in AQL documentation prod.categories OR [] is used to fallback to an empty array in case of no categories (some products don't have any) in order to prevent an error with the FOR loop. By the way: Keywords like FOR, COLLECT and RETURNare written all upper case in the code examples, but it is merely a convention. You may also write them all lower case or in mixed case. Names of variables, attributes and collections are case-sensitive however! 18
Queries blocking collections Memory-mapped files vs. RocksDB 19
Queries blocking collections While running a data modification query that inserts new documents into a collection for instance, other queries that try to read from the same collection are blocked – if you use ArangoDB with the memory mapped files (MMFiles) storage engine. Using this mostly-memory engine, writes block reads and locking is on collection level, i.e. the entire collection is locked for the duration of the query, even if only single document is created or modified. It guarantees, that no simultaneous query interferes with the query, or sees a not yet committed change (ACID transaction). With the RocksDB storage engine however, document-level locks allow for concurrent writes. Writes do not block reads. Reads do not block writes. You can test this with the following query: FOR i IN 1..100 INSERT {value: i} INTO products RETURN SLEEP(1) Storage Engine documentation 20
Queries blocking collections The query inserts documents into the products collection we already have, and waits for a second between each write. This gives us enough time to try to view the products collection in the web interface after starting the query. You will see the "Fetching documents" animation and the documents will not show before the slow query completed (after 100 seconds) unless you are using the RocksDB storage engine. If you are unsure which storage engine you are currently using, check the dashboard in the web interface (versions before 3.2 support MMFiles only and do not show an engine): Switching the storage engine 21
Queries blocking collections If you compare the query execution times of the examples presented in this course, performance will probably be in favor of MMFiles. However, this is under laboratory conditions. You will likely see a notable performance benefit with RocksDB in production with many concurrent requests. MMFiles RocksDB default optional dataset needs to fit into memory work with as much data as fits on disk hot set in memory, data and indexes on disk fast startup (no rebuilding of indexes) collection data always persisted concurrent reads and writes indexes in memory slow restart due to index rebuilding volatile collections (only in memory, optional) collection level locking (writes block reads) See the key differences between both storage engines in alongside table. If RocksDB is suitable for your use case, consider switching from MMFiles. Comparing new RocksDB and MMFiles storage engines 22
Find exact matches Document key & other attributes 23
Find exact match – document key The probably most common operation you want to perform in a database system is to retrieve one particular document. If you know the primary key (or document key in ArangoDB terms) of that document, you can do a lookup in a key-value store fashion. What that means is, that the database system returns a value – the desired document – upon your request, in which you tell it the key for that value. What is in that document doesn’t matter, the database system treats it as opaque. It’s like a book in a shelf with a number on it and you ask the librarian for that book, and he brings it without opening and reading in it. Let’s do such a request in AQL using the example dataset. 24
Find exact match – document key RETURN DOCUMENT('products/31931') { "_key": "31931", "_id": "products/31931", "ASIN": "0851993575", "title": "Avocado: Botany, Production and Uses", "group": "Book", … } We request the document with key "31931" from the products collection, which happens to be a book about Avocado. The query time should be in the sub milliseconds area. There is a primary index on the _key attribute of every collection in ArangoDB, which is used in lookups by _key or _id and it can not be removed. 25
Find exact match – document key Here is an equivalent AQL query, which also returns this document: FOR prod IN products FILTER prod._key == '31931' RETURN prod It is a bit more to type yet the query achieves the same result and is equally fast. The benefit of above query is the flexibility of the FOR loop-style approach. We can easily replace _key by any other attribute key like ASIN or group. The DOCUMENT function on the other hand can only retrieve documents via the document ID or a combination of collection name and document key. DOCUMENT function in AQL documentation 26
Find exact matches Document key & other attributes 27
Find exact match – other attribute Instead of filtering by the document key, we could use the Amazon Standard Identification Number (ASIN) to find the document: FOR prod IN products FILTER prod.ASIN == '0851993575' RETURN prod The result is identical, but wait, why does it suddenly take seconds than milliseconds to compute? Well, to get back to the library metaphor: the book doesn’t have the ASIN written on the cover. And there’s also no rule where to find it based on what the ASIN is, unlike the indexed document key. This leads to the librarian opening every book in the library to check what ASIN is written inside, which takes way longer than to locate a book by its key. He has to perform what is called a full collection scan in database terms. 28
Find exact match – other attribute For a full collection scan, we start at the first document in the collection and check the ASIN attribute if it matches our condition. Then we check the second document and so on until the last document in the collection. If we know that every value exists once at most (which is the case for ASIN), we can stop scanning after the first match because there can't be another match. To express this in a query, we can do: FOR prod IN products FILTER prod.ASIN == '0851993575' LIMIT 1 // stop after first match RETURN prod The limit operation can be omitted if an index with the unique option exists for the attribute. 29
Find exact match – other attribute How can we avoid the full collection scan? By creating an index for the ASIN, so that we can do a fast lookup. Since we are looking for an exact match of the ASIN, the correct choice is a hash index. The primary index on the key attribute is also a hash index by the way. 30
Find exact match – other attribute Now let’s re-run the same query from before: FOR prod IN products FILTER prod.ASIN == '0851993575' RETURN prod It should take like a millisecond, so about the same time as looking up the document by its key. Here is a comparison of the execution plans before and after adding the index (using the Explain feature): 31
Find exact match – other attribute You can see that the hash index is used in the middle section of the explain output, as well as by the optimization rule "use-indexes" listed at the bottom. The execution plan looks different as well, there is no FILTER statement anymore. Instead of a full collection scan, the hash index is consulted to retrieve the right document. The selectivity of the hash index is 100%, which means there is at most one document for a given ASIN. This is always the case for indexes with the unique option enabled. If there were duplicate ASIN values, the index creation would had failed with a unique constraint violation error. 32
Joins Fast joins with hash indexes 33
Fast joins For some queries, the performance gain achieved with a hash index can be higher by orders of magnitude. This is usually the case if you use JOINs to match up documents of multiple collections. FOR prod IN products LET categories = ( FOR catKey IN prod.categoryKeys FOR cat IN categories FILTER cat._key == catKey RETURN cat.label ) RETURN MERGE(prod, {categories: categories}) Long running query! You may add e.g. LIMIT 10 after first line. If there are category keys stored per product and we want to resolve those keys to their readable name, then we would need to iterate over the categoryKeys and join the category labels. Without the primary index on _key, the computational complexity would be really high in the worst case - for each key all documents in "categories" would be checked until a match is found. 34
Fast joins For 5 categoryKeys and with 14,500 categories in total and if we assume that we find a match after checking 50% of all categories, that would sum up to 36,250 document reads - for a single product. To do that for all 550,000 products would mean almost 20 billion document reads. This insanely high computational complexity makes such a query a no-go. No matter how much hardware you throw at it, you can not perform this query in production in a reasonable manner. 35
Fast joins Because there is a hash index on _key that can be used for matching up the categoryKeys however, there is no need to check a lot of documents in categories. Instead, the correct document can be looked up immediately. So for 5 categoryKeys, it takes 5 lookups in the hash index for one product, or 2.75 million for all products. As a rule of thumb: If you write queries that contain nested FOR loops, make sure to check if indexes can be and are actually utilized by inspecting the query explain output. 36
Hidden lookups Lookups in UPSERT operations 37
Hidden lookups The previous example had nested FOR loops in the query, which are easy to spot and should make you think for a second whenever ever you see such a construct. Check the explain output to verify that some index is used for good performance. Sometimes it is less obvious. An UPSERT statement inside a FOR loop can also have a high runtime complexity. If we iterate over each product and either INSERT or UPDATE a document based on whether or not a matching document already exists, then there's an obfuscated lookup to check for the existence of a matching document – which can be another full collection scan if no index can be used. Create a new document collection "ranks" if you want to execute the query: FOR prod IN products UPSERT { salesrank: prod.salesrank } INSERT { salesrank: prod.salesrank, count: 1 } UPDATE { count: OLD.count + 1} IN ranks Extremely long running query! 38
Hidden lookups The query creates one document per salesrank value and counts the number of occurrences. The query explain reveals that there is a nested full collection scan, which is the representation of the UPSERT { salesrank: prod.salesrank } expression to check for an existing document. No indexes are used. 39
Hidden lookups If we create a unique hash index for the "salesrank" attribute for the ranks collection and explain the query, we can see the index being used for the lookup: The query takes about a minute to complete with the hash index scan, and a long time without. 40
Hidden lookups The better choice to compute the same results would be a COLLECT query however, which does not require any index being set up before the query execution and takes only around 15 seconds to complete. Note: Remove the index on salesrank and truncate the ranks collection before the query. FOR prod IN products COLLECT salesrank = prod.salesrank WITH COUNT INTO count INSERT { salesrank, count } INTO ranks 41
Search value range / sorting Using skiplist indexes 42
Search a value range / sorting We have a "salesrank" attribute in the documents, which we could use to sort the products and return the bestsellers. Let's try that without any preparations: FOR prod IN products SORT prod.salesrank FILTER prod.salesrank > 0 LIMIT 100 RETURN prod We sort by salesrank in ascending order (default), filter out some oddball values (-1, 0 and no salesrank value) and return the top 100 products. The query is kind of slow and actually sorts all products over and over again if you run the query multiple times. We could create a hash index on the attribute, but it wouldn't be used at all – because hash indexes exclusively support exact matching and can't compare the indexed values. 43
Search a value range / sorting What we need is a skip list index, which is a sorted list of values with shortcuts to find and return a range of values pretty fast: 3 milliseconds with the index, and around 6 seconds without. 44
Combine hash and skip list indexes? Multiple indexes & indexes over multiple fields 45
Combine hash and skip list indexes? An actually more common query in a web shop however is to return bestsellers for a certain category. What we want to check is that the category a user is interested in is contained in the list of categories, then sort by salesrank and return the top products. It sounds like we should use a hash array index to exactly match the category, and a skip list index to sort by rank. This isn't possible however. Indexes can only be utilized in combination in a few cases. In a case like ours, we could either pre-filter products with a hash index, then sort the documents without index, or go through all products from best to least selling via the skip list index and post- filter on the category. FOR prod IN products SORT prod.salesrank FILTER 'Study & Teaching[13985]' IN prod.categories LIMIT 100 RETURN CONCAT(prod.salesrank, '. ', prod.title) 46
Combine hash and skip list indexes? There is one more option however: a skip list index, that indexes the array elements of "categories" and also the salesrank. This creates a rather large index, but it is the most efficient way to return the top products per category: FOR prod IN products FILTER prod.salesrank > 0 AND prod.salesrank < 100000 SORT prod.salesrank FILTER 'Wild West[513098]' IN prod.categories LIMIT 100 RETURN prod 47
Combine hash and skip list indexes? The upper boundary for salesrank is optional. Any range could be retrieved quickly. If there was a price attribute, we could index that as well and define the relevant price range in the query. As you can see in the query explainer, the index is used to restrict the result to the Wild West category as well as to limit the salesrank. 48
Excursion: When to use a graph data model? Category relations & similar products graph 49
Excursion: When to use a graph data model? We worked with the directly embedded categories as well as with the categoryKeys, which we used to join the category labels. The former approach is the fastest if we query for products, but at the cost of redundant category data, which may become inconsistent over time (i.e. if you want to rename a category, you need to update all product documents which have this category). Moreover, you can't quickly return a list of all distinct categories because you need to check each product document and only return unique category entries. The latter approach using joins overcomes these drawbacks at the cost of a slightly slower query execution speed if the category labels are joined into the result. 50