Lazy Evaluation Magic

by Joe Crean 

Laziness is rarely ever viewed as a good thing. But used appropriately, lazy evaluation can be a really great technique!

Lazy evaluation is the ability to process data only when needed. The example presented here shows how a call to cts:search() can be processed but actually evaluated at a later point when it is directly bound into an expression. This offers great advantages when we want to process a very long sequence (e.g. the search matches millions of documents) and we don’t really want to access all members of the sequence. Lazy Evaluation can be very useful however it is not always desired. For example if we need to read out the contents of a Range Index we may explicitly want to read all the contents of the index into memory. See this article for an example of this in action.

Recently we needed to implement a common customer requirement: produce a sequence of search results for “child” documents grouped by their “parent” document and specifically, for each parent document return only the most relevant hit. For the purposes of clarity let’s imagine that the parent documents are chapters with tables of contents and the child documents are sections with the actual content. We want to search all sections and receive a list of chapters whose children match the query. This list is in descending relevance order. Each chapter with a hit should only appear once in the list.

Child documents can be identified by an element is-chapter containing a boolean value set to false. Each child document has a parent chapter whose id is stored in the element chapter-id.

To help you visualize this let’s look at some sample XML.

Here is a sample of a chapter document:

1<document id=“0001”>
2<is-chapter>true</is-chapter>
3<content>…details….</content>
4</document>
And here is a sample section document. Note that it has an element chapter-id which identifies the chapter to which it belongs:
1<document id=“0001-0001”>
2<chapter-id>0001</chapter-id>
3<is-chapter>false</is-chapter>
4<content>…details…</content>
5</document>
And here is an XQuery implementation of this requirement which makes use of lazy evaluation:
1let $qtxt := “foo”
2let $query := cts:and-query((
3 cts:element-range-query(xs:QName(“is-chapter”),“=”, “false”),
4 cts:word-query($qtxt)
5  ))
(:extract the parent doc id for every hit:)
let $p := function($result as node()) { $result/chapter-id/fn:string() }                        
8let $options := (cts:score-order(“descending”), “unfiltered” )                     
return
10 fn:distinct-values( fn:map($p, cts:search(/document, $query, $options) ) )[1 to 10]
In the and-query we are performing a word search in all “non-chapter” docs (is-chapter = false). We then set up the final call of the query which returns a list of parents of the matched children in most relevant order. 

Profiling shows that this query executes lightning fast and that increasing numbers of distinct values (the chapter-id‘sare found with remarkably little overhead – this although the actual search query could match many documents.

For example: in order to get the first 10 distinct values in our test case, it needed to look at the first 12 search hits (the inline function $p above is called 12 times). To retrieve the first 20 distinct values, it needed to examine the first 31 search hits.

So what exactly is going on here? Shouldn’t we be paying a massive performance penalty for retrieving all search results and then trying to extract the distinct values? Enter lazy evaluation and some pretty useful features of the MarkLogic Search API.

The call to cts:search returns an iterator and the profiler shows that this function is called just once. The call to fn:map also returns an iterator (although, thanks to a profiler bug, this shows as being invoked many times – it is only invoked once). The function fn:distinct-values takes advantage of these iterators and uses them to fill up the sequence of unique values. Thus thanks to these iterators the query does not need to initially retrieve all search results and is super efficient!

This is a very useful technique if you want to selectively process large result sets!!!

Sponsored by MarkLogic

You may also like...