Using Large Constants in MongoDB MapReduce Framework

Posted by on May 29th, 2013
March 18th, 2015

If you are working on data aggregation problems using MongoDB MapReduce framework, and you think your job is slow, check this out: a way to speed up your solution may be contained in this very blog post.

On a data aggregation problem, we need to group subdomains into a single top domain (e.g. www.google.com, google.com, and mail.google.com are grouped into google.com). In order to determine what the primary domain is, we find the longest eTLD (effective top-level domain) and take it with the preceding zone (e.g. for www.portugal.gov.pt, both .pt and .gov.pt are eTLDs; since .gov.pt is the longest one, we identify portugal.gov.pt as the domain).

To tell if a part of a domain is an eTLD, we match it against Mozilla’s list of public suffix rules We have a function that uses a set consisting of all rules, which are about 6000. Since this is JavaScript, we use a plain object to mimic sets.

Originally, the function had the following prototype:

A global variable ‘suffixRules’ contains the set of all rules, and the ‘scope’ parameter was used to pass it to the MapReduce functions. With our current dataset (about 500 thousand records) and configuration, MongoDB took 36m56s to run the MapReduce job. Unsatisfied with the results, we tried a trick: since ‘suffixRules’ is in fact a constant, we embedded it in the function definition.

First, we changed it so that ‘suffixRules’ becomes a parameter:

Then, we fixed the first parameter before passing the function to MongoDB (we use the ‘system.js’ collection to store functions on the server side) using the following partial application wrapper:

This cut down execution time to 1m53s, a 20-fold improvement in performance. The loss in performance from using the ‘scope’ variable seems to be related with how variables are synchronized between MongoDB and the embedded V8 engine. Any thoughts?

Processing...