Aaron Miller

Creating a Couchbase 2.0 view of data from Reddit

The chart on reddalyzr.com is powered by Couchbase, using views.

All Reddits Chart

The chart is constructed using from JSON documents representing links posted to reddit, such as this:

{
   "over_18": false,
   "banned_by": null,
   "is_self": false,
   "link_flair_text": null,
   "hidden": false,
   "edited": false,
   "kind": "link",
   "subreddit_id": "t5_2qh55",
   "downs": 5,
   "domain": "ibelieveicanfry.com",
   "selftext": "",
   "approved_by": null,
   "score": 5,
   "author": "ibelieveicanfry",
   "name": "t3_yph1p",
   "num_comments": 0,
   "selftext_html": null,
   "link_flair_css_class": null,
   "likes": null,
   "media_embed": {
   },
   "media": null,
   "title": "I don't buy the bottled Thai Sweet Chili Sauce anymore...",
   "thumbnail": "",
   "permalink": "/r/food/comments/yph1p/i_dont_buy_the_bottled_thai_sweet_chili_sauce/",
   "url": "http://www.ibelieveicanfry.com/2012/08/thai-sweet-chili-sauce.html",
   "created": 1345745189,
   "num_reports": null,
   "saved": false,
   "subreddit": "food",
   "ups": 10,
   "created_utc": 1345745189,
   "author_flair_css_class": null,
   "id": "yph1p",
   "author_flair_text": null,
   "clicked": false
}

The map function

The map function of a view is run on each document in the database. For each document, it can output any number of key-value pairs. These pairs can then be looked up by key, or all of the pairs in a range of keys can be looked up at once.

I want to output keys of the form [subreddit, day-of-week]. The collation rules1 of Couchbase views will sort keys that are arrays first by the first element, then by the second, and so on. This means that all of the items in the view from the same subreddit will be grouped together, and then grouped by day of the week.

All queries on a Couchbase view operate on a contiguous range of keys, so it’s important to order them so that the subsets of data that need to be to requested together or aggregated over are grouped together.

function (doc, meta) {
  // Skip documents that aren't JSON
  if (meta.type == "json") {
    // Skip docs that aren't links
    if(doc.kind == "link") {
      var dt = new Date(doc.created_utc * 1000);

      //Get day of week, but start week on Saturday, not Sunday, so that
      //we can pull out the weekend easily.
      var ssday = dt.getUTCDay() + 1;
      if (ssday == 7) ssday = 0;

      emit([doc.subreddit, ssday], {hour: dt.getUTCHours(), score: doc.score});
    }
  }
}

A view with this map function can be queried with the parameters startkey and endkey to select keys within a single subreddit, or all keys within a subreddit also within a range of days of the week, such as all posts within r/funny/ between Monday and Friday.

It is essentially a huge list of the links in the database, sorted by subreddit, and then by day of the week:

/* ... lots more stuff */
{"id":"zx4sc","key":["funny",0],"value":{"hour":9,"score":0}},
{"id":"zxak2","key":["funny",0],"value":{"hour":13,"score":1}},
{"id":"ytw3t","key":["funny",1],"value":{"hour":0,"score":938}},
{"id":"yv3uf","key":["funny",1],"value":{"hour":19,"score":2508}},
/* ... lots more stuff */

Notice that the emitted pairs are also associated with the ID of the document they came from, so it is not necessary to output this ourselves if we want to be able to look up the posts these rows refer to. In fact, the view can be queried with the parameter include_docs set to true and these rows will have the document included.

The reduce function

In addition to a map function, a view can also contain a reduce function, which is used to compute an aggregate value over ranges of the pairs emitted by map. Couchbase has some native reducers that are useful in many cases.2 These are faster than Javascript reduce functions and should be used if possible. For this particular problem it’s useful to have the ability to write my own.

For my chart, I want to group the posts by hour and collect the total amount of posts and total post score within each hour, so I’ll output arrays of these values with an element for each hour.

A view’s reduce function is passed three parameters. The important one is values which will be the values being aggregated. This can be either an array of value portion of rows in the view emitted by the map function, or an array of previous outputs of reduce.

If the values are from the map function, the third parameter, rereduce, will be false, and keys will an array of the keys that correspond to the values in values. If rereduce is true, the values in values are the output of the view’s reduce, and keys will be null, as the values in values do not correspond to single keys.

function (keys, values, rereduce) {
  var out = {freqs: [], score: []};
  //Prefill the arrays with zeroes.
  for(i = 0; i < 24; i++) {
    out.freqs[i] = 0;
    out.score[i] = 0;
  }
  for(v in values) {
    if(!rereduce) { //Values are the output of map
      out.freqs[values[v].hour] += 1;
      out.score[values[v].hour] += values[v].score;
    } else { //Values are the output of reduce
      // Combine the arrays
      for(h in values[v].freqs) {
        out.freqs[h] += values[v].freqs[h];
        out.score[h] += values[v].score[h];
      }
    }
  }
  return out;
}

Querying the view

By default, if a view has a reduce function set, it is enabled. A request for this view with no parameters passed will output the aggregated frequencies and scores for the whole dataset:

{"rows":[{"key":null,
          "value":{"freqs":[20753,19760,15821,15284,14627,13699,11012,8991,
                            7330,6327,6637,7711,10003,12705,15464, 17765,
                            19265,21043,21068,22372,18423,17951,20382,20404],
                   "score":[640304,620266,543505,507882,444247,362853,307157,
                            269177,249111,299142,336299,484781,701107,885255,
                            1006005,1095631,1020605,982352,849484,864482,
                            727186,689255,666884,692730],
                   "total":364797}}]}

The parameter reduce can be set to false in order to get the raw pairs output by the map function, as seen earlier.

The parameter group_level can by passed when the keys emitted by map are arrays3, and instead of evaluating the reduction of the entire requested range, evaluate it over each sub-range such that the first group_level elements of the array key are the same.

The view can be queried for the reduce values of each subreddit with a group_level of 1:

{"rows":[
    /* ... lots more stuff ... */
    {"key":["Minecraft"],
     "value":{"freqs":[172,177,142,126,122,113,80,52,66,52,52,67,84,109,119,
                       142,160,148,175,171,198,206,182,186],
              "score":[12309,8582,12970,7416,5284,4295,1356,5896,5157,3457,
                       5222,4660,10685,11485,12745,20232,20032,11887,12746,
                       14606,12684,14662,7699,13316],
              "total":3101}},
    {"key":["minecraftsuggestions"],
     "value":{"freqs":[17,19,15,14,13,10,11,8,5,6,6,13,12,13,16,26,17,15,
                       26,21,13,16,16,18],
              "score":[20,20,15,14,13,12,13,8,6,8,5,19,12,14,17,32,18,16,
                       25,22,14,12,16,17],
              "total":346}},
    /* ... lots more stuff ... */
]}

A group_level of 2 would group by distinct subreddit and day of week.

To pull out one specific subreddit, or any other contiguous range, the parameters startkey and endkey can be used. Set startkey set to ["funny"], and endkey set to ["funny",{}] (a value that will sort after all of the rows for the “funny” reddit) to get:

{"rows":[
    {"key": null,
     "value": {"freqs": [5185, 5015, 4812, 4504, 4019, 3216, 2358, 1708,
                         1467, 1151, 1234, 1472, 1887, 2488, 3058, 3538,
                         4040, 4345, 4755, 5010, 4837, 5078, 5237, 5072],
               "score": [223614, 191437, 132562, 119852, 112021, 78050,
                         89736, 78804, 94981, 112777, 118679, 180749,
                         221582, 310680, 392977, 420129, 376967, 385918,
                         334743, 306935, 220212, 214143, 230025, 239153]}
    }]}