Forget Gmail filters, let Google sort your inbox using Machine Learning algorithms

Among the multitude of programming APIs provided by Google lies a jewel called Prediction API. It has a high-quality classifier that allows for continuous learning with model updates.

Let's quickly use it to automatically sort incoming mail into your existing labels. The most tedious part is configuration:

Configuration

  1. Create a new Blank Project in Google Apps Script and enable Prediction API in the Resources/Advanced Google services… menu.
  2. Sign up for the Google Developers Console and take the 300$ free credit. Then, create your first Developers Console project and enable Prediction API in its API & auth section.
  3. Switch back to your newly created Google Apps Script and link it with your new Developers Console project through the Resources/Developers Console Project menu.

We are done configuring. Now, there are only two functions to implement: one to train the model and the other to classify incoming mail.

Train

The function GmailApp.getUserLabels() ❶ gets all labels that you defined in Gmail and disregards standard labels such as Inbox, All Mail or Spam. Mails in Gmail are organized by threads, so once you get a handle on a label, you have to get all of its threads ❷, then grab individual mails under that thread. We'll use the first email of a thread for this simple exercise ❸.

For each email, generate a line in the training set that consists of the Gmail label and various features extracted from the email. Again, in the simplest case, these features can be:

  1. Subject of the email
  2. From field
  3. To and Cc fields combined. ❹

If no model has bee trained previously, call the update() function on the Prediction API. ❺ Otherwise, create the model with the insert() function ❻.

function train() {
  var training_instances = [];
  var offset = null;
  var name = null;
  
  glabels = GmailApp.getUserLabels(); ❶
  for (var i = 0; i < glabels.length; ++i) {
    name = glabels[i].getName();
    offset = getOffset(name);
    threads = glabels[i].getThreads(offset, limit); ❷
    for (var j = 0; j < threads.length; ++j) {
      var message = threads[j].getMessages()[0]; ❸
      var csv_instance = [message.getSubject(), message.getFrom(), message.getTo().concat(' ', message.getCc())]; ❹
      training_instances.push({'output': name, 'csvInstance': csv_instance}); 
    }    
    PropertiesService.getUserProperties().setProperty(name, offset + threads.length);
  }

  var resource = {'id': id, 'trainingInstances': training_instances}  
  try {
    if (Prediction.Trainedmodels.get(project, id).trainingStatus == 'DONE') {
      for (var i = 0; i < training_instances.length; ++i) {
        Prediction.Trainedmodels.update(training_instances[i], project, id); ❺
        Utilities.sleep(100);
      }
    } 
  } catch (e) {
    if (e.message == "No Model found. Model must first be trained.") {
      Prediction.Trainedmodels.insert(resource, project); ❻
    } else {
      Logger.log(e);
    }
  }
}

Classify

First off, let's find messages that were not yet filed under any of the labels. ❼ Then, call the predict() function of the Prediction API for each of the remaining messages. ❽ It returns a set of labels and associated scores that we have to filter out to extract the best matches. ❾ We are now ready to file the message under the best matching labels ❿

function classify() {
  var exclusion = GmailApp.getUserLabels().map(function(l){return '-label:"' + l.getName() + '"'}).join(' ');
  var offset = getOffset('Inbox');
  var threads = GmailApp.search('-is:important in:inbox ' + exclusion, offset, limit); ❼
  for (var i = 0; i < threads.length; ++i) { 
    var message = threads[i].getMessages()[0];
    var prediction = Prediction.Trainedmodels.predict({ ❽
      input: {csvInstance: [message.getSubject(), message.getFrom(), message.getTo().concat(' ', message.getCc())]}
    }, project,     id);
    var labels = prediction.outputMulti.filter(function(list){
        return list.score > min_score ? true : false}).map(function(list){ return {"label": list.label, "score": list.score}
    }); ❾
    labels.forEach(function(l){
      GmailApp.getUserLabelByName(l.label).addToThread(threads[i]); ❿
     })
  }  
}

Limits

I left aside the definition of a helper function getOffset() to discuss execution limits at the end. Google Apps Scripts are shut down after 6 minutes, so any realistically sized mailbox will have to run the train() function more than once, saving the intermediate state. For the purpose of this demonstration I've chosen to store the intermediate state in UserProperties. ⓫

function getOffset(name) {
  var value = null;
  return (value = PropertiesService.getUserProperties().getProperty(name)) == null ? 0 : parseInt(value); ⓫
}

Variables

Last but not least, let us define global variables used above:

var project = '00000000000'; // Google Developers Console project
var id = 'classifier'; // model id
var limit = 20; // limit search to 20 items
var min_score = 0.9; // cutoff for model score on 0..1 range

Now, run train a few (or better a few dozen) time, then classify — and watch your inbox being sorted automagically. Don't forget that you can configure regular execution of scripts via triggers in Resources/All your triggers.