Among the multitude of programming APIs provided by Google lies a jewel called Prediction API. It has a high-quality classifier that allows for continuous learning with model updates.
Let's quickly use it to automatically sort incoming mail into your existing labels. The most tedious part is configuration:
Configuration
- Create a new Blank Project in Google Apps Script and enable Prediction API in the
Resources/Advanced Google services…
menu. - Sign up for the Google Developers Console and take the 300$ free credit. Then, create your first Developers Console project and enable Prediction API in its API & auth section.
- Switch back to your newly created Google Apps Script and link it with your new Developers Console project through the
Resources/Developers Console Project
menu.
We are done configuring. Now, there are only two functions to implement: one to train the model and the other to classify incoming mail.
Train
The function GmailApp.getUserLabels()
❶ gets all labels that you defined in Gmail and disregards standard labels such as Inbox
, All Mail
or Spam
. Mails in Gmail are organized by threads, so once you get a handle on a label, you have to get all of its threads ❷, then grab individual mails under that thread. We'll use the first email of a thread for this simple exercise ❸.
For each email, generate a line in the training set that consists of the Gmail label and various features extracted from the email. Again, in the simplest case, these features can be:
- Subject of the email
- From field
- To and Cc fields combined. ❹
If no model has bee trained previously, call the update()
function on the Prediction API. ❺ Otherwise, create the model with the insert()
function ❻.
function train() { var training_instances = []; var offset = null; var name = null; glabels = GmailApp.getUserLabels(); ❶ for (var i = 0; i < glabels.length; ++i) { name = glabels[i].getName(); offset = getOffset(name); threads = glabels[i].getThreads(offset, limit); ❷ for (var j = 0; j < threads.length; ++j) { var message = threads[j].getMessages()[0]; ❸ var csv_instance = [message.getSubject(), message.getFrom(), message.getTo().concat(' ', message.getCc())]; ❹ training_instances.push({'output': name, 'csvInstance': csv_instance}); } PropertiesService.getUserProperties().setProperty(name, offset + threads.length); } var resource = {'id': id, 'trainingInstances': training_instances} try { if (Prediction.Trainedmodels.get(project, id).trainingStatus == 'DONE') { for (var i = 0; i < training_instances.length; ++i) { Prediction.Trainedmodels.update(training_instances[i], project, id); ❺ Utilities.sleep(100); } } } catch (e) { if (e.message == "No Model found. Model must first be trained.") { Prediction.Trainedmodels.insert(resource, project); ❻ } else { Logger.log(e); } } }
Classify
First off, let's find messages that were not yet filed under any of the labels. ❼ Then, call the predict()
function of the Prediction API for each of the remaining messages. ❽ It returns a set of labels and associated scores that we have to filter out to extract the best matches. ❾ We are now ready to file the message under the best matching labels ❿
function classify() { var exclusion = GmailApp.getUserLabels().map(function(l){return '-label:"' + l.getName() + '"'}).join(' '); var offset = getOffset('Inbox'); var threads = GmailApp.search('-is:important in:inbox ' + exclusion, offset, limit); ❼ for (var i = 0; i < threads.length; ++i) { var message = threads[i].getMessages()[0]; var prediction = Prediction.Trainedmodels.predict({ ❽ input: {csvInstance: [message.getSubject(), message.getFrom(), message.getTo().concat(' ', message.getCc())]} }, project, id); var labels = prediction.outputMulti.filter(function(list){ return list.score > min_score ? true : false}).map(function(list){ return {"label": list.label, "score": list.score} }); ❾ labels.forEach(function(l){ GmailApp.getUserLabelByName(l.label).addToThread(threads[i]); ❿ }) } }
Limits
I left aside the definition of a helper function getOffset()
to discuss execution limits at the end. Google Apps Scripts are shut down after 6 minutes, so any realistically sized mailbox will have to run the train()
function more than once, saving the intermediate state. For the purpose of this demonstration I've chosen to store the intermediate state in UserProperties
. ⓫
function getOffset(name) { var value = null; return (value = PropertiesService.getUserProperties().getProperty(name)) == null ? 0 : parseInt(value); ⓫ }
Variables
Last but not least, let us define global variables used above:
var project = '00000000000'; // Google Developers Console project var id = 'classifier'; // model id var limit = 20; // limit search to 20 items var min_score = 0.9; // cutoff for model score on 0..1 range
Now, run train
a few (or better a few dozen) time, then classify
— and watch your inbox being sorted automagically. Don't forget that you can configure regular execution of scripts via triggers in Resources/All your triggers
.