Forget Gmail filters, let Google sort your inbox using Machine Learning algorithms

Among the multitude of programming APIs provided by Google lies a jewel called Prediction API. It has a high-quality classifier that allows for continuous learning with model updates.

Let's quickly use it to automatically sort incoming mail into your existing labels. The most tedious part is configuration:

Configuration

  1. Create a new Blank Project in Google Apps Script and enable Prediction API in the Resources/Advanced Google services… menu.
  2. Sign up for the Google Developers Console and take the 300$ free credit. Then, create your first Developers Console project and enable Prediction API in its API & auth section.
  3. Switch back to your newly created Google Apps Script and link it with your new Developers Console project through the Resources/Developers Console Project menu.

We are done configuring. Now, there are only two functions to implement: one to train the model and the other to classify incoming mail.

Train

The function GmailApp.getUserLabels() ❶ gets all labels that you defined in Gmail and disregards standard labels such as Inbox, All Mail or Spam. Mails in Gmail are organized by threads, so once you get a handle on a label, you have to get all of its threads ❷, then grab individual mails under that thread. We'll use the first email of a thread for this simple exercise ❸.

Quick notes on running MacOS X, NetBSD and Windows 7 together under KVM on Linux

As a reminder for myself, mostly:

  1. Only Mavericks, Yosemite is not yet working under KVM.
  2. KVM SLiRP networking does not work with MacOS X, bridging is hard to setup for wireless networks so it is better to use NAT versions of qemu-ifup and qemu-ifdown.
  3. It looks so cool on screenshots

Data-mining users in a screenful of code

Objective

Select like-minded users from a local community website.

Pre-requisites

  1. A Drupal website with the votingapi module enabled and at least a few dozen votes by registered users.
  2. A working installation of the R language.

Exract data

For each user, select all other users that voted on same node and comments:

SELECT v1.uid uid1, v2.uid uid2, u1.name name1, u2.name name2,
  v2.entity_id entity_id, v1.value value1, v2.value value2
FROM votingapi_vote v1
JOIN (votingapi_vote v2, users u1, users u2)
 ON (v1.uid != v2.uid AND v1.entity_id=v2.entity_id
   AND v1.entity_type=v2.entity_type AND v1.uid=u1.uid AND v2.uid=u2.uid)
WHERE v1.uid 

This produces a table

A subtle allusion to the f-word in Microsoft's EU coding week banners

Just stumbled upon a fancy banner by Microsoft that advertises its Embrace and Extend from the childhood program.

For the record: the only reason Microsoft supports this "Coding in classroom initiative" is because they want to push their products through kids. It's a problem, but a bigger problem is that Microsoft have long striven to make computing an elite profession by introducing inconsistencies and complexity for the most basic abstractions: a character, a file, a block device... their products are designed to fail pupils who want to understand how computers work. And this design is intentional, because the less people understand computing, the less competition their business has... and higher are the profits.

Thus, taking money from Microsoft to promote coding in the classroom is akin to taking money from Philip Morris to promote healthy lifestyle. Shameful.

All permutations of a string for the teacher's sake

Today my kid brought back from school an assignment to guess words from a bag of letters... it took a mum and a programmer to solve all six. I left one for you, though. Guess what C D E E I M R R stands for.

P.S. it's a classical programming interview question about all permutations of a string. Generate all permutations, grep -f them against /usr/share/dict/french and you'll get the answer.

Reducing the size of the codebase by 20 or 30 times is possible, I've done it… twice.

Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs. © Bill Gates

I once rewrote 30 000 lines of C++ code in 1000 lines of Ruby. Years passed, and shit hit the fan again. Today, I rewrote 424 lines of Java+Spring+Hibernate in 18 lines of bash. This is less glorious, but if you compare the size of the deliverable, it's 39Mb for the J2EE webapp against… 772 bytes for the shell script.

P.S. It is probably safe to say now, after 5+ years, that the C++ code was TopiEngine and my rewrite was tm4r. The latest version of TopiEngine on launchpad has 67 279 lines of code. It doubled in size since I rewrote it in Ruby. My tm4r now counts 1 227 lines of code.

P.P.S. Of course, these rewrites are not exact functional replicas. tm4r is an in-memory engine, TopiEngine uses sqlite underneath, so their usage patterns may differ wildly. Same with the Java → bash rewrite. But for the task at hand, there was always a reason to rewrite, and the reason was directly related to the code bloat, modifiability and maintainability.

P.P.P.S. Both TopiEngine and tm4r have little practical value. Topic Maps are dead.

Pages