Data-mining users in a screenful of code

Objective

Select like-minded users from a local community website.

Pre-requisites

  1. A Drupal website with the votingapi module enabled and at least a few dozen votes by registered users.
  2. A working installation of the R language.

Exract data

For each user, select all other users that voted on same node and comments:

SELECT v1.uid uid1, v2.uid uid2, u1.name name1, u2.name name2,
  v2.entity_id entity_id, v1.value value1, v2.value value2
FROM votingapi_vote v1
JOIN (votingapi_vote v2, users u1, users u2)
 ON (v1.uid != v2.uid AND v1.entity_id=v2.entity_id
   AND v1.entity_type=v2.entity_type AND v1.uid=u1.uid AND v2.uid=u2.uid)
WHERE v1.uid < v2.uid AND v1.uid != 0 AND v2.uid != 0
ORDER BY v1.uid,v2.uid;

This produces a table

uid1    uid2    name1   name2   value1  value2
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     20      20
1       2       Administrator   Bob     40      40
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     20      100
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     80      80
1       2       Administrator   Bob     100     20
1       2       Administrator   Bob     20      20
1       2       Administrator   Bob     60      60
1       2       Administrator   Bob     100     100
1       2       Administrator   Bob     100     100

with five columns:

  1. first user id
  2. second user id
  3. first user's name
  4. second user's name
  5. vote of the first user
  6. vote of the second user

The important parts in the SQL are

  1. the JOIN on the same table, which allows to generate all permutations of uid1 and uid2
  2. the WHERE clause on v1.uid < v2.uid which reduces permutations to combinations.

The uid of 0 is skipped, because it is the uid of the anonymous user. Every anonymous vote is attributed to it.

Calculate similarity

It can be done in PHP, but why bother? Here's a handy R script that takes the above table as in.tsv and produces, for each user, a file with the following columns:

  1. id of the other user
  2. username
  3. number of votes in common
  4. Pearson's correlation coefficient between votes
  5. a p-value that indicates how certain was the algorithm.
#!/usr/bin/env Rscript
d <- read.delim("in.tsv")

unique1 <- unique(c(d$uid1, d$uid2))

for (id1 in unique1) {
  if (file.exists(as.character(id1))) {
    file.remove(as.character(id1))
  }
  temp1 <- d[d$uid1==id1 | d$uid2==id1, ]
  unique2 <- unique(c(temp1$uid1, temp1$uid2))
  unique2 <- unique2[!unique2 == id1] # remove id1
  for (id2 in unique2) {
    if (id1 < id2) {
      result <- temp1[temp1$uid1==id1 & temp1$uid2==id2, ]
      name <- as.character(result$name2[1])
    } else {
      result <- temp1[temp1$uid1==id2 & temp1$uid2==id1, ]
      name <- as.character(result$name1[1])
    }
    n = nrow(result)

    if (n > 7) {
      x <- result$value1
      y <- result$value2
      pvalue <- cor.test(x,y)$p.value
      if (is.finite(pvalue) && pvalue < 0.05) {
        correlation <- cor(x,y)
        cat(id2, name, n, correlation, pvalue, "\n", sep = "\t", file = paste(id1, sep = ""), append = T)
      }
    }
  }
}

Notice the use of the cor(x,y) function that calculates the correlation and cor.test(x,y) that produces additional metrics for the correlation, including p-value. By convention, everything above p-value < 0.05 is considered uncertain, so we only print lines where p-value < 0.05. Jiggling with id1 and id2, and the if-else block are there to select pairs of users in any order.

Here's the output from the above data:

2       Bob     15      0.6039604       0.01710946

Display results

The rest is fairly obvious. I've chosen to display the data as a tag cloud on user profiles. With a hook on hook_menu,

/**
 * Hook into the user menu
 */
function mymodule_menu() {
  $items['user/%user/likeminded'] = array(
    'access callback' => TRUE,
    'access arguments' => array(1),
    'page callback' => 'mymodule_likeminded', // function defined below
    'page arguments' => array(1),
    'title' => 'Likeminded',
    'weight' => 5,
    'type' => MENU_LOCAL_TASK,
  );
  return $items;
}

I fetch the user's data file as generated by the R script above and display the data from it in a bag of words of varying sizes:

/**
 * Display likeminded users
 */
function mymodule_likeminded($arg){

  if (is_object($arg) && !$arg->uid) {
    return;
  }
  # this is my path to the results, your path may be different
  $path =  drupal_get_path('module', 'mymodule') . '/pearsons/' . $arg->uid; 
  $lines = array();
  $min = 0; $max = 0;

  if ($handle = @fopen($path, 'r')) {
    while($line = fgets($handle)) {
      $line = explode("\t", $line);
      if ($line[2] >= $max) { $max = $line[2]; }
      if ($line[2] <  $min) { $min = $line[2]; }
      $lines[] = $line;
    } 
  }
  $output = ''; 
  // Likeminded
  $output .= '

' .t('Likeminded') .'

' ; $output .= '
'; foreach($lines as &$line) { if ($line[3] > 0 ) { $size =mymodule_font_size($min, $max, $line[2]); $opacity = $line[3]; $output .= ""; $output .= l($line[1], 'user/' . $line[0]); $output .= ""; } } $output .= '
'; // Adversaries $output .= '

' .t('Adversaries') .'

' ; $output .= '
'; foreach($lines as &$line) { if ($line[3] < 0 ) { $size =mymodule_font_size($min, $max, $line[2]); $opacity = abs($line[3]); $output .= ""; $output .= l($line[1], 'user/' . $line[0]); $output .= ""; } } $output .= '
'; return $output; } /** * calculate the font size in proportion to the maximum and minimum of common votes */ function mymodule_font_size($min_count, $max_count, $cur_count, $min_font_size=11, $max_font_size=36) { if ($min_count == $max_count) # avoid DivideByZero exception { return $min_font_size; } return ( ($max_font_size - $min_font_size) / ($max_count - $min_count) * ($cur_count - $min_count) + $min_font_size); }

That's it.

This approach scales fairly well. It takes around one minute to extract the data and around 30 minutes to calculate similarity on a database of 100 000 users, 1 000 000 posts and 4 500 000 votes, all on the same server that runs the website.

The lead image shows a real user profile page with a selection of like-minded users and adversaries.

P.S. If there's enough interest, I will rewrite the above code as a Drupal module.

P.P.S. Want to datamine your own data and receive an understandable explanation afterwards? Drop me a line.