Objective
Select like-minded users from a local community website.
Pre-requisites
- A Drupal website with the votingapi module enabled and at least a few dozen votes by registered users.
- A working installation of the R language.
Exract data
For each user, select all other users that voted on same node and comments:
SELECT v1.uid uid1, v2.uid uid2, u1.name name1, u2.name name2, v2.entity_id entity_id, v1.value value1, v2.value value2 FROM votingapi_vote v1 JOIN (votingapi_vote v2, users u1, users u2) ON (v1.uid != v2.uid AND v1.entity_id=v2.entity_id AND v1.entity_type=v2.entity_type AND v1.uid=u1.uid AND v2.uid=u2.uid) WHERE v1.uid < v2.uid AND v1.uid != 0 AND v2.uid != 0 ORDER BY v1.uid,v2.uid;
This produces a table
uid1 uid2 name1 name2 value1 value2 1 2 Administrator Bob 100 100 1 2 Administrator Bob 20 20 1 2 Administrator Bob 40 40 1 2 Administrator Bob 100 100 1 2 Administrator Bob 20 100 1 2 Administrator Bob 100 100 1 2 Administrator Bob 100 100 1 2 Administrator Bob 100 100 1 2 Administrator Bob 100 100 1 2 Administrator Bob 80 80 1 2 Administrator Bob 100 20 1 2 Administrator Bob 20 20 1 2 Administrator Bob 60 60 1 2 Administrator Bob 100 100 1 2 Administrator Bob 100 100
with five columns:
- first user id
- second user id
- first user's name
- second user's name
- vote of the first user
- vote of the second user
The important parts in the SQL are
- the JOIN on the same table, which allows to generate all permutations of uid1 and uid2
- the WHERE clause on
v1.uid < v2.uid
which reduces permutations to combinations.
The uid of 0 is skipped, because it is the uid of the anonymous user. Every anonymous vote is attributed to it.
Calculate similarity
It can be done in PHP, but why bother? Here's a handy R script that takes the above table as in.tsv
and produces, for each user, a file with the following columns:
- id of the other user
- username
- number of votes in common
- Pearson's correlation coefficient between votes
- a p-value that indicates how certain was the algorithm.
#!/usr/bin/env Rscript d <- read.delim("in.tsv") unique1 <- unique(c(d$uid1, d$uid2)) for (id1 in unique1) { if (file.exists(as.character(id1))) { file.remove(as.character(id1)) } temp1 <- d[d$uid1==id1 | d$uid2==id1, ] unique2 <- unique(c(temp1$uid1, temp1$uid2)) unique2 <- unique2[!unique2 == id1] # remove id1 for (id2 in unique2) { if (id1 < id2) { result <- temp1[temp1$uid1==id1 & temp1$uid2==id2, ] name <- as.character(result$name2[1]) } else { result <- temp1[temp1$uid1==id2 & temp1$uid2==id1, ] name <- as.character(result$name1[1]) } n = nrow(result) if (n > 7) { x <- result$value1 y <- result$value2 pvalue <- cor.test(x,y)$p.value if (is.finite(pvalue) && pvalue < 0.05) { correlation <- cor(x,y) cat(id2, name, n, correlation, pvalue, "\n", sep = "\t", file = paste(id1, sep = ""), append = T) } } } }
Notice the use of the cor(x,y)
function that calculates the correlation and cor.test(x,y)
that produces additional metrics for the correlation, including p-value. By convention, everything above p-value < 0.05
is considered uncertain, so we only print lines where p-value < 0.05
. Jiggling with id1
and id2
, and the if-else
block are there to select pairs of users in any order.
Here's the output from the above data:
2 Bob 15 0.6039604 0.01710946
Display results
The rest is fairly obvious. I've chosen to display the data as a tag cloud on user profiles. With a hook on hook_menu
,
/** * Hook into the user menu */ function mymodule_menu() { $items['user/%user/likeminded'] = array( 'access callback' => TRUE, 'access arguments' => array(1), 'page callback' => 'mymodule_likeminded', // function defined below 'page arguments' => array(1), 'title' => 'Likeminded', 'weight' => 5, 'type' => MENU_LOCAL_TASK, ); return $items; }
I fetch the user's data file as generated by the R script above and display the data from it in a bag of words of varying sizes:
/** * Display likeminded users */ function mymodule_likeminded($arg){ if (is_object($arg) && !$arg->uid) { return; } # this is my path to the results, your path may be different $path = drupal_get_path('module', 'mymodule') . '/pearsons/' . $arg->uid; $lines = array(); $min = 0; $max = 0; if ($handle = @fopen($path, 'r')) { while($line = fgets($handle)) { $line = explode("\t", $line); if ($line[2] >= $max) { $max = $line[2]; } if ($line[2] < $min) { $min = $line[2]; } $lines[] = $line; } } $output = ''; // Likeminded $output .= '' .t('Likeminded') .'
' ; $output .= ''; foreach($lines as &$line) { if ($line[3] > 0 ) { $size =mymodule_font_size($min, $max, $line[2]); $opacity = $line[3]; $output .= ""; $output .= l($line[1], 'user/' . $line[0]); $output .= ""; } } $output .= ''; // Adversaries $output .= '' .t('Adversaries') .'
' ; $output .= ''; foreach($lines as &$line) { if ($line[3] < 0 ) { $size =mymodule_font_size($min, $max, $line[2]); $opacity = abs($line[3]); $output .= ""; $output .= l($line[1], 'user/' . $line[0]); $output .= ""; } } $output .= ''; return $output; } /** * calculate the font size in proportion to the maximum and minimum of common votes */ function mymodule_font_size($min_count, $max_count, $cur_count, $min_font_size=11, $max_font_size=36) { if ($min_count == $max_count) # avoid DivideByZero exception { return $min_font_size; } return ( ($max_font_size - $min_font_size) / ($max_count - $min_count) * ($cur_count - $min_count) + $min_font_size); }
That's it.
This approach scales fairly well. It takes around one minute to extract the data and around 30 minutes to calculate similarity on a database of 100 000 users, 1 000 000 posts and 4 500 000 votes, all on the same server that runs the website.
The lead image shows a real user profile page with a selection of like-minded users and adversaries.
P.S. If there's enough interest, I will rewrite the above code as a Drupal module.
P.P.S. Want to datamine your own data and receive an understandable explanation afterwards? Drop me a line.