Saturday, November 14, 2009

Bayesian filtering of RSS feeds - can you automatically find interesting journal articles?

Introduction

In this long rambling post (too bad the name Rambling librarian is taken) , I  write about filtering RSS feeds (in particular table of contents from online journals), using 3 services , SuxOr, FeedZero and FeedScrub. I ramble on about social filtering versus Bayesian filters, Spam filtering versus filtering of RSS feeds , some very brief initial thoughts etc.

Readers who are aware of the background should just skip to the sections on SuxOr, FeedZero and FeedScrub


In Aggregating sources for academic research in a web 2.0 world, I wrote about keeping up with your research using RSS feeds from

"traditional databases (citation alerts, table of contents of favourite journals), library opac feeds of searches and new additions, book vendor sites (e.g Amazon) book sharing sites (e.g LibraryThing), social bookmarking sites both generic (e.g. Delicious) and research 2.0 sites (e.g. citeulike), Google alerts and more"


The main problem with this of course is that you quickly get overwhelmed with results. In many cases you can't create a custom RSS feed (e.g. Many libraries provide RSS feeds of "new additions" in broad subject areas like Economics) and even in instances where you can , say a EBSCOHOST database search in RSS, even the most finely tuned search query can often bring up quite a lot of irrelevant results.

Social filtering


The answer here of course is to do some sort of filtering. Currently, with all the focus on social media, the idea of social search or social filtering is all the rage with Google, Bing competing to add features , many of which attempt to leverage on the social web, least of which is Google's Social search

The basic idea is simple, look at what other people who are in your field are sharing (through social bookmark or blogs or tweeting or likes etc..) .

Or as Chris Anderson (of The long tail fame) puts it

" "Social filtering" is a great way to describe this process. Instead of going directly to the source, we are only going to content that our network suggests is going to be interesting or relevant to us."

Twitter is probably the most famous example, where people you follow act as your social filter. Many claim that they don't even use their rss reader anymore they just look at what people they follow on Twitter , retweet or favorite (e.g favestar.fm lists tweets that people favourite)! Why manually read all your RSS feeds, if your friends who have similar taste as you dutifully share and highlight everything that is likely to be interesting to you as well!

This idea of social filtering and social search is embedded in the "wisdom of the crowds" approach, from the old school Digg/Slashdot approach which aggregated the votes from all users , to more individualized/ collaborative filtering approaches like Friendfeed's "best of the day" that takes into account only actions of your friends (or friend's friend and so on), or actions of users similar to yourself (e.g. Netflix). There are hundreds of examples of web 2.0 services using such approach (basically every social network out there and  recommendation systems) so I won't bother to list them .

The problem with just social filtering or social search is this, it presumes that your tastes is similar to that of the average masses (in cases like Digg, where every vote is aggregated) or that there are people out there in the network or among your contacts that have similar tastes/interests. But that isn't always the case, particularly when interests are defined very narrowly.

Taking myself as an example, I'm broadly interested in "Library stuff", essentially library 2.0, social media, Bibliometrics some aspects of linked data etc, I look at Retweets, faves, delicious etc and at this broad level I get interesting and useful recommendations particularly from the sources everyone monitors (Mashable etc, popular library blogs like Shifted Librarian etc )

But I'm also a Phd student (in theory anyway!) working on an obscure area involving valuation of library services using a specific technique. There are far fewer people working in this area, and hence social filtering is not a reliable method. I can't really expect someone to read a interesting journal article on that specific topic and share it, possibly no such person exists or even if he did he might not be that generous to share.

Note to self : to check if the few people who have written in the area I'm looking at, have a strong online social presence.

Bayesian filtering


So what's the answer? If you can't expect a human to do the hard work for you, you could train a machine to do it for you! There are many "machine learning techniques", but in this post, I focus on Bayesian filtering, a technique well known for being effective against spam.

The idea for Bayesian filtering for spam is usually credited to Paul Graham's "A plan for spam" , for those not into Mathematics, essentially Bayes Theorem allows you to calculate the possibility of a given event given that some other event has occurred.

Here's a simple layman explanation.

First you need to train your Spam filter on say 200 spam mail and 200 ordinary mail. You tell the filter, "this mail is spam", "this one isn't", "this one is" etc..

The filter will eventually "learn" that when certain words appear in the message (e.g. "viagra") the mail is more likely to be spam, while other words like "library" tend to be mail you want to read. (Some words would be neutral because they either appear in both types of mail or they don't appear in either).

Bayes Theorem lets you calculates the exact probability that a mail is "spam" or "ham"(good mail) and beyond a certain threshold of "spaminess", you can be pretty sure it's spam and classify it as such.

The history of the use of Bayesian filters for spam filtering is a fascinating one but I won't recount it here, except to say that they have being very effective despite attempts by spammers to beat it using various tricks.


I used filters in the past such as  POPFile, Spambayes before switching mostly to gmail, and after a couple of weeks of training, I achieved success rates of  catching spam of 99% and up with almost negligible false positive rates.



Bayesian filtering of RSS feeds


There's nothing essentially special about the categories "Spam" and "ham", you can teach a standard Bayesian filter to classify text into any 2 or more categories. Though many Bayesian spam filters, usually allow only 2 categories , some like POPFile allow you to classify into as many "buckets" as you want.

By now you now where I'm going with this, why not use Bayesian filtering to recognise "interesting" vs "non-interesting" articles in RSS feeds?

In the examples that follow, I experiment with bayesian filtering of RSS feeds from table of contents of electronic journals. While I could do this on any RSS feed, I feel it makes little sense to do Bayesian filtering on popular blog feeds, since such feeds would definitely be read and filtered (Retweet or otherwise shared) by humans, while sharing of journal articles particularly for less popular journals is far less likely.

A wild idea here is to feed in a RSS feed that is already pre-filtered adding another layer of filtering to purely social filters. E.g. Try bayesian filtering on a Citeulike rss feed!

To make things simple I used the free ticTOCs service   (other similar services)to get 20 RSS feeds from various Library and information science journals then feed then into the bayesian filters


How to do bayesian filtering of RSS feeds

I'm aware of 3 services that do bayesian filtering of RSS feeds. 2 are web commercial services (FeedZero and  Feedscrub) and one is a open source project (SuxOr).

Google Reader provides a new sort option "magic", which  " is personalized for you, and gets better with time as we learn what you like best — the more you "like" and "share" stuff, the better your magic sort will be". It's unclear if this uses Bayesian filtering or some similar technique.


I have also being experimenting with converting RSS feeds into NNTP, IMAP/POP3 and then using POPFile (an open source bayesian filter designed mainly for classifying email into any number of arbitrary categories) to classify the items, but I doubt it's going to be a real solution.



SuxOr

I first became aware of SuxOr thanks to a tweet by the amazing Ostephens, he tweeted about the Bayesian Feed Filtering (BayesFF)




"The Bayesian Feed Filtering (BayesFF) project will be trying to identify those articles that are of interest to specific researchers from a set of RSS feeds of Journal Tables of Content by applying the same approach that is used to filter out junk emails.
We will develop and investigate the performance of a tool that will aggregate and filter a range of RSS and ATOM feeds selected by a user. The algorithm used for the filtering is similar to that used to identify spam in many email filters only in this case it will be “trained” to identify items that are interesting and should be highlighted, not those that should be junked." (Full proposal)


The research project carried out at the Heriot-Watt University, uses the opensource SuxOr project. They get actual researchers to go through the process of using bayesian filtering on Journal Table of contents and they will try to evaluate the effectiveness of the method.

The research project is currently in progress, but their blog already has a lot of interesting information.

It's a open source project, so you can download and setup your own server with this. Alternatively if you don't want to go through the hassle of doing this, you can sign up at http://icbl.macs.hw.ac.uk/sux0r206/home to play with this (do note that there are no guarantees for long term use).

SuxOr is a very sophisticated system, but I was disappointed that there was no way to import a package of RSS feeds using OPML (a package of RSS feeds you can get from RSS feed readers like Google Reader), so you have to manually import each RSS feed.

Another problem is that adding RSS feeds that are not already listed isn't automatic, you need the administrator to approve it first (though I read they are working on changing this). 

Fortunately for me, the administator Lisa Roger was kind enough to quickly approve 20 or so library and information science journal - Table of contents (e.g Library Quarterly, Journal of Documentation, Journal of American Society for Information Science and Technology) so I was able to set it up for my use to look for articles of interest to me (basically valuation of library services).











Once you have your RSS feeds setup the next thing you need to do is to set up your categories for filtering.

Click on your name and then "edit bayesian".




First you need to create one more "vectors".  This confused me for a while. I understood of course that you could create categories for the filters to classify into. So you could create categories "interesting" and "not interesting" (which is basically what the research project is doing) or even specific categories like "Library 2.0", "Cataloging", "Information Literacy" etc.

But what is a "vector"?

As far as I could make it out, a "vector" is a dimension along which you can classify items into.
So in theory you could have a vector "relevance" with two categories "relevant" and "not relevant" as well as along another vector "subject" with categories "Library 2.0", "Cataloguing" and "Others".

Personally I just stuck to having one vector and 2 categories to give the filter the best chance of working.


Normally, you train on existing items from feeds by classifying items in the feed you import, but you can also copy and paste text from any other place to train the filter. So for instance, you can copy and paste abstracts from relevant articles in your bibliographic database,  to quickly train the filter to watch out for those words.










































There are other advanced features like the ability to share/training etc, possibly even share your filters so someone could run text  through your filter to quickly see if you would be interested in that article but I didn't explore that.



The image below shows items in a feed being classified by the vector "interestingness" into 2 categories "interesting" and "not interesting". I set the filter threshold to an arbitrary 77%  and it shows articles in the feed where the filter calculates that there is a 77% or more chance of being "interesting".



 Of course, the filter is not going to perfect  , particularly at the beginning where it has not being sufficiently trained, so you can change the verdict for each item by clicking on the pull down menu and changing the category and the filter will then "learn" by adjusting the training weights given by the text in the document.



FeedZero


 Feedzero is a web-based service that allows you to enter feeds and then train them with a "thumbs up" and "thumbs down". This is a bayesian filter of course.



It has less options than Suxor, in that you can only filter into 2 categories, but it offers more import options, in particular OPML import which I think is a must-have ,so you can import a bundle of RSS feeds from say Google Reader. Great time saver.





FeedScrub  

 

As I write this, Feedscrub is in closed beta testing, and requires a invitation code to test. I managed to get one from here, I'm not sure if the code still works.

















The free version is very limited you can only add 5 RSS feeds. The paid version allows unlimited feeds and allows you to import feeds using OPML.


Spam filtering vs RSS filtering- some thoughts


I'm aware of  some discussions on how to train bayesian filters but it applies mostly to handling spam.
It's unclear whether this carries over to bayesian filtering of text (I'm sure there is work done on purely text classification using bayesian filters, but I haven't looked at it).


But the two basic methods are "train on error" and "Train on exhaustion"  (see here also)


The former involves

"scanning a corpus of known spam and non-spam messages; only those that are misclassified, or classed as unsure, get registered in the training database. It's been found that sampling just messages prone to misclassification is an effective way to train"

This is the method I suspect most users will use since it's most intuitive, the other method  is just too much work. 

Many spam filters are purely "naive bayesian" filters and I believe results with these filters should be exactly the same as with RSS bayesian filters.

But spam has many special characteristics, and many spam filters have added features to handle it.

Firstly, spammers are trying to beat the bayesian filters using various techniques from using images, or adding random words to mail trying to poison or fool the Bayesian filter - the so called "word salad" method  (see The Spammers' Compendium  for more tricks) . For instance, POPfile  use pseudo words to handle special characteristics of spam.

Secondly, because in spam filtering,  a false positive (classifying a good mail has spam) is more serious than a false negative (classifying spam has okay), filters such as spambayes has a "unsure" feature.

It's unclear if filtering of RSS feeds (in particular those from Table of contents of online journals) is easier or harder than spam filtering. For sure you don't have the problem of an adversary trying to beating the filter (unless you include authors using interesting but misleading terms in their abstract trying to interest you!), so it seems it might be easier.

On the other hand, trying to judge if a full article is interesting based on just the abstract is challenging, it's unclear if there is enough information.

In fact it is worse than that, if you look at the image above which showed filtering of table of contents, you will notice that some RSS feeds of online journal tables of contents just list the title and author and have no abstracts, in such cases, it's unclear if Bayesian filters will be effective. While many RSS feeds will have "full-text" (e.g. Blogs with full-feeds), many will not (e.g. books from OPAC feeds)...

 The other issue is the issue of "false positives". Assuming one filters RSS feeds into 2 categories, where one category  plays the role of "good mail" (interesting item) and the other "spam mail" (not interesting items), should one try to minimize the false positive rate (or the chance of missing a interesting item because the filter wrongly classified it as not interesting)?

SuxOr allows you to select probability thresholds but the question is what threshold of "interestingness" or rather "uninterestingness" should you use?

Also wondering, in terms of evaluation , how does one evaluate the effectiveness of the filter? This moves us into the realm of information retrieval.

As this report on spam filtering  points out , either you calculate

a)The recall and precision rates of not interesting items which is the standard in the information retrievial literature or what all librarians were taught in library school

or

b) Miss rates and false positive rates  which is commonly used in reporting spam filter results.

I think the recall rate of not interesting items plus the miss rate will always equal to 1 and the precision rate of not interesting items plus the false positive rate will also equal to 1.

Still mulling over this and trying to remember what I studied about information retrieval in library school but I think it is critical to aim for a low false positive rate or conversely a high precision of uninteresting item to avoid missing interesting items. The converse problem of seeing an item marked interesting but really isn't is a lesser problem, since I rather err on the side of caution.

It's interesting to speculate how well bayesian filtering of RSS feeds will do.

My limited experience so far is that it's a bit tricky classifying the articles on "interesting" vs "non-interesting" or "relevant" versus "non-relevant" etc..  For spam vs non-spam it is usually  a no-brainer (less the occasional semi-spammy mailing list) , but given that  relevance or interestingness is a multi-dimension concept, I find myself debating over whether a certain item is really relevant or interesting or not. An article might be interesting if you haven't read it before, but i still classify it as interesting anyway cause of the keywords in there.

Another issue, is that with table of contents you don't get the full text, so if the abstract looks interesting but the full-text isn't, is it to be classified as interesting or not? Maybe classifying by topic e,g "valuation of libraries", "Library 2.0" plus "others" might be better?


Conclusion

It still early days, it will be interesting to see how successful bayesian filtering will turn out, but you can follow the discussion here on Friendfeed. According to MrGunn, Friendfeed (based on social filters) still beats bayesian filtering for him, but he hasn't trained the bayesian filters for long yet and it's possible that  social filters work well for him because he has strong social networks, due to the strong community of Life scientists on Friendfeed.




For me, I'm almost certain that with enough training Bayesian filtering  will be useful above and beyond the usual social filters, but that's because I have weaker social ties. Though of  course if science 2.0 sites that are social networks for researchers like Mendeley take off, it might become easier to find other like minded researchers. Still currently it seems this isn't happening.

But it's not really a question of social filters versus Bayesian filters as MrGunn notes above "Bayesian filter helps you find best stuff you already know you like, whereas Social search helps you find stuff you didn't know you were looking for."

I would add that, one could combine the two systems, since in a 2 category system, your "likes" or "faves" or "thumbs up" gives data that can be used for either bayesian filtering of articles or for fueling your social graphs (to find people who have "liked" similar articles), both systems could work in parallel or chained together.

In many ways this is similar to email filtering systems like SpamAssassin which use a host of filtering techniques from  bayesian filtering to DNS-black-listing, greylisting, checksum-based filtering and more.

 I particularly like the following diagram from the CiteUlike blog



















In this post, I'm basically doing "more like this" feature but there are other recommendations that can be done.


blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...