Wednesday, July 18, 2012

How big are the indexes of web scale discovery services? How do they affect searching?

With the rise of web scale discovery services like Summon, Ebsco Discovery Service, WorldCat Local and Primo Central, librarians have began to assess how to teach searching and Information literacy differently.

As I blog this there is a Information Literacy & Summon workshop going on at Sheffield Hallam University.

Librarians are talking about "Going beyond Boolean" and invoke Dave's Lawusers should not have to become mini-librarians in order to use the library and one blogger talked about how a library was teaching boolean operators, phrase searching, truncation despite the library having "one search tool on their page, a discovery system in which almost none of these search strategies will work". (Off topic, while the discovery tools I am familiar with - primarily Summon, Ebsco Discovery Service don't need boolean or phrase searching - they *can* work with it)

In short, we talking about a Google-like search that users can key in a few keywords and get relevant results.

In a earlier blog post "How is Google different from traditional Library OPACs & databases?" , I discussed some of the differences between Google and a traditional library databases/OPAC. Some of the differences like ranking by relevance, "implied AND" have filtered down to most library databases and OPACs and are no longer a distinguishing factor.

Other features like autostemming (for databases), covering full-text  - are slowly gaining control in systems.

As noted in a comment in the blog, I forgot to mention what was the biggest obvious difference between Google and traditional library databases/OPAC - the huge size of the index!

A huge index, magnitudes larger than traditional databases coupled with full-text searching means a more forgiving search. It also means throwing away traditional stop words like "the" become a productive strategy.

Currently the only library system that is even close to fulfilling the two criteria of huge index and full-text searching are the web scale discovery services. But how huge is huge?  

I am most familiar with Summon, so I will discuss that from here on. 

What I did was to again use lib-web-cats to pull out ARL libraries using Summon. 1 or 2 may have be in testing or perhaps the information was outdated and I was unable to find Summon. But I ended up with 21 University Libraries.

The nice thing about Summon is that you can do a "blank search" to search everything. e.g Here's one for Duke University Library. 

By doing such searches for all 21 Libraries and looking at the facet counts, one can quickly figure out how much is in the index in total and for each category.

For example in the blank search above, you can see the library has 389 million items, of which 272 million items are in Newspaper Article and 83 million are journal articles etc. 

So how much do the typical ARL libraries have in Summon?

As seen above, at the time I did this last week, Princeton University had the biggest index with 392 million, the smallest was University Nevada - Las Vegas with 157 million and the average was 300 million. 

How big is 300 million? Given even the largest University library probably has no more than 10 million unique titles , the typical Summon library on ARL is at least 30 times the largest catalogue. (The average ARL library seems to have on average 60 times as much as their own catalogues) Also consider the fact that a large amount (unknown how much) is full-text searchable.

A large proportion of most Summon Libraries is in Newspapers articles (for obvious reasons), removing that we see the ARL libraries start to look similar averaging 100 million 

How about if one were just interested in Journal Articles? Average seems to be 72 million. Compare with Scopus which claims 40+ million (Sciverse Hub with ScienceDirect 10 million +Scopus+third party data is much larger and is I believe is in the class of a web scale Discovery Service). Again consider almost double and full-text search....

And finally if we just look at items in peer reviewed publications - which amounts to mostly peer reviewed articles.

Summon knows of 54 million peer reviewed articles, most libraries are on par about 45 million.

See google doc for more raw data, perhaps one could analyse in the future by discipline.

While Summon is huge one suspects it is still smaller than Google or Google Scholar if you think Google is a unfair comparison, but there is no way to tell except by estimating using search terms which I won't do.

A few qualifiers of the data above. The first thing to note is such statistics are outdated the moment they are obtained due to addition of packages and even if no new packages were added, newspapers, journals gained new items etc.

Neither are the figures a 100% representation of what each library has, some might be more complete in terms of putting packages they subscribe into Summon and even if this wasn't a factor, what Summon shows is limited to what is indexed in Summon.

The last probably explains why the libraries all bunch up in terms of number of peer reviewed articles. The biggest library possibly has a much bigger set but due to the "limited" size of articles indexed in Summon, the advantage cannot show as clearly.

Of course, arguably if an item can't be found in the discovery tool in an environment where discovery tool is the default, the effective accessibility for most users would indeed be the amount indexed in Summon.

That said, in this enlightened day and age , we are supposed to move away from input measures and even output measures, and should try to measure outcome if not impact measures so it seems totally archaic to start benchmarking on the size of your discovery tool index, so perhaps the individual numbers are not important.

Size is important though, as arguably the search experiences changes quite a bit when you scale up from searching a few million items (either full text or just metadata) to hundreds of millions (full text).

(As an aside, I have heard of the 'new' growing field of digital digital humanities, if more and more full-text of journals and books is indexed in Summon, I wonder if one could use it to run all sorts of analysis?)

Different search strategies need to come into play and perhaps this explains the resistance of some librarians to web scale discovery tools because despite the fact that superficially they look like databases like JSTOR they don't quite behave like that...

I speculate of course. For those of you who have some experience with such tools, how do they feel to you? Like a typical library database ? Or do they feel like a different category all together that you start feeling your usual instincts start to feels odd?

 Actually the answer is simple, they feel like Google! Typically you get a list of results that you are unlikely to exhaust reading , so you just look through the first few until you start to get  a string of irrelevant results then you stop.

That I feel is quite a different mind-set from searching a typical A&I database or even a smaller full-text database.

As we just launched Summon here, I don't have that much experience but in future posts I will talk about what I mean.

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...