Thursday, April 25, 2013

How are discovery systems similar to Google? How are they different?

Like many academic libraries, we recently launched our discovery service Summon. Having worked intensively on this project since 2011 during the evaluation followed in 2012 by the implementation phase, I had an opportunity to delve into the topic perhaps deeper than many of my colleagues not on the team.

I would guess most librarians probably see Summon and its competitors as "Google, but for academic research or they see as "Google Scholar like". For sure, users see it that way and so did I.

In a way this isn't a bad way to understand Summon. Similar to Google, Summon builds a centralised index of results that it queries whenever you search, so you can get almost instant results. Of course this isn't how older library federated search products work which pulls in results in real-time from multiple sources (this is the library equalvant of web metasearch services like ixquick ) rather than storing the data before-hand in a single index.

Similar to Google, Summon generally isn't restricted to searching just the metadata or the bibliographic record and searches through full-text of most journal articles and many books - if these are available or provided.

Other similarities include the holy grail of "the one search box" that searches "everything" (or close enough) and heavy focus on relevancy ranking to surface desired results.

As a sidenote, relevancy ranking isn't really new to library catalogues by now (for example our "next generation catalogue" Encore, has relevancy ranking and so does the older webopac), but one thing that is often missed by librarians is that because Summon searches full-text rather than just metadata/library record, Summon's relevancy ranking owes more to how typical web search engines works and is often unpredictable to a large extent.

Even if you knew the exact formulas and weightings of each factor, you would have to crunch the numbers and probably it doesn't work such that "no matter what, this journal must appear on top because it matched 245$a and...." and for sure you can't "explain" why this results appears on top but not another.

As stated in my post, How is Google different from traditional Library OPACs & databases? , Summon is probably as close to Google/Google scholar as any Library associated search currently out there including features like autostemming, search over full-text and Summon 2.0 will come even closer by adding Auto query expansion that will automatically search synonyms.

Other upcoming features like "topic explorer" which pulls in short entries from reference material from sources such as Britannica online and Wikipedia, reminds me of a very primitive form of Google Knowledge graph at least visually (as far as I know Summon has no Semantic search). For example compare the following result from Google for "heart attack".

With the topic explorer in Summon 2.0

I would add that such "Topic pages" is not unique to Summon, for example Ebsco Discovery service is adding topic pages.

Summon 2.0's content spotlighting that "Groups newspaper content for easy identification" and "Local collection and image spotlighting" reminds me of how Google's "universal search" dynamically shows content from Google News and images when necessary.

Below shows a Google search with news items been distinctly grouped and highlighted

In short, both functionally and visually, Summon is getting very close to Google with the main exception it does not do a soft AND - it doesn't occasionally drops terms from the search.

A sidenote is that there are metadata fields in Summon that are never displayed to the user but are indexed and matched, so occasionally it seems Summon might appear to do a soft AND and pull out results that do not match all terms (taking into account stemming) but it's just a illusion.

As such I think while most librarians know how Summon is similar to Google/Google Scholar, what is often not mentioned is how different Summon is from Google. These differences are often technical but I suspect drive a lot of unhappiness towards discovery services because they can't meet "Google level expectations"

I am not technical expert but I believe, the main difference between Google/Google scholar and Summon stems from the fact that

Google mostly obtains knowledge of webpages/articles by crawling such pages and harvesting them directly using spiders, Summon generally doesn't. 

See also Google's Inside Search

This difference has 2 effects

1) Less stability in links
2) Less capability in relevancy ranking

Have you ever wondered why Google or Google Scholar seem to have a much lower broken links rate despite covering so much ground?

Essentially, how Google works is that, they have bots that go out to different webpages and capture the information on those pages and from those pages the bots crawl to other pages via links on those pages.

Google scholar is similar 

"Google Scholar uses automated software, known as "robots" or "crawlers", to fetch your files for inclusion in the search results. It operates similarly to regular Google search. Your website needs to be structured in a way that makes it possible to "crawl" it in this manner. In particular, automatic crawlers need to be able to discover and fetch the URLs of all your articles, as well as to periodically refresh their content from your website."

I recently lead a workshop on using Google Scholar for bibliometrics, and despite how I tried, based on the questions they asked, I suspect many just couldn't wrap their minds how Google Scholar obtained entries for indexing compared to how Scopus and Web of Science worked. pretty much sets out the inclusion guidelines for what Google Scholar will index.

Essentially a Pdf file, that looks vaguely article like (e.g Title in big font, author in one line before it, a section titled references etc) and on a edu domain will be considered scholarly and included by the spider into Google scholar if it comes across.

I believe Summon generally does not find information to index this way (I could be wrong).

This difference means that in general Summon relies fully if not mostly on the quality of information given by publishers etc (whether via FTP/USB/OAI-PMH) and does not really "know" if the information given is correct as it has not really "seen" the page or article in question on the site.

While Summon and competitors in its class try to obtain full-text as well as meta-data whenever possible, it relies heavily on the cooperation of the content owner. So often, it may just have the metadata but not full text, particularly for smaller less technically capable content owners. Comparatively Google Scholar if given permission can pretty much grab "everything" full text and all, if their spiders are allowed permission. My anecdotal testing shows this sometimes makes a big difference for example compare the following for Summon eds discovery in Google scholar vs in Summon and you will notice more relevant results appearing in Google scholar due to more full-text indexing even though most of the articles shown in Google scholar are indexed (metadata/abstract only) in Summon as well.

This also means unlike Google, linking in Summon is going to be less reliable. Let's leave aside the complication of journal articles residing in different locations and the need to use openurl resolvers and assume all articles reside only at the/one publisher.

Google is generally sure that when they display a link, the webpage exists, at least at the point in time the bot harvesting the page, it was definitely there. And also because they directly check to see if the page exists, they can easily do link checks and fight link-rot. They can even tell which domains tend to have more broken links and can penalize such sites more.

Imagine if Summon had such data and could use it to automatically adjust openurl database ordering when there are multiple copies available.

I don't think Summon has a way of knowing what links are broken though? Even though Summon has "Index-Enhanced Direct Linking" which uses information from the publisher for more reliable linking compared to openurl linking it is still not directly checking to see if the article exists. For instance, I notice many of these partnerships seem to be using doi, and believe it or not, dois occasionally still do not resolve properly.

The other thing that people like to moan about is the relevancy ranking. Why isn't Summon as good? Don't get me wrong Summon's is very good, but I doubt anyone would say it's better than Google's and I would guess many if not most would say it isn't as good. I also have anecdotal information in the sense that so far the dedicated google scholar users I know of have not switched to Summon, though they acknowledged Summon is a very good effort, signalling that at the very least Summon isn't much better to be worth switching.

Google has a very sophisticated ranking system of course, they can rank based on social signals, usage, tracking click data etc, which leads to fears of filter bubbles where you get totally different results depending on who you are, when you search, where you are when searching etc..

In any case, I don't believe Summon currently uses any of this, though I would love to see Summon take into account click data usage etc whether on a institutional level or global level if hasn't already, similar to how Summon generates "related search" suggestions.

Summon related searches

But the better relevancy also stems from the fact that because Google directly crawls each page, they can study the linkage patterns between webpages leading to the famous Page Rank algorithm.  As you know, each inbound link is a "vote of approval" from the source page that the destination page is important. While this factor may not be as dominant a factor as it used to be, with other "signals", it's easy to believe it is still very useful for Google.

There's a beautiful explanation here.

"The Web is a complex network of interlinked documents and files. It's vast. It's open. Although much of its data is not very well-structured, it does at least share a common structure (HTML, XML) and a common infrastructure. You can write a program that crawls from document to document on the Web and automatically gleans lots of contextual information based on what links to what, the text in which the link is embedded, and lots of other contextual clues. The contextual data might not be 100% accurate, but it's incredibly rich."

Then it goes on to explain why library data is different.

"Library data, on the other hand, consists mostly of various separate pools of records/resources that, 1. have little (if any) contextual data, 2. are not linked together in any meaningful way (not universally and not with unambiguous, machine-readable links), 3. do not share a common structure, 4. do not share a common infrastructure, and 5. are generally not freely/openly available. So much of what Google has leveraged to make Web search work well is simply not part of library data. "

This is for Google, but applies for Google Scholar as well I would guess to a lesser degree.

For Summon the closest equivalent to that which we have is using citation data from Web of Science/Scopus. I have no information, how this is used, but regardless given that most articles are not even cited once (at least as seen in Citation indexes from Scopus or Web of Science), this citation web is a very poor substitute to the link analysis Google uses.

I would add it's well known Google Scholar generally shows more cites than Web of Science for the same article, due to the "looseness" of what is considere a cite, so this technique of weighting results based on cites is far more effective for them.

Can Summon further improve the relevancy ranking? Yes. For example , Google is famous for personalizing search results using either the fact you are logged into Google accounts or because there is a long term non-expiring cookie as well as hundreds of other cues including social media related ones.

Google Scholar as far as I tell isn't that personalised based on doing the same search on different systems and ips but that's besides the point.

Could Summon do personalised results? In theory it could take into account logged-in users , what discipline they are in, what level of study etc, similar to what Primo's Scholarrank claims to do.

But this would still lack the link analysis Google can do by studying the web as a graph of inter-related articles.

One wonders if adding data from citation managers like Citeulike and Mendeley could help improve relevancy ranking, though of course if altmetrics takes off (in many ways this would be the "social signals" of scholarly works ), Summon could exploit that as well.

Beyond that, I am not sure what the solution is for better relevancy, perhaps moving towards a "linked resource discovery environment" (a concept I don't fully grasp) would help but that would be a fundamental change compared to the shift towards web scale discovery services, but as more and more content gets sucked into Summon and it's competitors , this problem of relevancy ranking is not going to get better.


This post is just my education guess on how Summon and Google work and I might be totally wrong. If you have more knowledge and are aware of errors, please help share what you know in the comments.

If you want to keep uptodate with articles, blog posts, videos on web scale discovery do subscribe to my curated magazine on flipboard on web scale discovery.
blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...