This blog is the communication channel between the Odysci Team and you – our users! We will use this space to let you know about what is happening and new features we develop. We value your opinion – send us your feedback via comments below or email to feedback@odysci.com.

Academic Search – A Primer

Odysci (www.odysci.com) is a web portal for academic search and technical collaboration in the computer science, electrical engineering and related areas. Academic search is a specialized type of vertical search that returns published scholarly works, ranked according to various criteria.

Before the actual search and rank can be done, there are dozens of tasks that need to be accomplished, many of which involve significant algorithmic and computing challenges in the areas of machine learning, similarity and clustering, graph analysis, information retrieval and databases. This article briefly introduces the reader to these tasks and informally navigates through the steps of developing a service like odysci.com.

Main steps for an academic search site (by no means exhaustive):

  1. Gather metadata for articles
  2. Organize data in databases
  3. Handle data de-duplication
  4. Link analysis and ranking metrics
  5. Information retrieval, indexing and search
  6. Ranking of articles
  7. Client-server architecture
  8. End-user interface

1.  Gather metadata for articles

The first major step in developing an academic search site is the gathering of metadata for published articles. Metadata includes, for example, title, publication venue (journal, conference, book), authors, affiliation, abstract, keywords and references. If the full text is available, it can also be used as indexable content. The metadata can, in some cases, be obtained directly from the publishers, from other public repositories, or using crawlers on the web. Data from publishers and repositories tend to be reliable and well structured. Data obtained from crawling the web relies on the software being able to extract the metadata from either the pdf file itself or from the context within the html page which points to the pdf file. Due to the large number of documents being handled, efficient mechanisms for parallel processing and distributed storage are needed. Examples of such a distributed storage system are BigTable[1] and Hadoop File System[2].

If an article is obtained as pdf, it must be converted to structured metadata before it can be imported. This conversion implies converting it to text and then extracting the relevant metadata from the text, i.e., parse the text and recognize the names of authors, their affiliations, abstract, body and individual references. This is generally known as the problem of text segmentation and labeling, and current techniques are based on conditional random fields[3] and Markov models[4].

2.  Organize data in databases

There are many ways of organizing and storing data in databases. The metadata components, e.g., title, publication venue, authors, etc., can be organized as a complex graph structure where a venue points to all its articles, an article points to its authors and references, authors point to their affiliations and so on. This whole graph can be referred to as “Digital Library (DL) Object Graph”. Relational databases (e.g., MYSQL) are typically used for storing such structures. However, as the amount of data grows, other approaches that lend themselves better to parallel processing and distributed storage have gained attention[5].

3.  Handle data de-duplication

Very often, the same metadata is received in different forms. Due to different sources and different extraction processes, the same article may be represented by two or more metadata files with non-matching fields. In order to avoid duplication in the database, the software must be able to detect whether or not two records represent the same article. This problem is generally known as Named Entity Resolution[6].

Entity resolution or disambiguation is particularly difficult because names can be misspelled or abbreviated, conference names can be written in different forms, and worst of all, references can be very cryptic with several terms abbreviated or omitted. A range of techniques is used to compare two metadata records, from text-based approaches such as Jaro-Winkler, Levenshtein and others[7], to vector space models[8], to neighborhood-analysis methods[9].

4.  Link analysis and ranking metrics

The most successful web ranking algorithms are based on analysis of the web graph. The two best known methods that extract the ranking of a web page by exploring the web graph are PageRank[10] and HITS[11]. Developed independently at around the same time, both approaches rely on the simple premise that a web page is relevant if pointed by other relevant web pages. The HITS method offered the variation of distinguishing between in-links (Authority nodes) and out-links (Hub nodes) as part of the ranking. Both approached can be formalized as an eigenvector problem, and solved iteratively (which is the most efficient way when considering billions of nodes).

Other methods have also been developed. The PopRank[12] algorithm also considers the links present in the web graph, but apply different propagation factors for links between objects of different types. The Hilltop algorithm[13] scores pages according to the number of links (and their relevance) from independent expert pages on the topic of the user query.

All these methods can be applied to the ranking of scientific publications, using the DL Object Graph formed by articles, citations, authors, conferences, journals, etc. Furthermore, the ranking of an article may be influenced by the relevance of its composing objects. For example, different ranking metrics can be computed for authors (e.g., H-Index[14]) and for conferences and journals (e.g., Impact-Factor[15]).

5.  Information retrieval, indexing and search

Search engines often rely on a specialized data structure called “Inverted Index” for efficient text-based document search and retrieval. An inverted index[16] is similar to a hash table where the key is represented by a word and the value is a list of all documents (i.e., their ids) which contain that word, known as “Posting list”.

Given two words in a query, the intersection of the two lists of document ids, return the ids of the documents containing both words, whereas the union of both lists return the ids of all documents that contain at least one of the words. Extensions of this data structure may include the positions of each word in a document, allowing phrase searches.

Typically, a searchable term (i.e., the keys in the inverted index) can range from a single word, that should be present in the article, to portions of its metadata (e.g., part of title, author name). Ideally, the terms in a article should be indexed to match precisely the way a user will search for them. This is, of course, impossible to predict. However, one can create a very effective index by generating different transformations and expansions to the words in order to cover a multitude of different user search behaviors. Examples of these transformations are tokenization or grouping of words, expansion of synonyms, acronyms and abbreviations, stop-word removal and stemming.[17]

It is also common to create different indexes for different features of a document, such as, individual inverted indexes for author names and their papers, conference names and all their papers, words in article titles, etc. Using different indexes we can efficiently perform search on multiple fields of an article.

An important part of search is the interpretation of the user query. A search on ”Bryant, bdd, DAC” could reasonably be expected to return Bryant’s papers on bdds published at DAC. However, this implies a specific interpretation of the query that assumes “Bryant” as an author, “bdd” as a topic, and “DAC” as a conference. In order for this to work, queries must be processed to meet expected user intent. Techniques for query expansion and rewriting[18] are interesting research challenges.

6.  Ranking of articles

When a user searches for articles using keywords, the system returns a ranked list of papers relevant to the keywords used. The ranking of these articles depends on two major criteria: (1) how well the contents of the article match the keywords used – called query-dependent metrics, and (2) the ranking of the articles based on link analysis metrics extracted from the DL Object Graph – called query-independent metrics.

The final ranking of an article will be a composition of its query-dependent and its query-independent metrics. How exactly these metrics are combined is usually proprietary information.

Different methods have been developed to dynamically improve the ranking based on usage[19] as well as combine the topic of the context of the query with graph-based methods[20].

7.  Client-Server architecture

A search service may involve several layers of successive requests which can be handled by one or more servers. In general terms, when the user makes a search, a request is sent from his/her browser to the web server receiving the requests. The requests are parsed by the web server (or front-end) to determine what type of services it needs to request from the application server (or back-end). Typically the web server handles the interaction with the client, and passes the requests to the application server, which then accesses the index, performs the search and ranks the results. The results are passed back to the web server which then constructs the html page and sends it back to the client browser (i.e., the user).

The communication between the front-end and back-end servers can be done using different protocols and services which vary in complexity, capability and computational load. Examples of these services are SOAP, WSDL, RESTful. The RESTful service is implemented using HTTP and REST (Representational State Transfer)[21], and is simple, yet scalable and uses common Internet media types like XML and JSON, supported by standard HTTP methods.

8.  End-user interface

The main principle behind the end-user interface for the Odysci site was easy accessibility to search results and related information. The goal is to allow the user to reach any paper, conference, journal, author with the minimum number of cliques. Part of that is, of course, to be able to rank the results well. But on top of this, comes the ability to navigate easily through the results. For example, from the listing of papers (resulting from a search) one can access six sets of data with a single click of the mouse, namely: (a) details about a paper, (b) references of a papers, (c) list of papers which cite a given paper, (d) the papers of any listed author, (e) the papers published in a journal or conference in a given listed year, and (f) all the papers published in a journal or conference for all the years of publication.

Odysci Academic Search Portal

Odysci’s main research focuses on developing novel algorithms for data de-duplication and ranking algorithms, since we consider the quality of our data and ranking important aspects for the user. The web site also offers facilities to foster interaction among researchers, such as the ability to comment on papers. Users can also follow papers and get alerts about them (such as when a paper you follow is cited by a new paper). An Online Editorial Board of renowned experts will provided regular comments about their favorite papers. Odysci currently covers computer science, electrical engineering and related areas. We currently list almost 2 million documents (or about 65 million nodes in the DL Object Graph), including data from ACM, IEEE Computer Society and several other public databases. New conferences and journals are added on a weekly basis.

We would like to invite the computing community to try out www.odysci.com for searching research papers.

References

[1] “Bigtable: A Distributed Storage System for Structured Data”, Fay Chang et al., ACM Transactions on Computer Systems, vol. 26, no. 2, 2008.
[2] “The Hadoop Distributed File System”, Konstantin Shvachko et al., 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010.
[3] “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, John D. Lafferty et al., 18th International Conference on Machine Learning (ICML 2001), 2001.
[4] “Maximum Entropy Markov Models for Information Extraction and Segmentation”, Andrew McCallum et al., 17th International Conference on Machine Learning (ICML 2000), 2000.
[5] “Cassandra: A Structured Storage System on a P2P Network”, Avinash Lakshman et al., presentation at SIGMOD 2008 (http://www.slideshare.net/jhammerb/data-presentations-cassandra-sigmod/).
[6] “Duplicate Record Detection: A Survey”, Ahmed .K. Elmagarmid et al., IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, 2007.
[7] “A Comparison of String Distance Metrics for Name-Matching Tasks”, William Cohen et al, IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), 2003.
[8] “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, Gerard Salton, Addison-Wesley, 1989.
[9] “SimRank: A Measure of Structural-Context Similarity”, Glen Jeh and Jennifer Widom, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
[10] “The PageRank Citation Ranking: Bringing Order to the Web”, Larry Page et al., Technical Report 1999-0120, CS Dept., Stanford University, 1999.
[11] “Authoritative Sources in a Hyperlinked Environment”, Jon M. Kleinberg, Journal of the ACM, Vol.46, No.5, September 1999.
[12] “Object-Level Ranking: Bringing Order to Web Objects”, Zaiqing Nie et al., World Wide Web Conference 2005 (WWW 2005).
[13] “When experts agree: using non-affiliated experts to rank popular topics, Krishna Bharat and George A. Mihaila, ACM Trans. on Information Systems, vol.20, no.1, January 2002.
[14] “An index to quantify an individual’s scientific research output”, J. E. Hirsch, PNAS, vol. 102 no. 46, November 15, 2005.
[15] “The Thomson Reuters Impact Factor”, http://thomsonreuters.com/products_services/science/free/essays/impact_factor/.
[16] “Inverted files for text search engines”, Justin Zobel and Alistair Moffat, ACM Computing Surveys, vol.38, no.2, July 2006.
[17] “Text Operations”, Chapter 7, “Modern Information Retrieval”, Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999, Addison-Wesley.
[18] “Concept-based interactive query expansion”, Bruno M. Fonseca et al., 2005 ACM International Conference on Information and Knowledge Management (CIKM), 2005.
[19] “Learning to rank using gradient descent”, Christopher Burges et al., 22nd International Conference on Machine Learning (ICML 2005), 2005.
[20] “Topic-sensitive PageRank”, Taher H. Haveliwala, World Wide Web Conference 2002 (WWW 2002).
[21] “Principled design of the modern Web architecture”, Roy T. Fielding and Richard N. Taylor, ACM Transactions on Internet Technology, vol.2, no.2, May 2002.

Categories: Academic Search. Tags: .

Add comments... »

Post a comment.