Monday, June 23, 2014

IR: Indexing and Searching


Databases that incorporate an index provide the quickest search results, as the index contains the most relevant terms of documents encompassed within the electronic database.  Just as we refer to an index in a book to direct us to the precise page(s) where a specific term (word) is used, an electronic index does the same thing – it makes searching efficient.

Indexing = fast search results =  efficiency
According to Baeza-Yates, major performance measures for information retrieval systems are:
§  Indexing time: The size of the text corresponds to time necessary to create index.
§  Indexing space: The size of the text correlates to space needed as index is being constructed.
§  Index storage: Ultimately, completed index will require much less space than actual text size.
§  Query latency: Difference in time of initiation of search query and output of results (answer).  This is most noticeable to the user/researcher.
§  Query throughput: Derived from query latency to reflect average queries per second.  The average Google search that I’ve performed takes approximately 1 – 3 seconds; obtaining search results (with hundreds of thousands or millions of hits) in that amount of time is considered to be quite efficient.
Entire indexing of IR systems is normally performed on a daily basis, while incremental indexing may be performed more frequently to provide fresh information to the user/researcher.  It would become very tedious to completely re-index each time new data/information is added to the database.  Implementation of the "basic inverted index” is most often used, as its structure encompasses “all different words (vocabulary)” within the text of the document.  Terminology is fundamental to the indexing process – especially the inverted index. 
“For each word in the vocabulary the index stores the documents which contain that word.  For this reason it is called an inverted index, as we can reconstruct the text using the index” (Baeza-Yates, p 340).
The inverted index is most popular with the English language, wherein suffix arrays are more utilized with languages that require phrasing (several sequential words), such as Asian and Middle-eastern languages and information retrieval of music (which has sequential notes).  As it takes more than one note to express a melody, various languages require more than one word to express a concept – this is the uniqueness of languages.
There are other indexing techniques, which are more complex (than the inverted index), but relevant to information retrieval systems.  However, due to its simplicity, compared to all other techniques, the inverted index is the oldest and most used in today’s advanced technological IR systems.  Algorithms have become more sophisticated, and pseudocode (computer program instructions) maintains the intricate steps of the varied techniques.
New and improved technologies involving multi-core CPU chips, RAM (memory) compressions, better storage devices (such as solid-state hard drives), etc. set the trends for generating better IR systems and research techniques.  As the Internet and WWW continue to expand and evolve, search scalability (relating to object representation – from millions to billions to trillions . . .) will continue to be a challenge.

With a stash of chocolate at my fingertips,
I’m up to the challenge!


References

Baeza-Yates, R., & Ribeiro-Neto. (2011). Modern Information Retrieval; the concepts and technology behind search; second edition. New York: Pearson Education Limited.