Databases that incorporate an index provide the quickest search results, as the index contains the most relevant terms of documents encompassed within the electronic database. Just as we refer to an index in a book to direct us to the precise page(s) where a specific term (word) is used, an electronic index does the same thing – it makes searching efficient.
Indexing = fast search results = efficiency
According
to Baeza-Yates, major performance measures for information retrieval systems
are:
§ Indexing
time: The size of the text corresponds to time necessary to create index.
§ Indexing
space: The size of the text correlates to space needed as index is being
constructed.
§ Index
storage: Ultimately, completed index will require much less space than actual
text size.
§ Query
latency: Difference in time of initiation of search query and output of results
(answer). This is most noticeable to the
user/researcher.
§ Query
throughput: Derived from query latency to reflect average queries per
second. The average Google search that I’ve
performed takes approximately 1 – 3 seconds; obtaining search results (with
hundreds of thousands or millions of hits) in that amount of time is considered
to be quite efficient.
Entire
indexing of IR systems is normally performed on a daily basis, while
incremental indexing may be performed more frequently to provide fresh
information to the user/researcher. It
would become very tedious to completely re-index each time new data/information
is added to the database. Implementation
of the "basic inverted index” is most often used, as its structure
encompasses “all different words (vocabulary)” within the text of the document. Terminology is fundamental to the indexing
process – especially the inverted index.
“For each word in the
vocabulary the index stores the documents which contain that word. For this reason it is called an inverted
index, as we can reconstruct the text using the index” (Baeza-Yates, p 340).
The
inverted index is most popular with the English language, wherein suffix arrays
are more utilized with languages that require phrasing (several sequential
words), such as Asian and Middle-eastern languages and information retrieval of
music (which has sequential notes). As
it takes more than one note to express a melody, various languages require more
than one word to express a concept – this is the uniqueness of languages.
There
are other indexing techniques, which are more complex (than the inverted
index), but relevant to information retrieval systems. However, due to its simplicity, compared to
all other techniques, the inverted index is the oldest and most used in today’s
advanced technological IR systems. Algorithms
have become more sophisticated, and pseudocode (computer program instructions)
maintains the intricate steps of the varied techniques.
New
and improved technologies involving multi-core CPU chips, RAM (memory)
compressions, better storage devices (such as solid-state hard drives), etc.
set the trends for generating better IR systems and research techniques. As the Internet and WWW continue to expand
and evolve, search scalability (relating to object representation – from millions
to billions to trillions . . .) will continue to be a challenge.
With
a stash of chocolate at my fingertips,
I’m
up to the challenge!
References
Baeza-Yates, R., & Ribeiro-Neto. (2011). Modern
Information Retrieval; the concepts and technology behind search; second
edition. New York: Pearson Education Limited.