Course:LIBR557/2020WT2/inverted index

From UBC Wiki

Inverted Index

An inverted index (or inverted file) is a data structure built on a document collection in order to facilitate efficient querying of that collection. The most rudimentary version of an inverted index includes a vocabulary (a list of all of the distinct words in a collection, excluding stop words removed during the tokenizing process) and a list of every document in which each distinct word can be found. Some inverted indexes include additional information such as the frequency at which a term occurs in a document or the position of a term within the document. The inclusion of these pieces of information makes more complex searching (such as proximity searches) possible, but also amplifies the workload of the indexer or web crawler which adds documents to the index. Of course the additional information also requires additional storage space, which can slow the speed of the search.

The index is a core component of most large scale search engines as it prevents the engine from having to scour the text of every document in a collection to find matches for query terms. Instead the search engine finds it’s query terms in the inverted index, and the index tells the search engine the specific documents that the term can be found in. Importantly, the index also tells the search engine which documents do not contain any of the query terms, allowing the search engine to disregard these documents, which improves search speed.

Compression

Compression is the process of reducing the amount of storage space the inverted index requires. In recent years the document collections inverted indexes are concerned with have grown exponentially (in the case of web search engines billions of pages translating to trillions of bytes of data), making compression a common practice. Compression is useful because reducing the size of an inverted index also reduces the size of the file search engines need to read, resulting in a shorter search time. There are many compression techniques, each having trade-offs between:

  • Compression ratio - the better the ratio the more storage space saved.
  • Coding time - faster coding times reduce initial processing costs
  • Decoding time - faster decoding means faster searching
  • Direct searching of compressed text without decoding it

Challenges

Adding to the index incrementally or starting over

One of the major challenges regarding compression of inverted indexes is the decision of how to deal with new material being added to the index. When a new document is added to an existing index, the contents of the document are converted to an inverted index, which is then merged with the original inverted index. Two issues arise here. First, word frequency values that are compressed in the original text may become inaccurate over time as more and more data is added to the index. Second, if a new word occurs in the document being added to the compressed index, that word may not have any compressed value related to it. There is a partial solution to this second problem. A dummy word can be included in the original index. When a new word is added in a merger, it piggybacks on the dummy word, giving it a compressed value. Unfortunately, over time this will have a negative impact on the compression ratio of the index. Recompressing the entire index solves both of these problems, but it is a resource intensive task and can’t be undertaken every time new material is added to the index. The question becomes, how long can an index make incremental changes before it needs to be fully recompressed?

Bibliography

Pibiri, G. E., & Venturini, R. (2021). Techniques for Inverted Index Compression. ACM Computing Surveys, 53(6), 1–36. https://doi.org/10.1145/3415148

Rasmussen, E. (2011). Access models. In I. Ruthven & D. Kelly (Eds.), Interactive Information Seeking, Behaviour and Retrieval (pp. 95-112). Facet. doi:10.29085/9781856049740.008

Ziviani, N., Silva de Moura, E., Navarro, G., & Baeza-Yates, R. (2000). Compression: a key for next-generation text retrieval systems. Computer, 33(11), 37–44. https://doi.org/10.1109/2.881693