Create your own awesome maps

Even on the go

with our free apps for iPhone, iPad and Android

Get Started

Already have an account?
Log In

NLPA An Overview of Information Retrieval by Mind Map: NLPA An Overview of
Information Retrieval
0.0 stars - reviews range from 0 to 5

NLPA An Overview of Information Retrieval

advanced topics

link analysis

matrix decompositions and LSA

statistical approaches

XML retrieval

web search basics

web crawling and indexes

what is "IR" in practice?

text vs knowledge

what we want

what we get

given the vector space model

term frequencies turn text documents into vectors

once in vector form, we can apply regular pattern recognition and neural network models

kinds of models

generative models, assume vectors are distributed according to some distribution

discriminative models, just treat vectors as inputs in some high dimensional space and learn a classifier

very high dimensional

IR problems give rise to very high dimensional and sparse classification problems

kernel methods may help (more features than objects)

also: sparse matrices etc.


how do machine learning, classifiers and information retrieval fit together?

can we learn the mapping: query->result set directly?

usually machine learning is used not for solving retrieval directly, but solving related problems, learn relevance of terms, learn a better scoring function, extract features from terms and determine relevance for the result set, perform clustering on the result set and show only representatives from each cluster (since the rest of the objects in the cluster are very similar), organize the database so that retrieval is faster, perform query expansion, find synonyms in queries, learn document zones etc., relevance feedback

applying machine learning, to apply machine learning, you need..., features, training data, the ability to generalize across samples (can't generalize from "cat" to "dog" without additional information)

learning a better scoring function, perform query as usual, using cosine scoring, normally, we just rank based on the cosine score, but maybe we can improve it, what additional information do we have?, document frequencies, term frequencies for query terms, term frequencies in document, semantic information about query terms, category / class of document (e.g., from clustering), link and popularity information of documents in general, training data consists of the above features and feedback from users whether a document was relevant, put this into a machine learning algorithm and learn a classifier, use the classifier to rescore results; use the probability-of-relevance as the score, instead of the original cosine score

commonly used

k-nearest neighbor classifier, usually, using normalized dot product

Rocchio classifier, for each class, compute the class mean, classify according to the closest mean, (how is this justified in pattern recognition?)

support vector machines

Naive Bayesian

logistic regression

neural networks


decision trees

getting at relevance

search for documents on the inauguration

terms: Obama, 2013, 2012, inaguration

maybe also: president, oath, etc.

generally: get list of matches, need to then check manually

would like to start with the "most likely relevant" documents first

general idea: give positive "scores" or "weight" to relevant terms

general idea

"Obama" - frequent, not very specific, but very relevant

"inauguration" - infrequent, very relevant, but might apply to other presidents


Obama=10, inauguration=10, oath=3, office=3, oath-of-office=10, president=2, Washington=1, etc.

score = 10 * matches("Obama") + 10 * matches("inaguration") + ...

vector space view

recall the matrix of documents vs terms

documents vs terms

the distribution of terms over documents is a vector

the total score for weighted matches is the dot product of a weight vector with the document vector

once we have justified these dot products in such a way, all of linear algebra becomes available

choosing weights

heuristic - TF-IDF, we choose scores that intuitively "make sense" and tune the process until things work well

generative, we make assumptions about how documents are generated, infer probabilities, and then use those probabilities to answer queries

discriminative, we have a training set of documents and retrieval results, train a linear classifier, and then use that to assign weights


term frequency, frequency of a term within a document, boolean - [ t in d ], logarithmic - log(1 + #occurrences-in-d), normalized - #occurrences-in-d / max #occurences of any term

inverse document frequency, what fraction of the documents contain the term, generally, log of (#documents / #documents containing term)

tf-idf weight is the product of the term frequency and the document frequency

terms receive a high weight if they are frequent in the document but infrequent among all documents

heuristic construction of scoring functions - c.f. fuzzy logic


ultimately, a model-based justification of these approaches has to happen in terms of statistics

statistical model for language generation: semantics, syntax

approximation in terms of n-grams

approximation in terms of vector space

tolerant retrieval

with "grep" or "glob", we can search for arbitrary wildcards like /a.*b/ etc.; these are processed sequentially

for large scale IR, we need to be more restrictive since we want faster performance

wildcards restricted to terms

apply the wildcard to the dictionary

then use the set of terms found that way for retrieval

trailing wildcard within a term

very important for handling grammatical endings

represent the dictionary as a trie

efficiently look for the prefix in the trie, enumerate the children

compute the union of the postings for the children

note that you can combine the tree data structure for the prefixes with the postings lists by moving common postings up the tree

leading wildcard within a term

use tree on reverse words

important in some languages

wildcard in the middle

use a "permuterm index"

that is, represent "word$" and all its rotations in the tree, "ord$w", "rd$wo"

prefixes of this string now turn into internal wildcard matches (provided they contain $)

k-gram indexes

consider substrings of size k and index those

"abc*def" must contain the 3-grams "abc" and "def"

general approaches

take both query and index terms

map both into a different space

perform search in that space


spelling correction: string->string or string->set-of-k-grams

phonetic: string->string (phonetic representation)

edit distance: variety of representations, or sequential search

"traditional" IR


simplest version: term-document matrix

"term"?, words are a linguistic entity, term = words + proper nouns + numbers + abbreviations + ..., term = "unit the user can search for"

term document incidence matrix, binary matrix (we could also include counts), rows are vectors giving term occurrences by document, columns are vectors giving terms contained in document, column is a bit vector representation of the set of words

boolean queries, "find all the documents that contain Caesar and Calpurnia", query language: caesar AND calpurnia, conceptually: bit-and of row vectors for terms "caesar" and "calpurnia", conceptually: intersection of document sets, also: caesar OR calpurnia

inverted index, millions of documents, tens of thousands of terms, the matrix is sparse, so most terms do not occur in most documents, inverted index, the list of terms / tokens is collectively referred to as the dictionary, the list of documents for each term are the postings, note that we represent documents as integers (often, the terms as well), if the list of documents is sorted, then intersection and union turn into efficient algorithms similar to mergesort, given two document lists of length N and M, what is the computational complexity of intersection, union?, optionally use skiplists to speed up intersection of postings

simple information retrieval system, INPUT: collection of documents, tokenize each document into a list of tokens, normalize each token using linguistic preprocessing (remove grammatical variation etc.), create an inverted index, parse user queries, translate user queries into efficient sequences of set operations on the inverted index, NB: both the index and the dictionaries can be compressed; prefix-compression, delta compression, etc.

proximity queries, "X near Y", simple approach: biword index, create an index of word pairs, more general data structure: positional indexes, for each term, store the document that contains the term, plus the positions within that document, NEAR queries now turn into a nested "merge" operation, for each term, iterate over the sorted document ids, collect positions by terms, then "merge" the term positions in a similar way, note that this means basically storing one integer per word in each document; the index is about as large as the original text collection

practical considerations

memory and disk space, does the index fit into memory (much easier construction), how does search deal with the different performance characteristics of memory and disks?, how does this relate to relational databases? can we build on top of relational databases? is that efficient?, what do they support that we don't need?, what do they lack that we need to build on top of them?, how can we compress the index?, how can we compress the dictionary?

distributed construction, document collections and indexes have gotten so large that they can't easily fit into a single machine anymore (c.f. Google), solution: distributed computations, common paradigm: MapReduce

structured texts

most texts aren't just collections of words; they have a logical structure, title, authors, abstract, text, references

instead of indexing the whole document, we treat these as "mini-documents" or "zones" within the document

retrieval is still at the whole document level

zone-based indexing is very common; Google supports it, title:, url:, etc.


use cases

CS student writing a paper on efficient sorting algorithms.

Home owner looking for a plumber to fix a leak.

Linguist looking to define the meaning of a word by finding it used in context.

Historian searching for documents about an event.

Lawyer trying to determine whether an opposing party knew or didn't know about an event, based on evidence.

Find the document I wrote last week on ...


What is the collection of documents?, dynamic or static, small or large

What is the collection of users?, experience level?, motivation?

What is the language?, How many grammatical variants does each word have?, How relevant are those grammatical variants?, Are words separated or easily separable?, English (nearly ideal for IR), each word is almost fixed, words are separated by spaces, so tokenization is easy, words are almost identical with morphemes, other languages, words may have hundreds of forms, many "words" may be combined together in complicated ways, words may not be separated from each other

Uses, users just needs a subset of useful candidates as starting points, high precision necessary, low recall tolerated, exhaustive collection of matching documents (e.g., in legal cases), high recall necessary, low precision tolerated

Resources, How much storage and computation is available?, "Personal file indexing" on PC has been limited in the past: users bothered by CPU and storage usage.


inverted indexes

efficient computations, out of memory computations


basic relationship between vector space models and pattern recognition