with our free apps for iPhone, iPad and Android

Get Started

Already have an account?

Log In

NLPA An Overview of
Information Retrieval
by Thomas Breuel
# NLPA An Overview of
Information Retrieval

## advanced topics

### link analysis

### matrix decompositions and LSA

### statistical approaches

### XML retrieval

### web search basics

### web crawling and indexes

## what is "IR" in practice?

### text vs knowledge

### what we want

### what we get

## given the vector space model

### term frequencies turn text
documents into vectors

### once in vector form, we can apply regular
pattern recognition and neural network
models

### kinds of models

### very high dimensional

### connection

### commonly used

## getting at relevance

### search for documents on
the inauguration

### general idea

### vector space view

### choosing weights

### TF-IDF

### justification?

## tolerant retrieval

### with "grep" or "glob", we can search for
arbitrary wildcards like /a.*b/ etc.; these are
processed sequentially

### for large scale IR, we need to be more
restrictive since we want faster
performance

### wildcards restricted to terms

### trailing wildcard within a term

### leading wildcard within a term

### wildcard in the middle

### k-gram indexes

### general approaches

### specifics

## "traditional" IR

### indexes

### practical considerations

### structured texts

## background

### use cases

### properties

## today

### inverted indexes

### efficient computations, out of
memory computations

### TF-IDF

### basic relationship between vector
space models and pattern
recognition

0.0 stars - reviews
range from 0 to 5

generative models, assume vectors are distributed according to some distribution

discriminative models, just treat vectors as inputs in some high dimensional space and learn a classifier

IR problems give rise to very high dimensional and sparse classification problems

kernel methods may help (more features than objects)

also: sparse matrices etc.

how do machine learning, classifiers and information retrieval fit together?

can we learn the mapping: query->result set directly?

usually machine learning is used not for solving retrieval directly, but solving related problems, learn relevance of terms, learn a better scoring function, extract features from terms and determine relevance for the result set, perform clustering on the result set and show only representatives from each cluster (since the rest of the objects in the cluster are very similar), organize the database so that retrieval is faster, perform query expansion, find synonyms in queries, learn document zones etc., relevance feedback

applying machine learning, to apply machine learning, you need..., features, training data, the ability to generalize across samples (can't generalize from "cat" to "dog" without additional information)

learning a better scoring function, perform query as usual, using cosine scoring, normally, we just rank based on the cosine score, but maybe we can improve it, what additional information do we have?, document frequencies, term frequencies for query terms, term frequencies in document, semantic information about query terms, category / class of document (e.g., from clustering), link and popularity information of documents in general, training data consists of the above features and feedback from users whether a document was relevant, put this into a machine learning algorithm and learn a classifier, use the classifier to rescore results; use the probability-of-relevance as the score, instead of the original cosine score

k-nearest neighbor classifier, usually, using normalized dot product

Rocchio classifier, for each class, compute the class mean, classify according to the closest mean, (how is this justified in pattern recognition?)

support vector machines

Naive Bayesian

logistic regression

neural networks

k-means

decision trees

terms: Obama, 2013, 2012, inaguration

maybe also: president, oath, etc.

generally: get list of matches, need to then check manually

would like to start with the "most likely relevant" documents first

general idea: give positive "scores" or "weight" to relevant terms

"Obama" - frequent, not very specific, but very relevant

"inauguration" - infrequent, very relevant, but might apply to other presidents

etc.

Obama=10, inauguration=10, oath=3, office=3, oath-of-office=10, president=2, Washington=1, etc.

score = 10 * matches("Obama") + 10 * matches("inaguration") + ...

recall the matrix of documents vs terms

documents vs terms

the distribution of terms over documents is a vector

the total score for weighted matches is the dot product of a weight vector with the document vector

once we have justified these dot products in such a way, all of linear algebra becomes available

heuristic - TF-IDF, we choose scores that intuitively "make sense" and tune the process until things work well

generative, we make assumptions about how documents are generated, infer probabilities, and then use those probabilities to answer queries

discriminative, we have a training set of documents and retrieval results, train a linear classifier, and then use that to assign weights

term frequency, frequency of a term within a document, boolean - [ t in d ], logarithmic - log(1 + #occurrences-in-d), normalized - #occurrences-in-d / max #occurences of any term

inverse document frequency, what fraction of the documents contain the term, generally, log of (#documents / #documents containing term)

tf-idf weight is the product of the term frequency and the document frequency

terms receive a high weight if they are frequent in the document but infrequent among all documents

heuristic construction of scoring functions - c.f. fuzzy logic

ultimately, a model-based justification of these approaches has to happen in terms of statistics

statistical model for language generation: semantics, syntax

approximation in terms of n-grams

approximation in terms of vector space

apply the wildcard to the dictionary

then use the set of terms found that way for retrieval

very important for handling grammatical endings

represent the dictionary as a trie

efficiently look for the prefix in the trie, enumerate the children

compute the union of the postings for the children

note that you can combine the tree data structure for the prefixes with the postings lists by moving common postings up the tree

use tree on reverse words

important in some languages

use a "permuterm index"

that is, represent "word$" and all its rotations in the tree, "ord$w", "rd$wo"

prefixes of this string now turn into internal wildcard matches (provided they contain $)

consider substrings of size k and index those

"abc*def" must contain the 3-grams "abc" and "def"

take both query and index terms

map both into a different space

perform search in that space

spelling correction: string->string or string->set-of-k-grams

phonetic: string->string (phonetic representation)

edit distance: variety of representations, or sequential search

simplest version: term-document matrix

"term"?, words are a linguistic entity, term = words + proper nouns + numbers + abbreviations + ..., term = "unit the user can search for"

term document incidence matrix, binary matrix (we could also include counts), rows are vectors giving term occurrences by document, columns are vectors giving terms contained in document, column is a bit vector representation of the set of words

boolean queries, "find all the documents that contain Caesar and Calpurnia", query language: caesar AND calpurnia, conceptually: bit-and of row vectors for terms "caesar" and "calpurnia", conceptually: intersection of document sets, also: caesar OR calpurnia

inverted index, millions of documents, tens of thousands of terms, the matrix is sparse, so most terms do not occur in most documents, inverted index, the list of terms / tokens is collectively referred to as the dictionary, the list of documents for each term are the postings, note that we represent documents as integers (often, the terms as well), if the list of documents is sorted, then intersection and union turn into efficient algorithms similar to mergesort, given two document lists of length N and M, what is the computational complexity of intersection, union?, optionally use skiplists to speed up intersection of postings

simple information retrieval system, INPUT: collection of documents, tokenize each document into a list of tokens, normalize each token using linguistic preprocessing (remove grammatical variation etc.), create an inverted index, parse user queries, translate user queries into efficient sequences of set operations on the inverted index, NB: both the index and the dictionaries can be compressed; prefix-compression, delta compression, etc.

proximity queries, "X near Y", simple approach: biword index, create an index of word pairs, more general data structure: positional indexes, for each term, store the document that contains the term, plus the positions within that document, NEAR queries now turn into a nested "merge" operation, for each term, iterate over the sorted document ids, collect positions by terms, then "merge" the term positions in a similar way, note that this means basically storing one integer per word in each document; the index is about as large as the original text collection

memory and disk space, does the index fit into memory (much easier construction), how does search deal with the different performance characteristics of memory and disks?, how does this relate to relational databases? can we build on top of relational databases? is that efficient?, what do they support that we don't need?, what do they lack that we need to build on top of them?, how can we compress the index?, how can we compress the dictionary?

distributed construction, document collections and indexes have gotten so large that they can't easily fit into a single machine anymore (c.f. Google), solution: distributed computations, common paradigm: MapReduce

most texts aren't just collections of words; they have a logical structure, title, authors, abstract, text, references

instead of indexing the whole document, we treat these as "mini-documents" or "zones" within the document

retrieval is still at the whole document level

zone-based indexing is very common; Google supports it, title:, url:, etc.

CS student writing a paper on efficient sorting algorithms.

Home owner looking for a plumber to fix a leak.

Linguist looking to define the meaning of a word by finding it used in context.

Historian searching for documents about an event.

Lawyer trying to determine whether an opposing party knew or didn't know about an event, based on evidence.

Find the document I wrote last week on ...

What is the collection of documents?, dynamic or static, small or large

What is the collection of users?, experience level?, motivation?

What is the language?, How many grammatical variants does each word have?, How relevant are those grammatical variants?, Are words separated or easily separable?, English (nearly ideal for IR), each word is almost fixed, words are separated by spaces, so tokenization is easy, words are almost identical with morphemes, other languages, words may have hundreds of forms, many "words" may be combined together in complicated ways, words may not be separated from each other

Uses, users just needs a subset of useful candidates as starting points, high precision necessary, low recall tolerated, exhaustive collection of matching documents (e.g., in legal cases), high recall necessary, low precision tolerated

Resources, How much storage and computation is available?, "Personal file indexing" on PC has been limited in the past: users bothered by CPU and storage usage.