generative models, assume vectors are distributed according to some distribution
discriminative models, just treat vectors as inputs in some high dimensional space and learn a classifier
IR problems give rise to very high dimensional and sparse classification problems
kernel methods may help (more features than objects)
also: sparse matrices etc.
how do machine learning, classifiers and information retrieval fit together?
can we learn the mapping: query->result set directly?
usually machine learning is used not for solving retrieval directly, but solving related problems, learn relevance of terms, learn a better scoring function, extract features from terms and determine relevance for the result set, perform clustering on the result set and show only representatives from each cluster (since the rest of the objects in the cluster are very similar), organize the database so that retrieval is faster, perform query expansion, find synonyms in queries, learn document zones etc., relevance feedback
applying machine learning, to apply machine learning, you need..., features, training data, the ability to generalize across samples (can't generalize from "cat" to "dog" without additional information)
learning a better scoring function, perform query as usual, using cosine scoring, normally, we just rank based on the cosine score, but maybe we can improve it, what additional information do we have?, document frequencies, term frequencies for query terms, term frequencies in document, semantic information about query terms, category / class of document (e.g., from clustering), link and popularity information of documents in general, training data consists of the above features and feedback from users whether a document was relevant, put this into a machine learning algorithm and learn a classifier, use the classifier to rescore results; use the probability-of-relevance as the score, instead of the original cosine score
k-nearest neighbor classifier, usually, using normalized dot product
Rocchio classifier, for each class, compute the class mean, classify according to the closest mean, (how is this justified in pattern recognition?)
support vector machines
terms: Obama, 2013, 2012, inaguration
maybe also: president, oath, etc.
generally: get list of matches, need to then check manually
would like to start with the "most likely relevant" documents first
general idea: give positive "scores" or "weight" to relevant terms
"Obama" - frequent, not very specific, but very relevant
"inauguration" - infrequent, very relevant, but might apply to other presidents
Obama=10, inauguration=10, oath=3, office=3, oath-of-office=10, president=2, Washington=1, etc.
score = 10 * matches("Obama") + 10 * matches("inaguration") + ...
recall the matrix of documents vs terms
documents vs terms
the distribution of terms over documents is a vector
the total score for weighted matches is the dot product of a weight vector with the document vector
once we have justified these dot products in such a way, all of linear algebra becomes available
heuristic - TF-IDF, we choose scores that intuitively "make sense" and tune the process until things work well
generative, we make assumptions about how documents are generated, infer probabilities, and then use those probabilities to answer queries
discriminative, we have a training set of documents and retrieval results, train a linear classifier, and then use that to assign weights
term frequency, frequency of a term within a document, boolean - [ t in d ], logarithmic - log(1 + #occurrences-in-d), normalized - #occurrences-in-d / max #occurences of any term
inverse document frequency, what fraction of the documents contain the term, generally, log of (#documents / #documents containing term)
tf-idf weight is the product of the term frequency and the document frequency
terms receive a high weight if they are frequent in the document but infrequent among all documents
heuristic construction of scoring functions - c.f. fuzzy logic
ultimately, a model-based justification of these approaches has to happen in terms of statistics
statistical model for language generation: semantics, syntax
approximation in terms of n-grams
approximation in terms of vector space
apply the wildcard to the dictionary
then use the set of terms found that way for retrieval
very important for handling grammatical endings
represent the dictionary as a trie
efficiently look for the prefix in the trie, enumerate the children
compute the union of the postings for the children
note that you can combine the tree data structure for the prefixes with the postings lists by moving common postings up the tree
use tree on reverse words
important in some languages
use a "permuterm index"
that is, represent "word$" and all its rotations in the tree, "ord$w", "rd$wo"
prefixes of this string now turn into internal wildcard matches (provided they contain $)
consider substrings of size k and index those
"abc*def" must contain the 3-grams "abc" and "def"
take both query and index terms
map both into a different space
perform search in that space
spelling correction: string->string or string->set-of-k-grams
phonetic: string->string (phonetic representation)
edit distance: variety of representations, or sequential search
simplest version: term-document matrix
"term"?, words are a linguistic entity, term = words + proper nouns + numbers + abbreviations + ..., term = "unit the user can search for"
term document incidence matrix, binary matrix (we could also include counts), rows are vectors giving term occurrences by document, columns are vectors giving terms contained in document, column is a bit vector representation of the set of words
boolean queries, "find all the documents that contain Caesar and Calpurnia", query language: caesar AND calpurnia, conceptually: bit-and of row vectors for terms "caesar" and "calpurnia", conceptually: intersection of document sets, also: caesar OR calpurnia
inverted index, millions of documents, tens of thousands of terms, the matrix is sparse, so most terms do not occur in most documents, inverted index, the list of terms / tokens is collectively referred to as the dictionary, the list of documents for each term are the postings, note that we represent documents as integers (often, the terms as well), if the list of documents is sorted, then intersection and union turn into efficient algorithms similar to mergesort, given two document lists of length N and M, what is the computational complexity of intersection, union?, optionally use skiplists to speed up intersection of postings
simple information retrieval system, INPUT: collection of documents, tokenize each document into a list of tokens, normalize each token using linguistic preprocessing (remove grammatical variation etc.), create an inverted index, parse user queries, translate user queries into efficient sequences of set operations on the inverted index, NB: both the index and the dictionaries can be compressed; prefix-compression, delta compression, etc.
proximity queries, "X near Y", simple approach: biword index, create an index of word pairs, more general data structure: positional indexes, for each term, store the document that contains the term, plus the positions within that document, NEAR queries now turn into a nested "merge" operation, for each term, iterate over the sorted document ids, collect positions by terms, then "merge" the term positions in a similar way, note that this means basically storing one integer per word in each document; the index is about as large as the original text collection
memory and disk space, does the index fit into memory (much easier construction), how does search deal with the different performance characteristics of memory and disks?, how does this relate to relational databases? can we build on top of relational databases? is that efficient?, what do they support that we don't need?, what do they lack that we need to build on top of them?, how can we compress the index?, how can we compress the dictionary?
distributed construction, document collections and indexes have gotten so large that they can't easily fit into a single machine anymore (c.f. Google), solution: distributed computations, common paradigm: MapReduce
most texts aren't just collections of words; they have a logical structure, title, authors, abstract, text, references
instead of indexing the whole document, we treat these as "mini-documents" or "zones" within the document
retrieval is still at the whole document level
zone-based indexing is very common; Google supports it, title:, url:, etc.
CS student writing a paper on efficient sorting algorithms.
Home owner looking for a plumber to fix a leak.
Linguist looking to define the meaning of a word by finding it used in context.
Historian searching for documents about an event.
Lawyer trying to determine whether an opposing party knew or didn't know about an event, based on evidence.
Find the document I wrote last week on ...
What is the collection of documents?, dynamic or static, small or large
What is the collection of users?, experience level?, motivation?
What is the language?, How many grammatical variants does each word have?, How relevant are those grammatical variants?, Are words separated or easily separable?, English (nearly ideal for IR), each word is almost fixed, words are separated by spaces, so tokenization is easy, words are almost identical with morphemes, other languages, words may have hundreds of forms, many "words" may be combined together in complicated ways, words may not be separated from each other
Uses, users just needs a subset of useful candidates as starting points, high precision necessary, low recall tolerated, exhaustive collection of matching documents (e.g., in legal cases), high recall necessary, low precision tolerated
Resources, How much storage and computation is available?, "Personal file indexing" on PC has been limited in the past: users bothered by CPU and storage usage.