Vollbild-Modus

NLPA An Overview of Information Retrieval

Thomas Breuel

Folgen

Jetzt loslegen. Gratis!

oder registrieren mit Ihrer E-Mail-Adresse

Ähnliche Mindmaps Mindmap-Gliederung

NLPA An Overview of Information Retrieval von Thomas Breuel Mind Map: NLPA An Overview of Information Retrieval

1. advanced topics

1.1. link analysis

1.2. matrix decompositions and LSA

1.3. statistical approaches

1.4. XML retrieval

1.5. web search basics

1.6. web crawling and indexes

2. what is "IR" in practice?

2.1. text vs knowledge

2.2. what we want

2.3. what we get

3. given the vector space model

3.1. term frequencies turn text documents into vectors

3.2. once in vector form, we can apply regular pattern recognition and neural network models

3.3. kinds of models

3.3.1. generative models

3.3.1.1. assume vectors are distributed according to some distribution

3.3.2. discriminative models

3.3.2.1. just treat vectors as inputs in some high dimensional space and learn a classifier

3.4. very high dimensional

3.4.1. IR problems give rise to very high dimensional and sparse classification problems

3.4.2. kernel methods may help (more features than objects)

3.4.3. also: sparse matrices etc.

3.5. connection

3.5.1. how do machine learning, classifiers and information retrieval fit together?

3.5.2. can we learn the mapping: query->result set directly?

3.5.3. usually machine learning is used not for solving retrieval directly, but solving related problems

3.5.3.1. learn relevance of terms

3.5.3.2. learn a better scoring function

3.5.3.3. extract features from terms and determine relevance for the result set

3.5.3.4. perform clustering on the result set and show only representatives from each cluster (since the rest of the objects in the cluster are very similar)

3.5.3.5. organize the database so that retrieval is faster

3.5.3.6. perform query expansion, find synonyms in queries

3.5.3.7. learn document zones etc.

3.5.3.8. relevance feedback

3.5.4. applying machine learning

3.5.4.1. to apply machine learning, you need...

3.5.4.2. features

3.5.4.3. training data

3.5.4.4. the ability to generalize across samples (can't generalize from "cat" to "dog" without additional information)

3.5.5. learning a better scoring function

3.5.5.1. perform query as usual, using cosine scoring

3.5.5.2. normally, we just rank based on the cosine score, but maybe we can improve it

3.5.5.3. what additional information do we have?

3.5.5.3.1. document frequencies

3.5.5.3.2. term frequencies for query terms

3.5.5.3.3. term frequencies in document

3.5.5.3.4. semantic information about query terms

3.5.5.3.5. category / class of document (e.g., from clustering)

3.5.5.3.6. link and popularity information of documents in general

3.5.5.4. training data consists of the above features and feedback from users whether a document was relevant

3.5.5.5. put this into a machine learning algorithm and learn a classifier

3.5.5.6. use the classifier to rescore results; use the probability-of-relevance as the score, instead of the original cosine score

3.6. commonly used

3.6.1. k-nearest neighbor classifier

3.6.1.1. usually, using normalized dot product

3.6.2. Rocchio classifier

3.6.2.1. for each class, compute the class mean

3.6.2.2. classify according to the closest mean

3.6.2.3. (how is this justified in pattern recognition?)

3.6.3. support vector machines

3.6.4. Naive Bayesian

3.6.5. logistic regression

3.6.6. neural networks

3.6.7. k-means

3.6.8. decision trees

4. getting at relevance

4.1. search for documents on the inauguration

4.1.1. terms: Obama, 2013, 2012, inaguration

4.1.2. maybe also: president, oath, etc.

4.1.3. generally: get list of matches, need to then check manually

4.1.4. would like to start with the "most likely relevant" documents first

4.1.5. general idea: give positive "scores" or "weight" to relevant terms

4.2. general idea

4.2.1. "Obama" - frequent, not very specific, but very relevant

4.2.2. "inauguration" - infrequent, very relevant, but might apply to other presidents

4.2.3. etc.

4.2.4. Obama=10, inauguration=10, oath=3, office=3, oath-of-office=10, president=2, Washington=1, etc.

4.2.5. score = 10 * matches("Obama") + 10 * matches("inaguration") + ...

4.3. vector space view

4.3.1. recall the matrix of documents vs terms

4.3.2. documents vs terms

4.3.3. the distribution of terms over documents is a vector

4.3.4. the total score for weighted matches is the dot product of a weight vector with the document vector

4.3.5. once we have justified these dot products in such a way, all of linear algebra becomes available

4.4. choosing weights

4.4.1. heuristic - TF-IDF

4.4.1.1. we choose scores that intuitively "make sense" and tune the process until things work well

4.4.2. generative

4.4.2.1. we make assumptions about how documents are generated, infer probabilities, and then use those probabilities to answer queries

4.4.3. discriminative

4.4.3.1. we have a training set of documents and retrieval results, train a linear classifier, and then use that to assign weights

4.5. TF-IDF

4.5.1. term frequency

4.5.1.1. frequency of a term within a document

4.5.1.2. boolean - [ t in d ]

4.5.1.3. logarithmic - log(1 + #occurrences-in-d)

4.5.1.4. normalized - #occurrences-in-d / max #occurences of any term

4.5.2. inverse document frequency

4.5.2.1. what fraction of the documents contain the term

4.5.2.2. generally, log of (#documents / #documents containing term)

4.5.3. tf-idf weight is the product of the term frequency and the document frequency

4.5.4. terms receive a high weight if they are frequent in the document but infrequent among all documents

4.5.5. heuristic construction of scoring functions - c.f. fuzzy logic

4.6. justification?

4.6.1. ultimately, a model-based justification of these approaches has to happen in terms of statistics

4.6.2. statistical model for language generation: semantics, syntax

4.6.3. approximation in terms of n-grams

4.6.4. approximation in terms of vector space

5. tolerant retrieval

5.1. with "grep" or "glob", we can search for arbitrary wildcards like /a.*b/ etc.; these are processed sequentially

5.2. for large scale IR, we need to be more restrictive since we want faster performance

5.3. wildcards restricted to terms

5.3.1. apply the wildcard to the dictionary

5.3.2. then use the set of terms found that way for retrieval

5.4. trailing wildcard within a term

5.4.1. very important for handling grammatical endings

5.4.2. represent the dictionary as a trie

5.4.3. efficiently look for the prefix in the trie, enumerate the children

5.4.4. compute the union of the postings for the children

5.4.5. note that you can combine the tree data structure for the prefixes with the postings lists by moving common postings up the tree

5.5. leading wildcard within a term

5.5.1. use tree on reverse words

5.5.2. important in some languages

5.6. wildcard in the middle

5.6.1. use a "permuterm index"

5.6.2. that is, represent "word$" and all its rotations in the tree, "ord$w", "rd$wo"

5.6.3. prefixes of this string now turn into internal wildcard matches (provided they contain $)

5.7. k-gram indexes

5.7.1. consider substrings of size k and index those

5.7.2. "abc*def" must contain the 3-grams "abc" and "def"

5.8. general approaches

5.8.1. take both query and index terms

5.8.2. map both into a different space

5.8.3. perform search in that space

5.9. specifics

5.9.1. spelling correction: string->string or string->set-of-k-grams

5.9.2. phonetic: string->string (phonetic representation)

5.9.3. edit distance: variety of representations, or sequential search

6. "traditional" IR

6.1. indexes

6.1.1. simplest version: term-document matrix

6.1.2. "term"?

6.1.2.1. words are a linguistic entity

6.1.2.2. term = words + proper nouns + numbers + abbreviations + ...

6.1.2.3. term = "unit the user can search for"

6.1.3. term document incidence matrix

6.1.3.1. binary matrix (we could also include counts)

6.1.3.2. rows are vectors giving term occurrences by document

6.1.3.3. columns are vectors giving terms contained in document

6.1.3.4. column is a bit vector representation of the set of words

6.1.4. boolean queries

6.1.4.1. "find all the documents that contain Caesar and Calpurnia"

6.1.4.2. query language: caesar AND calpurnia

6.1.4.3. conceptually: bit-and of row vectors for terms "caesar" and "calpurnia"

6.1.4.4. conceptually: intersection of document sets

6.1.4.5. also: caesar OR calpurnia

6.1.5. inverted index

6.1.5.1. millions of documents, tens of thousands of terms

6.1.5.2. the matrix is sparse, so most terms do not occur in most documents

6.1.5.3. inverted index

6.1.5.3.1. the list of terms / tokens is collectively referred to as the dictionary

6.1.5.3.2. the list of documents for each term are the postings

6.1.5.3.3. note that we represent documents as integers (often, the terms as well)

6.1.5.4. if the list of documents is sorted, then intersection and union turn into efficient algorithms similar to mergesort

6.1.5.4.1. given two document lists of length N and M, what is the computational complexity of intersection, union?

6.1.5.4.2. optionally use skiplists to speed up intersection of postings

6.1.6. simple information retrieval system

6.1.6.1. INPUT: collection of documents

6.1.6.2. tokenize each document into a list of tokens

6.1.6.3. normalize each token using linguistic preprocessing (remove grammatical variation etc.)

6.1.6.4. create an inverted index

6.1.6.5. parse user queries

6.1.6.6. translate user queries into efficient sequences of set operations on the inverted index

6.1.6.7. NB: both the index and the dictionaries can be compressed; prefix-compression, delta compression, etc.

6.1.7. proximity queries

6.1.7.1. "X near Y"

6.1.7.2. simple approach: biword index, create an index of word pairs

6.1.7.3. more general data structure: positional indexes

6.1.7.3.1. for each term, store the document that contains the term, plus the positions within that document

6.1.7.3.2. NEAR queries now turn into a nested "merge" operation

6.1.7.3.3. note that this means basically storing one integer per word in each document; the index is about as large as the original text collection

6.2. practical considerations

6.2.1. memory and disk space

6.2.1.1. does the index fit into memory (much easier construction)

6.2.1.2. how does search deal with the different performance characteristics of memory and disks?

6.2.1.3. how does this relate to relational databases? can we build on top of relational databases? is that efficient?

6.2.1.3.1. what do they support that we don't need?

6.2.1.3.2. what do they lack that we need to build on top of them?

6.2.1.4. how can we compress the index?

6.2.1.5. how can we compress the dictionary?

6.2.2. distributed construction

6.2.2.1. document collections and indexes have gotten so large that they can't easily fit into a single machine anymore (c.f. Google)

6.2.2.2. solution: distributed computations

6.2.2.3. common paradigm: MapReduce

6.3. structured texts

6.3.1. most texts aren't just collections of words; they have a logical structure

6.3.1.1. title

6.3.1.2. authors

6.3.1.3. abstract

6.3.1.4. text

6.3.1.5. references

6.3.2. instead of indexing the whole document, we treat these as "mini-documents" or "zones" within the document

6.3.3. retrieval is still at the whole document level

6.3.4. zone-based indexing is very common; Google supports it

6.3.4.1. title:

6.3.4.2. url:

6.3.4.3. etc.

7. background

7.1. use cases

7.1.1. CS student writing a paper on efficient sorting algorithms.

7.1.2. Home owner looking for a plumber to fix a leak.

7.1.3. Linguist looking to define the meaning of a word by finding it used in context.

7.1.4. Historian searching for documents about an event.

7.1.5. Lawyer trying to determine whether an opposing party knew or didn't know about an event, based on evidence.

7.1.6. Find the document I wrote last week on ...

7.2. properties

7.2.1. What is the collection of documents?

7.2.1.1. dynamic or static

7.2.1.2. small or large

7.2.2. What is the collection of users?

7.2.2.1. experience level?

7.2.2.2. motivation?

7.2.3. What is the language?

7.2.3.1. How many grammatical variants does each word have?

7.2.3.2. How relevant are those grammatical variants?

7.2.3.3. Are words separated or easily separable?

7.2.3.4. English (nearly ideal for IR)

7.2.3.4.1. each word is almost fixed

7.2.3.4.2. words are separated by spaces, so tokenization is easy

7.2.3.4.3. words are almost identical with morphemes

7.2.3.5. other languages

7.2.3.5.1. words may have hundreds of forms

7.2.3.5.2. many "words" may be combined together in complicated ways

7.2.3.5.3. words may not be separated from each other

7.2.4. Uses

7.2.4.1. users just needs a subset of useful candidates as starting points

7.2.4.1.1. high precision necessary

7.2.4.1.2. low recall tolerated

7.2.4.2. exhaustive collection of matching documents (e.g., in legal cases)

7.2.4.2.1. high recall necessary

7.2.4.2.2. low precision tolerated

7.2.5. Resources

7.2.5.1. How much storage and computation is available?

7.2.5.2. "Personal file indexing" on PC has been limited in the past: users bothered by CPU and storage usage.