NLPA An Overview of Information Retrieval

advanced topics

link analysis

matrix decompositions and LSA

statistical approaches

XML retrieval

web search basics

web crawling and indexes

what is "IR" in practice?

text vs knowledge

what we want

what we get

given the vector space model

term frequencies turn text documents into vectors

once in vector form, we can apply regular pattern recognition and neural network models

kinds of models

very high dimensional


commonly used

getting at relevance

search for documents on the inauguration

general idea

vector space view

choosing weights



tolerant retrieval

with "grep" or "glob", we can search for arbitrary wildcards like /a.*b/ etc.; these are processed sequentially

for large scale IR, we need to be more restrictive since we want faster performance

wildcards restricted to terms

trailing wildcard within a term

leading wildcard within a term

wildcard in the middle

k-gram indexes

general approaches


"traditional" IR


practical considerations

structured texts


use cases



inverted indexes

efficient computations, out of memory computations


basic relationship between vector space models and pattern recognition