This is an old version of the compendium, written Nov. 26, 2014, 3:52 p.m. Changes made in this revision were made by thormartin91. View rendered version.

TDT4117: Information retrieval

Disclaimer: Not complete # 1 - Introduction ## 1.1 - Information retrieval Information retrieval deaels with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia object. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest. IR includes modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering, languages. IR consist mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms to improve the results. In one form or another, indexes are at the core of every modern information retrieval system. The web has become a universal repository of human knowledge and culture. Users have created billions of documents, and finding useful information is not an easy task unless running a search, and search is all about IR and its technologies. ## 1.2 - The IR Problem __The IR Problem__: the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible. The user of a retrieval system has to translate their information need into a query in the language provided by the system. The user is concerned more with retrieveing information about a subject than retrieving data that satisfies the query. A user is willing to accept documents that conatin synonys of the query terms in the result set, even when those documents do not contain any query terms. Data retrieval, while providing a solution to the user of a database system, does not solve the problem of retrieving information about a subject or topic. ## 1.3 - The IR System The software architecture: - Document collection - Crawler (if this is a Web collection) - Indexer - Retrieval and ranking process - Query parsing and expansion - System query - Retrieval and ranking The most used index structure is an _inverted index_ composed of all the distinct words of the collection and, for each word, a list of the documents that contain it. A document collection must be indexed before the retrieval process can be performed on it. The retrieval process consists of retrieving documents that satisfy either a user query or a click in a hyperlink. In the last case we say the user is _browsing_. Web search is today the most prominent application of IR and its techniques, and has had a major impact on the development of IR. A new component introduced with the web is the _crawler_ which is discussed in chapter 12. Impacts of the web is that performance and reliability has become critical characteristics of the IR system, and need to handle vast sizes of document collections. Search problem has extended beyond the seeking of text information to encompass user needs (hotel prices, book prices, phone numbers, software download links). Web spam is sometimes some compelling that it is confused with truly relevant content. Practical issues on the web includes security, user privacy, copyrights and patent rights. Is a site which supervises all the information it posts acting as a publisher? And if so, is it responsible for misuse of the information it posts? # 2 - User Interfaces for Search ## 2.1 - Introduction The role of the search user interface is to aid in the searcher's understanding and expression of their information needs, and to help users formulate their queries, select among available information sources, understand search results, and keep track of the progress of their search. ## 2.2 - How People Search Search interfaces should support a range of tasks, while taking into account how people think about searching for information. Marchionini makes a disctiction between _information lookup_ and _exploratory search_. Lookup tasks are fact retrieval or question answering and are satisfied by short, discrete pieces of information. Exploratory search can be divided into learning and investigating tasks. Learning searches require more than single query-response pairs, and involve the searcher spending time scanning and reading multiple information items, and syntesizing content to form new understanding. Investigative search may be done to support planning, discover gaps in knowledge, and to monitor an on-going topic. Users learn as they search where they refine their queries as they go. Such dynamic process is sometimes referred to as the _berry picking_ model of search. A typical searcher will at first opt to query on a more general term and either reformulate the query or visit appropriate Web sites and browse to find the desired product. This is approach is sometimes referred to as _orienteering_. Web search logs suggest this approach is common and the proportion of users who modify their queries is about 52%. Searchers often prefer _browsing_ over keyword searching when the information structure is well-matched to their information needs. Browsing works well only so long as appropriate links are available, and have meaningful cues about the underlying information. If at some point midway through the click the searcher does not see a link leading closer to the goal, then the experience is frustrating and the interface fails from a usability perspective. Studies have shown that it is difficult for people to determine whether or not a document is relevant to a topic, and the less someone knows about a topic, the poorer judge they are about if a search result is relevant to that topic. Searches are biased towards thinking the top one or two results are better than those beneath it. ## 2.3 - Search Interfaces Today At the heart of the typical search session is a cycle of query specification, inspection of retrieval results, and query reformulation. As the process proceeds, the searcher learns about their topic, as well as about the available information sources. The most common way to start a search session is to access a Web browser and use a Web search engine. Second comes selecting a Web site from a personal collection of previously visited sites. Third comes Web directories, such as Yahoo's directory. For Web search engines, the query is specified in textual form. Typically in Web queries text is very short, consisting of one to three words. If the results given do not look relevant, then the user reformulates their query. In many cases searchers would prefer to state their information need in more detail, but past experience with search engines taught them that this methos does not work well, and that keyboard querying combined with orienteering works better. Some systems supports Boolean operators. But, Boolean operators and command-line syntax have been shown time and again to be difficult for most users to understand, and those who try to use them often do so incorrectly. On Web search engines today, conjuctive ranking is the norm, where only documents containing all query terms are displayed in the results. However, Web search engines have become more sophisticated about dropping terms that would result in empty hits, while matching the important terms, ranking the hits higher that have these terms in close proximity to one another, and using the other aspects of ranking that have been found to work well for Web search. In design, studies suggest a relationshi between query length and the width of the entry form; small forms discourage long queries and wide forms encourage longer queries. Auto-complete (or auto-suggest or dynamic query suggestions) which is shown in real time has greatly improved query specification. In some Web search engines the query is run immeadiately; on others the user must hit the Return key or click the Search button. When displaying the search results, the document surrogate refers to the information that summarizes the document. The text summary containing text extracted from the document is also critical for assessment of retrieval results. Several studies have shown that longer results are deemed better than shorter ones for certain types of information needs. Users engaged in _known-item searching_ tend to prefer short surrogates that clearly indicate the desired information. One of the most important query reformulation techniques consists of showing terms related to the query or to the documents retrieved in response to the query. Usually only one suggested alternative is shown; clicking on that alternative re-executes the query. Search engines are increasingly employing related term suggestions, referred to as _term expansion_. Relevance feedback has been shown in non-interactive or artificial settings to be able to greatly improve rank ordering, but this method has not been successful from a usability perspective. In organizing search results, a category system is a set of meaningful labels organized in such a way as to reflect the concepts relevant to a domain. Most commonly used category structures are _flat_, _hierarchical_, and _faceted_ categories. Flat categories are simply lists of topics or subjects. Hierarchical organization online is most commonly seen in desktop file system browsers. It is, however, difficult to maintain a strict hieratchy when the information collection is large. Faceted metadata, has become the primary way to organize Web site content and search results.Faceted metadata consists of a set of categories (flat or hierarchical). Usability studies find that users tlike, and are successful, using faceted navigation, if the interface is designed properly, and that faceted interfaces are overwhelmingly preferred for collection search and browsing as an alternative to the standard keyword-and-results listing interface. Clustering refers to the grouping of items according to some measure of similarity. the greatest advantage is that it is fully automatic and can be easily applied to any text collection. The disadvantages include an unpredictability in the form and quality of results. One drawback of faceted interfaces versus clusters are that the categories of interest must be known in advance. The largest drawback is the fact that in most cases the category hierachies are built by hand. ## 2.4 - Visualization in Search Interfaces Information visualization has become a common presence in news reporting and financial analysis, but visualization of inherently abstract information is more difficult, and visualization of textually represented information is especially challenging. When using boolean syntax, Hertzum and Frokjaer found that a simple Venn diagram representation produced more accurate results. One of the best known experimental visualizations is the TileBars interface which documents are shown as horizontal glyphs with the locations of the query term hits marked along the glyph. Variations of the TileBars display have been proposed, including a simplified version which shows only one square per query term, and color gradation is used to show query term frequency. Other approaches to showing query term hits within document collections include placing the query terms in bar charts, scatter plots, and tables. A usability study that compared five views to the Web-style listing, there were no significant differences for task effectiveness for the other confitions, except for bar charts, which were significantly worse. All confitions had significantly higher mean task times than the Web-style listing. Evaluations that have been conducted so far provide negative evidence as to their usefulness. For text mining, most users of search systems are not interested in seeing how words are distributed across documents. ## 2.5 - Design and Evaluation of Search Interfaces Based on years of experience, a set of practices and guidelines have been developed to facilitate the design of successful interfaces. The practice are collectively referred to as _user-centered design_, which emphasizes building the design around people's activites and thought processes, rather than the converse. The quality of a user interface is determined by how people respond to it. Subjective responses are as, if not more, important than quantitative measures, because if a person has a choice between two systems, they will use the one they prefer. Another major evaluation technique that has become increasingly popular in the last few years is to perform experiments on a massve scale on already heavily-used Web sites. This approach is often referred to as _bucket testing_, _A/B testing_, or _parallel flights_. Participants do not knowingly "opt in" to the study; rather, visitors to the site are selected to be shown alternatives without their knowledge or consent. A potential downside to such studies is that users who are familiar with the control version of the site may initially react to the unfamilirarity of the interface. # 3 - Modeling ## 3.1 - IR Models Modeling in IR is a process aimed at producing a ranking function that assigns scores to documents for a given query. ### 3.1.2 Characterization of an IR Model A information retrieval model consists of four variables [D,Q,F,R(q_i,d_j)] where - D Set of the representations of documents in a collection - Q The set of queries given by the user information needs. - F Framework for modeling document representations, queries, and their relationships (sets and Boolean relations, vectors and linear algebra operation, sample spaces and probability distributions) - R(q_i,d_j)] is a ranking function. Gives a ranking to the documents given the query q_i and orders them by relevance ### Classical Similarity Models #### Boolean model Based on set theory and Boolean algebra. Simple, intuitive, precise semantics. No partial matching or ranking, so more data retrieval model(than a IR-model). Hard to translate queries to boolean expressions. #### Vector Space model Assigns weights to index terms in queries and documents, used to calculate the _degree of similarity_ between a query and documents in the collection. This produces a ranked result. Its term-weighting scheme improves retrieval performance and its partial matching strategy allows retrieval of documents that approximate the query conditions. The disadvantages of the model is that it assumes all terms are independent. It also is not as sound as the probabilistic model. #### Probabilistic model There exists a subset of the documents collection that are relevant to a given query. A probabilistic retrieval model ranks this set by the probability that the document is relevant to the query. The advantage of this model is that documents are ranked in decreasing order of their probability of being relevant. the disadvantages is the need to guess the initial separation of documents into relevant and non-relevant sets. ### Alternative Probabilistic Models #### BM25 Based on experiments with classic probabilistic ranking equation... #### Language model Based on finite state automata... # Retrieval Evaluation ### Precision

The fraction of __retrieved ~~documents~~_ ~~that are~~ _ documents that are __relevant__. $Precision = P(relevant|retrieved)$

### Recall The fraction of _relevant documents_ that are _retrieved_. $Recall = P(retrieved|relevant)$ # Text Operations ## Document Preprocessing There are 5 text transformations (operations) used to prepare a document for indexing. || __Lexical analysis__ || Handle digits, hyphens, punctation marks, and the case of letters. || || __Stopword elimination__ || Filter out words with low discrimination values. || || __Stemming__ || Remove affixes (prefixes and suffixes). || || __Index term selection__ || Select words, stems or group of words to be used as indexing elements. || || __Thesauri__ || Construct a standard vocabulary by categorizing words within the domain. || #### Lexical analysis _Numbers_ are of little value alone and should be removed. Combinations of numbers and words could be kept. _Hyphens_ (dashes) between words should be replaced with whitespace. _Punctuation marks_ are usually removed, except when dealing with program code etc. To deal with the _case of letters_; convert all text to either upper or lower case. #### Stopword elimination Stopwords are words occurring in over 80% of the document collection. These are not good candidates for index terms and should be removed before indexing. (Can reduce the index by 40%) This often includes words as articles, prepositions, conjunctions. Some verbs, adverbs, adjectives. Lists of popular stop words can be found on the Internet. #### Stemming The idea of stemming is to let any conjugation and plurality of a word produce the same result in a query. This improves the retrieval performance and reduce the index size. There are four types of stemming: _Affix removal, Table Lookup, Successor variety, N-grams_. Affix removal is the most intuitive, simple and effective and is the focus of the course, with use of the _Porter Algorithm_. #### Index term selection Instead of using a full text representation (using all the words as index terms) we often select the terms to be used. This can be done manually, by a specialist, or automatically, by identifying noun groups. Most of the semantics in a text is carried by the noun words, but not often alone (e.g. _computer science_). A noun group are nouns that have a predefined maximum distance from each other in the text. (syntactic distance) #### Thesauri To assist a user for proper query terms you can construct a Thesauri. This provides a hierarchy that allows the broadening and narrowing of a query. It is done by creating a list of important words in a given domain. For each word provide a set of related words, derived from synonymity relationship.