This is an old version of the compendium, written Dec. 16, 2017, 2:59 p.m. Changes made in this revision were made by hoanghn. View rendered version.

TDT4117: Information retrieval

Disclaimer: Not complete # Tips Gjør quizoppgavene på [http://www.intoit.io](http://www.intoit.io) . Du får teste deg selv og få en oversikt over temaene! # 1 - Introduction ## 1.1 Information retrieval Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia object. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest. IR includes modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering and languages. IR consist mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms to improve the results. In one form or another, indexes are at the core of every modern information retrieval system. The web has become a universal repository of human knowledge and culture. Users have created billions of documents, and finding useful information is not an easy task unless running a search, and search is all about IR and its technologies. ## 1.2 The IR Problem __The IR Problem__: the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible. The user of a retrieval system has to translate their information need into a query in the language provided by the system. The user is concerned more with retrieveing information about a subject than retrieving data that satisfies the query. A user is willing to accept documents that contain synonyms of the query terms in the result set, even when those documents do not contain any query terms. Data retrieval, while providing a solution to the user of a database system, does not solve the problem of retrieving information about a subject or topic. __Differences__ between data retrieval and information retrieval are shown in the table below: || || __Data retrieval__ || __Information retrieval__ || || Content || Data || Information || || Data object || Table || Document || || Matching || Exact || Partial || || Items wanted || Matching || Relevant || || Query language || Artificial || Natural || || Query specification || Complete || Incomplete || || Model || Deterministic || Probabilistic || || Structure || High || Less || ## 1.3 The IR System ![High level architecture of an IR system](https://i.imgur.com/0DsHWSL.jpg "High level architecture of an IR system. CC-BY-SA 3.0, Eirik Vågeskar") _High level architecture of an IR system. (CC-BY-SA 4.0, Eirik Vågeskar)_ The software architecture: - Document collection - Crawler (if this is a Web collection) - Indexer - Retrieval and ranking process - Query parsing and expansion - System query - Retrieval and ranking The most used index structure is an _inverted index_ composed of all the distinct words of the collection and, for each word, a list of the documents that contain it. A document collection must be indexed before the retrieval process can be performed on it. The retrieval process consists of retrieving documents that satisfy either a user query or a click on a hyperlink. In the last case we say the user is _browsing_. Web search is today the most prominent application of IR and its techniques, and has had a major impact on the development of IR. A new component introduced with the web is the _crawler_ which is discussed in chapter 12. Impacts of the web is that performance and reliability has become critical characteristics of the IR system, which needs to handle vast sizes of document collections. Search problem has extended beyond the seeking of text information to encompass user needs (hotel prices, book prices, phone numbers, software download links). Web spam is sometimes so compelling that it is confused with truly relevant content. Practical issues on the web includes security, user privacy, copyrights and patent rights. Is a site which supervises all the information it posts acting as a publisher? And if so, is it responsible for misuse of the information it posts? # 2 - User Interfaces for Search ## 2.1 Introduction The role of the search user interface is to aid in the searcher's understanding and expression of their information needs, and to help users formulate their queries, select among available information sources, understand search results, and keep track of the progress of their search. ## 2.2 How People Search Search interfaces should support a range of tasks, while taking into account how people think about searching for information. Marchionini makes a disctiction between _information lookup_ and _exploratory search_. Lookup tasks are fact retrieval or question answering and are satisfied by short, discrete pieces of information. Exploratory search can be divided into learning and investigating tasks. Learning searches require more than single query-response pairs, and involve the searcher spending time scanning and reading multiple information items, and synthesizing content to form new understanding. Investigative search may be done to support planning, discover gaps in knowledge, and to monitor an on-going topic. Users learn as they search where they refine their queries as they go. Such dynamic process is sometimes referred to as the _berry picking_ model of search. A typical searcher will at first opt to query on a more general term and either reformulate the query or visit appropriate Web sites and browse to find the desired product. This is approach is sometimes referred to as _orienteering_. Web search logs suggest this approach is common and the proportion of users who modify their queries is about 52%. Searchers often prefer _browsing_ over keyword searching when the information structure is well-matched to their information needs. Browsing works well only so long as appropriate links are available, and have meaningful cues about the underlying information. If at some point midway through the click the searcher does not see a link leading closer to the goal, then the experience is frustrating and the interface fails from a usability perspective. Studies have shown that it is difficult for people to determine whether or not a document is relevant to a topic, and the less someone knows about a topic, the poorer judge they are about if a search result is relevant to that topic. Searches are biased towards thinking the top one or two results are better than those beneath it. ## 2.3 Search Interfaces Today At the heart of the typical search session is a cycle of query specification, inspection of retrieval results, and query reformulation. As the process proceeds, the searcher learns about their topic, as well as about the available information sources. The most common way to start a search session is to access a Web browser and use a Web search engine. Second comes selecting a Web site from a personal collection of previously visited sites. Third comes Web directories, such as Yahoo's directory. For Web search engines, the query is specified in textual form. Typically in Web queries text is very short, consisting of one to three words. If the results given do not look relevant, then the user reformulates their query. In many cases searchers would prefer to state their information need in more detail, but past experience with search engines taught them that this method does not work well, and that keyword querying combined with orienteering works better. Some systems support Boolean operators. But, Boolean operators and command-line syntax have been shown time and again to be difficult for most users to understand, and those who try to use them often do so incorrectly. On Web search engines today, conjunctive ranking is the norm, where only documents containing all query terms are displayed in the results. However, Web search engines have become more sophisticated about dropping terms that would result in empty hits, while matching the important terms, ranking the hits higher that have these terms in close proximity to one another, and using the other aspects of ranking that have been found to work well for Web search. In design, studies suggest a relationship between query length and the width of the entry form; small forms discourage long queries and wide forms encourage longer queries. Auto-complete (or auto-suggest or dynamic query suggestions) which is shown in real time has greatly improved query specification. In some Web search engines the query is run immeadiately; on others the user must hit the Return key or click the Search button. When displaying the search results, the document surrogate refers to the information that summarizes the document. The text summary containing text extracted from the document is also critical for assessment of retrieval results. Several studies have shown that longer results are deemed better than shorter ones for certain types of information needs. Users engaged in _known-item searching_ tend to prefer short surrogates that clearly indicate the desired information. One of the most important query reformulation techniques consists of showing terms related to the query or to the documents retrieved in response to the query. Usually only one suggested alternative is shown; clicking on that alternative re-executes the query. Search engines are increasingly employing related term suggestions, referred to as _term expansion_. Relevance feedback has been shown in non-interactive or artificial settings to be able to greatly improve rank ordering, but this method has not been successful from a usability perspective. In organizing search results, a category system is a set of meaningful labels organized in such a way as to reflect the concepts relevant to a domain. Most commonly used category structures are _flat_, _hierarchical_, and _faceted_ categories. Flat categories are simply lists of topics or subjects. Hierarchical organization online is most commonly seen in desktop file system browsers. It is, however, difficult to maintain a strict hierarchy when the information collection is large. Faceted metadata, has become the primary way to organize Web site content and search results.Faceted metadata consists of a set of categories (flat or hierarchical). Usability studies find that users like, and are successful, using faceted navigation, if the interface is designed properly, and that faceted interfaces are overwhelmingly preferred for collection search and browsing as an alternative to the standard keyword-and-results listing interface. Clustering refers to the grouping of items according to some measure of similarity. The greatest advantage is that it is fully automatic and can be easily applied to any text collection. The disadvantages include an unpredictability in the form and quality of results. One drawback of faceted interfaces versus clusters are that the categories of interest must be known in advance. The largest drawback is the fact that in most cases the category hierachies are built by hand. ## 2.4 Visualization in Search Interfaces Information visualization has become a common presence in news reporting and financial analysis, but visualization of inherently abstract information is more difficult, and visualization of textually represented information is especially challenging. When using boolean syntax, Hertzum and Frøkjær found that a simple Venn diagram representation produced more accurate results. One of the best known experimental visualizations is the TileBars interface which documents are shown as horizontal glyphs with the locations of the query term hits marked along the glyph. Variations of the TileBars display have been proposed, including a simplified version which shows only one square per query term, and color gradation is used to show query term frequency. Other approaches to showing query term hits within document collections include placing the query terms in bar charts, scatter plots, and tables. A usability study that compared five views to the Web-style listing, there were no significant differences for task effectiveness for the other conditions, except for bar charts, which were significantly worse. All conditions had significantly higher mean task times than the Web-style listing. Evaluations that have been conducted so far provide negative evidence as to their usefulness. For text mining, most users of search systems are not interested in seeing how words are distributed across documents. ## 2.5 Design and Evaluation of Search Interfaces Based on years of experience, a set of practices and guidelines have been developed to facilitate the design of successful interfaces. The practice are collectively referred to as _user-centered design_, which emphasizes building the design around people's activites and thought processes, rather than the converse. The quality of a user interface is determined by how people respond to it. Subjective responses are as, if not more, important than quantitative measures, because if a person has a choice between two systems, they will use the one they prefer. Another major evaluation technique that has become increasingly popular in the last few years is to perform experiments on a massive scale on already heavily-used Web sites. This approach is often referred to as _bucket testing_, _A/B testing_, or _parallel flights_. Participants do not knowingly "opt in" to the study; rather, visitors to the site are selected to be shown alternatives without their knowledge or consent. A potential downside to such studies is that users who are familiar with the control version of the site may initially react to the unfamiliarity of the interface. # 3 - Modeling ## 3.1 IR Models Modeling in IR is a process aimed at producing a ranking function that assigns scores to documents for a given query. ### 3.1.2 Characterization of an IR Model An information retrieval model consists of four variables [D,Q,F,R(q_i,d_j)] where: - $D$ Set of the representations of documents in a collection - $Q$ The set of queries given by the user information needs. - $F$ Framework for modeling document representations, queries, and their relationships (sets and Boolean relations, vectors and linear algebra operation, sample spaces and probability distributions) - $R(q_i,d_j)$ is a ranking function. Gives a ranking to the documents given the query $q_i$ and orders them by relevance #### Term and Document frequency | TF-IDF The __raw frequency__ of a term in a document is simply how many times it occurs, but the relevance of the document does not increase proportionally with the term frequency (10 more occurrences does not mean 10 times more relevant). __Log frequency__ weighting lowers this ratio: $$tf_{i,j} = \left\{ \begin{array}{rr} 1+\log f_{i,j} \quad \text{if } f_{i,j} > 0 \\ 0 \quad \text{otherwise} \end{array} \right.$$ The logarithm base is not important, just be concise with the choice. We want high scores for frequent terms, but we want even higher score for a rare, descriptive term. These terms introduce a good discrimination value, and their score is captured in the __Inverse Document Frequency__: $$ idf_i = \log \frac{N}{n_i} $$ Where $N$ is the total number of documents in the collection and $n_i$ is the number of documents which contain keyword $k_i$. If we combine these two scores we get the best known weighting scheme in IR: __the TF-IDF weighting scheme__. This measure increases with the number of occurrences within a document and with the rarity of the terms in the collection. $$w_{i,j} = \left\{ \begin{array}{r r} (1+\log f_{i,j}) \cdot \log\frac{N}{n_i} \quad \text{if $f_{i,j} > 0$}\\ 0 \quad \text{otherwise} \end{array} \right.$$ ## 3.2 Classical Similarity Models ### Boolean model Based on set theory and Boolean algebra. Simple, intuitive, precise semantics. No partial matching or ranking, so more data retrieval model (than a IR-model). Hard to translate queries to boolean expressions. ### Vector Space model Assigns weights to index terms in queries and documents, used to calculate the _degree of similarity_ between a query and documents in the collection. This produces a ranked result. Its term-weighting scheme improves retrieval performance and its partial matching strategy allows retrieval of documents that approximate the query conditions. Document length normalization is also naturally built into the ranking. The disadvantages of the model is that it assumes all terms are independent. __Ranking__ in the Vector Space model: The query and documents are modeled as vectors. The similarity between a query and a document is calculated with the __Cosine Similarity__: $$ sim(d_j , q) = \frac{\vec{d_j} \cdot \vec{q}}{\left|\vec{d_j}\right|\left|\vec{q}\right|} = \frac{\sum _{i=1}^t w_{ij}\cdot w_{iq} }{\sqrt{\sum _{i=1}^tw_{ij}^2}\cdot \sqrt{\sum _{i=1}^tw_{iq}^2}} $$ This basically measures how much the query vector and the document vector are "pointing in the same direction". ### Probabilistic model There exists a subset of the documents collection that are relevant to a given query. A probabilistic retrieval model ranks this set by the probability that the document is relevant to the query. The advantage of this model is that documents are ranked in decreasing order of their probability of being relevant. The main disadvantage is the need to guess the initial separation of documents into relevant and non-relevant sets. __Ranking__ in the Probabilistic model: The similarity function uses the __Robertsen-Sparck Jones__ equation: $$ sim(d_j , q) \sim \sum_{k_i \in q \wedge k_i \in d_j } \log \frac{N + .5}{n_i + .5} $$ ## 3.5 Alternative Probabilistic Models #### BM25 (Best Match 25) Extends the classic probabilistic model with information on term frequency and document length normalization. The ranking formula is a combination of the equation for BM11 and BM15, and can be written as $$ sim_{BM25}(D, Q) = \sum _{i=1}^n IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$ where $f(q_i, D)$ is $q_i$'s term frequency in the document $D$, $|D|$ is the length of the document $D$ in words, $avgdl$ is the average document length in the text collection, $k_1$ and $b$ are free parameters, and $IDF$ is the inverse document frequency given by $$ IDF(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} $$ where $N$ is the total number of documents in the collection, and $n(q_i)$ is the number of documents containing $q_i$. #### Language model A language model is a function that puts a probability measure over strings drawn from some vocabulary. The idea is to define a language model for each document in the collection and use it to inquire about the likelihood of generating av given query, instead of using the query to predict the likelihood of observing a document. #####Formal Language (Model) Based on finite state machines or grammatical rules. #####Stochastic Language Models The probability that the string s can be generated in the language M. #####Smoothing Uses statistics for the entire document collection to avoid assigning zero probability to terms that are not in the document. One popular technique for smoothing is to move som mass probability form the (query) terms in the document to the terms not in the document. # 4 - Retrieval Evaluation ## 4.3 Retrieval Metrics ### Precision & Recall __Precision__ is the fraction of _retrieved_ documents that are _relevant_. $Precision = P(relevant|retrieved)$ __Recall__ is the fraction of _relevant_ documents that are _retrieved_. $Recall = P(retrieved relevant|relevant)$ These are values that measure the quality of the retrieved information, how much relevant information did it contain, and how much relevant information did the search miss. Precision and recall are to be used on unranked sets. When dealing with __ranked lists__ you compute the measures for each of the returned documents, from the most relevant to the least (by their ranking; top 1, top 2, etc.) This gives you the __precision-recall-curve__. The __interpolation__ of this result is simply to let each precision be the maximum of all future points. This removes the "jiggles" in plot, and is necessary to compute the precision at recall-level 0. A recall-level is typicall a number from 0 to 10 where 0 means recall = 0 and 10 means recall = 100 %. Precision and recall are complementary values and should always be reported together. If reported separate it is easy to make one value arbitrarily high. ### Single value summaries #### R-precision In a search where there are $R$ relevant documents, the R-precision is the fraction of relevant documents in the $R$ first results. Or in other words: precision at the R-th position. #### The Harmonic Mean / F-measure This is a combination of Precision and Recall. The Harmonic Mean of the j-th document is given by: $$F(j) = \frac{2}{\frac{1}{R}+\frac{1}{P}} = \frac{2PR}{P+R}$$ Where R and P is the recall and precision of the j-th document. It is 0 when no relevant documents are retrieved, and 1 when all ranked documents are relevant. A high value of F symbols a good compromise between precision and recall. #### E-measure Combines recall and precision to a single measure. There's a parameter $\beta$ for giving extra weight to either precision or recall. With $\beta = 1$ the E-measure is the same as the complement of the F-measure. In other words: $F(j) = 1 - E(j)$ $$E = F_\beta = \frac{(\beta^2 + 1) P R}{\beta^2 P + R}$$ $\beta > 1$ weight recall more. $\beta < 1$ weigt precision more #### Mean Average Precision | MAP MAP is Average Precision across multiple queries/rankings (recall-levels). It is a single value summary of the ranking by averaging the precisions obtained after each new relevant dokument is observed (P@n). #### Mean Reciprocal Rank | MRR Used when we're interested in finding how high in a rankinglist for a query a relevant hit is. Used when we are interested in the first correct answer to a given query or task. Given $\mathcal{R}_i$, the ranking relative to a query $q_i$; $S_h$, a threshold for the lowest acceptable ranking position; and $S_{correct}$, a function which returns the position of the first correct answer in $\mathcal{R_i}$, we have the reciprocal rank: $$RR_i = \begin{cases} \frac{1}{S_{correct}(\mathcal{R}_i)} & \text{if } S_{correct}(\mathcal{R}_i) \leq S_h\\ 0 & \text{otherwise} \end{cases}$$ The mean reciprocal rank is the average of all these for a set $Q$ of $N_q$ queries: $$MRR(Q) = \sum_{i=1}^{N_q}RR_i$$ ### User-oriented measures #### Coverage ratio Defined as the fraction of the documents known and relevant that are in the answer set. #### Novelty ratio Defined as the fraction of relevant documents in the answer set that are not known to the user. #### Relative recall The ratio between the number of relevant documents found and the number of relevant documents the user expected to find. #### Recall effort The ratio between the number of relevant documents the user expected to find and the number of documents examined in an attempt to find the expected relevant documents. # 5 - Relevance Feedback and Query Expansion A process to improve the first formulation of a query to make it closer to the user intent. Often done iteratively (cycle). __Query Expansion (QE)__: Information related to the query used to expand it. __Relevance Feedback (RF)__: Information on relevant documents explicitly provided from user to a query. ## 5.4 Explicit feedback (Manual) Feedback information provided by the users inspecting the retrieved documents. ### Vector Space Model the _Vector Space Model_ has three, almost identical, methods to improve a users query: || Rocchio Algorithm || $\vec{q_m} = \alpha \vec{q} + \frac{\beta}{\left|D_r\right|} \sum_{\vec{d_j} \in D_r} \vec{d_j} - \frac{\gamma}{|D_n|} \sum_{\vec{d_j} \in D_n} \vec{d_j}$ || || Ide Regular || $\vec{q_m} = \alpha \vec{q} + \beta \sum_{\vec{d_j} \in D_r} \vec{d_j} - \gamma \sum_{\vec{d_j} \in D_n} \vec{d_j}$ || || Ide «dec hi» || $\vec{q_m} = \alpha \vec{q} + \beta \sum_{\vec{d_j} \in D_r} \vec{d_j} - \gamma MaxRank(D_n)$ || Where $\vec{q_m}$ is the modified query, $\vec{d_j}$ is a weigted term vector associated with documents $d_j$, $D_r$ is the set of relevant documents retrieved, $D_n$ is the set of non-relevant documents, and $\alpha$, $\beta$ and $\gamma$ are constants. (The absolute value means the number of documents in the set.) The basic idea behind all three is to reformulate the query such that it gets closer to the neighborhood of relevant documents in the vector space and away from the neighborhood of the non-relevant documents. The __differences__ in the methods is that _Rocchio_ normalizes the number of relevant and non-relevant documents, _Ide regular_ do not. _Ide «dec hi»_ uses the highest ranked non-relevant document, insted of the sum of all. They all yield __similar results__ and __improved performance__ (precision & recall). They uses both _Query Expansion_ and _Term Re-weighing_, and they are __simple__ (because they compute the modified term wights directly from the set of retrieved documents). ### Probabilistic model The Relevance Feedback procedure for the Probabilistic model uses statistics found in retrieved documents. $P(k_i | R) = \frac{\left|D_{r,i}\right|}{\left|D_r\right|}$ and $P(k_i | \bar{R}) = \frac{n_i - \left|D_{r,i}\right|}{N - \left|D_r\right|}$ Where $D_r$ is the set of relevant and retrieved documents, and $D_{r,i}$ is the set of relevant and retrieved documents that contain $k_i$. The main __advantage__ of this relevance feedback procedure are that the feedback process is directly related to the derivation of new weights for the query terms. The __disadvantages__ is that it uses only term reweighing (no Query Expansion), the document term weight is not incorporated and initial query term weights are disregarded. ## 5.5 Implicit feedback (Automated) No participation of users in the feedback process. ### Local Analysis #### Clusters Forms of deriving synonymy relationship between two local terms, building association matrices quantifying term correlations. __Association clusters__ are based on the frequency of co-occurrence of terms inside documents, it does not take into account where in the document the terms occur. __Metric clusters__ are based on the idea that two terms occurring in the same sentence tend to be more correlated, and factor the distance between the terms in the computation of their correlation factor. __Scalar clusters__ are based on the idea that two terms with similar neighborhoods have some synonymy relationship, and uses this to compute the correlations. ### Global Analysis Determine term similarity through a pre-computed statistical analysis on the complete collection. Expand queries with statistically most similar terms. Two methods: __Similarity Thesaurus__ and __Statistical Thesaurus__. (A thesaurus provides information on synonyms and semantically related words and phrases.) Increases recall, may reduce precision. # 6 - Text Operations ## 6.6 Document Preprocessing There are 5 text transformations (operations) used to prepare a document for indexing. || __Lexical analysis__ || The process of tokenizing a document. A token is a string with an identified meaning. Also removes characters with little value (such as numbers). || || __Stopword elimination__ || Filter out words with low discrimination values. || || __Stemming__ || Remove affixes (prefixes and suffixes). || || __Index term selection__ || Select words, stems or group of words to be used as indexing elements. || || __Thesauri__ || Construct a standard vocabulary by categorizing words within the domain. || #### Lexical analysis _Numbers_ are of little value alone and should be removed. Combinations of numbers and words could be kept. _Hyphens_ (dashes) between words should be replaced with whitespace. _Punctuation marks_ are usually removed, except when dealing with program code etc. To deal with the _case of letters_; convert all text to either upper or lower case. #### Stopword elimination Stopwords are words occurring in over 80% of the document collection. These are not good candidates for index terms and should be removed before indexing. (Can reduce the index by 40%) This often includes words as articles, prepositions, conjunctions. Some verbs, adverbs, adjectives. Lists of popular stopwords can be found on the Internet. #### Stemming The idea of stemming is to let any conjugation and plurality of a word produce the same result in a query. This improves the retrieval performance and reduce the index size. There are four types of stemming: _Affix removal, Table Lookup, Successor variety, N-grams_. Affix removal is the most intuitive, simple and effective and is the focus of the course, with use of the _Porter Algorithm_. #### Index term selection Instead of using a full text representation (using all the words as index terms) we often select the terms to be used. This can be done manually, by a specialist, or automatically, by identifying _noun groups_. Most of the semantics in a text is carried by the noun words, but not often alone (e.g. _computer science_). A noun group are nouns that have a predefined maximum distance from each other in the text. (syntactic distance) #### Thesauri To assist a user for proper query terms you can construct a Thesauri. This provides a hierarchy that allows the broadening and narrowing of a query. It is done by creating a list of important words in a given domain. For each word provide a set of related words, derived from synonymity relationship. ## 6.8 Text compression There are three basic models used to compress text; _Adaptive_, _Static_ and _Semi-static_. (A __model__ is the probability distribution for the next symbol. __Symbols__ can be characters and words.) The __adaptive__ model passes over the text once and progressively learn the statistical distribution and encodes the text. Good for general purpose, but decompression must start from the beginning (no direct access). The __static__ model assumes an average distribution for all input text and it uses one pass to encode the text. It can have poor performance (compression) when the text deviates from the average. (e.g. english literature vs. financial text). This model allows for direct access (?). The __semi-static__ model passes over the text twice; first for modeling, second for compressing. This ensures a good compression and allows direct access. The disadvantage is the need for two passes and the model should be stored and sent to the decoder. Using the above models, there are two __general approaches to text compression__: _statistical_ and _dictionary-based_. A __statistical__ approach has two parts: _(1) the modeling_, which estimates a probability for each next symbol, and _(2) the coding_, which encodes the next symbol as a function of the probability assigned to it by the model. The compression can be done character- or word-based or code-based (Huffman, dense codes, arithmetic). The former treats characters or words as symbols, while the latter establish a representation (codeword) for each source symbol. A __dictionary__ approach achieve compression by replacing groups of consecutive symbols, or phrases (sequence of symbols), with a pointer to an entry in a dictionary. There is no distinction between modeling and coding in this approach. _Ziv-Lempel_ is an adaptive method and _Re-Pairing_ is a semi-static. ### Huffman coding Huffman coding is a prefix coding for lossless data compression, commonly used on text to compress, or code, words or characters. It assignes variable length bit-codes to a word or character (called node), where the shortest code is assigned to the most frequent node. The codes are defined by a *Huffman tree*. #### Normal Huffman tree A normal Huffman tree is build by sorting the nodes after frequencies. Then the two lowest nodes are joined, and their combined frequency is put back in the sorted queue. An example can be seen below: ![Normal Huffman Tree](https://s13.postimg.org/u2cdmiec7/huf1.gif) The codes can be extracted from the tree by following the edges. Left edge: 0, right edge: 1. The shortcoming of the normal Huffman tree is how it handles equal frequencies. If two nodes have the same frequency, then one of them is chosen randomly. This gives potentionally multiple, equally correct, Huffman trees from the same set of input. To cope with this we can define a *Canonical* Huffman tree. #### Canonical Huffman tree The key with the Canonical Huffman tree is to impose an order such that the same input will yield the same output every time. The subtrees and leaves are ordered, and the Canonical Huffman trees and codes will always be the same. Unfortunately there are several methods of achieving this order. I will describe three such methodes here. Examples of the methodes can be found in [this document](http://docdro.id/GuPbkDZ). **Method 1:** This is the method presented by the book. It assignes the shortest possible code to the most frequent node, but it tends to create a "deep" tree. 1. Order the original list of nodes by frequency and then by content (i.e. the piece of text or code that the node represents) 2. Start assigning the two lowest frequent nodes as leaf nodes. 3. For each more frequent leaf, move «one step up» the tree. 4. If more than two nodes are equally frequent, assign two nodes to that level, move one level up and assign the next (two) nodes. 5. When you are finished, you will see that the most frequent node will be assigned the shortest code: 1. **Method 2:** This is a more visual method, using your normal Huffman tree and shifting subtrees and nodes around. 1. Take your normal Huffman tree 2. Shift subtrees to the right so that subtrees to the right of any node is equally deep or deeper than the node's own subtree. 3. Sort codes of the same length alphabetically to find your canonical Huffman tree. **Method 3:** This is a systematic approach to achieving the same tree as above. It is the method found on Wikipedia. 1. Take the codes assigned from creating the original Huffman tree and sort them by code length and then by content 2. The most frequent node gets assigned a codeword which is the same length as the node's original codeword but all zeros. 3. Each subsequent node is assigned the next binary number in sequence, ensuring that following codes are always higher in value. 4. When you reach a longer codeword, then after incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as a left shift. 5. Next, construct your tree. An example (of method 3) can be seen below: ![Canonical Huffman Tree](https://s22.postimg.org/f3p05nkgv/can1.gif) # 9 - Indexing and searching ## 9.2 Inverted Indexes __Inverted__ means that you can reconstruct the text from the index. __Vocabulary__: the set of all different words in the text. Low space requirement, usually kept in memory. __Occurrences__: the (position of) words in the text. More space requirement, usually kept on disk. __Basic Inverted Index__: The oldest and most common index. Keeps track of terms; in _which_ document and _how many_ times it occur. __Full Inverted Index__: This keeps track of the same things as the basic index, in addition to _where_ in the document the terms occurs (position). __Block addressing__ can be used to reduce space requirements. This is done by dividing text into blocks and let the occurrences point to the blocks. __Searching__ in inverted indexes are done in three steps: 1. __Vocabulary search__ - words in queries are isolated and searched separately. 2. __Retrieval of occurrences__ - retrieving occurrence of all the words. 3. __Manipulation of occurrences__ - occurrences processed to solve phrases, proximity or Boolean operations. __Ranked retrieval__: When dealing with _weight-sorted_ inverted lists we want the _best_ result. Sequentially searching through all the documents are time consuming, we just want the _top-k_ documents. This is trivial with a single word query; the list is already sorted and you return the first _k_-documents. For other query we need to merge the lists. (see _Persin’s algorithm_). When __constructing__ a full-text inverted index there are two sets of algorithms and methods: __Internal Algorithms__ and __External Algorithms__. The difference is wether or not we can store the text and the index in internal, main memory. The former is relatively simple and low-cost, while the latter needs to write partial indexes to disk and then merge them to one index file. In general, there are three different ways to __maintain__ an inverted index: - __Rebuild__, simple on small texts. - __Incremental updates__, done while searching only and when needed. - __Intermittent merge__, new documents are indexed and the new index is merged with the existing. This is, in general, the best solution. Inverted indexes can be __compressed__ in the same way as documents (chapter 6.8). Some popular coding schemes are: _Unary_, _Elias_-$\gamma$, _Elias_-$\delta$ and _Golomb_. __Heaps’ Law__ estimates the number of distinct words in a document or collection. Predicting the growth of the vocabulary size. $V = Kn^\beta$, where _n_ is the size of the document or collection (number of words), and $10 < K < 100, 0< \beta < 1$ __Zipf’s law__ estimates the distribution of words across documents in the collection (approximate model). It states that if $t_1$ is the most common word in the collection, $t_2$ the next most common, and so on, then the frequency of $f_i$ of the _i_-th most common word is proportional to $\frac{1}{i}$. That is: $f_i = \frac{c}{i}$, where _c_ is a constant. (?) ## 9.3 Signature Files Signature files are word-oriented index structures based on hashing. It has a poorer performance than Inverted indexes, since it forces a sequential search over the index, but is suitable for small texts. A signature file uses a __hash function__ (or «signature») that maps word blocks to bit masks of a given size. The mask is obtained by bitwise OR-ing the signatures of all the words in the text block. To __search__ a signature file you hash a word to get a bit mask, then compare that mask with each bit mask of all the text blocks. Collect the candidate blocks and perform a sequential search for each. ## 9.4 Suffix Trees and Suffix Arrays __Suffix trees__ This is a structure used to index, like the Inverted Index , when dealing with _large alphabets_ (Chinese Japanese, Korean), _agglutinating languages_ (Finnish, German). A __Suffix trie__ is an ordered tree data structure built over the suffixes of a string. A __Suffix tree__ is a compressed trie. And a __Suffix array__ is a «flattened» tree. These structures handles the whole text as a string, dividing it into suffixes. Either by character or word. Each suffix is from its start point to the end of the text, making them smaller and smaller. e.g.: - mississippi (1) - ississippi (2) - … - pi (10) - i (11) These structures makes it easier to search for substrings but they have large __space requirements__: A tree takes up to 20 times the space of the text and an array about 4 times the text space. # 11 - Web Retrieval ## 11.4 Search engine Architectures ### Crawler-indexer architecture (centralized) ![Centralized crawler-indexer architecture](https://i.imgur.com/7o947MR.png) ### Harvest architecture (distributed) ![Harvest architecture](https://i.imgur.com/9Mxnza5.png) ## 11.5 Link-based ranking ### HITS __HITS__, or _Hyperlink-Induced Topic Search_, divides pages into two sets; _Authorities_ and _Hubs_. If page _i_ have many links pointing to it, it is called an Authority because it is susceptible to contain authorative and thus, relevant content. If it contains many links to relevant documents (authorities) it is called a Hub. || __Pros__ || __Cons__ || || Quick on small neighborhood graphs || Hub- and authority-score must be calculated for each query || || Query-oriented (ranking according to user relevance || Weak against advertisement and spam (outgoing links) || $$H\left(p\right)=\sum _{u\in S \;|\; p\rightarrow u} A\left(u\right)$$ $$A\left(p\right)=\sum _{v\in S \;|\; v\rightarrow p} H\left(v\right)$$ ### PageRank __Pagerank__ differs from HITS because it produces a ranking independent of a user’s query. The concept is that a page is considered important if it is pointed to by other important pages. Meaning; the PageRank-score for a page is determined my summing the PageRanks of all pages that point to it. || __Pros__ || __Cons__ || || Robust against spam (create inlinks) || Difficult for new pages || || Global measure (query independent) || Weak against inlink-farms || $$PR(p_i) = \frac{1 - d}{N} + d \cdot \sum _{p\in M(p_i)}^n \frac{PR(p_j)}{L(p_j)}$$ _where $d$ is the damping factor, $N$ is the number of pages in the collection, $M(p_i)$ is the set og pages pointing to $p_i$ and $L(p_j)$ is the number of outgoing links of page $p_j$._ Other methods are WebQuery and Most-Cited. ## Spam techniques __Keyword stuffing__ and __hidden text__ involves misleading meta-tags, excessive repetition of keywords hidden from the user with colors, stylesheet tricks, etc. left for the web crawlers to find. Most search engines catch these now days. __Doorway page__ is a page optimized for a single keyword and redirects to the real target page. __Lander pages__ are optimized for a singe keyword, or a misspelled domain name, designed to attract surfers who will then click on ads. __Cloaking__ is a technique in which two distinctive users are served different content. Used to serve fake content to search engine crawler and spam to real users. __Link spam__ is about creating lots of links pointing to the page you want to promote, and put these links on pages with high PageRank. __Search Engine Optimization (SEO)__ is a fine balance between spam and legitimate methods. Mostly involves buying recommendation or working hard for promotion. # 12 - Web Crawling --TODO-- # 14 - Multimedia Information Retrieval ## 14.1 Introduction ### 14.1.3 Text IR vs. Multimedia IR There are several ways in which text retrieval differs from multimedia retrieval. - __Basic units and structure__: Text has words as basic units, and structure is provided by punctiation and paragraphs. In multimedia, much work has to be put in to define the _semantic unit_ and _structure_ for the media. For instance: - In video, a basic semantic unit may be one shot. - In speech, the basic unit may be a word. The spacing between words may also convey meaning (smaller pause is a blank space, long pause is a full stop and even longer pause a paragraph). - __Size__: Text requires the least space of all media types. Images and audio require a lot of space in comparison, and video even more. - __Browsing__: Society has reached upon a set of conventions for finding summarizing and highlighting text. There are fewer conventions on how to summarize and highlight multimedia searches, and the way of presenting is not as canonical as with text. Think for instance of the many different ways to present video and audio: Should there be a thumbnail? Should the thumbnail change when the user hovers the cursor above it? Should there be a button which presents a small preview? Should there be a text description of the media? ## 14.2 Challenges A __feature__ is like an attribute, it holds some information about the media needed to index and search. ![Illustration of the semantic gap, the gap between signal type and knowledge. Categories, sorted from smallest to largest semantic gap: Text, speech, images, video, and music.](https://i.imgur.com/JPLiTPc.jpg "The semantic gap. CC-BY-SA 4.0, Eirik Vågeskar.") _The semantic gap. CC-BY-SA 4.0, Eirik Vågeskar._ __The semantic gap__, denotes the distance between the multimedia content and its meaning. The gap is increasing from text, speech, image, video and music. (with music having the largest distance from its content to its meaning). __Feature Ambiguity__, mean the lack of global information to interpret content. (e.g. _The Aperture Problem_). __Data Models__ are used to provide a framework expressing the properties of the data to be stored and retrieved. In multimedia IR the data models __requirements__ are: - __Extensibility__ - Possibility to add new data types. - __Representation Ability__ - Ability to represent basic media types and composite objects. - __Flexibility__ - Ability to specify, query, search items at different abstraction levels. And - __Storage and Search Efficiency__. __Feature Extraction__ involves gathering information from an object to be used for indexing. There are four levels of features and attributes: 1. __Metadata__ - Formal or factual attributes of multimedia objects. 2. __Text Annotation__ - Text describing object content. 3. __Low Level Content Features__ - Capturing data patterns and statistics of multimedia object. 4. __High Level Content Features__ - Attempting to recognize and understand objects. __Indexing__ are performed over the features extracted from the multimedia object. ## 14.3 Image Retrieval There are two approaches to image retrieval: _Text-Based (TBIR)_ and _Content-Based (CBIR)_. In __Text-Based Image Retrieval__ the features are annotation (tags), made either by people or automatically. The perception of an image is subjective and the annotations can vary (be imprecise). On the other hand, it is easy to capture high level abstractions and concepts, like «smile» and «sunset». __Content-Based Image Retrieval__ is the task of retrieving images based on their contents. This is done by extracting one or several visual features and then use the corresponding representations in a matching procedure. To illustrate, consider the task of finding similar images to one that the user queries (Query By Example). This method ignores semantics and uses features like color, texture and shape. __Color-Based Retrieval__ can represent the image with a color histogram. This will be independent of e.g resolution, focal point and pose, and the retrieval process is to compare the histograms. __Texture-Based Retrieval__ extracts the repetitive elements in the image and uses this as a feature. ## 14.5 Video Retrieval and Indexing ### Shot detection (segmentation) A __video shot__ is a logical unit or segment with properties like; from the same scene, single camera motion, a distinct event or an action, a __single indexable event__. The steps in __Shot-based Video Indexing and Retrieval__ is to _segment_ the video into shots, _index_ each shot, apply a _similarity measurement_ between queries and video shots and _retrieve_ shots with high similarities. We use two different approaches for detecting shots. || Color histogram difference || $\sum_{j} H_i(j) - H_{i+1}(j)$ || || x^2 (Chi-square test) || $\sum_{j} \frac{(H_i(j) - H_{i+1}(j))^2}{H_{i+1}(j)}$ || _where i is frame number and j is gray level._ In words, one calculate the difference between this frame and the next using the color histograms and then one select an appropriate __threshold__ to decide if the two frames are in the same shot or not. Other detection techniques are: Twin comparison, False shot detection, Motion removal and Edge detection. ### Indexing and retrieval __R-frames__ are frames picked to represent each shot, and these still images are processed for __feature expansion__ (like with image retrieval). There are also other ways to extract features for a video, like __metadata__ (title, directors, video type, etc.), __annotation__ either manually or associated with transcripts or subtitles or even speech recognition on sound track. ### Video representation and abstraction || Micon || Explanation... || || Video Streamer || || || Clipmap || || || Storyboard || || || Mosaicking || || || Video skimming || ||