This is an old version of the compendium, written Dec. 12, 2017, 3:48 p.m. Changes made in this revision were made by hoanghn. View rendered version.

TDT46: Information Retrieval

Curriculum from fall 2016. # Tips Gjør quizoppgavene på www.intoit.io . Du får teste deg selv og har bedre oversikt

# A Survey of Named Entity Recognition and Classification (by David Nadeau, Satoshi Sekine) This is a survey non-exhaustive of 1991-2006 research in the NERC field, short for "named entity recognition and classification". A named entity may be the name of companies, persons and locations, but perhaps more surprisingly also time, date, money, percent and numeric expressions. ## Observations Activity overview: 1991 - 1995 : Few articles (found 8 in English) 1996 - now : Accelerates and does not go down. ### Language Factor Good portion studies English, perhaps more address language independence. Well studied: - English - German - Spanish - Dutch - Japanese - Chinese - French - Greek - Italian Some attention for: - Basque - Bulgarian - Catalan - Cebuano - Danish - Hindi - Polish - Romanian - Russian - Swedish - Turkish Arabic has started to receive attention in large-scale projects. ### Textual Genere and Domain > The impact of textual genre (journalistic, scientific, informal, etc.) and domain (gardening, sports, business, etc.) has been rather neglected in the NERC literature. The research that exists shows this has a major impact. Newswire texts vs. manual translations of phone conversations and technical e-mails: Performance of all systems drop. ### Entity Type Named entity - "Named" refers to rigid designator. Eg. "Ford" or "Ford Motor Company" instead of "the automotive company created by Henry Ford". Examples include: - Proper names - Biological species - Substances - Numerical expressions - Temporal expressions Temporal expressions does not include unspecific terms like "June", which may be June in any year. #### Categorization Often categorized and further subcategorized. Some common categories: - enamex (subclass of proper names, most studied) - persons - locations - organizations - timex - date - time - numex - money - percent Also cases of marginal types for special purposes like "email address", "phone number", "project name" and so on. (*miscellaneous* has been used about proper names outside enamex) #### Open Domain No limits on possible types. Some recent research. Also some examples of extremely many categories (around 200). ### What's Next? - Multimedia - Semi- and unsupervised approaches - Complex linguistic phenomena - Large-scale projects - integration of NERC and Machine Translation Stay tuned. ## Machine Learning Techniques Essential to recognize previously unknown entities. Mostly based on rules, either created by hand or machine learning. Machine learning preferred today when enough training data is available. ### Supervised Learning Typically transforms a large annotated corpus into rules and memoized lists. Basline method: Tag words in test corpus when they are annotated in the training corpus (simple memoization). Performence of this baseline depends on tthe *vocabulary transfer*. #### Vocabulary Transfer Measures the success achievable by simple memoization - proportion of unique words appearing in both training and testing corpus. 21% proportions from the MUC-6 training data. Highlights: - 42% for location names - 17% for organization names - 13% for person names MUC-7 corpus: - 76% for locations - 49% for organizations - 26% for persons Precision: 70%-90%. Enamex (total): 76% precision, 48% recall Pessimistic as measure - more occurences may be recognized due to repetition. ### Semi-supervised Learning Main technique is "bootstrappig" (small degree of supervision). Example approach: 1. User provides a few example names 2. The system finds all occurances 3. Use the setences with occurences for training. 4. Find new examples using the developed rules. 5. Repeat from (2) using the examples found. Comparable performance to supervised learning. #### Examples S. Brin (1998) : Develop regex for recognizing book titles and authors. Same format often used throughout a site, thus recognizing one book gives a pattern to recognize the next. M. Colis and Singer (1999) : Pattern based recognition where a pattern is a pair {spelling, context}. Spelling rule example: If spelling contains "Mr.", it is a person. Aggregate contextual rules, wich can be used to find new spellings. E. Riloff and Jones (1999) : "Mutual bootstrapping" - grow entities and contexts in turns. Deteriorates quickly when nocie is introduced. A. Cucchiareli and Velardi (2001) : Variant of previous. Uses syntactic relations for more accurate contextual evidence. Uses extinging NER systems for seeds. M. Pasca et al. (2006) : Also based on mutual bootstrapping. Generates synonyms to generalize patterns. E.g. "X was born in November" can recognize several months if "November" is a (Lin's) synonym for {"March", "October",...}. Existing NE classifiers can be improved with bootstrapping. Large collections is not sufficient by itself, documents rich in proper names and coreferences bring the best results. ### Unsupervised Learning Typically rely on lexical resources, patterns and statistics rom a lange unnanotated corpus. Example is clustering - groups based on similarity of context. #### Examples [#] "WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept." E. Alfonseca and Manandhar (2002) : Label input word with NE type. NE types from WordNet. Register words frequently co-occuring in large corpus ("type signatures"). Compare context to type signatures when trying to determine NE type of word. R. Evans (2003) : Use hypernyms. E.g. finding occurances such as "organizations such as X". In the sentence "An $L_0$ is a (kind of) $L_1$", $L_1$ is the hypernym of $L_0$ and the relationship is reflexive and transitive, but not symmetric. Y. Shinayama and Sekine (2004) : Exploits that named entities appear synchronously (at around the same time) in several news articles. O. Etzioni et al. (2005) : Use web search to determine. E.g. to find the type of London, search for "London is a nation", "London is a city",... ## Features of Words for Algorthmic Consumption > Features are descriptors or characteristic attributes of words designed for algorithmic consumption. Examples: - capitalized or not - length in characters - lowercased version of the word The sentence "The president of Apple eats an apple" with the above gives the feature vector <true, 3, “the”>, <false, 9, “president”>, <false, 2, “of”>, <true, 5, “apple”>, <false, 4, “eats”>, <false, 2, “an”>, <false, 5, “apple”> Typically used in rules for the NERC system. The rest of this section basically lists a lot of word features. Read the article for examples. ### Categories Word-level Features : Based on the characters making up the word. List Lookup Features : Whether or not the word exists in a list of words (e.g. "company names", "nouns",...). May also use other matching techniques, such as fuzzy matching, stemming techniques or using the soundex algorithm. Document/Corpus Features : Features based on the context of the word. E.g. multiple occurences, aliases, document metadata, statistics. ## NERC System Evaluation Usually compare the solution of the system with the solution from human linguists.Teh system can produce 3 types of errors. False positive : System detected entity, but there was none. Missed : System did not detect entity, but there was one. Wrong label : System detected entity, but gave it the wrong label. Incorrect boundaries : System detected entity, but included too much or too little. Both wrong label and boundaries : Obviously a combination of the two previous. ### MUC Evaluations [#] MUC: Message Understanding Conference Based on two axes: TYPE : Correct if entitiy is assigned correct type. Boundaries does not matter as long as they overlap. TEXT : Correct if boundaries matches. Three measures for each axis, the number of correct, system guesses and entities in the solution. Then calculate precision $p$ and recal $r$ as given by $$p = \frac{\text{correct guesses}}{\text{total guesses}}$$ $$r = \frac{\text{correct guesses}}{\text{entities in corpus}}$$ where the numbers include both TEXT and TYPE. The score is the micro-averaged f-measure (MAF), $$MAF = \frac{2PR}{P + R}$$ where precision and recall is micro-averaged. That is, average across the set of test cases $T$ $$P = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{total guesses}}$$ $$R = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{entities in corpus}}$$ ### Exact-match Evaluations Requires exact matches (thus only one axis with a binary scale). Equations for precision and recall are otherwise the same as for MUC. Score is calculated using the micro-averaged f-measure. ### ACE Evaluation More complicated than the previous ones (includes subtypes, class, entitiy mentions...), mostly ignored here. Each entitiy type (person, company...) is assigned a weight and a maximal contribution MAXVAL. Each error (false positive, missed entity...) is assigned a cost. Special rule for partial matches. Final Entity Detection and Recognition Value (EDR): $$EDR = 100\% - \text{penalties}$$ ## Conclusion/Summary - Area of NER has thrived for 15 years - Wide range of languages, domains, textual generes and entity types - Overview of techniques - Hand crafted rules vs. machine learning - Varying levels of supervision - Listed features > The use of an expressive and varied set of features turns out to be just as important as the choice of machine learning algorithms. - Overview of evaluation methods NERC's impact: - Yellow pages - Trend monitoring - Basis for advances in biology and genetics # Survey of Temporal Information Retrieval and Related Applications (by Ricardo Campos, Gaël Dias, Alípio M. Jorge, Adam Jatowt) Temporal Information Retrieval, T-IR, includes the temporal aspect when determining relevance. Five key aspects for document quality: - *timeliness* - relevance - accuracy - objectivity - coverage The introduction further lists some projects with temporal aspects, like the Internet Archive project. The survey focuses on T-IR for web. ## High-level Overview Introduction to basic concepts and extraction. ### Notion of Time Some formal models developed. Less formally "an inherent construct in human life". Point-in-time values can be represented by units with ranging granularity (seconds, days, years, decades..). Two types of time in database research. Valid : Time at which the event occured in real life Transaction : Time at which the fact was stored For web, focus time, the time mentioned in the text, is considered. Regarded as a counterpart of valid time. There are also multiple transaction times, eg. created, modified and published. Other relevant times: reading time and document age. ### Events and Timelines An event has a time and a place. Defined as "occurence" that happened at a given time in a given place. A sequence of events is a timeline or chronology, as expected. ### Temporal Expressions Temproal experession can be grouped in to three categories. Explicit : Precise moment in time, eg. 25.12.2009. Varying granularity. Implicit : Often associated with events (eg. "Christmas Day"). Requires more info to get temporal value. Relative : Time relative to another time, eg. "today". ### Temporal Information Extraction (T-IE) Usually performed by a *temporal tagger*. Does tokenization, sentence extraction, part-of-speech taging and named-entity recognition. T-IE used to be part of NER, now separate task consisting of the following steps. 1. Extraction/Recognition 2. Normalization 3. Temporal Annotation Result: Annotated text (usually TimeML). Often rule based, language-specific solutions tailored for one domain (the news). No extensive collections of annotated text. Performance measured using $F_\beta$. $$F_\beta = \frac{(\beta^2+1)PR}{\beta^2P + R}$$ where $$P = \frac{TP}{TP+FP}$$ $$R = \frac{TP}{TP+FN}$$. TP : True positives FP : False positives FN : False negatives ## Time Related Feature Extraction ## Advances in T-IR ## Future ## Conclusion