TDT46: Information Retrieval
Curriculum from fall 2016.
# Tips
Gjør quizoppgavene på www.intoit.io . Du får teste deg selv og har bedre oversikt
# A Survey of Named Entity Recognition and Classification
(by David Nadeau, Satoshi Sekine)
This is a survey non-exhaustive of 1991-2006 research in the NERC field, short for "named entity recognition and classification".
A named entity may be the name of companies, persons and locations, but perhaps more surprisingly also time, date, money, percent and numeric expressions.
## Observations
Activity overview:
1991 - 1995
: Few articles (found 8 in English)
1996 - now
: Accelerates and does not go down.
### Language Factor
Good portion studies English, perhaps more address language independence.
Well studied:
- English
- German
- Spanish
- Dutch
- Japanese
- Chinese
- French
- Greek
- Italian
Some attention for:
- Basque
- Bulgarian
- Catalan
- Cebuano
- Danish
- Hindi
- Polish
- Romanian
- Russian
- Swedish
- Turkish
Arabic has started to receive attention in large-scale projects.
### Textual Genere and Domain
> The impact of textual genre (journalistic, scientific, informal, etc.) and domain
(gardening, sports, business, etc.) has been rather neglected in the NERC literature.
The research that exists shows this has a major impact. Newswire texts vs. manual translations of phone conversations and technical e-mails: Performance of all systems drop.
### Entity Type
Named entity - "Named" refers to rigid designator. Eg. "Ford" or "Ford Motor Company" instead of "the automotive company created by Henry Ford". Examples include:
- Proper names
- Biological species
- Substances
- Numerical expressions
- Temporal expressions
Temporal expressions does not include unspecific terms like "June", which may be June in any year.
#### Categorization
Often categorized and further subcategorized. Some common categories:
- enamex (subclass of proper names, most studied)
- persons
- locations
- organizations
- timex
- date
- time
- numex
- money
- percent
Also cases of marginal types for special purposes like "email address", "phone number", "project name" and so on.
(*miscellaneous* has been used about proper names outside enamex)
#### Open Domain
No limits on possible types. Some recent research. Also some examples of extremely many categories (around 200).
### What's Next?
- Multimedia
- Semi- and unsupervised approaches
- Complex linguistic phenomena
- Large-scale projects - integration of NERC and Machine Translation
Stay tuned.
## Machine Learning Techniques
Essential to recognize previously unknown entities. Mostly based on rules, either created by hand or machine learning. Machine learning preferred today when enough training data is available.
### Supervised Learning
Typically transforms a large annotated corpus into rules and memoized lists.
Basline method: Tag words in test corpus when they are annotated in the training corpus (simple memoization). Performence of this baseline depends on tthe *vocabulary transfer*.
#### Vocabulary Transfer
Measures the success achievable by simple memoization - proportion of unique words appearing in both training and testing corpus.
21% proportions from the MUC-6 training data. Highlights:
- 42% for location names
- 17% for organization names
- 13% for person names
MUC-7 corpus:
- 76% for locations
- 49% for organizations
- 26% for persons
Precision: 70%-90%. Enamex (total): 76% precision, 48% recall
Pessimistic as measure - more occurences may be recognized due to repetition.
### Semi-supervised Learning
Main technique is "bootstrappig" (small degree of supervision).
Example approach:
1. User provides a few example names
2. The system finds all occurances
3. Use the setences with occurences for training.
4. Find new examples using the developed rules.
5. Repeat from (2) using the examples found.
Comparable performance to supervised learning.
#### Examples
S. Brin (1998)
: Develop regex for recognizing book titles and authors. Same format often used throughout a site, thus recognizing one book gives a pattern to recognize the next.
M. Colis and Singer (1999)
: Pattern based recognition where a pattern is a pair {spelling, context}. Spelling rule example: If spelling contains "Mr.", it is a person. Aggregate contextual rules, wich can be used to find new spellings.
E. Riloff and Jones (1999)
: "Mutual bootstrapping" - grow entities and contexts in turns. Deteriorates quickly when nocie is introduced.
A. Cucchiareli and Velardi (2001)
: Variant of previous. Uses syntactic relations for more accurate contextual evidence. Uses extinging NER systems for seeds.
M. Pasca et al. (2006)
: Also based on mutual bootstrapping. Generates synonyms to generalize patterns. E.g. "X was born in November" can recognize several months if "November" is a (Lin's) synonym for {"March", "October",...}.
Existing NE classifiers can be improved with bootstrapping. Large collections is not sufficient by itself, documents rich in proper names and coreferences bring the best results.
### Unsupervised Learning
Typically rely on lexical resources, patterns and statistics rom a lange unnanotated corpus. Example is clustering - groups based on similarity of context.
#### Examples
[#] "WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept."
E. Alfonseca and Manandhar (2002)
: Label input word with NE type. NE types from WordNet. Register words frequently co-occuring in large corpus ("type signatures"). Compare context to type signatures when trying to determine NE type of word.
R. Evans (2003)
: Use hypernyms. E.g. finding occurances such as "organizations such as X". In the sentence "An $L_0$ is a (kind of) $L_1$", $L_1$ is the hypernym of $L_0$ and the relationship is reflexive and transitive, but not symmetric.
Y. Shinayama and Sekine (2004)
: Exploits that named entities appear synchronously (at around the same time) in several news articles.
O. Etzioni et al. (2005)
: Use web search to determine. E.g. to find the type of London, search for "London is a nation", "London is a city",...
## Features of Words for Algorthmic Consumption
> Features are descriptors or characteristic attributes of words designed for algorithmic
consumption.
Examples:
- capitalized or not
- length in characters
- lowercased version of the word
The sentence "The president of Apple eats an apple" with the above gives the feature vector
<true, 3, “the”>,
<false, 9, “president”>,
<false, 2, “of”>,
<true, 5, “apple”>,
<false, 4, “eats”>,
<false, 2, “an”>,
<false, 5, “apple”>
Typically used in rules for the NERC system.
The rest of this section basically lists a lot of word features. Read the article for examples.
### Categories
Word-level Features
: Based on the characters making up the word.
List Lookup Features
: Whether or not the word exists in a list of words (e.g. "company names", "nouns",...). May also use other matching techniques, such as fuzzy matching, stemming techniques or using the soundex algorithm.
Document/Corpus Features
: Features based on the context of the word. E.g. multiple occurences, aliases, document metadata, statistics.
## NERC System Evaluation
Usually compare the solution of the system with the solution from human linguists.Teh system can produce 3 types of errors.
False positive
: System detected entity, but there was none.
Missed
: System did not detect entity, but there was one.
Wrong label
: System detected entity, but gave it the wrong label.
Incorrect boundaries
: System detected entity, but included too much or too little.
Both wrong label and boundaries
: Obviously a combination of the two previous.
### MUC Evaluations
[#] MUC: Message Understanding Conference
Based on two axes:
TYPE
: Correct if entitiy is assigned correct type. Boundaries does not matter as long as they overlap.
TEXT
: Correct if boundaries matches.
Three measures for each axis, the number of correct, system guesses and entities in the solution.
Then calculate precision $p$ and recal $r$ as given by
$$p = \frac{\text{correct guesses}}{\text{total guesses}}$$
$$r = \frac{\text{correct guesses}}{\text{entities in corpus}}$$
where the numbers include both TEXT and TYPE.
The score is the micro-averaged f-measure (MAF),
$$MAF = \frac{2PR}{P + R}$$
where precision and recall is micro-averaged. That is, average across the set of test cases $T$
$$P = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{total guesses}}$$
$$R = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{entities in corpus}}$$
### Exact-match Evaluations
Requires exact matches (thus only one axis with a binary scale).
Equations for precision and recall are otherwise the same as for MUC. Score is calculated using the micro-averaged f-measure.
### ACE Evaluation
More complicated than the previous ones (includes subtypes, class, entitiy mentions...), mostly ignored here.
Each entitiy type (person, company...) is assigned a weight and a maximal contribution MAXVAL. Each error (false positive, missed entity...) is assigned a cost.
Special rule for partial matches.
Final Entity Detection and Recognition Value (EDR):
$$EDR = 100\% - \text{penalties}$$
## Conclusion/Summary
- Area of NER has thrived for 15 years
- Wide range of languages, domains, textual generes and entity types
- Overview of techniques
- Hand crafted rules vs. machine learning
- Varying levels of supervision
- Listed features
> The use of an expressive and varied set of
features turns out to be just as important as the choice of machine learning algorithms.
- Overview of evaluation methods
NERC's impact:
- Yellow pages
- Trend monitoring
- Basis for advances in biology and genetics
# Survey of Temporal Information Retrieval and Related Applications
(by Ricardo Campos, Gaël Dias, Alípio M. Jorge, Adam Jatowt)
Temporal Information Retrieval, T-IR, includes the temporal aspect when determining relevance. Five key aspects for document quality:
- *timeliness*
- relevance
- accuracy
- objectivity
- coverage
The introduction further lists some projects with temporal aspects, like the Internet Archive project.
The survey focuses on T-IR for web.
## High-level Overview
Introduction to basic concepts and extraction.
### Notion of Time
Some formal models developed.
Less formally "an inherent construct in human life".
Point-in-time values can be represented by units with ranging granularity (seconds, days, years, decades..).
Two types of time in database research.
Valid
: Time at which the event occured in real life
Transaction
: Time at which the fact was stored
For web, focus time, the time mentioned in the text, is considered. Regarded as a counterpart of valid time. There are also multiple transaction times, eg. created, modified and published.
Other relevant times: reading time and document age.
### Events and Timelines
An event has a time and a place. Defined as "occurence" that happened at a given time in a given place.
A sequence of events is a timeline or chronology, as expected.
### Temporal Expressions
Temproal experession can be grouped in to three categories.
Explicit
: Precise moment in time, eg. 25.12.2009. Varying granularity.
Implicit
: Often associated with events (eg. "Christmas Day"). Requires more info to get temporal value.
Relative
: Time relative to another time, eg. "today".
### Temporal Information Extraction (T-IE)
Usually performed by a *temporal tagger*. Does tokenization, sentence extraction, part-of-speech taging and named-entity recognition. T-IE used to be part of NER, now separate task consisting of the following steps.
1. Extraction/Recognition
2. Normalization
3. Temporal Annotation
Result: Annotated text (usually TimeML).
Often rule based, language-specific solutions tailored for one domain (the news). No extensive collections of annotated text.
Performance measured using $F_\beta$.
$$F_\beta = \frac{(\beta^2+1)PR}{\beta^2P + R}$$
where
$$P = \frac{TP}{TP+FP}$$
$$R = \frac{TP}{TP+FN}$$.
TP
: True positives
FP
: False positives
FN
: False negatives
## Time Related Feature Extraction
## Advances in T-IR
## Future
## Conclusion