TDT46: Information Retrieval
Curriculum from fall 2016.
Tips
Gjør quizoppgavene på http://www.intoit.io . Du får teste deg selv og få en oversikt over temaene!
A Survey of Named Entity Recognition and Classification
(by David Nadeau, Satoshi Sekine)
This is a survey non-exhaustive of 1991-2006 research in the NERC field, short for "named entity recognition and classification".
A named entity may be the name of companies, persons and locations, but perhaps more surprisingly also time, date, money, percent and numeric expressions.
Observations
Activity overview:
- 1991 - 1995
- Few articles (found 8 in English)
- 1996 - now
- Accelerates and does not go down.
Language Factor
Good portion studies English, perhaps more address language independence.
Well studied:
- English
- German
- Spanish
- Dutch
- Japanese
- Chinese
- French
- Greek
- Italian
Some attention for:
- Basque
- Bulgarian
- Catalan
- Cebuano
- Danish
- Hindi
- Polish
- Romanian
- Russian
- Swedish
- Turkish
Arabic has started to receive attention in large-scale projects.
Textual Genere and Domain
The impact of textual genre (journalistic, scientific, informal, etc.) and domain (gardening, sports, business, etc.) has been rather neglected in the NERC literature.
The research that exists shows this has a major impact. Newswire texts vs. manual translations of phone conversations and technical e-mails: Performance of all systems drop.
Entity Type
Named entity - "Named" refers to rigid designator. Eg. "Ford" or "Ford Motor Company" instead of "the automotive company created by Henry Ford". Examples include:
- Proper names
- Biological species
- Substances
- Numerical expressions
- Temporal expressions
Temporal expressions does not include unspecific terms like "June", which may be June in any year.
Categorization
Often categorized and further subcategorized. Some common categories:
- enamex (subclass of proper names, most studied)
- persons
- locations
- organizations
- timex
- date
- time
- numex
- money
- percent
Also cases of marginal types for special purposes like "email address", "phone number", "project name" and so on.
(miscellaneous has been used about proper names outside enamex)
Open Domain
No limits on possible types. Some recent research. Also some examples of extremely many categories (around 200).
What's Next?
- Multimedia
- Semi- and unsupervised approaches
- Complex linguistic phenomena
- Large-scale projects - integration of NERC and Machine Translation
Stay tuned.
Machine Learning Techniques
Essential to recognize previously unknown entities. Mostly based on rules, either created by hand or machine learning. Machine learning preferred today when enough training data is available.
Supervised Learning
Typically transforms a large annotated corpus into rules and memoized lists.
Basline method: Tag words in test corpus when they are annotated in the training corpus (simple memoization). Performence of this baseline depends on tthe vocabulary transfer.
Vocabulary Transfer
Measures the success achievable by simple memoization - proportion of unique words appearing in both training and testing corpus.
21% proportions from the MUC-6 training data. Highlights:
- 42% for location names
- 17% for organization names
- 13% for person names
MUC-7 corpus:
- 76% for locations
- 49% for organizations
- 26% for persons
Precision: 70%-90%. Enamex (total): 76% precision, 48% recall
Pessimistic as measure - more occurences may be recognized due to repetition.
Semi-supervised Learning
Main technique is "bootstrappig" (small degree of supervision). Example approach:
- User provides a few example names
- The system finds all occurances
- Use the setences with occurences for training.
- Find new examples using the developed rules.
- Repeat from (2) using the examples found.
Comparable performance to supervised learning.
Examples
- S. Brin (1998)
- Develop regex for recognizing book titles and authors. Same format often used throughout a site, thus recognizing one book gives a pattern to recognize the next.
- M. Colis and Singer (1999)
- Pattern based recognition where a pattern is a pair {spelling, context}. Spelling rule example: If spelling contains "Mr.", it is a person. Aggregate contextual rules, wich can be used to find new spellings.
- E. Riloff and Jones (1999)
- "Mutual bootstrapping" - grow entities and contexts in turns. Deteriorates quickly when nocie is introduced.
- A. Cucchiareli and Velardi (2001)
- Variant of previous. Uses syntactic relations for more accurate contextual evidence. Uses extinging NER systems for seeds.
- M. Pasca et al. (2006)
- Also based on mutual bootstrapping. Generates synonyms to generalize patterns. E.g. "X was born in November" can recognize several months if "November" is a (Lin's) synonym for {"March", "October",...}.
Existing NE classifiers can be improved with bootstrapping. Large collections is not sufficient by itself, documents rich in proper names and coreferences bring the best results.
Unsupervised Learning
Typically rely on lexical resources, patterns and statistics rom a lange unnanotated corpus. Example is clustering - groups based on similarity of context.
Examples
- E. Alfonseca and Manandhar (2002)
- Label input word with NE type. NE types from WordNet. Register words frequently co-occuring in large corpus ("type signatures"). Compare context to type signatures when trying to determine NE type of word.
- R. Evans (2003)
- Use hypernyms. E.g. finding occurances such as "organizations such as X". In the sentence "An
$L_0$ is a (kind of)$L_1$ ",$L_1$ is the hypernym of$L_0$ and the relationship is reflexive and transitive, but not symmetric. - Y. Shinayama and Sekine (2004)
- Exploits that named entities appear synchronously (at around the same time) in several news articles.
- O. Etzioni et al. (2005)
- Use web search to determine. E.g. to find the type of London, search for "London is a nation", "London is a city",...
Features of Words for Algorthmic Consumption
Features are descriptors or characteristic attributes of words designed for algorithmic consumption.
Examples:
- capitalized or not
- length in characters
- lowercased version of the word
The sentence "The president of Apple eats an apple" with the above gives the feature vector
<true, 3, “the”>,
<false, 9, “president”>,
<false, 2, “of”>,
<true, 5, “apple”>,
<false, 4, “eats”>,
<false, 2, “an”>,
<false, 5, “apple”>
Typically used in rules for the NERC system.
The rest of this section basically lists a lot of word features. Read the article for examples.
Categories
- Word-level Features
- Based on the characters making up the word.
- List Lookup Features
- Whether or not the word exists in a list of words (e.g. "company names", "nouns",...). May also use other matching techniques, such as fuzzy matching, stemming techniques or using the soundex algorithm.
- Document/Corpus Features
- Features based on the context of the word. E.g. multiple occurences, aliases, document metadata, statistics.
NERC System Evaluation
Usually compare the solution of the system with the solution from human linguists.Teh system can produce 3 types of errors.
- False positive
- System detected entity, but there was none.
- Missed
- System did not detect entity, but there was one.
- Wrong label
- System detected entity, but gave it the wrong label.
- Incorrect boundaries
- System detected entity, but included too much or too little.
- Both wrong label and boundaries
- Obviously a combination of the two previous.
MUC Evaluations
Based on two axes:
- TYPE
- Correct if entitiy is assigned correct type. Boundaries does not matter as long as they overlap.
- TEXT
- Correct if boundaries matches.
Three measures for each axis, the number of correct, system guesses and entities in the solution.
Then calculate precision
The score is the micro-averaged f-measure (MAF),
Exact-match Evaluations
Requires exact matches (thus only one axis with a binary scale).
Equations for precision and recall are otherwise the same as for MUC. Score is calculated using the micro-averaged f-measure.
ACE Evaluation
More complicated than the previous ones (includes subtypes, class, entitiy mentions...), mostly ignored here.
Each entitiy type (person, company...) is assigned a weight and a maximal contribution MAXVAL. Each error (false positive, missed entity...) is assigned a cost.
Special rule for partial matches.
Final Entity Detection and Recognition Value (EDR):
Conclusion/Summary
- Area of NER has thrived for 15 years
- Wide range of languages, domains, textual generes and entity types
- Overview of techniques
- Hand crafted rules vs. machine learning
- Varying levels of supervision
- Listed features
The use of an expressive and varied set of features turns out to be just as important as the choice of machine learning algorithms. - Overview of evaluation methods
NERC's impact:
- Yellow pages
- Trend monitoring
- Basis for advances in biology and genetics