Processing math: 100%

‹

TDT46: Information Retrieval

Tags:

Curriculum from fall 2016.

Edit

Tips

Gjør quizoppgavene på http://www.intoit.io . Du får teste deg selv og få en oversikt over temaene!

Edit

A Survey of Named Entity Recognition and Classification

(by David Nadeau, Satoshi Sekine)

This is a survey non-exhaustive of 1991-2006 research in the NERC field, short for "named entity recognition and classification".

A named entity may be the name of companies, persons and locations, but perhaps more surprisingly also time, date, money, percent and numeric expressions.

Edit

Observations

Activity overview:

1991 - 1995: Few articles (found 8 in English)
1996 - now: Accelerates and does not go down.

Edit

Language Factor

Good portion studies English, perhaps more address language independence.

Well studied:

English
German
Spanish
Dutch
Japanese
Chinese
French
Greek
Italian

Some attention for:

Basque
Bulgarian
Catalan
Cebuano
Danish
Hindi
Polish
Romanian
Russian
Swedish
Turkish

Arabic has started to receive attention in large-scale projects.

Edit

Textual Genere and Domain

The impact of textual genre (journalistic, scientific, informal, etc.) and domain (gardening, sports, business, etc.) has been rather neglected in the NERC literature.

The research that exists shows this has a major impact. Newswire texts vs. manual translations of phone conversations and technical e-mails: Performance of all systems drop.

Edit

Entity Type

Named entity - "Named" refers to rigid designator. Eg. "Ford" or "Ford Motor Company" instead of "the automotive company created by Henry Ford". Examples include:

Proper names
Biological species
Substances
Numerical expressions
Temporal expressions

Temporal expressions does not include unspecific terms like "June", which may be June in any year.

Edit

Categorization

Often categorized and further subcategorized. Some common categories:

enamex (subclass of proper names, most studied)
- persons
- locations
- organizations
timex
- date
- time
numex
- money
- percent

Also cases of marginal types for special purposes like "email address", "phone number", "project name" and so on.

(miscellaneous has been used about proper names outside enamex)

Edit

Open Domain

No limits on possible types. Some recent research. Also some examples of extremely many categories (around 200).

Edit

What's Next?

Multimedia
Semi- and unsupervised approaches
Complex linguistic phenomena
Large-scale projects - integration of NERC and Machine Translation

Stay tuned.

Edit

Machine Learning Techniques

Essential to recognize previously unknown entities. Mostly based on rules, either created by hand or machine learning. Machine learning preferred today when enough training data is available.

Edit

Supervised Learning

Typically transforms a large annotated corpus into rules and memoized lists.

Basline method: Tag words in test corpus when they are annotated in the training corpus (simple memoization). Performence of this baseline depends on tthe vocabulary transfer.

Edit

Vocabulary Transfer

Measures the success achievable by simple memoization - proportion of unique words appearing in both training and testing corpus.

21% proportions from the MUC-6 training data. Highlights:

42% for location names
17% for organization names
13% for person names

MUC-7 corpus:

76% for locations
49% for organizations
26% for persons

Precision: 70%-90%. Enamex (total): 76% precision, 48% recall

Pessimistic as measure - more occurences may be recognized due to repetition.

Edit

Semi-supervised Learning

Main technique is "bootstrappig" (small degree of supervision). Example approach:

User provides a few example names
The system finds all occurances
Use the setences with occurences for training.
Find new examples using the developed rules.
Repeat from (2) using the examples found.

Comparable performance to supervised learning.

Edit

Examples

S. Brin (1998): Develop regex for recognizing book titles and authors. Same format often used throughout a site, thus recognizing one book gives a pattern to recognize the next.
M. Colis and Singer (1999): Pattern based recognition where a pattern is a pair {spelling, context}. Spelling rule example: If spelling contains "Mr.", it is a person. Aggregate contextual rules, wich can be used to find new spellings.
E. Riloff and Jones (1999): "Mutual bootstrapping" - grow entities and contexts in turns. Deteriorates quickly when nocie is introduced.
A. Cucchiareli and Velardi (2001): Variant of previous. Uses syntactic relations for more accurate contextual evidence. Uses extinging NER systems for seeds.
M. Pasca et al. (2006): Also based on mutual bootstrapping. Generates synonyms to generalize patterns. E.g. "X was born in November" can recognize several months if "November" is a (Lin's) synonym for {"March", "October",...}.

Existing NE classifiers can be improved with bootstrapping. Large collections is not sufficient by itself, documents rich in proper names and coreferences bring the best results.

Edit

Unsupervised Learning

Typically rely on lexical resources, patterns and statistics rom a lange unnanotated corpus. Example is clustering - groups based on similarity of context.

Edit

Examples

"WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept."

E. Alfonseca and Manandhar (2002): Label input word with NE type. NE types from WordNet. Register words frequently co-occuring in large corpus ("type signatures"). Compare context to type signatures when trying to determine NE type of word.
R. Evans (2003): Use hypernyms. E.g. finding occurances such as "organizations such as X". In the sentence "An $L_0$ is a (kind of) $L_1$ ", $L_1$ is the hypernym of $L_0$ and the relationship is reflexive and transitive, but not symmetric.
Y. Shinayama and Sekine (2004): Exploits that named entities appear synchronously (at around the same time) in several news articles.
O. Etzioni et al. (2005): Use web search to determine. E.g. to find the type of London, search for "London is a nation", "London is a city",...

Edit

Features of Words for Algorthmic Consumption

Features are descriptors or characteristic attributes of words designed for algorithmic consumption.

Examples:

capitalized or not
length in characters
lowercased version of the word

The sentence "The president of Apple eats an apple" with the above gives the feature vector

<true, 3, “the”>,
<false, 9, “president”>,
<false, 2, “of”>,
<true, 5, “apple”>,
<false, 4, “eats”>,
<false, 2, “an”>,
<false, 5, “apple”>

Typically used in rules for the NERC system.

The rest of this section basically lists a lot of word features. Read the article for examples.

Edit

NERC System Evaluation

Usually compare the solution of the system with the solution from human linguists.Teh system can produce 3 types of errors.

False positive: System detected entity, but there was none.
Missed: System did not detect entity, but there was one.
Wrong label: System detected entity, but gave it the wrong label.
Incorrect boundaries: System detected entity, but included too much or too little.
Both wrong label and boundaries: Obviously a combination of the two previous.

Edit

MUC Evaluations

MUC: Message Understanding Conference

Based on two axes:

TYPE: Correct if entitiy is assigned correct type. Boundaries does not matter as long as they overlap.
TEXT: Correct if boundaries matches.

Three measures for each axis, the number of correct, system guesses and entities in the solution.

Then calculate precision $p$ and recal $r$ as given by $p = \frac{\text{correct guesses}}{\text{total guesses}}$ $r = \frac{\text{correct guesses}}{\text{entities in corpus}}$ where the numbers include both TEXT and TYPE.

The score is the micro-averaged f-measure (MAF), $MAF = \frac{2PR}{P + R}$ where precision and recall is micro-averaged. That is, average across the set of test cases $T$ $P = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{total guesses}}$ $R = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{entities in corpus}}$

Edit

Exact-match Evaluations

Requires exact matches (thus only one axis with a binary scale).

Equations for precision and recall are otherwise the same as for MUC. Score is calculated using the micro-averaged f-measure.

Edit

ACE Evaluation

More complicated than the previous ones (includes subtypes, class, entitiy mentions...), mostly ignored here.

Each entitiy type (person, company...) is assigned a weight and a maximal contribution MAXVAL. Each error (false positive, missed entity...) is assigned a cost.

Special rule for partial matches.

Final Entity Detection and Recognition Value (EDR): $EDR = 100\% - \text{penalties}$

Edit

Conclusion/Summary

Area of NER has thrived for 15 years
Wide range of languages, domains, textual generes and entity types
Overview of techniques
Hand crafted rules vs. machine learning
Varying levels of supervision
Listed features

The use of an expressive and varied set of features turns out to be just as important as the choice of machine learning algorithms. - Overview of evaluation methods

NERC's impact:

Yellow pages
Trend monitoring
Basis for advances in biology and genetics

Written by

hoanghn toast

Last updated: 8 years ago.

Wikipendium

TDT46: Information Retrieval

Tips

A Survey of Named Entity Recognition and Classification

Observations

Language Factor

Textual Genere and Domain

Entity Type

Categorization

Open Domain

What's Next?

Machine Learning Techniques

Supervised Learning

Vocabulary Transfer

Semi-supervised Learning

Examples

Unsupervised Learning

Examples

Features of Words for Algorthmic Consumption

Categories

NERC System Evaluation

MUC Evaluations

Exact-match Evaluations

ACE Evaluation

Conclusion/Summary

Written by

TDT46: Information Retrieval

Tips

A Survey of Named Entity Recognition and Classification

Observations

Language Factor

Textual Genere and Domain

Entity Type

Categorization

Open Domain

What's Next?

Machine Learning Techniques

Supervised Learning

Vocabulary Transfer

Semi-supervised Learning

Examples

Unsupervised Learning

Examples

Features of Words for Algorthmic Consumption

Categories

NERC System Evaluation

MUC Evaluations

Exact-match Evaluations

ACE Evaluation

Conclusion/Summary

Survey of Temporal Information Retrieval and Related Applications

High-level Overview

Notion of Time

Events and Timelines

Temporal Expressions

Temporal Information Extraction (T-IE)

Time Related Feature Extraction

Advances in T-IR

Future

Conclusion

Written by