Wikipendium

Share on Twitter Create compendium Add Language
Edit History
Tools
  • Edit
  • History
  • Share on Twitter

  • Add language

  • Create new compendium
Log in
Table of Contents
  1. Tips
  2. A Survey of Named Entity Recognition and Classification
    1. Observations
      1. Language Factor
      2. Textual Genere and Domain
      3. Entity Type
        1. Categorization
        2. Open Domain
      4. What's Next?
    2. Machine Learning Techniques
      1. Supervised Learning
        1. Vocabulary Transfer
      2. Semi-supervised Learning
        1. Examples
      3. Unsupervised Learning
        1. Examples
    3. Features of Words for Algorthmic Consumption
      1. Categories
    4. NERC System Evaluation
      1. MUC Evaluations
      2. Exact-match Evaluations
      3. ACE Evaluation
    5. Conclusion/Summary
  3. Survey of Temporal Information Retrieval and Related Applications
    1. High-level Overview
      1. Notion of Time
      2. Events and Timelines
      3. Temporal Expressions
      4. Temporal Information Extraction (T-IE)
    2. Time Related Feature Extraction
    3. Advances in T-IR
    4. Future
    5. Conclusion
‹

TDT46: Information Retrieval

Tags:
+

Curriculum from fall 2016.

Tips

Gjør quizoppgavene på http://www.intoit.io . Du får teste deg selv og få en oversikt over temaene!

A Survey of Named Entity Recognition and Classification

(by David Nadeau, Satoshi Sekine)

This is a survey non-exhaustive of 1991-2006 research in the NERC field, short for "named entity recognition and classification".

A named entity may be the name of companies, persons and locations, but perhaps more surprisingly also time, date, money, percent and numeric expressions.

Observations

Activity overview:

1991 - 1995
Few articles (found 8 in English)
1996 - now
Accelerates and does not go down.

Language Factor

Good portion studies English, perhaps more address language independence.

Well studied:

  • English
  • German
  • Spanish
  • Dutch
  • Japanese
  • Chinese
  • French
  • Greek
  • Italian

Some attention for:

  • Basque
  • Bulgarian
  • Catalan
  • Cebuano
  • Danish
  • Hindi
  • Polish
  • Romanian
  • Russian
  • Swedish
  • Turkish

Arabic has started to receive attention in large-scale projects.

Textual Genere and Domain

The impact of textual genre (journalistic, scientific, informal, etc.) and domain (gardening, sports, business, etc.) has been rather neglected in the NERC literature.

The research that exists shows this has a major impact. Newswire texts vs. manual translations of phone conversations and technical e-mails: Performance of all systems drop.

Entity Type

Named entity - "Named" refers to rigid designator. Eg. "Ford" or "Ford Motor Company" instead of "the automotive company created by Henry Ford". Examples include:

  • Proper names
  • Biological species
  • Substances
  • Numerical expressions
  • Temporal expressions

Temporal expressions does not include unspecific terms like "June", which may be June in any year.

Categorization

Often categorized and further subcategorized. Some common categories:

  • enamex (subclass of proper names, most studied)
    • persons
    • locations
    • organizations
  • timex
    • date
    • time
  • numex
    • money
    • percent

Also cases of marginal types for special purposes like "email address", "phone number", "project name" and so on.

(miscellaneous has been used about proper names outside enamex)

Open Domain

No limits on possible types. Some recent research. Also some examples of extremely many categories (around 200).

What's Next?

  • Multimedia
  • Semi- and unsupervised approaches
  • Complex linguistic phenomena
  • Large-scale projects - integration of NERC and Machine Translation

Stay tuned.

Machine Learning Techniques

Essential to recognize previously unknown entities. Mostly based on rules, either created by hand or machine learning. Machine learning preferred today when enough training data is available.

Supervised Learning

Typically transforms a large annotated corpus into rules and memoized lists.

Basline method: Tag words in test corpus when they are annotated in the training corpus (simple memoization). Performence of this baseline depends on tthe vocabulary transfer.

Vocabulary Transfer

Measures the success achievable by simple memoization - proportion of unique words appearing in both training and testing corpus.

21% proportions from the MUC-6 training data. Highlights:

  • 42% for location names
  • 17% for organization names
  • 13% for person names

MUC-7 corpus:

  • 76% for locations
  • 49% for organizations
  • 26% for persons

Precision: 70%-90%. Enamex (total): 76% precision, 48% recall

Pessimistic as measure - more occurences may be recognized due to repetition.

Semi-supervised Learning

Main technique is "bootstrappig" (small degree of supervision). Example approach:

  1. User provides a few example names
  2. The system finds all occurances
  3. Use the setences with occurences for training.
  4. Find new examples using the developed rules.
  5. Repeat from (2) using the examples found.

Comparable performance to supervised learning.

Examples

S. Brin (1998)
Develop regex for recognizing book titles and authors. Same format often used throughout a site, thus recognizing one book gives a pattern to recognize the next.
M. Colis and Singer (1999)
Pattern based recognition where a pattern is a pair {spelling, context}. Spelling rule example: If spelling contains "Mr.", it is a person. Aggregate contextual rules, wich can be used to find new spellings.
E. Riloff and Jones (1999)
"Mutual bootstrapping" - grow entities and contexts in turns. Deteriorates quickly when nocie is introduced.
A. Cucchiareli and Velardi (2001)
Variant of previous. Uses syntactic relations for more accurate contextual evidence. Uses extinging NER systems for seeds.
M. Pasca et al. (2006)
Also based on mutual bootstrapping. Generates synonyms to generalize patterns. E.g. "X was born in November" can recognize several months if "November" is a (Lin's) synonym for {"March", "October",...}.

Existing NE classifiers can be improved with bootstrapping. Large collections is not sufficient by itself, documents rich in proper names and coreferences bring the best results.

Unsupervised Learning

Typically rely on lexical resources, patterns and statistics rom a lange unnanotated corpus. Example is clustering - groups based on similarity of context.

Examples

"WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept."
E. Alfonseca and Manandhar (2002)
Label input word with NE type. NE types from WordNet. Register words frequently co-occuring in large corpus ("type signatures"). Compare context to type signatures when trying to determine NE type of word.
R. Evans (2003)
Use hypernyms. E.g. finding occurances such as "organizations such as X". In the sentence "An $L_0$ is a (kind of) $L_1$", $L_1$ is the hypernym of $L_0$ and the relationship is reflexive and transitive, but not symmetric.
Y. Shinayama and Sekine (2004)
Exploits that named entities appear synchronously (at around the same time) in several news articles.
O. Etzioni et al. (2005)
Use web search to determine. E.g. to find the type of London, search for "London is a nation", "London is a city",...

Features of Words for Algorthmic Consumption

Features are descriptors or characteristic attributes of words designed for algorithmic consumption.

Examples:

  • capitalized or not
  • length in characters
  • lowercased version of the word

The sentence "The president of Apple eats an apple" with the above gives the feature vector

<true, 3, “the”>,
<false, 9, “president”>,
<false, 2, “of”>,
<true, 5, “apple”>,
<false, 4, “eats”>,
<false, 2, “an”>,
<false, 5, “apple”>

Typically used in rules for the NERC system.

The rest of this section basically lists a lot of word features. Read the article for examples.

Categories

Word-level Features
Based on the characters making up the word.
List Lookup Features
Whether or not the word exists in a list of words (e.g. "company names", "nouns",...). May also use other matching techniques, such as fuzzy matching, stemming techniques or using the soundex algorithm.
Document/Corpus Features
Features based on the context of the word. E.g. multiple occurences, aliases, document metadata, statistics.

NERC System Evaluation

Usually compare the solution of the system with the solution from human linguists.Teh system can produce 3 types of errors.

False positive
System detected entity, but there was none.
Missed
System did not detect entity, but there was one.
Wrong label
System detected entity, but gave it the wrong label.
Incorrect boundaries
System detected entity, but included too much or too little.
Both wrong label and boundaries
Obviously a combination of the two previous.

MUC Evaluations

MUC: Message Understanding Conference

Based on two axes:

TYPE
Correct if entitiy is assigned correct type. Boundaries does not matter as long as they overlap.
TEXT
Correct if boundaries matches.

Three measures for each axis, the number of correct, system guesses and entities in the solution.

Then calculate precision $p$ and recal $r$ as given by $$p = \frac{\text{correct guesses}}{\text{total guesses}}$$ $$r = \frac{\text{correct guesses}}{\text{entities in corpus}}$$ where the numbers include both TEXT and TYPE.

The score is the micro-averaged f-measure (MAF), $$MAF = \frac{2PR}{P + R}$$ where precision and recall is micro-averaged. That is, average across the set of test cases $T$ $$P = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{total guesses}}$$ $$R = \frac{\sum_{t \in T}\text{correct guesses}}{\sum_{t \in T}\text{entities in corpus}}$$

Exact-match Evaluations

Requires exact matches (thus only one axis with a binary scale).

Equations for precision and recall are otherwise the same as for MUC. Score is calculated using the micro-averaged f-measure.

ACE Evaluation

More complicated than the previous ones (includes subtypes, class, entitiy mentions...), mostly ignored here.

Each entitiy type (person, company...) is assigned a weight and a maximal contribution MAXVAL. Each error (false positive, missed entity...) is assigned a cost.

Special rule for partial matches.

Final Entity Detection and Recognition Value (EDR): $$EDR = 100\% - \text{penalties}$$

Conclusion/Summary

  • Area of NER has thrived for 15 years
  • Wide range of languages, domains, textual generes and entity types
  • Overview of techniques
  • Hand crafted rules vs. machine learning
  • Varying levels of supervision
  • Listed features

The use of an expressive and varied set of features turns out to be just as important as the choice of machine learning algorithms. - Overview of evaluation methods

NERC's impact:

  • Yellow pages
  • Trend monitoring
  • Basis for advances in biology and genetics

Survey of Temporal Information Retrieval and Related Applications

(by Ricardo Campos, Gaël Dias, Alípio M. Jorge, Adam Jatowt)

Temporal Information Retrieval, T-IR, includes the temporal aspect when determining relevance. Five key aspects for document quality:

  • timeliness
  • relevance
  • accuracy
  • objectivity
  • coverage

The introduction further lists some projects with temporal aspects, like the Internet Archive project.

The survey focuses on T-IR for web.

High-level Overview

Introduction to basic concepts and extraction.

Notion of Time

Some formal models developed. Less formally "an inherent construct in human life". Point-in-time values can be represented by units with ranging granularity (seconds, days, years, decades..).

Two types of time in database research.

Valid
Time at which the event occured in real life
Transaction
Time at which the fact was stored

For web, focus time, the time mentioned in the text, is considered. Regarded as a counterpart of valid time. There are also multiple transaction times, eg. created, modified and published.

Other relevant times: reading time and document age.

Events and Timelines

An event has a time and a place. Defined as "occurence" that happened at a given time in a given place.

A sequence of events is a timeline or chronology, as expected.

Temporal Expressions

Temproal experession can be grouped in to three categories.

Explicit
Precise moment in time, eg. 25.12.2009. Varying granularity.
Implicit
Often associated with events (eg. "Christmas Day"). Requires more info to get temporal value.
Relative
Time relative to another time, eg. "today".

Temporal Information Extraction (T-IE)

Usually performed by a temporal tagger. Does tokenization, sentence extraction, part-of-speech taging and named-entity recognition. T-IE used to be part of NER, now separate task consisting of the following steps.

  1. Extraction/Recognition
  2. Normalization
  3. Temporal Annotation

Result: Annotated text (usually TimeML).

Often rule based, language-specific solutions tailored for one domain (the news). No extensive collections of annotated text.

Performance measured using $F_\beta$. $$F_\beta = \frac{(\beta^2+1)PR}{\beta^2P + R}$$ where $$P = \frac{TP}{TP+FP}$$ $$R = \frac{TP}{TP+FN}$$.

TP
True positives
FP
False positives
FN
False negatives

Time Related Feature Extraction

Advances in T-IR

Future

Conclusion

Written by

hoanghn toast
Last updated: Tue, 12 Dec 2017 15:51:11 +0100 .
  • Contact
  • Twitter
  • Statistics
  • Report a bug
  • Wikipendium cc-by-sa
Wikipendium is ad-free and costs nothing to use. Please help keep Wikipendium alive by donating today!