Named Entity Recognition Experiments on Turkish Texts

Transkript

Named Entity Recognition
Experiments on Turkish Texts
Dilek Küçük1 and Adnan Yazıcı2
1 TÜBİTAK
- Uzay Institute, Ankara - Turkey
[email protected]
2 Dept.
of Computer Engineering, METU, Ankara - Turkey
[email protected]
Outline

Introduction
Named Entity Recognition in Turkish

Evaluation






Evaluation on News Texts
Evaluation on Child Stories and Historical Texts
Evaluation on Video Texts
Future Work
Conclusion
Named Entity Recognition Experiments on
Turkish Texts
2
Introduction [1]

Named entity recognition (NER) is one of the
main information extraction (IE) tasks


recognition of names of people, locations,
organizations as well as temporal and numeric
expressions in texts (Nadeau and Sekine, 2007).
NER task is known to be a solved problem
especially for English with state-of-the-art
performance above 90 %.
Turkish Texts
3
Introduction [2]

NER research in Turkish is known to be rare.

Language-independent IE system (Cucerzan and
Yarowsky, 1999)

Statistical name tagger for Turkish (Tür et al, 2003)

Person name tagger for financial news texts
(Bayraktar and Taşkaya-Temizel, 2008)

Person mention extractor and a string matching based
coreference resolver (Küçük and Yazıcı, 2008)
Turkish Texts
4
Introduction [3]

In this study, we present a rule-based system for
named entity recognition from Turkish texts.

Proposed for the domain of news texts.

Evaluated on

Newswire texts

Child stories and historical texts

News video transcriptions
Turkish Texts
5
Named Entity Recognition in
Turkish [1]

The domain is determined as news texts.

News texts from METU Turkish corpus (Say et
al., 2002) are examined.

Capitalization and punctuation clues are not
utilized

Since they may be missing in automatic speech
recognition (ASR) outputs and texts obtained
from the Web.
Turkish Texts
6
Turkish [2]

A set of information sources has been compiled.
Turkish Texts
7
Turkish [3]

The lexical resources include

a dictionary of person names in Turkish
comprising about 8300 entries,

a list of well-known political people,

a list of well-known locations (the names of cities
and towns) in Turkey as well as in the world,

a list of well-known organizations in Turkey and
those in the world.
Turkish Texts
8
Turkish [4]

Pattern bases for the extraction of location/organization
names as well as that of the numeric/temporal expressions.

The system makes use of a simple morphological analyzer to
validate candidates. Named Entity Recognition Experiments on
Turkish Texts
9
Evaluation [1]

The system tags its output with Message
Understanding Conference (MUC) style named
entity tags:

ENAMEX, TIMEX, and NUMEX

An annotation tool is developed to annotate the
evaluation texts with the same tags to create
answer sets.

Evaluation is performed by comparing the
answer set with that of the system output.
Turkish Texts
10
Evaluation [2]

The Annotation Tool
Turkish Texts
11
Evaluation [3]

Evaluation is performed in terms of precision, recall,
and f-measure
Turkish Texts
12
Evaluation on News Text [1]
Turkish Texts
13
Turkish Texts
14

The precision of person name recognition using only a dictionary of
person names turns out to be too low.


During location and organization name recognition, the system
performs erroneous extractions.



Savaş („war‟), barış („peace‟), özen („care‟)…
anlatmanın yolu (the way to tell),
ilk üniversitesi (first university)…
Organization name recognition also suffers from the erroneous
extractions in case of compound organization names.

İstanbul Üniversitesi Siyasal Bilgiler Fakültesi…
„İstanbul University Political Science Faculty‟

as „İstanbul Üniversitesi‟ and „Bilgiler Fakültesi‟
Turkish Texts
15

As opposed to the statistical system (Tür et al., 2003),
the rule based system considers numeric and temporal
expressions

in addition to the person, location, organization names.

The statistical system has been trained on a set of news
articles with 492821 words (37277 NEs).

The statistical system has been tested on a news article
set of about 28000 words (2197 NEs) and has achieved a
best performance of 91.56 % in f-measure.

The rule-based system has been tested on a set of 20131
words (1591 NEs) and achieved an f-measure of 78.7 %.

The statistical system performs deeper language
processing compared to the rule-based system.
Turkish Texts
16
Evaluation on Child Stories and
Historical Texts [1]

The child stories set comprises two stories by the same
author (Ilgaz, 2003a-b).

The historical text includes the first three chapters of a
book describing five cities mostly on their historical basis
(Tanpınar, 2007).
Turkish Texts
17
Evaluation on Child Stories and
Historical Texts [2]

The main problem for child stories data set is the existence of
foreign person names throughout the stories.

The performance drop for historical text is due to the
nonexistence of historical person names and organizations
(such as the names of empires) in the lexical resources.

The results are in line with the well-known finding that rulebased systems suffer from performance degradation when
ported to other domains.
Turkish Texts
18
Evaluation on Video Texts [1]

An important research area which can benefit from
IE techniques is automatic multimedia annotation.

Several studies are carried out on employing especially
NER output for semantic multimedia annotation.

Multimedia indexing system for English, German and
Dutch football videos (Saggion et al., 2004)

Video annotation system for Italian news videos (Basili et
al., 2005)

Automatic annotation system for BBC radio and TV news
(Dowman et al., 2005)
Turkish Texts
19

We have compiled a video data set of Turkish news
videos

From the Web site of Turkish Radio and Television
Company (TRT).

Comprising 16 videos with a total duration of two
hours.

The videos are manually transcribed leading to a text
of 9804 words

Since no general purpose automatic speech recognizer
exists for Turkish.
Turkish Texts
20

The transcription text is annotated with named entity tags

resulting in 1090 named entities (256 person, 479 location,
and 222 organization names, 70 numeric and 63 temporal
expressions).

Evaluation of the recognizer on the text resulted in a
precision of 73.3%, a recall of 77.0%, and so an f-measure
of 75.1%.

The results on video transcriptions are satisfactory for a
first attempt of named entity recognition on genuine video
texts

It is significant step towards the employment of IE
techniques for semantic annotation of videos in Turkish.
Turkish Texts
21
Future Work

Future work based on the current study includes

Improvement of the system benefiting from the
error analyses.

Extending the system to output finer grained named
entity classes employing a named entity ontology.

Employment of machine learning algorithms for the
NER task

The results can be compared with that of the rule
based recognizer.
Turkish Texts
22
Conclusion [1]

Information extraction in Turkish is a rarely studied
research area.

In this study, we have presented a rule-based system
for named entity recognition from Turkish texts.

Initially engineered for news texts.

Employs a set of lexical resources and pattern bases.

Being a rule-based system, needs no training data.

Evaluated on diverse text types including news texts,
child stories, historical texts, and news video
transcriptions.
Turkish Texts
23
Conclusion [2]

The evaluation results for the news texts and
news video transcriptions are satisfactory for a
first attempt

Yet, the results for child stories and historical texts
are very low.

In line with the finding that rule-based IE systems
suffer from considerable performance drop when
evaluated on other domains.
Turkish Texts
24
References [1]
1.
2.
3.
4.
5.
6.
Roberto Basili, Marco Cammisa, and Emanuale Donati. RitroveRAI: A
web application for semantic indexing and hyperlinking of multimedia
news. In Proceedings of International Semantic Web Conference, 2005.
Özkan Bayraktar and Tuğba Taşkaya-Temizel. Person name extraction
from Turkish Financial news text using local grammar based approach.
In Proceedings of the International Symposium on Computer and
Information Sciences, 2008.
Silviu Cucerzan and David Yarowsky. Language independent named
entity recognition combining morphological and contextual evidence. In
Proceedings of the Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large Corpora, 1999.
Mike Dowman, Valentin Tablan, Hamish Cunningham, and Borislav
Popov. Web-assisted annotation, semantic indexing and search of
television and radio news. In Proceedings of the International Conference
on World Wide Web (WWW), 2005.
Rıfat Ilgaz. Bacaksız Kamyon Sürücüsü. Çınar Publications, 2003.
Rıfat Ilgaz. Bacaksız Tatil Köyünde. Çınar Publications, 2003.
Turkish Texts
25
References [2]
7.
8.
9.
10.
11.
12.
Dilek Küçük and Adnan Yazıcı. Identification of coreferential chains in
video texts for semantic annotation of news videos. In Proceedings of the
International Symposium on Computer and Information Sciences, 2008.
David Nadeau and Satoshi Sekine. “A Survey of Named Entity
Recognition and Classification”, Linguistica Investigationes, 2007, vol.
30, no. 1, pp.3-26.
Bilge Say, Deniz Zeyrek, Kemal Oflazer, and Umut Özge. Development of
a corpus and a treebank for present-day written Turkish. In Proceedings
of the 11th International Conference of Turkish Linguistics (ICTL), 2002.
Ahmet Hamdi Tanpınar. Beş Şehir. Dergah Publications, 2007.
Horacio Saggion, Hamish Cunningham, Kalina Bontcheva, Diana
Maynard, Oana Hamza, and Yorick Wilks. Multimedia indexing through
multi-source and multi-language information extraction: MUMIS project.
Data and Knowledge Engineering, 48:247-264, 2004.
Gökhan Tür, Dilek Hakkani-Tür, and Kemal Oflazer. A statistical
information extraction system for Turkish. Natural Language
Engineering, 9, 2:181-210, 2003.
Turkish Texts
26
Thank You
Turkish Texts
27

Named Entity Recognition Experiments on Turkish Texts

Transkript

Benzer belgeler

The Turkish National Anthem - English lyrics Turkish National

Patient Participation Group Turkish

Embassy English 2016 Promotion for Turkish

UNI 215 2015-2016 Spring Semester Syllabus

CeSMA newsletter March 2011