Using Cross-Lingual Distributional Word Vectors for Bitext Alignment

Dátum
Folyóirat címe
Folyóirat ISSN
Kötet címe (évfolyam száma)
Kiadó
Absztrakt

After introducing the necessary background based on a review of the literature, this paper presents a case study that examines ways in which a range of techniques that have been developed in the field of deep learning as applied to natural language processing can be used for bitext alignment. More specifically, the case study explores how various types of dense vector representations of words, popularly called word embeddings, which are usually used to approximate the meanings of words in a single language, can be utilised for bitext alignment, which has not been considered so far to my knowledge in the literature. To this end several new variants of cross-lingual lexical vector space representations are proposed. One specific goal of the case study is to examine whether cross-lingual vector space models which use subword information can improve alignment results as compared to cross-lingual embeddings that represent whole words as a discrete unit and ignore their form and internal structure. The ability of the new solutions utilising word vectors to identify corresponding segments in a bitext is measured on a small, manually checked gold standard parallel corpus. The linguistic data on which the study is based comprise several volumes of the English and Hungarian versions of the Official Journal of the European Union.

Leírás
Kulcsszavak
natural language processing, word embeddings, bitext alignment
Forrás
Gyűjtemények