Document vector embeddings for bibliographic records indexing
Abstract
This article presents the eXenSa contribution to the 2016 DEFT Workshop. The proposed task consists in indexing bibliographic records with keywords chosen by professional indexers. We propose a statistical approach which combines graphical and semantic approches. The first approach defines a document keywords as thesaurus terms graphically similars to terms contained in the title or the abstract of this document. The second approach assigns to document the keywords associated with semantically similar
documents in training corpora. Both approach use models generated using NC-ISC, a stochastic matrix factorisation algorithm. Oursystem obtains the best F-score on half of the four test corpuses and ranks second for the two others.