Current Proceedings on Technology
Yazarlar: Mercè Vàzquez, Antoni Oliver
Konular:-
Anahtar Kelimeler:Term candidate validation,Ranking metrics,Term extraction,Token slot detection
Özet: At times it is difficult to automatically identify the most representative terms in a specialized corpus and to validate them as correct due to the similarity of words and terms. In order to identify the most representative terms in a corpus that can be easily adapted to any language or terminology extraction tool, we explore the combination of token slot extraction and ranking metrics to select term candidates with a high likelihood of being terminological units. This paper presents the results we have identified using four statistical measures. We observe high term detection in English corpora (a precision of 76.92% and a recall of 79.09%) and Spanish corpora (a precision of 60% and a recall of 70.48%) using token slot detection together with four ranking metrics: Dice, True Mutual Information, T-score and Log-likelihood. In conclusion, token slot detection extracts terminological patterns in term candidates to reduce lists of candidates, and ranking metrics improve results and reduce the number to be evaluated manually. We will evaluate the algorithm’s performance in other domains and for other user profiles and needs.