Adaptation, Comparison, and Improvement of Metaheuristic Algorithms to the Part-of-Speech Tagging Problem

Part-of-Speech Tagging (POST) is a complex task in the preprocessing of Natural Language Processing applications. Tagging has been tackled from statistical information and rule-based approaches, making use of a range of methods. Most recently, metaheuristic algorithms have gained attention while being used in a wide variety of knowledge areas, with good results. As a result, they were deployed in this research in a POST problem to assign the best sequence of tags (roles) for the words of a sentence based on information statistics. This process was carried out in two cycles, each of them comprised four phases, allowing the adaptation to the tagging problem in metaheuristic algorithms such as Particle Swarm Optimization, Jaya, Random-Restart Hill Climbing, and a memetic algorithm based on Global-Best Harmony Search as a global optimizer, and on Hill Climbing as a local optimizer. In the consolidation of each algorithm, preliminary experiments were carried out (using cross-validation) to adjust the parameters of each algorithm and, thus, evaluate them 1 Universidad del Cauca (Popayán-Cauca, Colombia). miguelsolano@unicauca.edu.co. ORCID: 0000-00031936-3488 2 Universidad del Cauca (Popayán-Cauca, Colombia). josej@unicauca.edu.co. ORCID: 0000-0002-5436-0816 3 Ph. D. Universidad del Cauca (Popayán-Cauca, Colombia). lsierra@unicauca.edu.co. ORCID: 0000-00033847-3324 4 Ph. D. Universidad del Cauca (Popayán-Cauca, Colombia). ccobos@unicauca.edu.co. ORCID: 0000-00026263-1911 Adaptation, Comparison, and Improvement of Metaheuristic Algorithms to the Part-of-Speech Tagging Problem Revista Facultad de Ingeniería (Rev. Fac. Ing.) Vol. 29 (54), e11762. 2020. Tunja-Boyacá, Colombia. L-ISSN: 0121-1129, e-ISSN: 2357-5328, DOI: https://doi.org/10.19053/01211129.v29.n54.2020.11762 on the datasets of the complete tagged corpus: IULA (Spanish), Brown (English) and Nasa Yuwe (Nasa). The results obtained by the proposed taggers were compared, and the Friedman and Wilcoxon statistical tests were applied, confirming that the proposed memetic, GBHS Tagger, obtained better results in precision. The proposed taggers make an important contribution to POST for traditional languages (English and Spanish), non-traditional languages (Nasa Yuwe), and their application areas.

on the datasets of the complete tagged corpus: IULA (Spanish), Brown (English) and Nasa Yuwe (Nasa). The results obtained by the proposed taggers were compared, and the Friedman and Wilcoxon statistical tests were applied, confirming that the proposed memetic, GBHS Tagger, obtained better results in precision. The proposed taggers make an important contribution to POST for traditional languages (English and Spanish), non-traditional languages (Nasa Yuwe), and their application areas.
Keywords: computational intelligence; computational linguistics; evolutionary computing; heuristic algorithms; natural language processing; parts of speech tagging; search methods.

I. INTRODUCTION
Metaheuristic algorithms are being applied every day in a variety of areas of knowledge. It is not unusual, therefore, to use them in the problem of Part-of-speech Tagging (POST) or Identification. This is a complex task of great importance in Natural Language, given the challenges it faces, such as: the ambiguity of words, the size of the tag set, and the tagging of unknown words [1,2].
Metaheuristic algorithms in the tagging problem (POST) have been used to assign the best sequence of tags (roles) for the words of a sentence, based on both statistical information and rules of transformation to solve this problem, obtaining outstanding results in contrast to traditional approaches. Related work includes: 1) Alhasan and Al-taani [3], who represented the tagging problem as a graph, the nodes are the possible tags of a sentence and use the optimization algorithm by Bee Said metaheuristic approaches have been applied to corpus tagged in English, the Brown Corpus [8], the Penn Treebank Corpus [9], and other non-traditional languages such as Arabic with the KALIMAT corpus [10], Bengali (Bangladesh) [11], Hindi (India) [12], Telugu (India) [13], and Nasa Yuwe (an indigenous language of Colombia) [14]. Generally, these proposals use the Petrov tag set [15].
Metaheuristic algorithms solve problems using a search process (exploration and exploitation) of optimal solutions for a particular problem [16]. Thus, memetic algorithms [17] use population-based search to explore solutions, and local search based on neighborhood for the exploitation of promising solutions [18,19]. They also add knowledge of the problem to solve it. Table 1 describes the metaheuristics studied in this research for their subsequent adaptation to the tagging problem. Table 1. Metaheuristic algorithms studied to adapt to the tagging problem.

Metaheuristic algorithm Description
Random-Restart Hill Climbing (RRHC) [20] Simple state metaheuristic that improves Hill Climbing (HC) [17]. It seeks to prevent HC from being trapped in local optimum by performing repetitive explorations in the problem space, which are generated randomly until the stop criterion occurs or a better solution is not found.

Particle Swarm
Optimization (PSO) [21] Population metaheuristic motivated by the intelligent collective behavior of swarms in nature. Each potential solution is called a particle, the set of particles is known as a swarm, and the position of each particle changes depending on its own experience and the experience of the swarm [22].
Jaya [23] Population metaheuristic that seeks to find the best solution in the shortest possible time, but is also always trying to get away from failure, generating an optimal balance between exploration and exploitation. Jaya is a novel, simple and efficient algorithm for optimization problem solving with and without restrictions.
GBHS Tagger [4] Memetic algorithm adapted to the tagging problem [4], based on the Global-Best Harmony Search (GBHS) metaheuristic, which has the following parameters [17]: HMS (Harmonic Memory Size), NI (number of improvisations ), HCMR (Harmonic Memory Consideration Rate), and PARMin, PARMax (Tone Adjustment Rate). GBHS Tagger includes knowledge of the tagging problem using a local optimizer, adapted from the Hill Climb (HC) metaheuristic [17], which is applied to the best harmony in harmonic memory (HM). In addition to the GBHS parameters, three more parameters are defined: ProbOpt, which controls the percentage of times the local optimization process is carried out; MaxNeighbors, which defines the number of neighbors used in the local optimization process, and the parameter Alpha, which controls whether the components of each harmony in the population are randomly generated from their possible tags or taken from the tag with higher probability. Figure 1 shows the representation of the solution used for this investigation, which consists of: 1) a first vector of the size of the number of words in a sentence (one position per word), which contains the tags assigned to each word, from position 0 to position n-1 (T 0 , T 1 , … , T n−1 ); 2) a second vector containing the cumulative probability of each tagged word, and its relationship with its predecessor and successor, and 3) a field that stores the value of the fitness function, calculated as shown in Figure 1, adapted from [5]. In GBHS Tagger, the selected context for the word to be tagged is a trigram (predecessor, word to tag, successor).  In the present work, the adaptation of several metaheuristic algorithms to the tagging problem was carried out, using the representation of the solution proposed in [4], in order to propose improvements to the memetic presented in the same work, at the same time that it was sought to evaluate its performance on the corpus in Castilian IULA [24], English Brown [8] and Nasa Yuwe [14].
The rest of the article is organized as follows: Section 2 presents the methodology used; Section 3 details the adaptation of the selected metaheuristics to the tagging problem; Section 4 shows the results of the experiments carried out, and, finally, Section 5 presents the discussion, conclusions and future work.

II. METHODOLOGY
This section describes the dataset used for the evaluation of the algorithms, the activities carried out in each phase of the cycles of the Iterative Research Pattern (IRP) methodology [25], used for carrying out this work, and how the experiments were set up.

A. Used Method
Two cycles were used for this research. The first cycle focused on the adaptation of the metaheuristic algorithms to the tagging problem and the selection of the best one; the second cycle focused on the adaptation of the selected metaheuristic T 0 T 1 T 2 T 3 … T i … T n-1 0 1 2 3 n-1 P 0 P 1 P 2 P 3 … P i … P n-1 algorithms to the tagging problem and the proposal of a new version of the memetic algorithm. Table 2 describes the activities carried out in each phase.

B. Dataset and Experimental Setup
As part of this work, the IULA (Spanish), Brown (English) and Nasa Yuwe tagged corpus were integrated into a single database designed and developed in SQL Server. The experiments were carried out on this database and, for their execution (both preliminary and complete), a client-server model was used, in which the clients (machines) request the tasks to be carried out. Each task receives the phrase and the algorithm that it must run and evaluate. Likewise, each task is executed 30 times (repetitions of the experiment) on the local machine. Once the task is finished, the results are recorded in the cloud database.

III. RESULTS
In the first instance of this section, the adaptation of the algorithms to the tagging problem and a new version of the memetic GBHS Tagger (GBHS4Tagger) are presented. In the second instance, the experiments and the results obtained with the proposed taggers are shown. It is highlighted that all the adapted algorithms use the representation of the solution presented in [4], described in Figure 1.

A. Proposed JayaTagger
A discrete version of Jaya, called DJaya and proposed by [27], was used, it is free of parameters. The adaptation consisted in moving towards the best-known solution and moving away from the worst solution. Handling of the worst solution parameter was varied. The JayaTagger algorithm only handles three parameters: , MaxGenerations and 4. The latter controls the new solution from selecting a tag of the worst solution , making the algorithm simple to implement and evaluate. In Figure 2, the proposed JayaTagger pseudocode is presented.

B. Proposed PSOTagger
The adaptation proposed is done according to the following parameters (a discrete version of PSO [26] was used): , that selects a random tag for each dimension of a particle; 1, that selects the tag of the best particle history for that word; 2, that selects the tag of the best global of the swarm for each dimension of the particle, and , that maintains the components of the current particle. Additionally, PSOTagger involves the parameters and from its original version. The tuning of the , 1, 2, and parameters in PSO was carried out experimentally with cross validation of 5 folders and a small dataset as a sample of the evaluation dataset. The PSOTagger pseudocode is presented in Figure 3.

C. Proposed Random-Restart Hill Climbing (RRHC) Tagger
The adaptation of the RRHCTagger algorithm to the tagging problem was carried out as follows. 1) The parameters: _ controls the number of restarts of solution ; , list that stores the probabilities of the possible tags of a word; , a list that stores the positions of words that have more than one tag, and , a list that stores the words selected to make a stochastic improvement.
2) The solution is stochastically improved, after a certain number of iterations without obtaining improvements, the algorithm saves the current result and the solution is restarted again (n_restart parameter), selecting another word from all the possibilities. 3) A tabu memory was implemented, which saves the words that were selected in the solution restart. In Figure 4, the proposed RRHCTagger pseudocode is presented.

D. Proposed GBHS4Tagger
The GBHS4Tagger algorithm is based on the GBHSTagger memetic algorithm proposed in [4] and its improvement consists of the following steps. 1) The Hill Climbing (HC) algorithm was adapted to the tagging problem involving two neighborhoods. The first one selects a random word, regardless of the condition, and the second selects the word with the lowest probability. These neighborhoods are controlled with the parameter.
2) The proposed HCTagger was incorporated into GBHS Tagger 2 [4] as a local optimizer and, thus, the new memetic version called GBHS4Tagger. In Figure 5, the proposed HCTagger pseudocode is shown and in Figure 6, the proposed GBHS4Tagger is shown.

E. Experiments with the proposed taggers
To carry out the experiments, in the first instance, an adjustment (fine-tuning) of the tagging parameters was carried out using a small dataset (sample) of 5000 sentences, in order to select the best combinations of parameters of each algorithm. Table 3 shows the distribution of the sentences in the test and training datasets for each folder, with which the experiments were carried out on each complete corpus, as seen in Table 5. All the experiments were executed using cross-validation of 5 folders, except for the Nasa Yuwe corpus, with which Leave-One-Out was used, since the dataset has only 175 sentences.
In Table 4, following, the configuration of the algorithms for the experiments carried out with each corpus is presented.   Table 6 shows the ranking of each algorithm in the experiments carried out in each corpus once the Friedman NxN non-parametric statistical test has been applied, obtaining a p value smaller than 0.05, therefore, it makes the ranking statistically significant, complementing the evaluation of the algorithms. Additionally, the Wilcoxon test showed, with a significance level of 90%, that the results obtained for the winning algorithms, GBHS4Tagger for Spanish and English, and RRHCTagger for Nasa Yuwe, are better in contrast with the other proposed taggers.

V. DISCUSSION AND CONCLUSIONS
This work achieved the adaptation of the metaheuristic algorithms PSO, Jaya, and RRHC to the problem of part of speech tagging (POST), taking into account the characteristics of each algorithm, and performing the parameter adjustment required for each algorithm on each corpus, obtaining competitive results with respect to one of the state-of-the-art algorithms. It was also possible to propose an improvement to the state-of-the-art GBHS Tagger 2 memetic algorithm, which continued to demonstrate that the performance of the tagger improves by including knowledge of the problem, as seen in the IULA (Spanish) and Brown (English) corpus.
Consequently, the presented research reinforced the idea that metaheuristic approaches are capable of performing tagging with good results, with acceptable resources and times. Metaheuristic algorithms should continue to be used for tagging on other traditional and non-traditional languages, and seek new improvements for the proposed taggers in combination with other optimization techniques that improve the results of the tagging.