FINDING TOPICS IN CREATIVE WRITING ON ENVIRONMENTAL PRESERVATION FOR BETTER TEACHING STRATEGIES: A CASE OF STUDY IN AN ELEMENTARY SCHOOL FROM COLOMBIA

In this research, essays on trees preservation of fourth grade students (elementary school from Colombia) were evaluated with Latent Dirichlet Allocation (LDA). The objective was extracting the fundamental topics, to understand the students’ behavior and awareness towards the environment from the creative writing. The computational results suggest the student’s reflections on environment preservation are focused on five main topics in: Teach-Learn to care for the environment, Explore-discover the environment, Well-being of the environment, Concern for the environment, and Restoration and conservation of the environment. This text analysis by LDA can complement the manual analysis of teachers, avoiding the veracity bias and allowing the enhancement of teaching strategies.


INTRODUCTION
When researchers need to analyze a large amount of text, they need tools with the capacity to process, decode, and interpret significant information, avoiding the veracity bias and infoxication (B. Chen, Chen, & Xing, 2015;Seufert, Guggemos, & Sonderegger, 2019 ). Topic modeling (TM) is a set of computational tools based on machine learning for the analysis and extraction of topics from texts, audios, and videos (Blei, Ng, & Jordan, 2002;Chuang, Gupta, Manning, & Heer, 2013), including those having large volume of information (Letsche & Berry, 1997;Z. Liu, 2013;Röder et al., 2015;Stevens et al., 2012).
While applied in text analysis, TM can find patterns of occurrence of words or terms with appropriate statistical weight for the conformation of topics consistent with the texts (Anandarajan et al., 2019;Blei et al., 2003;Ezen-Can & Boyer, n.d.;Landauer et al., 2011;Xun et al., 2017). Basically, these techniques generate a combination of terms in the form of vectors and relate one of these vectors to each of other by considering the statistical proximity of the vectors, determining the frequency of each vector or the frequency of each term (Blei et al., 2003;Jelodar et al., 2019;Prabhakaran, 2018). After generating a matrix, TM examines: (a) which vector has the greatest statistical weight, (b) which vector are closest to the vectors with the greatest statistical weight, and (c) in this way a group of vectors relating to their frequency of occurrence is obtained (Prabhakaran, 2018). With this information, the most frequent latent topics of the volume of documents can be generated. Thus, the volume of information is reduced, synthesizing topics with a hierarchically organized statistical contribution, which allows revealing concepts, underlying intentions not observable using traditional text analysis (Balyan et al., 2020;Fischer et al., 2020). Deciding how many topics are relevant to a collection of documents is dependent on variables of specificity and interpretability managed by the data analysts and researchers involved in the study (Lenhart et al., 2020). One of the most relevant TM techniques in text analysis in educational sciences is Latent Dirichlet Allocation, since it is
Investigation groups have employed LDA to analyze students' writings in general.
Investigations aim want to identify student reflections by discovering significant topics in their texts (Y. Chen et al., 2016;Chuang et al., 2013;Ming, N. C. & Ming, 2012). Also, they focus on peer learning and analyze the relevant topics of the activities, to generate better strategies in science teaching (Louvigne et al., 2014;Tran et al., 2019). LDA has been used to grade student texts when the topics have been defined and in this way allow measuring the coherence score of each exam while avoiding teacher bias (Y. Chen et al., 2016;Kakkonen et al., 2006). Furthermore, this technique is used to identify patterns in scientific and socio-environmental argumentation writings, which help teachers to improve their activities and reflections in class (Xing et al., 2020). Teachers can investigate the student's interpretations, how they think and what they care about, and with this information monitor the evolution of the learning process with more useful information, adjusting their activities of class (Erkens et al., 2016;Ming, N. C. & Ming, 2012;Xing et al., 2020).
In this study we proposed a creative writing activity in which 4th grade students should write a story with the main theme of the trees' preservation of their educational center.
The objective was to discover the main topics on environmental sustainability students reflected upon after being taught in classes. We propose a method using LDA with MALLET toolkit to automatically explore, assess and discover the topics of students' creative writing (O'Callaghan et al., 2015). By discovering the optimal number of topics of the writings, we aim to strengthen the manual analysis of the writings. From these topics, it is possible to determine the associations that students make about caring for the environment, which allows teachers to generate more effective didactic strategies in teaching environmental sustainability.

Data preparation and text analysis
The texts employed in the investigation were the creative writing of 23 fourth grade students (9 -11 years old), with the main theme: the preservation of trees in their educational center (Institución Educativa San Jerónimo Emiliani, Tunja, Colombia). Students were encouraged to include their experiences, imagination, and things that are important to them. The purpose was to investigate and understand the interpretation and interaction that students have in the environmental context.

Finding topics in creative writing on environmental preservation for better teaching strategies: a case of study in an elementary school from Colombia
TM was employed to strengthen the analysis and interpretation of teachers, to generate didactic strategies in the teaching of environmental sustainability from the significant topics that students reflect upon. The texts were converted from .docx to .txt format, to introduce the documents in our TM algorithm, figure SI1 (Supplementary Information, SI). Topic Modeling.
For the TM, we used the LDA algorithm with MALLET toolkit with modifications according to our interest in Python 3.8, figure SI1. MALLET toolkit is a machine learning tool to discover common topics in different text classes and dimensions (O'Callaghan et al., 2015), with good statistical consistency of topics compared to other techniques. The selection of the number of topics is important in the analysis of the texts (Greene et al., 2014). If the number of topics is too large, the topics will be redundant; If the number of topics is too small, the different categories cannot be separated from each other and the topics will be too broad (Chuang et al., 2013;Greene et al., 2014). Generally, the number of topics (K) is adjusted by executing iterative modeling to find the scores of each K model and then choosing the best value of K, according to the mathematical analysis determined by Derek Green (Greene et al., 2014). This is verified by the results based on knowledge prior to reading the texts.
Considering that each student reflects on at least one topic of interest, K> = 1 is considered. Since students could address various topics in their texts, we limited this amount to 30 significant topics (K <= 30). The LDA algorithm was used with the MALLET toolkit to determine the best model that could fit the 23 texts in the range of 1 to 30 possible topics, considering a better analysis of the text avoiding human bias (Chang, Boyd-Graber, Gerrish, Wang, & Blei, 2009;Y. Chen et al., 2016).
It was determined that the number of topics should be K = 5, since it is an intermediate value between two extreme peaks of initial topics (Greene et al., 2014;Sarkar, 2019), which was also corroborated based on the knowledge from reading the texts, figure 1. Each topic is represented by a list of keywords. These topics are manually named according to their keywords (Xing et al., 2020;Zhu et al., 2019) and the value that defines the difference between two topics is determined from the agreement or consistency score between two of them (Lei et al., 2020;Stevens et al., 2012). The final stability score for each topic is calculated by the average of the concordance score among all the 5 topics of the chosen model. The higher the stability score, the more robust the model (Greene et al., 2014;Stevens et al., 2012). In our model, the highest stability score was found with K = 5. In fig. 1, the stability scores for K are shown between 1 to 30. Therefore, the result found by the algorithm supports the results found in reading the texts. Therefore, this model was chosen for the characterization of the most semantically relevant words of each topic extracted by MALLET toolkit (Song et al., 2009;Wang & Liu, 2017).

Common Topics in Creative Writing
The MALLET toolkit was employed to discover the possible topics covered by the 23 students' writings, which have an average of 581 words (table SI1). All the terms in documents that occurred less than 5 times and all terms that occurred more than 60% were removed (Greene et al., 2014), which reduced the analysis to a total of 148 words, improving the model. To find the optimal number of topics in the model, an iterative approach was used where several topic models were built in a range of 1 to 30. The topic model that presented the best coherence score was selected (Greene et al., 2014), and quality was verified with the evaluation of the coherence metrics (Aletras & Stevenson, 2013;Greene et al., 2014;Mimno et al., 2011). Fig. 1 shows the topic models according to the coherence score (Aletras & Stevenson, 2013;O'Callaghan et al., 2015), where it was determined that the best coherence score was for k = 5, also corroborated by the teacher's manual reading.

Figure 1. Coherence score as a function of the number of topics. 1
Note. The best model for the analysis of the texts was selected as k = 5, since it is an intermediate value between two initial external peaks (Greene et al., 2014). These 5 topics found by MALLET avoided the veracity bias and reinforced the analysis of the reading by the teacher.   Figure SI1 shows the results extracted from the topic model. These data are essential for the analysis of students' creative writing. From table SI1, it is possible to correlate parameters for the texts analysis, which is shown in fig. 2a, 2b and 2c. Fig. 2a shows the number of terms and Fig. 2b shows the semantic contribution in each document.
Comparing these results, one observes that the semantic contribution of each document does not depend on the number of terms, but on the connections of meanings, as values, behaviors, teachings, and attitudes that students reflect in their writing. For example, from fig. 2a and 2b, Document 13 has 1070 terms with a semantic contribution of 32.50% and Document 11 has 917 terms with a semantic contribution of 47.17%, which proves our hypothesis. In addition to fig. 2c, one can see that these two documents belong to the same dominant topic (topic = 5).

Source. Authors
One can see in figureSI2 that the model builds the five topics based on the set of relevant keywords (term by topic). Polysemy cases are observed (Yoshida et al., 2020), since similar words are found in different topics and within the same topic. These investigations are difficult to be discovered by traditional methods of text analysis and word counting. The model estimates the semantic contribution of the terms, generating the content in the dominant topics in each document, according to tables SI1 and SI2. Fig. 3 shows the results of the dominant topic in the students' texts, where the x-axis represents the texts that contribute to that topic and the y-axis represents the semantic

Camilo Arturo Suárez Ballesteros María Claudia Esperanza Bernal Camargo Nidia Yaneth Torres Merchán
contribution of each text. The results allow us to make a better decision about the topic of the texts. Table SI3 shows the documents with the highest semantic contribution of each topic. These documents express the more significant reflections in the solution of socio-environmental problems equivalent to a greater construction of meaning and understanding of the student. In contrast, the documents with the lowest semantic contribution indicated that a clear idea of the topic was not well developed, which may be due to factors such as little interest in writing, demotivation, and short texts that do not show a complete development of the idea.

Evaluation model using perplexity and coherence metrics
Topic coherence metrics can be used to measure the quality of models, where coherence measures take a word pair or subset of words, as well as their probabilities (Aletras & Stevenson, 2013;Chuang et al., 2013;Röder et al., 2015), and calculate the consistency of the word set with each term (Aletras & Stevenson, 2013). In our case, we used three metrics to qualify and interpret the model generated by MALLET (Stevens et al., 2012): perplexity, average coherence score (Cv) and average coherence score (UMass). According to the literature, low perplexity means a good model. Low UMass value and high Cv value define a good model (Stevens et al., 2012). The results of these three parameters are: Model Perplexity: -8.53533, Average Coherence Score (UMass): -0.9222260707279725 and Average Coherence Score (Cv): 0.3408527968278528, which represents a good result for the model comparing with other works (Aletras & Stevenson, 2013;Bhardwaj et al., 2010;Y. Chen et al., 2016;Kherwa & Bansal, 2018;Sarkar, 2019;Song et al., 2009;Valdez et al., 2018).

Figure 3.
The dominant topic of each document with its respective semantic contribution

Source. Authors
Finding topics in creative writing on environmental preservation for better teaching strategies: a case of study in an elementary school from Colombia

DISCUSSION
Currently, the use of topic modeling in educational sciences has gained importance as an alternative for qualitative and quantitative research (Fischer et al., 2020;Seufert et al., 2019). The computational tools' capability to discover the thematic structures of large numbers of documents allow the design of complementary strategies in text analysis contexts (Faustmann, 2018). An outstanding and complementary advantage of these tools is the possibility of reducing the investigator veracity bias in the analysis of their results (Faustmann, 2018;S. Liu et al., 2019), allowing to monitor communicative and creative thinking, examine positive attitudes towards writing and investigate the interdisciplinary scope and curricular integration.
In this work the teacher designed an activity in which the students expressed their reflections and thoughts related to the subject of trees preservation through writing.
To extract the fundamental topics from the writings by topic modeling techniques, we used Latent Dirichlet Allocation with MALLET toolkit. Through MALLET, five topics were found out of the 23 texts: Teach-Learn to care for the environment, Explore-discover the environment, Well-being of the environment, Concern for the environment and Restoration and conservation of the environment. The texts with the best semantic contribution in each topic had a greater connection of meanings related to the problems, teachings, values, attitudes, and behaviors regarding environmental sustainability (Ramadhan et al., 2019). In contrast, in the texts with low semantic contribution (table  1), the connections of socio-environmental meanings were shallow, possibly due to a lack of interest in writing. The evaluation metrics of the coherence of the model are good compared to works in the field of study (Aletras & Stevenson, 2013;Y. Chen et al., 2016;Faustmann, 2018;Stevens et al., 2012;Zhu et al., 2019).
In addition, this semantic characterization allows us to identify the conceptual strengths and weaknesses of students on environmental sustainability issues. With this in mind, teachers can develop didactic strategies that improve their dynamics and class content (Murillo, 2013). For example, teachers can design tasks where students express their thoughts about social, environmental, scientific problems, to characterize their concepts and thus investigate about strategies that motivate them and then teach from the student's conceptual interests involving these computational tools.

CONCLUSION
In this study, using LDA with the MALLET toolkit, we carried out an exploratory analysis of the creative writing of the 4th grade students with the main theme on trees´ preservation.

Camilo Arturo Suárez Ballesteros María Claudia Esperanza Bernal Camargo Nidia Yaneth Torres Merchán
The results suggest the potential use of topic modeling for the semantic analysis and characterization of students' reflections. With LDA technique, we achieve the objective of the research extracting the fundamental topics with the possibility to understand the students' behavior and awareness towards the environment from the creative writing. We found five relevant topics in their reflections on preservation of the environment: Teach-Learn to care for the environment, Explore-discover the environment, Well-being of the environment, Concern for the environment, and Restoration and conservation of the environment. These findings can complement the manual analysis of teachers in their research class, avoiding bias and allowing investigating the classroom activities with strategies to strengthen the conceptual weaknesses in environmental sustainability.
Topic modeling can contribute to the construction of analytical tools for the determination of hidden components in the essays, stories, debates, and writing of students at an early age. From this analysis, one can generate didactic strategies based on the concepts and thoughts of the students to strengthen critical thinking. As a perspective for future work, we seek to design activities related to socio-scientific problems to determine the conceptual veracity and in this way to build dynamics in which the student can be motivated and empowered in scientific thinking.