Smart Product Backlog: Automatic Classification of User Stories Using Large Language Models (LLM)
Abstract
In agile software development processes, specifically within intelligent applications that leverage artificial intelligence (AI), Smart Product Backlog (SPB) serves as an artifact that includes both AI-implementable functionalities and those that do not use AI. Significant work has been done in the development of Natural Language Processing (NLP) models, and Large Language Models (LLMs) have demonstrated exceptional performance. However, whether LLMs can be used in automatic classification tasks without prior annotation, thereby allowing direct extraction from the Smart Product Backlog (SPB) remains an unanswered question. In this study, we compared the effectiveness of fine-tuning techniques with “prompting” methods to determine the potential of models such as ChatGPT-4o, Gemini Pro 1.5, and ChaGPT-Mini. A dataset was constructed with user stories manually classified by a group of experts, which enabled assembling experiments and creating the respective contingency tables. The classification performance metrics of each LLM were statistically evaluated; accuracy, sensitivity, and F1-Score were used to assess the effectiveness of each model. This comparative approach aimed to highlight the strengths and limitations of each LLM in efficiently and accurately assisting in the construction of the SPB. This comparative analysis demonstrates that ChatGPT-Mini has limitations in balancing precision and sensitivity. Although Gemini Pro 1.5 was superior in accuracy scores and ChatGPT performed well, neither is robust enough to build a fully automated tool for user story classification. Therefore, we identified the need to develop a specialized classifier that enables the construction of an automated tool to recommend viable user stories for AI development, thereby supporting decision-making in agile software projects.
Keywords
Software Requirements Specification, User story classification, smart product backlog, smart user story identifier, Large scale language models, Artificial intelligence
References
- K. Beck, M. Fowler, Planning Extreme Programming. Addison Wesley, 2001.
- T. Sedano, P. Ralph, C. Peraire, “The Product Backlog,” in International Conference on Software Engineering, IEEE Computer Society, Montreal, Canada, 2019, pp. 200-211. https://doi.org/10.1109/ICSE.2019.00036
- C. A. Dos Santos, K. Bouchard, F. Petrillo, “AI-Driven User Story Generation,” in International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications (ACDSA), Victoria, Seychelles, 2024. https://doi.org/10.1109/ACDSA59508.2024.10467677
- K. Kaur and P. Kaur, “The application of AI techniques in requirements classification: a systematic mapping,” Artificial Intelligence Review., vol. 57(3), pp. 1-48, 2024. https://doi.org/10.1007/S10462-023-10667-1
- S. Arulmohan, M. J. Meurs, S. Mosser, “Extracting Domain Models from Textual Requirements in the Era of Large Language Models,” in ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), Suecia, 2023, pp. 580-587. https://doi.org/10.1109/MODELS-C59198.2023.00096
- Z. Zhang, M. Rayhan, T. Herda, M. Goisauf, P. Abrahamsson, “LLM-Based Agents for Automating the Enhancement of User Story Quality: An Early Report,” in Agile Processes in Software Engineering and Extreme Programming, Germany, 2024, pp. 117-126. https://doi.org/10.1007/978-3-031-61154-4_8
- T. Rahman, Y. Zhu, “Automated User Story Generation with Test Case Specification Using Large Language Model,” in Arxiv-Software Engineering, 2024. https://arxiv.org/abs/2404.01558v1
- P. Chuor, A. Ittoo, S. Heng, “User Story Classification with Machine Learning and LLMs,” in Lecture Notes in Computer Science. Berlin, Germany: Springer Science and Business Media, 2024, pp. 161-175. https://doi.org/10.1007/978-981-97-5492-2_13
- J. Hong et al., “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,” in Arxiv-Computation and Language, 2024. https://arxiv.org/abs/2403.15447v3
- L. Sun et al., “TrustLLM: Trustworthiness in Large Language Models,” Arxiv-Computation and Language, 2024. https://arxiv.org/abs/2401.05561v4
- B. Kumar, U. K. Tiwari, D. C. Dobhal, “Classification of NFR based Importance Level of User Story in Agile Software Development”, in 9th International Conference on Signal Processing, Communications and Computing, India, 2023, pp. 264-268. https://ieeexplore.ieee.org/document/10441284
- J. Liu et al., “Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering,” in Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 8938-8958. https://doi.org/10.18653/v1/2022.emnlp-main.611
- F. Dalpiaz, “Requirements data sets (user stories)”, Mendeley Data, vol. 1, e8, 2018. https://doi.org/10.17632/7ZBK8ZSD8Y.1