Evaluating Prompt Variability in Transformer-Based LLMs Through Discrete and Semantic PSI

Mehran Hanif; Abdul Rauf (Corresponding Author); Majid Hussain; Muhammad Kashif Siddhu; Rana Hassam Ahmed

Authors

Mehran Hanif Department of Computer Science, The University of Faisalabad, Faisalabad, Punjab Pakistan
Abdul Rauf (Corresponding Author) Department of Computer Science, The University of Faisalabad, Faisalabad, Punjab Pakistan
Majid Hussain Department of Computer Science, The University of Faisalabad, Faisalabad, Punjab Pakistan
Muhammad Kashif Siddhu Faculty of Computer Sciences, Lahore Garrison University, Lahore, Punjab Pakistan
Rana Hassam Ahmed Department of Computer Science, The University of Faisalabad, Faisalabad, Punjab Pakistan

Abstract

Large Language Models (LLMs) such as GPT, T5, and BART have demonstrated remarkable performance across diverse natural language processing tasks. However, their output variability in response to semantically similar but syntactically different prompts raises critical concerns regarding consistency, reliability, and reproducibility. This phenomenon, termed prompt sensitivity, poses a challenge for both scientific evaluation and real-world deployment, particularly in high-stakes domains like healthcare, legal systems, and education. This thesis investigates prompt sensitivity by quantitatively evaluating how different paraphrased inputs affect the responses generated by LLMs. The study leverages the BoolQ dataset, a benchmark for yes/no question answering, and applies the Prompt Sensitivity Index (PSI) — a dual-metric framework composed of Discrete PSI and Semantic PSI — to measure variations in model outputs. Discrete PSI captures output disagreement, while Semantic PSI measures embedding-based semantic drift across prompt variations.A comprehensive experimental setup was implemented using multiple transformer-based models (BART, T5, FLAN-T5), and prompt variants were systematically generated using paraphrasing techniques. The results demonstrate significant input-specific variability, with discrete disagreement rates as high as 36% in some cases, even when prompts were semantically identical. Visual analytics, statistical summaries, and error distributions are used to highlight model inconsistencies.The findings underline the need for improved robustness mechanisms in LLMs and suggest that prompt engineering alone is insufficient for ensuring consistent behavior. This thesis contributes a modular pipeline, reproducible codebase, and actionable insights to guide future research on model stability, fairness, and interpretability.

Keywords: Large Language Models (LLMs), Prompt Sensitivity Index (PSI), Multilingual Natural Language Processing, Robustness Evaluation, Semantic Consistency, Prompt Variability, Model Reliability, Cross-Lingual Analysis

Evaluating Prompt Variability in Transformer-Based LLMs Through Discrete and Semantic PSI

Authors

Abstract

Downloads

Published

How to Cite

Issue

Section

Current Issue

Information

Make a Submission

JOURNAL INFORMATION