Evaluating Prompt Variability in Transformer-Based LLMs Through Discrete and Semantic PSI
Abstract
Large Language Models (LLMs) such as GPT, T5, and BART have demonstrated remarkable performance across diverse natural language processing tasks. However, their output variability in response to semantically similar but syntactically different prompts raises critical concerns regarding consistency, reliability, and reproducibility. This phenomenon, termed prompt sensitivity, poses a challenge for both scientific evaluation and real-world deployment, particularly in high-stakes domains like healthcare, legal systems, and education. This thesis investigates prompt sensitivity by quantitatively evaluating how different paraphrased inputs affect the responses generated by LLMs. The study leverages the BoolQ dataset, a benchmark for yes/no question answering, and applies the Prompt Sensitivity Index (PSI) — a dual-metric framework composed of Discrete PSI and Semantic PSI — to measure variations in model outputs. Discrete PSI captures output disagreement, while Semantic PSI measures embedding-based semantic drift across prompt variations.A comprehensive experimental setup was implemented using multiple transformer-based models (BART, T5, FLAN-T5), and prompt variants were systematically generated using paraphrasing techniques. The results demonstrate significant input-specific variability, with discrete disagreement rates as high as 36% in some cases, even when prompts were semantically identical. Visual analytics, statistical summaries, and error distributions are used to highlight model inconsistencies.The findings underline the need for improved robustness mechanisms in LLMs and suggest that prompt engineering alone is insufficient for ensuring consistent behavior. This thesis contributes a modular pipeline, reproducible codebase, and actionable insights to guide future research on model stability, fairness, and interpretability.
Keywords: Large Language Models (LLMs), Prompt Sensitivity Index (PSI), Multilingual Natural Language Processing, Robustness Evaluation, Semantic Consistency, Prompt Variability, Model Reliability, Cross-Lingual Analysis
