Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting

Gauri Kambhatla, Chantal Shaib, Venkata Govindarajan

公開日: 2025/5/23

Abstract

Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. First, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. Our results indicate that persona prompting produces higher lexical diversity than prompting without personas, particularly in larger models. In contrast, adding fine-grained persona details yields minimal gains in diversity compared to simply specifying a length cutoff in the prompt.

Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting | SummarXiv | SummarXiv