Prompting With AI Personas Gets Streamlined Via Advent Of Million And Billion Personas-Sized Datasets

Prompting With AI Personas Gets Streamlined Via Advent Of Million And Billion Personas-Sized Datasets

Prompting With AI Personas Gets Streamlined Via Advent Of Million And Billion Personas-Sized Datasets

Author: Lance Eliot, Contributor
Published on: 2025-01-25 00:19:44
Source: Forbes – Innovation

Disclaimer:All rights are owned by the respective creators. No copyright infringement is intended.


In today’s column, I showcase a novel twist on the prompting of personas when using generative AI and large language models (LLMs). The trick is this. You conventionally enter a prompt describing the persona you want AI to pretend to be (it’s all just a computational simulation, not somehow sentience). Well, good news, you no longer need to concoct a persona depiction out of thin air. Instead, you can easily dip into massive-sized datasets with ready-made persona descriptions and then paste those depictions directly into your persona-stirring prompts. Easy-peasy.

Let’s talk about it.

This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Prompt Engineering And Personas

Readers might recall that I previously posted an in-depth elicitation of over fifty prompt engineering techniques and methods, see the link here. Among those myriad approaches was the use of personas, including individual personas and multiple personas, as depicted at the link here, and the much larger scale mega-personas at the link here. Personas are a powerful feature available in LLMs, yet few users seem to be familiar with the circumstances under which they should consider invoking the capability.

A persona consists of telling generative AI to pretend to be someone and simulate what that person might know or say. No special setup or elaborate instructions are needed to get this undertaken. I will be providing you in a moment with prompts that show how this is readily performed.

A quick example of the value of personas might be helpful here.

One of the most popularly invoked personas entails generative AI pretending to be Abraham Lincoln. A teacher might tell a generative AI app such as ChatGPT to simulate the nature of Honest Abe. Students then access the pretense or persona. In an amazing flair, the AI seemingly responds as we assume Lincoln might have responded.

Of course, this is based simply on the numerous speeches, written materials, and other collected writings that suggest what he was like. The AI has pattern-matched computationally on those works and mimics what Lincoln’s tone and remarks might be. Anyone using a generative AI persona needs to keep their wits about them and realize that the conversation or engagement is nothing more than a mimicry or imitation. Though the AI appears to often convincingly fake the nature of the person, it is all still a computational simulation.

AI Personas Of Unnamed Compositions

You don’t need to only use names of well-known celebrities or historical figures when seeking to instigate an AI persona.

Another equally fine approach consists of describing the overall nature of a persona that you want to have invoked. For example, I might enter a prompt stating that I want a persona consisting of a teenager who is a football player at their high school and has been lately having second thoughts about whether to pursue a sports career right away or continue into college.

Why would I craft such an AI persona?

Maybe I’m training to be a career counselor and want to practice honing my counseling techniques by using generative AI. I could readily come up with lots of scenarios that involve complex situations facing teens. The AI would go back and forth with me as I try to give career advice. It is handy since I can practice as much as I like and not worry about miscuing an actual teenager. Furthermore, the vast number of scenarios that I can conjure up would almost certainly be more than the actual count of teens that I will be potentially advising.

Devising From Scratch Or From Dataset

The most common way to indicate an unnamed AI persona is simply to describe the characteristics of the pretense involved. You usually do this out of your own head, in the sense that you think about what factors are crucial and then compose those into your desired prompt. It takes a bit of thinking but can be readily undertaken.

Things become slightly more complicated when you want to do a slew of AI personas. Now, you are forced to come up with a myriad of AI personas that fit what you want to do. You might also find yourself staring blankly at an empty prompt and trying to decide what is the best way to word the AI persona that you want to instantiate.

No worries.

AI researchers have realized that users and other AI analysts might find immensely useful a dataset of AI persona descriptions. The beauty is this. All you need to do is search the dataset to find what you are interested in as an AI persona. Voila, you then copy the description and paste the depiction into a prompt in generative AI.

You can do this repeatedly.

It goes like this. You find another AI persona description and use that one. You can even grab a bunch of them all at once and feed them as a group into a prompt. Rinse and repeat. Proceed as befits your needs.

An added advanced approach would be to use an online tool to dig into the dataset, pull out the AI persona descriptions, and feed them directly into generative AI. This could be a third-party tool, or you can have generative AI potentially do so for you if the AI app has suitable capabilities to allow this functionality.

Examples Via Some Existing Datasets

There are various AI persona datasets available online.

I am going to look at one called FinePersonas and another dataset known as PersonaHub. The datasets that provide AI personas are pretty much all relatively similar. The typical format is a spreadsheet-like structure that houses the AI persona descriptions. You can search by hand or use a tool if they provide one. Conventional spreadsheet skills are usually all that you need to know.

I went into FinePersonas which is on HuggingFace and opted to semi-randomly pull an AI persona description that caught my interest:

  • “A high school physics teacher who enjoys applying physics concepts to various sports, curious about the effects of water resistance and body positioning in swimming.”

I then used that text in a prompt and got AI to pretend to be that persona.

Here we go.

  • My entered prompt: “You are to undertake a persona that is based on this description: A high school physics teacher who enjoys applying physics concepts to various sports, curious about the effects of water resistance and body positioning in swimming. I want you to have this persona think aloud and explain what they are mulling over while watching a swim meet.”
  • Generative AI response: “Alright, so watching these swimmers, I can’t help but think about the physics at play. Look at their body positioning — streamlining is key to reducing drag. The more they minimize surface area against the water, the less resistance they face. And their strokes—each pull has to maximize propulsion while keeping turbulence low. I wonder how much energy is being lost to eddies forming behind their hands. Fascinating stuff!”

You can see that the AI pretended to simulate the aspects of that chosen AI persona.

Nice.

Playing Around With AI Personas

You don’t have to be dogmatic about using the AI personas strictly as specified in the datasets.

After seeing the response by AI to the above persona of a physics teacher, I decided to change or adjust the persona to explore an alternative aspect.

This is what I did.

  • My entered prompt: “Adjust the persona so that the physics teacher is an art teacher. What might be going through their mind while watching a swim meet?”
  • Generative AI response: “Watching these swimmers, I’m struck by the rhythm and flow of their movements — it’s almost like a living brushstroke on a canvas of water. The way their bodies cut through the pool creates ripples that shimmer under the lights, almost like abstract patterns. And those diving starts? Such a perfect arc of motion, like a sculpture in flight, frozen for just a moment.”

Multiple AI Personas At Your Fingertips

In those two examples, I used first a physics teacher and then an art teacher. I might want to run through a wider range of teachers that cover a variety of academic specialties.

You can invoke multiple AI personas and use just the one from the dataset as the core baseline.

Take a look.

  • My entered prompt: “Based on the persona that I’ve given you, vary the personas for fifteen different types of teachers in terms of the academic disciplines they are in. For each of the fifteen new personas that you’ve created, once again show me what each is thinking when watching a swim meet.”
  • Generative AI response: “Here are the fifteen personas that I devised based on the persona you provided, and their respective thinking-aloud mindset while watching a swim meet…”

I recently covered the use of multiple AI personas that represent expert-level specialists in just about any domain you might be examining, see the link here if that’s something you are interested in doing.

Things To Know About AI Persona Datasets

During my classes on prompt engineering, I walk through the use of AI personas and when we get to the matter of persona datasets, I mention that these are some notable ways to make use of pre-cooked or canned personas:

  • (1) Persona as is. Select a particular persona and use it in generative AI.
  • (2) Persona adjusted. Select a particular persona, adjust it to your suitability, and use it in generative AI.
  • (3) Persona as baseline. Select a persona as a baseline, have the AI replicate it in variations, and use it in generative AI.
  • (4) Set of personas. Select a set of personas, specify how they are to be used, and proceed to use them in generative AI.
  • Etc.

I also urge the attendees to consider these ten vital elements when deciding which AI persona dataset will best fit their needs:

  • (1) Size of the personas dataset.
  • (2) Stipulation of what each persona consists of.
  • (3) Level of granularity associated with the personas.
  • (4) Various mixes and types which are available.
  • (5) Applicability to your interests.
  • (6) Potential biases embedded in the personas.
  • (7) Ease of utilizing the personas.
  • (8) Any costs associated with the usage of the dataset.
  • (9) Potential copyright considerations if any.
  • (10) Availability of the personas dataset.

The surprise for many is that the number of AI personas in these datasets is usually in millions or billions of instances.

Say what?

Yes, it isn’t just a few hundred instances or several thousand, these datasets often are rather vast.

This isn’t overly shocking when you realize that most of these datasets are crafted by using AI or some related online tool. The person devising the dataset tells the AI or tool to generate tons and tons of personas and store them in a dataset.

That being said, I don’t want to seemingly diminish the heroic and thankful effort of those who put together these datasets. There is admittedly more elbow grease and hard work that goes into establishing a useful and usable personas dataset. No magic wand makes it appear magically.

Some Background On Specific Datasets

I mentioned that I had plucked the physics teacher AI persona out of the FinePersonas dataset that is on HuggingFace. The posted site indicates that its dataset has these core properties (excerpts):

  • “Open dataset of 21 million detailed personas for diverse and controllable synthetic text generation.”
  • “FinePersonas contains detailed personas for creating customized, realistic synthetic data.”
  • “With this dataset, AI researchers and engineers can easily integrate unique persona traits into text generation systems, enhancing the richness, diversity, and specificity of synthetic outputs without the complexity of crafting detailed attributes from scratch.”

Shifting gears, consider another persona dataset, called PersonaHub.

The PersonaHub dataset touts that it contains a billion personas and has an accompanying research paper describing the collection – the paper is entitled “Scaling Synthetic Data Creation with 1,000,000,000 Personas” by Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu, arXiv, September 24, 2024. Here are some salient excerpts explaining the creation and use of the dataset:

  • “We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data.”
  • To fully exploit this methodology at scale, we introduce Persona Hub – a collection of 1 billion diverse personas automatically curated from web data.”
  • “These 1 billion personas (∼13% of the world’s total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios.”
  • “By showcasing Persona Hub’s use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.”
  • “As the first version of Persona Hub, although it already contains 1 billion personas, the descriptions of these personas are focused only on major aspects and lack fine-grained details (e.g., preferences for colors and numbers; specific family backgrounds, historical contexts, and life experiences). We plan to refine the personas in subsequent versions of Persona Hub, aiming for their descriptions to be as detailed as those found in Wikipedia articles about individuals.”

As you might observe, the AI personas in these datasets can be a one-sentence affair or can be lengthier text depictions. The sky is the limit in the sense that these depictions could be devised as extraordinarily detailed. An entire extensive faked background could be contrived and placed into the dataset for use.

Use AI Persona Datasets When Needed

I would venture that the average use of AI personas in generative AI is usually based on something a particular person has in their mind as to what the persona is supposed to be. They probably wouldn’t see much need in using an AI persona dataset. They have a one-and-done mindset.

Good for them.

For others, if you are conducting a subject-based study and want to have a swath of AI personas, or if you are unsure of what AI persona you want to invoke, these datasets can be quite valuable. Indeed, any kind of large-scale testing of AI or using AI to generate lots of outputs of synthetic data can be streamlined by leveraging an AI persona dataset.

No need to derive AI personas from scratch when you can leisurely and conveniently lean into an AI persona dataset.

John Rockefeller famously said this: “The key to success is not just hard work, but smart work.” Keep in the back of your mind that there are AI persona datasets sitting out there and waiting for you to use them. This could reduce the hard work you are doing and aid you in working smarter.

That’s a worthwhile human persona to aspire toward.


Disclaimer: All rights are owned by the respective creators. No copyright infringement is intended.

Leave a Reply

Your email address will not be published. Required fields are marked *

Secured By miniOrange