For the past few years everyone in the A.I. world has been pushing the narrative that prompt engineering is the future and the only career option you'll soon have. And if you don't learn it now, you're going to be out of luck when finding a job.
And as someone who's been spending the past few months working on a new A.I. startup, I can say that it definitely is a complex skill to acquire and to get good at, but it really isn't any more or less important than any other technical skills needed to build an app.
So in this post I'll break down the more technical side of prompt engineering that often doesn't get talked about. We're talking structured schema's, tokenization and temperature and parameter configurations. The fun stuff.
What people think it is
Let's start here, because I often talk to many people that proclaim they are prompt engineering, but really, they're just typing prompts into ChatGPT and then storing them in an Excel sheet.
Technically not 'wrong', but I'd consider that maybe a Jr. level prompt engineer. The following is indeed a prompt that I engineered.
Write a function that validates an email :)
And on the more complex side, you often times see massive prompts that span thousands of tokens with the hope that the output is somehow 100% perfect.
# Modern Next.js Portfolio Site Template
Please create a modern, responsive portfolio website using Next.js 14 with the following specifications:
## Project Structure
- Use the Next.js App Router
- Implement TypeScript for type safety
- Structure the project with the following directories:
- app/ (for routes and layouts)
- components/ (reusable UI components)
- lib/ (utility functions and configurations)
- public/ (static assets)
- styles/ (global styles and CSS modules)
...
hundreds of lines more...
More than likely you'll actually end up with a more accurate response in the simple prompt above than the complex one. I've seen plenty of long-form prompts online that claim to do miracles, like build and launch an entire SaaS product in an hour. But when you actually go and paste that prompt into any model, the results are random and chaotic at best.
Let's get into the other aspects of prompt engineering that will help with increasing this accuracy for more complex tasks.
Understanding the models
Every A.I. company implements their own models, versions of models and modalities for their models. Some have text, speech and image recognition, while others might only have text but they perform better.
Some only accept up to 4096 tokens, while others accept 100,000. But again, quality matters and there's a good chance that the 4096 model probably returns more accurate information.
And they all have different costs as well. More expensive does not usually equate to higher accuracy either in this regard.
Costs
If you're doing all of your prompt engineering on a browser client, like ChatGPT or Claude or Gemini, then your costs are essentially the monthly subscription fee.
If your using any of the API's however, then pricing can get tricky. Most providers offer different tiers of pricing depending on the specific model that you use.
Anthropic API models and pricing
OpenAI models and pricing
And pretty much every provider charges based on the number of tokens that you're using per request. For most developers just getting started, the pricing is typically so low that you won't have to stress too much.
As an example, running a few dozen tests per day with moderate token sizes costs me 10 cents on average.
But if you know that you're going to be pushing volume at some point, then you definitely need to take this into consideration when creating your prompts.
Most people don't realize that the input prompt goes towards your token limit, meaning if you have a 2000 token prompt, then you only have 2096 tokens left for output (on a 4096 max model).
Tokenization
Tokenization refers to how the model processes text input into tokens. From a technical standpoint, a token can be a single letter, a word, or even a chunk of a word. It will really depend on the particular model's design and what it considers to be a 'token'.
For example, the sentence "I love programming" might be tokenized as the following:
["I", " love", " programming"]
But you might also have a complex word, such as "unbelievable", and that might be tokenized as follows:
["un", "believable"]
It will really depend on the particular model, but the idea will remain the same. But it's an important concept to keep in mind, as every model has their own set limitations on how many tokens you can use up per request.
And depending on the particular inputs and outputs that you are working with, you might need to craft concise, but still very much functional, prompts in order to save both on costs and on generation times.
Temperature
When working with A.I. models, temperature refers to the parameter that controls the randomness of a particular response. This value typically ranges from 0 to 2, with 2 being the maximum amount of creative randomness.
Typically 'low' temperatures will return more deterministic and predictable responses. While 'high' temperature values will return more creative and random responses, though with a slightly higher chance of being incorrect.
High temperatures are typically better for producing any kind of creative works, such as stories, articles or social media content.
At the highest temperatures, you might find too much instability with responses, which is why it'll be important to understand your inputs and expected outputs and to test your prompts heavily.
Experimentation
The trickiest part about working with A.I. models and engineering prompts is that you can't fully predict the output. Sometimes it's plain text, while other times it's structured data in a format that you won't recognize.
The typical solution is to prompt the model with the exact output that you need:
Provide the response in plain text with no special characters.
But this doesn't always work from my experience. Most of the time you will get the correct output, but every so often you'll probably get an error and you won't know why.
Edge cases
Edge cases with A.I. are vast and cryptic, because as mentioned above, you're prompt might work 90% of the time. But it's that last 10% that's going to cost you time and that's going to give your app users a bad experience.
And if you're looking to build a scalable real-world application that many people will use on the daily, then that last 10% needs to be accounted for.
This is why you'll probably have to append numerous edge case checks to your prompt, in order to ensure proper output.
Don't include special characters. Only return plain text. Remove formatting characters..etc
And there's a good chance that you will probably have to code some logic in order to sanitize and clean up the output if needed.
A/B testing
You might find a prompt that matches your needs and produces an output that is "good enough". It might even be "pretty good". But there's a high chance that you can do better. And this is where A/B testing comes into play, particularly if you have an application that is being used by people.
A big part of prompt engineering is also producing high quality content, so you will probably need to test multiple types of outputs with your users to see which has the highest engagement. And you'll need to modify and tweak your prompts accordingly based on this.
Contextual priming
Priming is a technique used to guide an A.I. model's responses by providing some kind of relevant background information that it may need. It's important to remember that A.I. models don't have memory by default, and your prompt should contain any and all data needed in order to produce a specific output.
So if you're writing a prompt that helps someone to learn quantum physics, instead of writing:
"Explain quantum physics"
You would rather prompt:
"Explain quantum physics to a high school student using simple language and focus on superposition"
The first prompt above might return an overly complex explanation that gives a bad user experience, while the more specific prompt primes the model ahead of time, in order to produce a more directive response.
You can also prime the model by giving it a specific persona as well, such as a known figure or a known role:
Take on the role of a quantum physics professor, who uses humor to teach complex topics.
Few-shot learning
Few-shot learning is a priming technique used to improve the accuracy of a response by giving the model a few examples in order to teach it how to output content in a specific way.
And you can do so by specifying any sample input that's required, and the expected output of that sample input, followed by the real data and a placeholder for the output.
For example:
Sample Transcript: "The CEO emphasized the importance of customer satisfaction. A new product line will be introduced in Q2, and employees are encouraged to share ideas for innovation."
Key Points:
Customer satisfaction is a top priority.
A new product line will be launched in Q2.
Employees are encouraged to contribute innovative ideas.
Transcript: {transcript}
Key Points:"*
In the snippet above, the sample transcript and the key points after it are not at all related to the real output values. They are just used to show the model what it should be generating.
You then provide the model with the real information and the model assumes what it must do next.
It's definitely a clever technique that can help you get more exact with how you want to return data.
Chain of thought thinking
Chain-of-thought prompting is a technique where the AI model is encouraged to "think step-by-step" by explicitly guiding it through intermediate reasoning steps before arriving at a final answer.
This method is particularly effective for tasks that require logical reasoning, complex problem-solving, or multi-step calculations.
Instead of asking the model for a direct answer, you structure the prompt to encourage a progressive breakdown of the problem. The intermediate steps provide a clearer pathway for the model, improving its reasoning and accuracy.
For example, if you wanted to ask a simple math question without chain of thought thinking, you might prompt something like the following:
Prompt: "If a store has 7 apples and sells 3, how many are left?"
Response: 4
With chain-of-thought thinking, your prompt might look something like the following:
Prompt: "A store has 7 apples. It sells 3 apples. First, determine how many apples are sold. Then subtract that number from the total apples. How many are left?"
Response: The store starts with 7 apples. It sells 3 apples. Subtracting 3 from 7 gives 4 apples left.
By specifying the particular steps that you want the model to follow, you are able to get a more detailed and accurate response.
This technique might be more beneficial in cases where you have to process user input in a very specific way in order to get an exact response.
Most people are used to A.I. models responding to a prompt in a single response output and mainly as a giant string of text.
The latest models however are able to respond in more complex ways, such as structured JSON schema's that you yourself specify in the prompt.
For example, let's say that you were working on an app that generated random coding questions for people to practice with. It would be difficult if a model responded with a giant string that you would then have to parse yourself.
Instead, you can specify the JSON schema that you would like the response in and tell the model to only return the data in that format.
Generate 3 technical programming questions and return then in JSON with the following schema:
[{title: 'title goes here', question: 'question goes here'}]
This is really the heart of prompt engineering. In a real-world application, you're going to need structured data in order to do anything with the UI/UX.
If you're building someone a resume using A.I. then you need to know what data corresponds to their past experience, education, skills, etc.
Do note though, that this isn't guaranteed to work 100% of the time either. From my experiments so far, it works 'most' of the time. But on occasion, the model might respond with a "You got it. Here's the data! [{...". Which will make it difficult to parse out.
But most of these edge cases can also be fixed by specifying to the model that it shouldn't return anything else except for valid JSON. But again, this seems to work 'most' of the time.
A few last words
Prompt engineering as a professional skill is still very much new. And it keeps changing as models change over time. New rules are set in place, new restrictions get added and sometimes new 'emotions' and behaviors get thrown into the mix.
But it is a good skill to have in your back pocket. Being able to craft concise input strings for an A.I. model in order to generate exact and accurate data on the other end is going to be huge in the near future as people interact more and more with artificial intelligence.