Prompting and In-Context Learning

Topics Covered

System Prompts and Role Definition

System Prompt vs User Message

Role Definition

Behavioral Constraints

Prompt Structure

Instruction Hierarchy

A Complete System Prompt Example

Few-Shot and Chain-of-Thought

The In-Context Learning Spectrum

Zero-Shot Prompting

Few-Shot Prompting

How Many Examples to Include

When Examples Help vs When They Waste Context

Chain-of-Thought Prompting

Why CoT Works

Combining Few-Shot and CoT

Structured Output and Response Formatting

Why Structure Matters

JSON Mode and Schema Enforcement

Prompt Techniques for Structure

XML and Markdown as Alternatives

Validation and Retry Loop

Choosing the Right Approach

Prompt Engineering for Reliability

Temperature and Sampling

Prompt Fragility

Iterative Refinement

Negative Prompting

Prompt Testing

The 80/20 Rule of Prompting

The system prompt is the single most important piece of code in any LLM-powered application. It defines who the model is, what it can and cannot do, and how it should behave. A well-crafted system prompt turns a general-purpose model into a specialized, reliable tool. Everything else (the orchestration logic, the retrieval pipeline, the tool integrations) amplifies what the system prompt establishes.

System Prompt User Message Response Flow

System Prompt vs User Message

Every LLM API call has two distinct input channels. The system prompt is persistent instructions the model follows across all interactions: it sets the behavioral foundation. The user message is the specific request for this turn. Think of the system prompt as a job description and the user message as a task assignment. The job description does not change between tasks, but it shapes how every task is executed.

Without a system prompt, the model defaults to a generic assistant persona. It will answer anything, use any format, and adopt no particular style. This is fine for a chatbot demo but unusable for a production system that needs consistent, predictable behavior.

In a multi-turn conversation, the system prompt persists across all turns. The user's first message, second message, and tenth message all execute within the behavioral context established by the system prompt. This persistence is what makes it valuable: you set the rules once and they apply to every interaction without repetition.

Role Definition

The most powerful single line in a system prompt is the role declaration:

 
You are a customer support agent for Acme Corp. You help
customers troubleshoot product issues, process returns, and
answer billing questions.

This one sentence constrains the model's behavior dramatically. It will not write poetry, solve math problems, or generate code, not because it cannot, but because a customer support agent would not. Role definition activates the model's ability to stay in character, and it works because LLMs have internalized vast amounts of role-specific behavior from training data.

Effective roles are specific. "You are a helpful assistant" provides almost no constraint: every assistant is helpful. "You are a senior database administrator who helps developers write efficient SQL queries and optimize database schemas" activates a specific domain, expertise level, and task scope. The more specific the role, the more predictable the model's behavior becomes.

Roles also set the communication style. A "senior engineer" explains with technical precision. A "kindergarten teacher" uses simple words and analogies. A "medical triage assistant" uses cautious, escalation-oriented language. The role does not just filter what the model knows. It filters how the model communicates, which is equally important for production quality.

Behavioral Constraints

After establishing the role, constraints narrow the model's action space further:

 
1Rules:
2- Never share internal pricing formulas or cost data.
3- Always respond in valid JSON.
4- If you do not know the answer, say "I'll escalate this
5  to a specialist", never guess.
6- Do not discuss competitors by name.
7- Keep responses under 200 words.

Each constraint eliminates a class of failure. "Never guess" prevents hallucination in support contexts. "Always respond in JSON" ensures downstream code can parse the output. Constraints are negative instructions: they define the boundaries the model must not cross.

The key insight is that constraints are composable. Each one independently eliminates a failure mode. You can add and remove constraints without restructuring the entire prompt. When a new failure appears in production (the model started suggesting competitor products) you add one line: "Do not mention competitor products by name." This additive approach makes system prompts maintainable over time.

Key Insight

The system prompt is the agent's brain. Your orchestration code is just the body that carries out what the brain decides. A mediocre orchestration framework with a great system prompt will outperform a sophisticated framework with a vague prompt every time.

Prompt Structure

A production system prompt follows a consistent structure that helps the model parse and prioritize instructions:

 
1IDENTITY: Who you are and your primary function.
2CAPABILITIES: What you can do (tools, data access, actions).
3CONSTRAINTS: What you must never do.
4OUTPUT FORMAT: How to structure every response.
5EXAMPLES: Reference input-output pairs for ambiguous cases.

This ordering matters. Identity and capabilities come first because they establish context for everything that follows. Constraints come before output format because violations are more costly than formatting errors. Examples come last because they are reference material, not core instructions.

Think of the structure as a funnel: each section narrows the model's behavior space. Identity creates a broad filter (only act as this role). Capabilities narrow further (only use these tools). Constraints narrow further still (never cross these lines). Output format specifies the exact shape of the response. By the time the model reaches the examples, its behavior is already tightly constrained, and the examples serve as calibration for edge cases within those constraints.

Instruction Hierarchy

When a user message conflicts with the system prompt, the model should follow the system prompt. If the system prompt says "Always respond in English" and the user says "Respond in French," the model should respond in English. This hierarchy is what makes system prompts useful for safety and compliance: they act as guardrails that the user cannot override through clever phrasing.

In practice, this hierarchy is not absolute. Determined users can sometimes override system instructions through prompt injection, where a user crafts input designed to make the model ignore its system prompt. This is why defense-in-depth matters: the system prompt is the first layer, but output validation, content filtering, and action permissions provide additional layers of protection.

A Complete System Prompt Example

Here is a production-quality system prompt that incorporates all the principles above:

 
1IDENTITY:
2You are a customer support agent for CloudStore, an
3e-commerce platform. You help customers with order
4tracking, returns, and account issues.
5
6CAPABILITIES:
7- Look up order status by order ID
8- Initiate return requests for orders within 30 days
9- Reset account passwords via email verification
10- Check product availability and pricing
11
12CONSTRAINTS:
13- Never share internal cost data or profit margins.
14- Never process refunds over $500 without escalation.
15- If unsure about a policy, say "Let me connect you
16  with a specialist" rather than guessing.
17- Do not discuss competitor platforms.
18- Respond only in English.
19
20OUTPUT FORMAT:
21Respond in a friendly, professional tone. Keep responses
22under 150 words. If an action is needed, state the action
23clearly and confirm with the customer before proceeding.

Notice how each section builds on the previous one. The identity tells the model who it is. The capabilities tell it what tools it has. The constraints tell it where the boundaries are. The output format tells it how to communicate. A model following this prompt will behave consistently across thousands of interactions.

One common mistake is overloading the system prompt with too many instructions. A system prompt with 50 constraints is hard for the model to follow reliably. It is also hard for engineers to maintain. Aim for 5-10 core constraints that address the most important failure modes. You can always add more later when you observe specific failures in production. Start lean and iterate.

Another mistake is writing system prompts in a conversational tone: "Hey, you should try to be really helpful and make sure you don't say anything bad." This creates ambiguity. Models respond better to clear, declarative instructions: "You are X. You do Y. You never do Z." Treat the system prompt like a specification, not a conversation.

You can teach an LLM a new task without any training, just show it examples in the prompt. This is in-context learning, and it is one of the most powerful capabilities of modern LLMs. The model generalizes from the examples you provide, matching the pattern for new inputs. Combined with chain-of-thought prompting, it transforms how accurately models handle complex reasoning tasks.

The In-Context Learning Spectrum

In-context learning exists on a spectrum based on how much information you provide in the prompt. At one end is zero-shot prompting (no examples, just a task description. In the middle is few-shot prompting) a handful of examples that demonstrate the pattern. At the far end is many-shot prompting: dozens of examples that essentially turn the prompt into a mini training set. Each point on the spectrum trades context window space for accuracy, and the right choice depends on the task complexity and the model's baseline capability.

Zero-Shot Prompting

Zero-shot gives the model a task description with no examples. The model relies entirely on its training knowledge to interpret the request:

 
1Classify the following customer message as one of:
2BILLING, TECHNICAL, ACCOUNT, OTHER.
3
4Message: "I was charged twice for my subscription."
5Category:

This works well for simple, well-known tasks: sentiment analysis, translation, summarization. The model has seen millions of similar tasks during training and knows what to do. But zero-shot fails when the task is ambiguous, the output format is specific, or the domain has unusual conventions. When a zero-shot prompt produces inconsistent results, the first fix to try is adding examples, not rewriting the instructions.

Few-Shot Prompting

Few-shot includes 2-5 input-output examples before the actual request. The model learns the pattern from the examples and applies it to the new input:

 
1Classify the customer message. Respond with exactly one
2category and a confidence score.
3
4Message: "My dashboard won't load."
5Category: TECHNICAL
6Confidence: 0.95
7
8Message: "Can I upgrade to the business plan?"
9Category: ACCOUNT
10Confidence: 0.90
11
12Message: "I was charged twice for my subscription."
13Category: BILLING
14Confidence: 0.92
15
16Message: "The export button gives a 500 error."
17Category:

The examples communicate three things that no amount of instruction text can match: the exact output format (category on one line, confidence on the next), the confidence range (0.90-0.95, not 85% or "high"), and edge case handling (how to distinguish TECHNICAL from BILLING). Few-shot prompting improves accuracy dramatically for tasks with specific formats or domain-specific reasoning.

How Many Examples to Include

The optimal number of examples depends on the task complexity and the context window budget. For most tasks, 2-5 examples are sufficient. More examples give diminishing returns: the model learns the pattern from the first few and additional examples add marginal benefit while consuming context tokens.

The quality of examples matters more than the quantity. Three diverse examples covering different edge cases outperform ten similar examples that all demonstrate the same pattern. When selecting examples, maximize coverage: include at least one example per output class, one edge case, and one example that demonstrates handling of ambiguous input. If space is limited, prioritize edge cases over standard cases: the model already handles standard cases well from its training data.

When Examples Help vs When They Waste Context

Examples are most valuable when the task is ambiguous (multiple valid interpretations), the output format is specific (JSON schema, fixed labels), or the domain has unusual conventions (legal citations, medical coding). In these cases, examples anchor the model's behavior more reliably than instructions alone.

Examples waste context when the task is simple and well-known (summarize this text, translate to Spanish) or when all examples are too similar to be informative (showing five positive sentiment examples teaches nothing about negative sentiment). Every token spent on examples is a token unavailable for the actual input, so use examples strategically.

A practical test: run your prompt with and without examples on 20 test inputs. If accuracy improves by more than 5 percentage points with examples, keep them. If accuracy is the same, remove the examples and save the context tokens for longer inputs. This empirical approach avoids both under-prompting (missing examples that would help) and over-prompting (wasting context on unnecessary examples).

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting asks the model to show its reasoning before giving an answer. The simplest form is appending "Think step by step" to the prompt. The more powerful form shows a worked example where the reasoning is visible:

Chain of Thought Reasoning
 
1Q: A store has 15 apples. It sells 8 in the morning and
2receives a shipment of 12 in the afternoon. A customer
3then buys 6. How many apples remain?
4
5A: Let me work through this step by step.
6- Start: 15 apples
7- After morning sales: 15 - 8 = 7 apples
8- After shipment: 7 + 12 = 19 apples
9- After customer purchase: 19 - 6 = 13 apples
10The store has 13 apples remaining.
11
12Q: A warehouse has 230 units. It ships 85 to Store A and
1345 to Store B. A delivery brings 120 new units. Store A
14returns 15 defective units. How many units are in the
15warehouse?
16
17A:

The model uses its own output tokens as working memory. Each intermediate step becomes context for the next step, allowing the model to solve problems that require holding multiple values in flight. Without CoT, the model must jump directly to the answer, and for multi-step problems, it frequently gets the arithmetic or logic wrong.

CoT is not limited to math. It improves accuracy on any task that benefits from decomposition: code debugging (identify the error, determine the fix, apply it), decision making (list the criteria, evaluate each option, select the best), and data analysis (extract the relevant numbers, compute the metric, interpret the result).

Interview Tip

Use chain-of-thought for multi-step reasoning, math, logic, and code generation. Use direct prompting (no CoT) for classification, extraction, and simple formatting tasks where the overhead of reasoning tokens adds latency without improving accuracy.

Why CoT Works

The key insight is that LLMs generate tokens sequentially: each token can attend to all previous tokens. When the model writes "15 - 8 = 7" as an intermediate step, the token "7" becomes part of the context for subsequent tokens. This means the model does not need to hold "7" in some internal register. It is right there in the text. By externalizing its reasoning into tokens, the model converts a hard problem (multi-step computation in a single forward pass) into an easy problem (single-step computation with the previous result visible in context).

This also explains why CoT increases token usage and latency. The reasoning tokens are real computation: they are the model doing work. For a simple classification task, this extra work is wasted. For a complex reasoning task, it is essential.

Combining Few-Shot and CoT

The most powerful prompting technique combines few-shot examples with chain-of-thought reasoning. Each example shows both the answer and the reasoning process. The model then applies the same reasoning pattern to new inputs. This is particularly effective for domain-specific tasks where the reasoning conventions are not obvious from the task description alone.

For example, a legal document classifier might need to consider jurisdiction, document type, and filing date to determine the correct category. A few-shot example that shows the reasoning process. "This is a filing in California (state jurisdiction), it is a motion to dismiss (procedural document), filed within 30 days (timely)"nullteaches the model which factors matter and how to weigh them. Without visible reasoning, the model might classify based on superficial text patterns rather than the correct legal criteria.

The practical recipe: start with zero-shot. If accuracy is insufficient, add few-shot examples. If the task requires multi-step reasoning, add chain-of-thought to the examples. Each step increases token usage and latency but also increases accuracy. Stop at the level that meets your accuracy requirement. Do not over-engineer the prompting strategy for a task that zero-shot handles adequately.

When an LLM is part of a software system, its output must be machine-readable. A customer support classifier needs to return a category enum, not a paragraph of explanation. A data extractor needs valid JSON with specific fields and types. Getting LLMs to produce consistently structured output is a core engineering challenge, because the model's default behavior is to produce natural language, not data structures.

Why Structure Matters

Downstream code parses the model's output. A missing field crashes the pipeline. Invalid JSON requires error handling and retries. An unexpected string where an integer was expected causes a type error three services downstream. In production, every response must conform to a contract, just like any other API.

json
1{
2  "category": "BILLING",
3  "confidence": 0.92,
4  "requires_human": false,
5  "suggested_action": "initiate_refund"
6}

If the model returns "confidence": "high" instead of "confidence": 0.92, the downstream service crashes. If it omits requires_human, the routing logic fails silently. Structured output is not a nice-to-have. It is a reliability requirement.

The challenge is that LLMs are trained to produce natural language: fluid, variable, context-dependent text. Forcing them into rigid schemas goes against their default behavior. Every technique in this section is about bridging that gap: making the model's output predictable enough for code to consume while preserving the model's ability to reason about the content.

Validation and retry loop for structured output extraction

JSON Mode and Schema Enforcement

Many LLM providers offer a JSON mode that constrains output to valid JSON syntax. This eliminates one class of error (malformed JSON) but does not guarantee the correct schema. The model might return valid JSON with wrong field names or missing required fields.

Schema enforcement goes further: you define the exact output shape (required fields, types, enums, nested objects) and the provider constrains the model's token generation to match. This is the most reliable approach for structured output:

json
1{
2  "type": "object",
3  "properties": {
4    "category": {
5      "type": "string",
6      "enum": ["BILLING", "TECHNICAL", "ACCOUNT", "OTHER"]
7    },
8    "confidence": {
9      "type": "number",
10      "minimum": 0,
11      "maximum": 1
12    },
13    "requires_human": {
14      "type": "boolean"
15    }
16  },
17  "required": ["category", "confidence", "requires_human"]
18}

With schema enforcement, the model cannot produce a category outside the enum or return a string for the confidence field. The provider's decoding layer rejects invalid tokens before they reach the output. This is the gold standard for structured output: the structure is guaranteed by the infrastructure, not by the prompt.

The tradeoff is flexibility. Schema enforcement constrains the model's output space, which can reduce quality for tasks that benefit from open-ended generation. For classification and extraction, the constraint improves reliability without sacrificing quality. For tasks that require nuanced explanations alongside structured data, consider splitting the output into a structured section (schema-enforced) and a free-text section.

Prompt Techniques for Structure

When schema enforcement is not available, prompt engineering achieves similar results with lower guarantees:

 
1Respond ONLY with valid JSON matching this exact schema.
2Do not include any text before or after the JSON object.
3Do not wrap the JSON in markdown code blocks.
4
5Schema:
6- category (string): one of BILLING, TECHNICAL, ACCOUNT, OTHER
7- confidence (number): between 0.0 and 1.0
8- requires_human (boolean): true if escalation needed

Combining this instruction with a few-shot example of the expected output achieves 95%+ compliance. The remaining failures are handled by the validation and retry loop.

A common mistake is providing the schema in natural language but not showing a concrete example. The model interprets "confidence (number)" in many ways: is it 0-1, 0-100, or a percentage string? A single example output resolves all ambiguity instantly.

XML and Markdown as Alternatives

JSON is not always the best structured format for LLM output. XML tags are naturally nested and self-documenting:

xml
1<analysis>
2  <sentiment>positive</sentiment>
3  <confidence>0.88</confidence>
4  <key_phrases>
5    <phrase>great product</phrase>
6    <phrase>fast shipping</phrase>
7  </key_phrases>
8</analysis>

XML works particularly well when the output contains mixed content (text with metadata) or deeply nested structures. Some models produce XML more reliably than JSON because XML tags provide clear start and end markers that are easier to maintain consistency with.

Markdown is the simplest structured format. Headers create sections, lists create enumerated items, and code fences isolate structured data. For tasks where the output is primarily text with some structure (a report with sections, a review with scores), markdown is natural for models and easy to parse. The tradeoff is that markdown provides weaker guarantees than JSON or XML: there is no formal schema to validate against.

The choice between formats depends on your downstream consumer. If it is code that needs typed fields, use JSON with schema enforcement. If it is another LLM that needs to read the output, use XML or markdown: they are easier for models to parse than JSON. If it is a human reviewing output, markdown is the most readable.

You can also combine formats. A common pattern is to use XML tags to separate sections of output, with JSON inside specific tags for structured data:

xml
1<response>
2  <reasoning>The customer is asking about a billing issue
3  from last month's invoice.</reasoning>
4  <structured_output>
5    {"category": "BILLING", "confidence": 0.94}
6  </structured_output>
7</response>

This gives you the best of both worlds: the model can reason freely in the reasoning section while producing machine-parseable data in the structured section.

Validation and Retry Loop

Even with the best prompting, structured output fails occasionally. Production systems need a validation layer:

  1. Parse the model's output (JSON.parse, XML parser)
  2. Validate against the expected schema (check required fields, types, enums)
  3. If invalid, send the error message back to the model with the original request
  4. Retry with the error context. "Your previous response had an error: missing required field 'confidence'. Please try again."

Most structured output failures are recoverable with a single retry. The error message gives the model specific feedback about what went wrong, and models are highly responsive to correction. If the second attempt also fails, fall back to a default value or escalate to a human.

Choosing the Right Approach

The reliability spectrum runs from prompt-only (lowest reliability, maximum flexibility) to full schema enforcement (highest reliability, constrained output). For prototyping, prompt-only is fine: you can iterate quickly and handle failures manually. For production pipelines processing thousands of requests, schema enforcement plus validation is essential. The cost of a parsing failure at scale (crashed pipelines, corrupted data, missing alerts) far exceeds the upfront cost of implementing proper schema enforcement.

When schema enforcement is not available from your provider, build the validation layer yourself. Parse the output, check every field against the expected type and value range, and retry on failure. This adds 50-100 lines of code but eliminates an entire class of production incidents. The retry budget should be small: one or two retries at most. If the model cannot produce valid output after two attempts, the problem is in the prompt, not in bad luck.

A prompt that works 95% of the time is not reliable enough for production. The remaining 5%, where the model ignores instructions, formats output incorrectly, or hallucinates, causes real failures that erode user trust and trigger on-call pages. Prompt engineering for reliability is about systematically closing that gap through testing, iteration, and defensive design.

Iterative prompt refinement workflow for production reliability

Temperature and Sampling

Temperature controls the randomness of the model's output. At temperature 0, the model always picks the most probable next token, producing deterministic, consistent responses. At temperature 1, the model samples more broadly, producing diverse and creative output. For production systems, the right temperature depends on the task. Classification, extraction, and structured output should use temperature 0 or near-zero for consistency. Creative writing, brainstorming, and content generation benefit from temperature 0.7-1.0 for variety. Using temperature 1.0 for a JSON extraction pipeline is a reliability bug: each run might produce slightly different field names or formats.

A related parameter is top-p (nucleus sampling), which limits the model to sampling from the smallest set of tokens whose cumulative probability exceeds a threshold. At top-p 0.1, the model only considers the most likely tokens. At top-p 1.0, it considers all tokens. For reliability-critical applications, set temperature to 0 and leave top-p at 1.0nullthis gives you the most deterministic behavior.

A common mistake is setting both temperature and top-p to low values simultaneously, which over-constrains the sampling and can produce repetitive or degenerate output. Choose one parameter to tune and leave the other at its default. For most production use cases, temperature 0 with default top-p is the right starting point.

Prompt Fragility

Common Pitfall

Small changes in prompt wording can cause large, unpredictable shifts in model behavior. Changing 'List the top 3 reasons' to 'What are the best 3 reasons' may alter the output format, length, and content. Always test prompt changes against a diverse set of inputs before deploying.

Prompt fragility is the most counterintuitive challenge in LLM engineering. A prompt that performs perfectly on your test cases might fail on production inputs because of a word choice difference you would never notice. Synonyms that mean the same thing to a human ("list" versus "enumerate," "summarize" versus "give an overview") activate different generation patterns in the model. This is why prompt engineering is empirical, not theoretical. You cannot reason your way to a reliable prompt. You must test it.

Fragility also means that model updates can break working prompts. A prompt tuned for one model version may produce different results on a newer version. This is analogous to a library upgrade breaking your code, except there is no changelog for model behavior changes. The defense is the same: automated tests that catch regressions immediately.

The practical lesson: never deploy a prompt change without testing it. Even changes that seem harmless (fixing a typo, reordering sentences, adding a clarification) can shift model behavior in unexpected ways. Treat every prompt modification as a code change that requires validation.

Iterative Refinement

Start with a simple prompt. Test it against 20-30 diverse inputs. Identify the failure modes: does the model sometimes include explanations when you asked for JSON only? Does it occasionally use a different category label? Add a constraint to address each specific failure. "Do not include any explanation or commentary." "Use only these exact category labels: BILLING, TECHNICAL, ACCOUNT, OTHER." Test again. Repeat until the failure rate is acceptable. This process mirrors software debugging: observe the bug, hypothesize the cause, add a fix, verify. Each iteration closes a specific failure mode.

A practical workflow looks like this: start with a 3-line prompt that describes the task. Run it on 10 test inputs. Record the failures. Add constraints to fix the top 3 failures. Run the full test suite again. Check that the fixes did not introduce new failures. Repeat with harder edge cases. After 3-5 iterations, most prompts reach 95%+ reliability. The remaining 5% requires either more examples, negative prompting, or validation-and-retry logic.

Negative Prompting

Telling the model what NOT to do is sometimes more effective than positive instructions. "Do not include explanations" is clearer than "Be concise" because it specifies exactly what to omit. "Never use bullet points" is more actionable than "Use paragraph format." Negative prompts work well because they eliminate specific failure modes directly. If the model keeps adding a preamble like "Sure, here is the JSON:" before the actual output, adding "Do not include any preamble or text before the JSON" eliminates that exact pattern.

The best approach combines positive and negative instructions: state what to do, then reinforce with what not to do. "Respond with valid JSON (do not include any text, explanation, or markdown formatting outside the JSON object)." The positive instruction tells the model the goal. The negative instruction eliminates the most common ways the model deviates from that goal.

Prompt Testing

Treat prompts like code. Maintain a test suite of inputs with expected outputs. Run your prompt against the suite after every change. Version control your prompts: a git diff on a prompt change should trigger the same review rigor as a code change. Track metrics: accuracy, format compliance, latency, token usage. When a prompt change improves accuracy from 94% to 97%, you want to know that before deploying, not after. When another change regresses format compliance from 99% to 91%, you want to catch it before it reaches production.

A minimal prompt test suite has three components: golden test cases (inputs with known-correct outputs), edge cases (inputs that have historically caused failures), and adversarial cases (inputs designed to break the prompt). Run all three categories after every prompt change. The golden tests catch regressions. The edge cases verify that known failure modes stay fixed. The adversarial cases probe for new vulnerabilities. This discipline is what separates production-grade prompts from prototype-grade ones.

The 80/20 Rule of Prompting

80% of prompt quality comes from two things: a clear task description and good examples. The remaining 20% comes from edge case handling, constraint refinement, and negative prompting. If your prompt is underperforming, first check whether the task description is unambiguous and whether the examples cover the important cases. Most prompt problems are not subtle: they are caused by unclear instructions or missing examples. Only after the fundamentals are solid should you invest in fine-tuning edge cases and adding defensive constraints.

This 80/20 split also applies to debugging time. Engineers often spend hours tweaking phrasing when the real problem is that the task description is ambiguous or the examples are missing. Before adding your tenth constraint, ask: "Does the model actually understand what I am asking it to do?" If the answer is no, more constraints will not help. A clearer task description and better examples will.