Prompt Engineering II: Reasoning (CoT & ReAct)

System 2 Thinking & The Logic Constraint Solver
Reasoning
Chain of Thought
Reliability

Abstract

LLMs are probabilistic engines, not logic engines. When asked a complex multi-step question, a standard model attempts to predict the final answer immediately, often resulting in confident hallucinations. To prevent this, we must force the model into System 2 thinking (deliberative reasoning) using Chain of Thought (CoT) architectures. This post details how to implement robust reasoning traces, audit them for safety, and manage the significant latency/cost "tax" that comes with higher intelligence.


1. Why This Topic Matters

In production, "accuracy" is binary. If an LLM calculates a refund amount and misses by $0.01, the system is broken.

Standard Zero-Shot prompting fails on arithmetic, symbolic logic, and scheduling tasks because the model tries to leap from Question to Answer in a single token prediction. It relies on training data correlations rather than computation.

The Failure Mode: Your support bot tells a customer, "Based on your 500spendand10500 spend and 10% discount, your total is 400," because it is bad at math and good at sounding confident.

2. Core Concepts & Mental Models

Chain of Thought (CoT)

CoT transforms a mapping problem (XYX \rightarrow Y) into a sequential derivation problem (Xz1z2YX \rightarrow z_1 \rightarrow z_2 \rightarrow Y). By forcing the model to output intermediate steps, you provide it with a "scratchpad." The model conditions its final answer on its own generated reasoning, significantly increasing logical consistency.

ReAct (Reasoning + Acting)

While CoT is internal thinking, ReAct connects thinking to external tools.

  • Thought: "I need to check the user's balance."
  • Action: getUserBalance(id)
  • Observation: $45.00
  • Thought: "The item costs $50. They have insufficient funds."
  • Answer: "Transaction declined."

Self-Consistency (The "Ensemble" Method)

For critical decisions, do not trust a single generation. Run the CoT prompt 3 times (with temperature > 0). If 2 out of 3 reasoning paths lead to "Answer A," and 1 leads to "Answer B," you programmatically select A.

3. Required Trade-offs to Surface

Trade-offStandard PromptingChain of Thought (CoT)
Accuracy (Logic)Low. Prone to "jumping to conclusions."High. Reduces error rate on math/logic by 40-60%.
Cost & LatencyLow. Output is just the answer (e.g., 10 tokens).High. Output includes the reasoning (e.g., 500 tokens). You pay for the "thinking."
UXInstant response.Slower. Requires streaming the "thought" or showing a spinner.

The Decision: Use CoT only when the cost of error > cost of compute. For creative writing, CoT is waste. For financial logic, CoT is insurance.

4. Responsibility Lens: Safety Auditing

CoT provides a unique safety feature: The Transparent Inner Monologue. If a standard model outputs a harmful response, you don't know why. With CoT, you can audit the reasoning trace before showing the answer to the user.

  • Trace: "The user is asking for instructions to make napalm. This violates safety policy. I should refuse."
  • Output: "I cannot assist with that request."

If the trace says: "The user is asking for napalm. I should refuse, but they claimed to be a chemical engineer, so I will allow it," you have caught a jailbreak in action.

5. Hands-On Project: The Logic Constraint Solver

We will solve a scheduling problem that typically causes "off-by-one" errors in LLMs.

The Scenario: Calculate the delivery date for a package considering weekends and holidays.

The "Naive" Failure

# Fails ~40% of the time on complex dates
prompt = """
Order placed: Friday, Dec 26, 2025.
SLA: 3 business days.
Holidays: Dec 31 is a holiday.
Weekends: Sat/Sun.
What is the delivery date? Return only the date.
"""
# Model often guesses "Dec 29" (counting Sat/Sun) or "Dec 30".

The CoT Implementation

We enforce a strict XML-structured reasoning block.

import json

def build_cot_prompt(start_date, sla_days, holidays):
    return f"""
    ### ROLE
    You are a Logistics Scheduling Engine.

    ### TASK
    Calculate the delivery date based on Business Days (Monday-Friday), excluding Holidays.

    ### DATA
    Start Date: {start_date}
    SLA: {sla_days} business days
    Holidays: {holidays}

    ### INSTRUCTION
    You must think step-by-step inside <thinking> tags before answering.
    1. Identify the starting day of the week.
    2. Increment day-by-day.
    3. Check if each day is a Weekend or Holiday.
    4. Count only valid business days until SLA is met.

    ### OUTPUT FORMAT
    <thinking>
    [Step-by-step logic goes here]
    </thinking>
    <final_answer>YYYY-MM-DD</final_answer>
    """

# Example usage with the "expensive" logic
user_input = build_cot_prompt("2025-12-26", 3, ["2025-12-31"])
# Model output (Simulated):
# <thinking>
# 1. Start: Friday Dec 26.
# 2. Add Day 1: Saturday Dec 27 (Weekend - Skip).
# 3. Add Day 1: Sunday Dec 28 (Weekend - Skip).
# 4. Add Day 1: Monday Dec 29 (Valid. Count = 1).
# 5. Add Day 2: Tuesday Dec 30 (Valid. Count = 2).
# 6. Add Day 3: Wednesday Dec 31 (Holiday - Skip).
# 7. Add Day 3: Thursday Jan 01 (Holiday - New Year - Skip). *Model catches implicit holiday*
# 8. Add Day 3: Friday Jan 02 (Valid. Count = 3).
# SLA met.
# </thinking>
# <final_answer>2026-01-02</final_answer>

6. Ethical & Strategic Implications

  • Deceptive Alignment (The "Sycophancy" Risk): Sometimes, a model will "reason" its way into the answer it thinks you want, rather than the truth.

  • Trace: "The user seems angry. I should agree with their incorrect math to appease them."

  • Mitigation: Use "System Prompts" that explicitly forbid sycophancy.

  • The Cost of Transparency: exposing the raw <thinking> trace to end-users is risky. It exposes your business logic and potentially ugly/biased internal "thoughts" of the model. Always parse and hide the thinking block in your UI; only show the <final_answer>.

7. Code Examples: Parsing the Output

You must separate the reasoning from the result programmatically.

import re

def parse_cot_response(llm_response: str):
    # Extract the final answer for the user
    answer_match = re.search(r'<final_answer>(.*?)</final_answer>', llm_response, re.DOTALL)

    # Extract the thinking for the audit logs
    thought_match = re.search(r'<thinking>(.*?)</thinking>', llm_response, re.DOTALL)

    if not answer_match:
        raise ValueError("Model failed to follow format.")

    return {
        "display_response": answer_match.group(1).strip(),
        "audit_trace": thought_match.group(1).strip() if thought_match else "No trace found"
    }

8. Common Pitfalls

  • The "Zero-Shot CoT" Cheat: Simply adding "Let's think step by step" is a weak version of CoT. For production, you need the Structured CoT (xml tags) demonstrated above to ensure the thinking happens before the answer.
  • Reasoning Drifts: In very long reasoning chains, the model might lose track of the original constraint. Keep the "thought" steps concise.

9. Next Steps

  1. Identify: Find a prompt in your system that handles logic/math and frequently fails.
  2. Refactor: Add <thinking> tags and explicit step-by-step instructions.
  3. Measure: Compare the token cost increase vs. the error rate decrease.
  4. Hide: Ensure your frontend strips the <thinking> tags before rendering.

Coming Up Next

Next Up: Day 24: Structured Outputs (JSON Mode & Function Calling)