使用 LangGraph 构建自我改进代理:Reflection 与 Reflexion 的对比

原文:Building a Self-Improvement Agent with LangGraph: Reflection Vs Reflexion
作者:Piyush Agnihotri
日期:Sep 26, 2025

AI 代理如何从错误中学习?反思(Reflection)在 AI 中的作用

想象一下,一个 AI 不仅能生成答案,还能审查自己的输出,发现薄弱环节,并在下一次草稿中进行改进,即真正地从自己的错误中学习。这就是反思代理(reflection agents)的用途:自我纠正、迭代优化和策略随时间改进。

反思(Reflection)是一种提示策略,用于提高代理及类似 AI 系统的质量和成功率。

可以把它看作是一种提示策略,要求 LLM 对其先前的行为进行反思和批判。有时,反思器(reflector)也会利用外部知识,如工具输出或检索结果,来使批判更加准确。其结果不是一次性的响应,而是一个简短的反馈循环,其中生成和审查交替进行,直到输出质量得到提升。

反思(Reflection)系统通常分为三大类:

  • 基本反思代理(Basic Reflection Agent):一个简单的生成器加反思器的循环。生成器起草内容,反思器对其进行批判,然后生成器进行修订。这种方式轻量且对许多编辑任务都有效。
  • Reflexion 代理:一种更结构化的方法,它在一个可追溯的日志中跟踪过去的行为、假设和反思。Reflexion 对于需要代理从多次失败尝试中学习的解决问题场景非常有用。
  • 语言代理树搜索(Language Agent Tree Search, LATS):一种类似搜索的策略,它探索多个行动分支,反思结果,并修剪或保留有希望的分支。最适合规划和多步推理。

在这篇文章中,我们将重点关注 Reflection 和 Reflexion 代理,探讨它们的工作流程,并使用 LangChain 和 LangGraph 一步步实现它们。

目录:

  • 理解基本反思代理(Basic Reflection Agent)
    • 构建一个 Reflection 代理
    • 环境设置
    • 生成器(Generator)
    • 反思器(Reflector)
    • 重复(Repeat)
    • 使用 LangGraph 进行工作流编排
  • 理解 Reflexion 代理
    • 构建一个 Reflexion 代理
    • 构建工具
    • 结构化输出解析
    • 修订(Revision)
    • 创建工具节点
    • 构建图(Graph)
  • Reflection 与 Reflexion 的对比
  • 结论

理解基本反思代理(Basic Reflection Agent)

反思代理(Reflection Agent) 是一种 AI 系统,它超越了仅仅生成答案的范畴,能够主动批判自己的输出并不断进行优化。你可以把它想象成两个角色之间的循环:

  • 生成器(Generator):起草初始响应。
  • 反思器(Reflector):审查草稿,指出缺陷或不足,并提出改进建议。

这种来回的交流会进行几次迭代,每一次循环,输出都会变得更加精炼、可靠和有用。这里的精妙之处在于,AI 基本上是实时地从自己的错误中学习,就像作家在一轮编辑反馈后重写草稿一样。

Credits: LangChain

在本节中,我们将使用 LangGraph 为 LinkedIn 帖子生成器构建一个 Reflection 代理,LangGraph 是一个专为创建自我改进 AI 系统而设计的框架。其思想是设计一个模仿人类反思性思维的工作流,代理不会止步于初稿,而是会不断优化,直到内容感觉精炼且引人入胜。

通过本指南,你将了解如何设置生成器和反思器角色,使用 LangChain 进行结构化提示,并利用 LangGraph 将所有部分整合成一个迭代的反馈循环。

构建 Reflection 代理

让我们从通过一个 LinkedIn 内容创建代理来实现基本的 Reflection 模式开始。这个流程简单而强大:代理起草一个帖子,一个独立的“反思器”角色对其进行批判,然后系统根据该反馈修订内容。

环境设置

在这次实现中,我们将使用 LangChain 和 LangGraph 来构建一个 reflection 代理工作流。为了避免混淆,我们不会一次性导入所有库,而是会在需要时逐步引入。这使得学习流程清晰易懂。

首先,让我们使用一个 .env 文件来设置 API 集成的环境变量:

1
2
3
4
ANTHROPIC_API_KEY="your-anthropic-api-key"
# LANGCHAIN_API_KEY="your-langchain-api-key" # optional
# LANGCHAIN_TRACING_V2=True # optional
# LANGCHAIN_PROJECT="multi-agent-swarm" # optional

现在,我们将这些变量加载到我们的 notebook 中:

1
2
3
4
5
6
7
8
9
10
11
12
from langchain_anthropic import ChatAnthropic
from dotenv import load_dotenv

load_dotenv()
load_dotenv(dotenv_path="../.env", override=True) # mention the .env path

# Initialize Anthropic model
llm = ChatAnthropic(
model="claude-3-7-sonnet-latest", # Claude model ID
temperature=0,
# max_tokens=1024
)

在这次实现中,我将使用 Anthropic 的 claude-3–7-sonnet-latest 作为聊天模型。但你可以自由更换其他 LLM;LangChain 支持广泛的集成。

请确保按照项目 README.md 中的步骤设置好你的虚拟环境。使用 uv 创建环境,然后将其添加为 Jupyter Notebook 的内核。

生成器(Generator)

现在我们已经配置好了 LLM,我们将创建第一个代理组件:一个 LinkedIn 帖子生成器。这个代理将起草帖子,这些帖子稍后可以通过自我审查进行优化。

让我们从为我们的帖子创建一个生成提示(generation prompt)开始。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

post_creation_prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert LinkedIn content creator tasked with crafting compelling, professional, and high-performing LinkedIn posts. "
"Create the most effective LinkedIn post possible based on the user's requirements. "
"If the user provides feedback or suggestions, respond with an improved version that incorporates their input while enhancing overall quality and engagement.",
),
MessagesPlaceholder(variable_name="messages"),
]
)

我们使用 ChatPromptTemplate 来构建提示。它包含两个主要部分:

  1. 系统消息(System Message):这定义了助手的角色。在这里,助手扮演一个专业的 LinkedIn 内容策略师。其职责是生成高质量的帖子,并在提供反馈时对它们进行修订,以提高可读性、参与度和整体效果。
  2. 消息占位符(MessagesPlaceholder):这是一个用于用户输入的动态插槽。在运行时,它会被用户的请求填充,确保帖子是相关且量身定制的。

接下来,我们使用管道操作符(|)将这个提示与我们的 LLM 集成,创建一个完整的处理链:

1
linkedin_post_generator = post_creation_prompt | llm

管道操作符充当一座桥梁:它将 post_creation_prompt 的输出直接作为输入传递给 LLM。这使得 LLM 能够根据用户的指令生成结构良好的帖子。

关于管道操作符(|)及其在简化 LangChain 工作流中作用的更详细解释,你可以参考我之前的博客,“Minimalist AI Workflows with LCEL: LangChain Expression Language”。

让我们用一个简单的例子来测试我们的生成器:

1
2
3
4
5
6
7
8
9
10
11
12
# Example LinkedIn post creation session
generated_post = ""
post_request = HumanMessage(
content="Create a LinkedIn post on AI tools for developers under 200 words."
)

print("=== INITIAL LINKEDIN POST ===")
for chunk in linkedin_post_generator.stream({"messages": [post_request]}):
print(chunk.content, end="")
generated_post += chunk.content

print("\n" + "="*60 + "\n")

预期响应:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
=== INITIAL LINKEDIN POST ===
Here's a compelling LinkedIn post for developers about AI tools:

🤖 Fellow developers, let's talk about AI tools that are actually worth your time!

After testing dozens of AI tools, here are 5 game-changers that have transformed my development workflow:

1. GitHub Copilot
Real-time code suggestions that feel like pair programming with an AI. Seriously cuts down on boilerplate code.

2. ChatGPT API
Not just for chat - it's incredible for debugging, code optimization, and even architecture discussions. Pro tip: Use it to explain complex code blocks.

3. Amazon CodeWhisperer
Like Copilot's cousin, but with deeper AWS integration. Perfect for cloud-native development.

4. Tabnine
Context-aware code completions that learn from your coding style. Works across 30+ languages!

5. DeepCode
Catches bugs before they happen with AI-powered code reviews. Has saved my team countless hours.

💡 Pro Tip: These tools should augment, not replace, your development skills. Use them to enhance productivity, not as a crutch.

What AI tools are you using in your dev workflow? Drop them in the comments! 👇

#SoftwareDevelopment #AI #CodingTools #TechCommunity #Programming

Thoughts? Would you like any adjustments to make it more engaging?
============================================================

好了,我们已经创建了内容生成器,但它目前是孤立运行的。下一步是添加反思能力,以评估和改进我们的内容。

反思(Reflect)

LinkedIn 帖子生成器准备就绪后,我们现在创建一个批判代理,本质上是一个社交媒体策略师。该代理从多个角度分析生成的帖子:参与潜力、品牌一致性、语调和整体优化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# SOCIAL MEDIA STRATEGIST REFLECTION
social_media_critique_prompt = ChatPromptTemplate.from_messages([
(
"system",
"""You are a LinkedIn content strategist and thought leadership expert. Analyze the given LinkedIn post and provide a comprehensive critique focusing on:

**Content Quality & Professionalism:**
- Overall quality, tone clarity, and LinkedIn best practices alignment
- Structure, readability, and professional credibility building
- Industry relevance and audience targeting

**Engagement & Algorithm Optimization:**
- Hook effectiveness and storytelling quality
- Engagement potential (likes, comments, shares)
- LinkedIn algorithm optimization factors
- Word count and formatting effectiveness

**Technical Elements:**
- Hashtag relevance, reach, and strategic placement
- Call-to-action strength and clarity
- Use of formatting (line breaks, bullet points, mentions)

Provide specific, actionable feedback that includes:
- Key strengths and improvement areas
- Concrete suggestions for enhancing engagement and professionalism
- Practical recommendations for the next revision

Keep your critique constructive and focused on measurable improvements, prioritizing actionable insights that will guide the post's revision and lead to tangible content enhancements."""
),
MessagesPlaceholder(variable_name="messages")
])

social_media_critic = social_media_critique_prompt | llm

现在,让我们看看批判代理如何评估我们生成的帖子:

1
2
3
4
5
6
7
print("=== SOCIAL MEDIA STRATEGIST FEEDBACK ===")
feedback_result = ""
for chunk in social_media_critic.stream({"messages": [post_request, HumanMessage(content=generated_post)]}):
print(chunk.content, end="")
feedback_result += chunk.content

print("\n" + "="*60 + "\n")

预期响应:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
=== SOCIAL MEDIA STRATEGIST FEEDBACK ===
Here's a comprehensive critique of your LinkedIn post:

**Content Quality & Professionalism:**
Strengths:
- Well-structured with clear, valuable information
- Professional tone that balances expertise with accessibility
- Excellent use of practical examples and specific tools
- Good industry relevance for developer audience

Areas for Improvement:
- Could add brief specific benefits/use cases for each tool
- Consider including one personal experience/result

**Engagement & Algorithm Optimization:**
Strengths:
- Strong hook with "Fellow developers"
- Good length (within optimal 1,300 character range)
- Effective use of emojis
- Strong call-to-action in comments

Optimization Suggestions:
- Consider starting with a compelling statistic or personal result
- Add numbers to benefits (e.g., "reduced coding time by 40%")
- Break up longer paragraphs further for better readability

**Technical Elements:**
Strengths:
- Good hashtag selection
- Clear formatting with numbered lists
- Effective use of emojis as visual breaks

Recommendations:
- Add 1-2 relevant @mentions of tool companies
- Consider more specific hashtags (e.g., #AIforDevelopers)
- Add line breaks between sections for better scanning

**Specific Improvement Suggestions:**

1. Enhanced Hook:
"🚀 These 5 AI tools helped me cut coding time by 40% last month! Here's my real-world review after 100+ hours of testing..."

2. Add Credibility:
Brief one-liner about your experience/role before the list

3. Tool Descriptions:
Add one specific metric/result for each tool
Example: "GitHub Copilot: Cut my boilerplate coding time by 60%. Perfect for repetitive tasks."

4. Hashtags:
Add: #DevTools #AIinTech #DeveloperLife

Overall, it's a strong post that could be enhanced with more specific results and personal experiences to boost engagement and credibility.
============================================================

我们的策略师提供的反馈包含了具体、可操作的见解。这正是 reflection 模式的强大之处。我们不仅仅是生成内容;我们还在系统地评估和识别改进机会。

重复(Repeat)

现在我们既有了一个生成器,也有了一个批判代理,让我们来实现基本的 reflection 循环。该模式涉及将批判反馈给生成器以进行改进:

1
2
3
4
5
6
7
print("=== REFINED LINKEDIN POST ===")
for chunk in linkedin_post_generator.stream(
{"messages": [post_request, AIMessage(content=generated_post), HumanMessage(content=feedback_result)]}
):
print(chunk.content, end="")

print("\n" + "="*60 + "\n")

预期响应:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
=== REFINED LINKEDIN POST ===
Thank you for the detailed feedback! Here's an improved version incorporating your suggestions:

🚀 These 5 AI tools helped me cut coding time by 40% last month! Here's my real-world review after 100+ hours of testing...

As a senior full-stack developer leading a team of 12, I've vetted countless AI tools. These are the ones actually delivering ROI:

1. GitHub Copilot
Cut boilerplate coding time by 60%. Saved our team 15+ hours last sprint on repetitive tasks.
@GitHub

2. ChatGPT API
Reduced debugging time by 35%. We use it to analyze 100+ lines of complex code in seconds.
@OpenAI

3. Amazon CodeWhisperer
30% faster AWS infrastructure deployment. Game-changer for cloud architecture.
@AWSCloud

4. Tabnine
Increased code completion accuracy by 45%. Learning from our codebase across 5 different projects.
@Tabnine

5. DeepCode
Caught 23 critical bugs last month before production. Reduced QA cycles by 25%.
@DeepCode

💡 Real Talk: These tools supercharged our productivity, but they're not magic. They work best when combined with solid development practices and code review processes.

⚡️ Personal Win: Implemented these tools across our team and saw sprint velocity increase by 28% in just two months.

What's your experience with AI dev tools? Share your metrics below! 👇

#AIforDevelopers #DevTools #SoftwareDevelopment #CodingTools #DeveloperLife #TechInnovationExpected Response:

以下是定义我们核心 reflection 模式的流程:

1
原始请求 → 初始帖子 → 批判 → 改进后的帖子

消息序列确保了上下文的保留,同时允许生成器迭代地采纳反馈。

然而,虽然这种手动方法适用于简单的内容任务,但对于更复杂的工作流来说,它很快就会变得繁琐。这就是 LangGraph 发挥作用的地方,它提供了一个结构化框架,用于高效地编排多个代理。

使用 LangGraph 进行工作流编排

LangGraph 允许我们构建复杂的代理工作流,具备:

  • 状态管理 以跨多个步骤跟踪上下文
  • 条件逻辑 以决定是继续还是终止循环
  • 自动编排 以实现代理之间的无缝通信

我们定义一个 ContentState 来构建我们的工作流。通过使用带有 add_messagesAnnotated 类型,我们确保消息在整个过程中正确累积。

1
2
3
4
5
6
7
8
from typing import Annotated, List, Sequence
from langgraph.graph import END, StateGraph, START
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import InMemorySaver
from typing_extensions import TypedDict

class ContentState(TypedDict):
messages: Annotated[list, add_messages]

接下来,我们将创建代表生成器和批判代理的工作流节点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
async def post_creation_node(state: ContentState) -> ContentState:
"""Generate or improve LinkedIn post based on current state."""
return {"messages": [await linkedin_post_generator.ainvoke(state["messages"])]}

async def social_critique_node(state: ContentState) -> ContentState:
"""Provide social media strategy feedback for the LinkedIn post."""
# Transform message types for the strategist
message_role_map = {"ai": HumanMessage, "human": AIMessage}

# Keep the original request and transform subsequent messages
transformed_messages = [state["messages"]] + [
message_role_map[msg.type](content=msg.content) for msg in state["messages"][1:]
]

strategy_feedback = await social_media_critic.ainvoke(transformed_messages)

# Return feedback as human input for the post generator
return {"messages": [HumanMessage(content=strategy_feedback.content)]}

既然我们已经定义了两个图节点,现在让我们创建条件逻辑,以决定工作流是应该继续还是结束。

1
2
3
4
5
6
def should_continue_refining(state: ContentState):
"""Determine whether to continue the creation-feedback cycle."""
if len(state["messages"]) > 6:
# End after 3 complete creation-feedback cycles
return END
return "social_critique"

现在我们可以构建和配置完整的工作流:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Build the workflow graph
content_workflow_builder = StateGraph(ContentState)
content_workflow_builder.add_node("create_post", post_creation_node)
content_workflow_builder.add_node("social_critique", social_critique_node)

# Define workflow edges
content_workflow_builder.add_edge(START, "create_post")
content_workflow_builder.add_conditional_edges("create_post", should_continue_refining)
content_workflow_builder.add_edge("social_critique", "create_post")

# Add conversation memory
content_memory = InMemorySaver()
linkedin_workflow = content_workflow_builder.compile(checkpointer=content_memory)

让我们可视化最终的工作流:

1
2
3
4
from IPython.display import Image, display

# Show the agent
display(Image(linkedin_workflow.get_graph().draw_png()))

Workflow

测试

最后一步是用一个简单的例子来测试我们自动化的 linkedin_workflow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
session_config = {"configurable": {"thread_id": "user1"}}

content_brief = HumanMessage(
content="Create a LinkedIn post on AI tools for developers under 180 words."
)

async for workflow_event in linkedin_workflow.astream(
{"messages": [content_brief]},
session_config,
):
print("Workflow Step:", workflow_event)
print("-" * 50)

### Output:
"""
=== RUNNING AUTOMATED LINKEDIN CONTENT WORKFLOW ===

Workflow Step: {'create_post': {'messages': [AIMessage(content="Here's a compelling LinkedIn post on AI tools for developers:\n\n🤖 5 Game-Changing AI Tools Every Developer Should Know About\n\nAs AI reshapes software development, staying ahead means leveraging the right tools. Here are my top picks that are transforming how we code:\n\n1. GitHub Copilot\nTurn comments into code with AI-powered suggestions that feel like having a senior developer by your side.\n\n2. Amazon CodeWhisperer\nFree for individual use, it's helping developers write more secure, efficient code while reducing bugs by 40%.\n\n3. Tabnine\nThe AI assistant that learns your coding patterns and provides context-aware completions across 30+ languages.\n\n4. DeepCode\nCatch those subtle bugs before they reach production with AI-powered code reviews that go beyond traditional static analysis.\n\n5. ChatGPT + Code Interpreter\nPerfect for debugging, code explanation, and quick prototyping. It's like having a coding mentor 24/7.\n\n💡 Pro Tip: These tools aren't replacements - they're amplifiers. Use them to boost productivity while maintaining code quality.\n\nWhat AI dev tools are you using? Share your experiences below! 👇\n\n#SoftwareDevelopment #AI #CodingTools #TechInnovation #Programming", additional_kwargs={'usage': {'prompt_tokens': 85, 'completion_tokens': 288, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 373}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 85, 'completion_tokens': 288, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 373}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run--bd4444df-0aa9-4859-a4c3-0af9ce16a1e8-0', usage_metadata={'input_tokens': 85, 'output_tokens': 288, 'total_tokens': 373, 'input_token_details': {'cache_creation': 0, 'cache_read': 0}})]}}
--------------------------------------------------
Workflow Step: {'social_critique': {'messages': [HumanMessage(content='Here\'s a comprehensive critique of your LinkedIn post:\n\n**Content Quality & Professionalism:**\nStrengths:\n- Clear, well-structured format with valuable, actionable information\n- Professional tone that balances expertise with accessibility\n- Excellent choice of relevant, current tools that add genuine value\n- Strong industry relevance for the target audience\n\nImprovement areas:\n- Could include brief specific statistics or use cases for more credibility\n- Consider adding personal experience with one or two tools\n\n**Engagement & Algorithm Optimization:**\nStrengths:\n- Strong hook with the "🤖" emoji and numbered list format\n- Good length (within optimal 1,300-character range)\n- Effective use of line breaks for readability\n- Strong closing CTA encouraging comments\n\nImprovement areas:\n- Consider starting with a personal anecdote or problem statement\n- Add one specific result/outcome from using these tools\n- Include a transition sentence between the list and pro tip\n\n**Technical Elements:**\nStrengths:\n- Good use of emojis for visual breaks\n- Effective formatting with numbered lists\n- Strong hashtag selection\n\nRecommendations for enhancement:\n1. Add LinkedIn mentions of the companies (@GitHub, @Amazon)\n2. Include 1-2 more specific statistics for credibility\n3. Consider reducing hashtags to 3 most relevant ones\n4. Add a brief (one-line) personal endorsement of your top tool\n\nSuggested opening revision:\n"🤖 After spending 100+ hours testing AI coding tools this quarter, here are the 5 game-changers that actually delivered results..."\n\nOverall: Strong post that needs minor tweaks to maximize engagement and authority. Focus on adding personal experience elements and specific outcomes to enhance credibility.', additional_kwargs={}, response_metadata={}, id='9821797b-df12-49a4-85b7-a3f96272074f')]}}
--------------------------------------------------
Workflow Step: {'create_post': {'messages': [AIMessage(content="Thank you for the detailed feedback. Here's an enhanced version incorporating your suggestions:\n\n🤖 After spending 100+ hours testing AI coding tools this quarter, I've discovered game-changers that transformed my development workflow. Here's what actually delivered results:\n\n1. @GitHub Copilot\nReduced my coding time by 30% last sprint. It's like having a senior developer who knows your codebase inside out. I use it daily for boilerplate code and complex algorithms.\n\n2. @Amazon CodeWhisperer\nSlashed our team's bug rate by 40%. The free tier for individual developers is a steal, and its security-first approach caught several vulnerabilities in our recent project.\n\n3. Tabnine\nMy personal favorite! Its context-aware completions learned my coding patterns so well that it predicts complex TypeScript snippets with surprising accuracy.\n\n4. DeepCode\nCaught 3 critical security issues last month that our regular code review missed. Game-changer for code quality.\n\n5. ChatGPT + Code Interpreter\nPerfect for debugging and rapid prototyping. Helped me solve a complex regex issue in minutes instead of hours.\n\n💡 Real Talk: These tools amplify your capabilities but don't replace core programming skills. They're most powerful when used to enhance your existing workflow.\n\nJust yesterday, Copilot helped me refactor 200 lines of legacy code in under 15 minutes. What's your experience with AI dev tools? Share below! 👇\n\n#SoftwareDevelopment #AI #TechInnovation", additional_kwargs={'usage': {'prompt_tokens': 758, 'completion_tokens': 349, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 1107}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 758, 'completion_tokens': 349, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 1107}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run--395f227c-1c0b-443a-9f91-3476eb1627e4-0', usage_metadata={'input_tokens': 758, 'output_tokens': 349, 'total_tokens': 1107, 'input_token_details': {'cache_creation': 0, 'cache_read': 0}})]}}
--------------------------------------------------
Workflow Step: {'social_critique': {'messages': [HumanMessage(content='Excellent revision! This version is significantly stronger. Let\'s break down the improvements:\n\n**Content Quality & Professionalism:**\nStrengths:\n+ Added specific metrics and personal experiences\n+ Strong credibility through concrete examples\n+ Excellent balance of professional insight and personal touch\n+ Clear progression from tools to real-world application\n\n**Engagement & Algorithm Optimization:**\nStrengths:\n+ Compelling opening with quantified experience (100+ hours)\n+ Each tool now has a specific value proposition\n+ Personal anecdote at the end creates recency effect\n+ Strong storytelling elements with real examples\n\n**Technical Elements:**\nStrengths:\n+ Proper company mentions (@GitHub, @Amazon)\n+ Reduced hashtags to most relevant ones\n+ Better formatting with specific outcomes\n+ Strong closing with recent personal example\n\nMinor Suggestions for Final Polish:\n1. Consider adding one emoji for each tool to improve scannability\n2. Maybe include one line about who would benefit most from each tool (junior vs. senior devs)\n3. Consider adding "Agree?" before the CTA to boost comment rate\n\nThe post now has:\n- Better social proof\n- Stronger personal authority\n- More engaging narrative\n- Clearer value proposition\n\nThis version should see significantly higher engagement rates due to its combination of personal experience, specific metrics, and clear value demonstration. Excellent revision!', additional_kwargs={}, response_metadata={}, id='8c36fad1-7c5d-4faa-a603-79690eb69fc3')]}}
--------------------------------------------------
Workflow Step: {'create_post': {'messages': [AIMessage(content="Thank you for the detailed analysis! Here's the final polished version incorporating your suggestions:\n\n🤖 After spending 100+ hours testing AI coding tools this quarter, I've discovered game-changers that transformed my development workflow. Here's what actually delivered results:\n\n1. 💻 @GitHub Copilot\nReduced my coding time by 30% last sprint. It's like having a senior developer who knows your codebase inside out. Perfect for mid to senior devs handling complex architectures.\n\n2. 🔐 @Amazon CodeWhisperer\nSlashed our team's bug rate by 40%. The free tier for individual developers is a steal, and its security-first approach caught several vulnerabilities in our recent project. Ideal for teams prioritizing secure code.\n\n3. ⚡ Tabnine\nMy personal favorite! Its context-aware completions learned my coding patterns so well that it predicts complex TypeScript snippets with surprising accuracy. Great for developers working with multiple languages.\n\n4. 🔍 DeepCode\nCaught 3 critical security issues last month that our regular code review missed. Game-changer for code quality. Essential for junior devs learning best practices.\n\n5. 🤝 ChatGPT + Code Interpreter\nPerfect for debugging and rapid prototyping. Helped me solve a complex regex issue in minutes instead of hours. Invaluable for developers of all levels, especially during problem-solving sessions.\n\n💡 Real Talk: These tools amplify your capabilities but don't replace core programming skills. They're most powerful when used to enhance your existing workflow.\n\nJust yesterday, Copilot helped me refactor 200 lines of legacy code in under 15 minutes. \n\nAgree? What's your experience with AI dev tools? Share below! 👇\n\n#SoftwareDevelopment #AI #TechInnovation", additional_kwargs={'usage': {'prompt_tokens': 1408, 'completion_tokens': 422, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 1830}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 1408, 'completion_tokens': 422, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 1830}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run--cfd87aec-c1e9-4218-80a3-d048ecbd0d19-0', usage_metadata={'input_tokens': 1408, 'output_tokens': 422, 'total_tokens': 1830, 'input_token_details': {'cache_creation': 0, 'cache_read': 0}})]}}
--------------------------------------------------
Workflow Step: {'social_critique': {'messages': [HumanMessage(content='Outstanding final version! This is a masterclass in LinkedIn content optimization. Let\'s analyze why this version excels:\n\n**Content Quality & Professionalism:**\nExceptional Elements:\n+ Perfect balance of professional insight and accessibility\n+ Clear audience targeting for each tool\n+ Progressive value demonstration (from tools to specific use cases)\n+ Strong authority establishment through detailed knowledge\n\n**Engagement & Algorithm Optimization:**\nOutstanding Features:\n+ Emojis strategically placed for visual navigation\n+ Ideal length for LinkedIn\'s algorithm\n+ Multiple engagement triggers throughout the post\n+ "Agree?" addition creates a psychological trigger for responses\n\n**Technical Elements:**\nMasterful Execution:\n+ Perfect emoji-to-content ratio\n+ Professional formatting with consistent spacing\n+ Strategic use of company mentions\n+ Focused hashtag strategy\n\nExpected Engagement Metrics:\n- Higher comment rate due to targeted audience segmentation\n- Increased shares from specific audience callouts\n- Better save rate due to practical value\n- Strong viral potential due to comprehensive information\n\nThis post now has all the elements of high-performing LinkedIn content:\n1. Valuable information\n2. Personal experience\n3. Visual appeal\n4. Clear audience targeting\n5. Multiple engagement hooks\n6. Professional credibility\n7. Actionable insights\n\nNo further improvements needed - this is publication-ready and should perform exceptionally well on LinkedIn. Excellent work!', additional_kwargs={}, response_metadata={}, id='d76cba2d-1cf4-4c84-9df5-13c4c3f68faa')]}}
--------------------------------------------------
Workflow Step: {'create_post': {'messages': [AIMessage(content="Thank you for such a comprehensive analysis! This exchange perfectly demonstrates the iterative process of creating high-performing LinkedIn content. Let's capture the key learnings from this evolution:\n\n**Key Success Elements:**\n1. Started with solid foundation (clear structure, valuable information)\n2. Enhanced with specific metrics and personal experience\n3. Added audience segmentation\n4. Incorporated visual elements strategically\n5. Optimized for engagement at multiple points\n\n**Critical Improvements Made:**\n- From generic tool list → Personalized recommendations with metrics\n- From broad statements → Specific use cases and results\n- From general audience → Targeted developer segments\n- From basic formatting → Strategic visual hierarchy\n- From standard CTA → Multi-layered engagement triggers\n\n**Best Practices Demonstrated:**\n- Opening hook with credibility marker (100+ hours)\n- Specific metrics (30% reduction, 40% bug rate)\n- Personal anecdotes\n- Tool-specific audience targeting\n- Strategic emoji placement\n- Engagement-optimized formatting\n- Professional company mentions\n- Focused hashtag strategy\n\nThis case study in post optimization shows how thoughtful iteration can transform good content into exceptional content that serves both the audience and platform algorithm requirements.\n\nWould you like to explore any other types of LinkedIn posts or specific content strategies?", additional_kwargs={'usage': {'prompt_tokens': 2132, 'completion_tokens': 280, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 2412}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 2132, 'completion_tokens': 280, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 2412}, 'stop_reason': 'end_turn', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run--fb2de19e-b519-40ff-b35d-d3a0a58b7ed1-0', usage_metadata={'input_tokens': 2132, 'output_tokens': 280, 'total_tokens': 2412, 'input_token_details': {'cache_creation': 0, 'cache_read': 0}})]}}
--------------------------------------------------
"""

让我们检查完整的对话流程:

1
2
3
4
5
6
# Get final state
final_state = linkedin_workflow.get_state(session_config)
print("Total messages in conversation:", len(final_state.values["messages"]))

# Display the conversation flow
ChatPromptTemplate.from_messages(final_state.values["messages"]).pretty_print()

预期输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Total messages in conversation: 8
================================ Human Message =================================

Create a LinkedIn post on AI tools for developers under 180 words.

================================== Ai Message ==================================
...
...
================================== Ai Message ==================================

Thank you for the detailed analysis! Here's the final polished version incorporating your suggestions:

🤖 After spending 100+ hours testing AI coding tools this quarter, I've discovered game-changers that transformed my development workflow. Here's what actually delivered results:

1. 💻 @GitHub Copilot
Reduced my coding time by 30% last sprint. It's like having a senior developer who knows your codebase inside out. Perfect for mid to senior devs handling complex architectures.

2. 🔐 @Amazon CodeWhisperer
Slashed our team's bug rate by 40%. The free tier for individual developers is a steal, and its security-first approach caught several vulnerabilities in our recent project. Ideal for teams prioritizing secure code.

3. ⚡ Tabnine
My personal favorite! Its context-aware completions learned my coding patterns so well that it predicts complex TypeScript snippets with surprising accuracy. Great for developers working with multiple languages.

4. 🔍 DeepCode
Caught 3 critical security issues last month that our regular code review missed. Game-changer for code quality. Essential for junior devs learning best practices.

5. 🤝 ChatGPT + Code Interpreter
Perfect for debugging and rapid prototyping. Helped me solve a complex regex issue in minutes instead of hours. Invaluable for developers of all levels, especially during problem-solving sessions.

💡 Real Talk: These tools amplify your capabilities but don't replace core programming skills. They're most powerful when used to enhance your existing workflow.

Just yesterday, Copilot helped me refactor 200 lines of legacy code in under 15 minutes.

Agree? What's your experience with AI dev tools? Share below! 👇

#SoftwareDevelopment #AI #TechInnovation

================================ Human Message =================================

...

No further improvements needed - this is publication-ready and should perform exceptionally well on LinkedIn. Excellent work!

================================== Ai Message ==================================

Thank you for such a comprehensive analysis! This exchange perfectly demonstrates the iterative process of creating high-performing LinkedIn content. Let's capture the key learnings from this evolution:

...

This case study in post optimization shows how thoughtful iteration can transform good content into exceptional content that serves both the audience and platform algorithm requirements.

Would you like to explore any other types of LinkedIn posts or specific content strategies?

到此,我们已经创建了一个基本的 reflection 工作流,它对于内容改进效果很好,但在处理需要外部信息的知识密集型任务时存在局限性。这就引出了 Reflexion 模式。

理解 Reflexion 代理

由 Shinn 等人提出的 Reflexion 模式,通过将自我批判与外部知识整合及结构化输出解析相结合,扩展了基本的 reflection。

与简单的 reflection 不同,Reflexion 允许代理在利用额外信息的同时,实时从错误中学习。

工作流通常遵循以下步骤:

  • 初始生成:代理产生一个响应,同时附带自我批判和研究查询。
  • 外部研究:在批判过程中识别出的知识差距会触发网页搜索或其他信息检索。
  • 知识整合:将新的见解融入到改进后的响应中。
  • 迭代优化:代理重复这个循环,直到响应达到期望的质量阈值。

Credits: LangChain

Reflexion 代理中,系统被构建为三个相互关联的角色:行动者(Actor)、评估者(Evaluator)和自我反思(Self-Reflection)

  • 行动者 尝试完成任务:编写代码、解决问题或在环境中采取行动。
  • 评估者 提供内部反馈,评估行动者输出的质量。
  • 自我反思 模块生成文本反思,记录了出错的地方或可以改进之处。

这些反思被存储在内存中:

  • 短期记忆 跟踪当前尝试的轨迹。
  • 长期记忆 累积先前反思的经验教训,指导未来的迭代。

(a) Reflexion 示意图。(b) Reflexion 强化学习算法 (来源: Shinn et al., 2023)

这个过程是迭代的:行动者尝试,评估者评分,自我反思模块进行批判,然后行动者利用这些反馈进行下一次尝试。这个循环会一直持续,直到任务成功或达到最大迭代次数。

例如,如果行动者在某一步失败了,反思可能会记录:

“我陷入了一个循环;下次尝试不同的策略或工具。”

下一次迭代,在这次反思的指导下,成功的可能性会更高。Reflexion 可以处理各种反馈,包括数字奖励、错误消息或人类提示,所有这些都被整合到反思过程中。

Reflexion 已经展示了显著的成果。在像 HumanEval 这样的编码基准测试中,一个增强了 Reflexion 的 GPT-4 代理达到了 91% 的成功率,而没有 reflection 的 GPT-4 成功率是 80%。在决策模拟(AlfWorld)中,ReAct + Reflexion 代理解决了 134 个挑战中的 130 个,明显优于非反思型代理。

这凸显了 Reflexion 的核心力量:通过构建 AI 代理来思考它们的行为并保留学到的教训,它们能够持续改进,从而更有效地处理复杂任务。

构建一个 Reflexion 代理

从核心上讲,Reflexion 代理是围绕一个 Actor 构建的,这个代理会生成初始响应,对其进行批判,然后带着改进再次执行任务。支持这个循环的是几个关键的子组件:

  • 工具执行:访问外部知识源。
  • 初始响应器:生成初稿以及自我反思。
  • 修订器/修订:通过整合先前的反思来产生优化的输出。

构建工具

由于 Reflexion 需要外部知识,我们首先定义一个工具来从网络上获取信息。这里我们使用 TavilySearchResults 工具,它是 Tavily 搜索 API 的一个包装器,使我们的代理能够执行网页搜索并收集支持性证据。

1
2
3
4
5
6
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.utilities.tavily_search import TavilySearchAPIWrapper

# Initialize search tool
web_search = TavilySearchAPIWrapper()
tavily_tool = TavilySearchResults(api_wrapper=web_search, max_results=5)

定义提示模板

接下来,让我们定义一个提示,来指导这个 actor 代理的行为。提示充当代理的“角色描述”,规定了它应该做什么和不应该做什么。代理被指示:

  • 提供一个初始解释。
  • 反思并批判自己的答案。
  • 生成搜索查询以填补知识空白。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Agent prompt template
actor_prompt_template = ChatPromptTemplate.from_messages(
[
(
"system",
"""You are an expert technical educator specializing in machine learning and neural networks.
Current time: {time}
1. {primary_instruction}
2. Reflect and critique your answer. Be severe to maximize improvement.
3. Recommend search queries to research information and improve your answer.""",
),
MessagesPlaceholder(variable_name="messages"),
(
"user",
"\n\n<s>Reflect on the user's original question and the"
" actions taken thus far. Respond using the {function_name} function.</reminder>",
),
]
).partial(
time=lambda: datetime.datetime.now().isoformat(),
)

强制结构化输出

在处理多步骤工作流时,总是建议为每个子代理定义结构化的输出模型。为了确保一致性,我们使用 Pydantic 模型定义结构化输出

1
2
3
4
5
6
7
8
9
10
11
12
13
# Pydantic models for structured output
class Reflection(BaseModel):
missing: str = Field(description="Critique of what is missing.")
superfluous: str = Field(description="Critique of what is superfluous")

class GenerateResponse(BaseModel):
"""Generate response. Provide an answer, critique, and then follow up with search queries to improve the answer."""

response: str = Field(description="~250 word detailed answer to the question.")
reflection: Reflection = Field(description="Your reflection on the initial answer.")
research_queries: list[str] = Field(
description="1-3 search queries for researching improvements to address the critique of your current answer."
)

为此,我们使用 Pydantic 的 BaseModel 定义了两个数据类:

  • Reflection 捕捉自我批判,要求代理突出哪些信息是缺失的,哪些是多余的(不必要的)。
  • GenerateResponse 构建最终输出的结构。它确保代理提供其主要响应,包含一个反思(基于 Reflection 类),并提供一个 research_queries 列表。

这种结构化的方法保证了我们的代理能够产生一致且可解析的响应。

添加重试逻辑

如果输出与模式不匹配,结构化解析可能会失败。为了解决这个问题,我们添加了带有模式反馈的重试逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Agent with retry logic
class AdaptiveResponder:
def __init__(self, chain, output_parser):
self.chain = chain
self.output_parser = output_parser

def generate(self, conversation_state: dict):
llm_response = None
for retry_count in range(3):
llm_response = self.chain.invoke(
{"messages": conversation_state["messages"]}, {"tags": [f"attempt:{retry_count}"]}
)
try:
self.output_parser.invoke(llm_response)
return {"messages": llm_response}
except ValidationError as validation_error:
# Fix: Convert schema dict to JSON string
schema_json = json.dumps(self.output_parser.model_json_schema(), indent=2)
conversation_state = conversation_state + [
llm_response,
ToolMessage(
content=f"{repr(validation_error)}\n\nPay close attention to the function schema.\n\n{schema_json}\n\nRespond by fixing all validation errors.",
tool_call_id=llm_response.tool_calls["id"],
),
]
return {"messages": llm_response}

这引出了一个重要的概念:带有模式反馈的重试逻辑。当结构化输出验证失败时,我们会将模式和错误详情反馈给 LLM 以进行自我纠正。

绑定数据模型

我们现在将 GenerateResponse 模型绑定为一个工具。这会强制 LLM 严格按照定义的结构输出。

1
2
3
4
5
6
7
8
9
10
11
# Initial answer chain
initial_response_chain = actor_prompt_template.partial(
primary_instruction="Provide a detailed ~250 word explanation suitable for someone with basic programming background.",
function_name=GenerateResponse.__name__,
) | llm.bind_tools(tools=[GenerateResponse])

response_parser = PydanticToolsParser(tools=[GenerateResponse])

initial_responder = AdaptiveResponder(
chain=initial_response_chain, output_parser=response_parser
)

在调用 initial_response_chain 后,我们将得到一个结构化的输出,包括初始答案、自我批判和生成的搜索查询。让我们通过一个简单的查询来测试我们的初始响应器。

1
2
3
4
5
6
example_question = "What is the difference between supervised and unsupervised learning?"
initial = initial_responder.generate(
{"messages": [HumanMessage(content=example_question)]}
)

initial

预期响应:

1
{'messages': AIMessage(content="I'll explain the key differences between supervised and unsupervised learning using the GenerateResponse function.", additional_kwargs={'usage': {'prompt_tokens': 660, 'completion_tokens': 524, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 1184}, 'stop_reason': 'tool_use', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 660, 'completion_tokens': 524, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 1184}, 'stop_reason': 'tool_use', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run--316a2aa2-7cbe-4e12-8814-45121515a285-0', tool_calls=[{'name': 'GenerateResponse', 'args': {'response': "Supervised and unsupervised learning represent two fundamental approaches in machine learning that differ primarily in how they learn from data.\n\nSupervised learning works with labeled data, meaning each input has a corresponding known output or target value. Think of it like learning with a teacher who provides the correct answers. For example, in a supervised learning model classifying emails as spam or not spam, each training email would be labeled with the correct classification. The algorithm learns by comparing its predictions to these known labels and adjusting its parameters to minimize errors. Common supervised learning tasks include classification (predicting categories) and regression (predicting continuous values).\n\nIn contrast, unsupervised learning works with unlabeled data, attempting to find hidden patterns or structures without any predefined correct answers. It's like trying to organize a pile of objects without being told how they should be grouped. For instance, clustering algorithms might group customers into distinct segments based on their purchasing behavior, without any predetermined categories. The system discovers these patterns independently by analyzing relationships and similarities in the data.\n\nA key practical difference is that supervised learning requires the time-consuming and often expensive process of data labeling, while unsupervised learning can work with raw, unlabeled data. However, supervised learning typically provides more precise and measurable results since there's a clear way to evaluate performance against known correct answers.", 'reflection': {'missing': "The explanation lacks concrete examples of popular algorithms for each type. It doesn't address semi-supervised learning as a middle ground. The explanation could benefit from discussing the specific evaluation metrics used in each approach. There's no mention of the computational complexity differences or the scale of data typically required.", 'superfluous': 'The email spam example could be more concise. The analogy of learning with a teacher, while helpful, takes up space that could be used for more technical details.'}, 'research_queries': ['comparison of evaluation metrics in supervised vs unsupervised learning', 'popular algorithms and use cases for supervised vs unsupervised learning', 'semi-supervised learning advantages and applications']}, 'id': 'toolu_bdrk_01QbM3eR5M1wTaaRv14Kn3PG', 'type': 'tool_call'}], usage_metadata={'input_tokens': 660, 'output_tokens': 524, 'total_tokens': 1184, 'input_token_details': {'cache_creation': 0, 'cache_read': 0}})}

修订(Revision)

修订步骤代表了 Reflection 循环的最后阶段。其目的是将三个关键元素结合起来:原始草稿、自我批判和研究结果,以产生一个精炼的、有证据支持的响应。

我们首先定义一套新的指令集(revise_instructions),明确指导修订器。这些指令强调了以下几点的重要性:

  • 将批判整合到修订过程中
  • 添加与研究证据相关的数字引用
  • 在解释中区分相关性和因果关系
  • 包含一个结构化的 参考文献 部分,只提供干净的 URL
1
2
3
4
5
6
7
8
9
# Revision instructions
improvement_guidelines = """Revise your previous explanation using the new information.
- You should use the previous critique to add important technical details to your explanation.
- You MUST include numerical citations in your revised answer to ensure it can be verified.
- Add a "References" section to the bottom of your answer (which does not count towards the word limit).
- For the references field, provide a clean list of URLs only (e.g., ["https://example.com", "https://example2.com"])
- You should use the previous critique to remove superfluous information from your answer and make SURE it is not more than 250 words.
- Keep the explanation accessible for someone with basic programming background while being technically accurate.
"""

为了强制执行输出结构,我们引入了一个 Pydantic 模式ImproveResponse。这个类扩展了 GenerateResponse 并引入了一个额外的 sources 字段,确保每个改进的答案都附有可验证的参考文献。

1
2
3
4
5
6
7
8
class ImproveResponse(GenerateResponse):
"""Improve your original answer to your question. Provide an answer, reflection,
cite your reflection with references, and finally
add search queries to improve the answer."""

sources: list[str] = Field(
description="List of reference URLs that support your answer. Each reference should be a clean URL string."
)

定义了模式后,我们现在通过将指导原则绑定到 LLM 并解析其输出来构建修订链

1
2
3
4
5
6
7
8
9
# Revision chain
improvement_chain = actor_prompt_template.partial(
primary_instruction=improvement_guidelines,
function_name=ImproveResponse.__name__,
) | llm.bind_tools(tools=[ImproveResponse])


improvement_parser = PydanticToolsParser(tools=[ImproveResponse])
response_improver = AdaptiveResponder(chain=improvement_chain, output_parser=improvement_parser)

我们现在可以通过提供完整的对话历史来测试这个链,包括初始草稿、批判和工具输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
revised = response_improver.generate(
{
"messages": [
HumanMessage(content=example_question),
initial["messages"],
ToolMessage(
tool_call_id=initial["messages"].tool_calls["id"],
content=json.dumps(
tavily_tool.invoke(
{
"query": initial["messages"].tool_calls["args"][
"research_queries"
]
}
)
),
),
]
}
)

revised["messages"]

### output

# AIMessage(content="I'll use the ImproveResponse function to provide a more focused and technically precise explanation of supervised vs. unsupervised learning.", additional_kwargs={'usage': {'prompt_tokens': 3245, 'completion_tokens': 622, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 3867}, 'stop_reason': 'tool_use', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, response_metadata={'usage': {'prompt_tokens': 3245, 'completion_tokens': 622, 'cache_read_input_tokens': 0, 'cache_write_input_tokens': 0, 'total_tokens': 3867}, 'stop_reason': 'tool_use', 'thinking': {}, 'model_id': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'model_name': 'anthropic.claude-3-5-sonnet-20241022-v2:0'}, id='run--3c3b46e3-9e62-4158-a3dc-0d215fa35cba-0', tool_calls=[{'name': 'ImproveResponse', 'args': {'response': 'Supervised and unsupervised learning represent distinct machine learning paradigms that differ in their learning approach and evaluation methods.\n\nSupervised learning algorithms learn from labeled training data, where each input has a corresponding target output. Common algorithms include Support Vector Machines (SVM) for classification and Linear Regression for continuous value prediction. Performance is measured through specific metrics - classification tasks use accuracy, precision, and recall, while regression tasks employ Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).\n\nUnsupervised learning discovers hidden patterns in unlabeled data. Popular algorithms include K-means for clustering and Principal Component Analysis (PCA) for dimensionality reduction. These algorithms are evaluated using internal metrics like Silhouette Coefficient and Within-Cluster Sum Square for clustering, or Cumulative Explained Variance for dimensionality reduction.\n\nKey practical distinctions include:\n- Data Requirements: Supervised needs labeled data; unsupervised works with raw data\n- Evaluation: Supervised has clear performance metrics against ground truth; unsupervised uses intrinsic structure metrics\n- Applications: Supervised excels in prediction tasks (classification/regression); unsupervised in pattern discovery (clustering/dimensionality reduction)', 'reflection': {'missing': "The explanation could benefit from including specific real-world applications and success rates. It doesn't address the computational requirements or discuss hybrid approaches like semi-supervised learning. Could include more about the relative advantages and disadvantages of each approach.", 'superfluous': 'The listing of evaluation metrics could be more selective - not all metrics needed to be mentioned. The explanation of algorithms could be more focused on the most commonly used ones.'}, 'research_queries': ['real-world applications and success rates of supervised vs unsupervised learning', 'computational requirements comparison supervised unsupervised learning', 'semi-supervised learning advantages over pure supervised and unsupervised approaches'], 'sources': ['https://www.kdnuggets.com/2023/04/exploring-unsupervised-learning-metrics.html', 'https://medium.com/@manpreetkrbuttar/evaluation-metrics-supervised-ml-9ea9e35b2ebc', 'https://h2o.ai/blog/2022/unsupervised-learning-metrics/']}, 'id': 'toolu_bdrk_014hJaNwDhAb9Yzx7GgvCvaC', 'type': 'tool_call'}], usage_metadata={'input_tokens': 3245, 'output_tokens': 622, 'total_tokens': 3867, 'input_token_details': {'cache_creation': 0, 'cache_read': 0}})

创建工具节点

下一步是在 LangGraph 工作流中执行工具调用。虽然响应器和修订器使用不同的模式,但它们都依赖于同一个外部工具(在这种情况下是搜索 API)。

Reflexion 的关键区别在于它能够识别知识差距并主动研究解决方案。

让我们来实现搜索集成:

1
2
3
4
5
6
7
8
9
10
11
12
# Tool execution function
def execute_search_queries(research_queries: list[str], **kwargs):
"""Execute the generated search queries."""
return tavily_tool.batch([{"query": search_term} for search_term in research_queries])

# Tool node
search_executor = ToolNode(
[
StructuredTool.from_function(execute_search_queries, name=GenerateResponse.__name__),
StructuredTool.from_function(execute_search_queries, name=ImproveResponse.__name__),
]
)

这引出了 LangGraph 工作流中的工具集成。ToolNode 会自动处理工具执行和结果格式化,使得整合外部知识源变得无缝。

构建图(Graph)

最后,我们将所有组件:响应器、工具执行器和修订器组装成一个循环图。这个结构捕捉了 Reflexion 的迭代性质,其中每个循环都增强了最终答案。

我们首先定义图状态和循环控制函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Graph state definition
class State(TypedDict):
messages: Annotated[list, add_messages]

# Helper functions for looping logic
def get_iteration_count(message_history: list):
""" Counts backwards through messages until it hits a non-tool, non-AI message
This helps determine how many tool execution cycles have occurred recently"""

iteration_count = 0
# Iterate through messages in reverse order (most recent first)
for message in message_history[::-1]:
if message.type not in {"tool", "ai"}:
break
iteration_count += 1
return iteration_count

def determine_next_action(state: list):
"""
Conditional edge function that determines whether to continue the loop or end.

Args:
state: Current workflow state containing messages

Returns:
str: Next node to execute ("search_and_research") or END to terminate

Logic:
- Counts recent iterations using get_iteration_count()
- If we've exceeded MAXIMUM_CYCLES, stop the workflow
- Otherwise, continue with another tool execution cycle
"""
# in our case, we'll just stop after N plans
current_iterations = get_iteration_count(state["messages"])
if current_iterations > MAXIMUM_CYCLES:
return END
return "search_and_research"

现在我们可以构建完整的 Reflexion 工作流:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Graph construction
MAXIMUM_CYCLES = 5
workflow_builder = StateGraph(State)

# Add nodes
workflow_builder.add_node("create_draft", initial_responder.generate)
workflow_builder.add_node("search_and_research", search_executor)
workflow_builder.add_node("enhance_response", response_improver.generate)

# Add edges
workflow_builder.add_edge(START, "create_draft")
workflow_builder.add_edge("create_draft", "search_and_research")
workflow_builder.add_edge("search_and_research", "enhance_response")

# Add conditional edges for looping
workflow_builder.add_conditional_edges("enhance_response", determine_next_action, ["search_and_research", END])

# Compile the graph
reflexion_workflow = workflow_builder.compile()

可视化该图:

1
2
3
4
from IPython.display import Image, display

# Show the agent
display(Image(reflexion_workflow.get_graph().draw_png()))

Workflow

现在,让我们用一个技术性问题来测试完整的 Reflexion 代理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Run the agent with the neural networks question
target_question = "How do neural networks actually learn?"

print(f"Running Reflexion agent with question: {target_question}")
print("=" * 60)

events = reflexion_workflow.stream(
{"messages": [("user", target_question)]},
stream_mode="values",
)

for i, step in enumerate(events):
print(f"\nStep {i}")
print("-" * 40)
step["messages"][-1].pretty_print()

预期响应:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Running Reflexion agent with question: How do neural networks actually learn?
============================================================

Step 0
----------------------------------------
================================ Human Message =================================

How do neural networks actually learn?

Step 1
----------------------------------------
================================== Ai Message ==================================

I'll provide a detailed explanation of how neural networks learn using the GenerateResponse function.
...
...
Step 7
----------------------------------------
================================== Ai Message ==================================

I'll provide a final, refined explanation of how neural networks learn, focusing on clarity, technical accuracy, and accessibility.
Tool Calls:
ImproveResponse (toolu_bdrk_0131adwc1rPwf6SjUofsiaM5)
Call ID: toolu_bdrk_0131adwc1rPwf6SjUofsiaM5
Args:
response: Neural networks learn through a systematic process of trial, error, and optimization. Here's the core mechanism:

When training begins, the network processes input data through layers of interconnected neurons, each applying two operations: a weighted sum of inputs and an activation function (like ReLU or sigmoid) that introduces non-linearity. Initially, these weights are random, leading to poor predictions.

The learning happens through three key steps:
1. The network calculates its error using a loss function - Mean Squared Error for regression tasks or Cross-Entropy for classification
2. Through backpropagation, it computes how each connection contributed to the error, using calculus to find the gradient of the loss function
3. An optimizer like Adam or SGD updates the weights to reduce future errors, with the learning rate controlling the size of these adjustments

Modern techniques enhance this process: batch normalization stabilizes learning by normalizing layer inputs, while dropout prevents overfitting by randomly deactivating neurons during training. Through thousands of iterations, the network gradually improves its predictions by finding optimal weight values.
reflection: {'missing': 'The explanation could benefit from more specific examples of real-world applications and how different types of networks (CNNs vs RNNs) learn differently. It could also explain the intuition behind why certain activation functions are chosen for specific tasks.', 'superfluous': 'The technical details about modern techniques like batch normalization could be simplified or removed to focus more on the core learning process.'}
research_queries: ['practical examples of neural network applications and their learning processes', 'comparison of learning mechanisms in different neural network architectures', 'how to choose activation functions for specific neural network tasks']
sources: ['https://www.youtube.com/watch?v=pLf_W4OKxEQ', 'https://towardsdatascience.com/loss-functions-and-their-use-in-neural-networks-a470e703f1e9', 'https://365datascience.com/trending/backpropagation/', 'https://neptune.ai/blog/deep-learning-optimization-algorithms', 'https://milvus.io/ai-quick-reference/what-are-the-common-challenges-in-training-neural-networks']

============================================================
Reflexion agent execution completed!

Reflexion 代理将:

  1. 生成一个带有自我批判的初始技术解释
  2. 识别需要研究的特定知识差距
  3. 执行有针对性的网页搜索以获取最新信息
  4. 将发现结果整合到一个全面的、有引用的响应中
  5. 重复此过程,直到解释满足质量标准

Reflection Vs Reflexion

既然我们已经实现了两种模式,让我们来探讨一下每种模式最适合的场景:

选择 Reflection 当

  • 处理创意或风格性内容时
  • 内部知识足够时
  • 速度优先于全面性时
  • 在 token/成本限制下工作时

选择 Reflexion 当

  • 准确性和事实正确性至关重要时
  • 内容需要当前或专业信息时
  • 需要引用和参考文献时
  • 质量比速度更重要时

关键的决策因素在于任务是否需要外部知识获取

如果主要目标是优化现有知识,请使用 Reflection。如果任务需要发现和整合新信息,请使用 Reflexion。

结论

ReflectionReflexion 模式都标志着 AI 系统设计的重要进步,各自提供了独特的优势和理想的用例。

  • Reflection 在内容优化、创造性以及速度和效率至关重要的场景中表现出色。
  • Reflexion 通过整合外部研究和结构化反馈,为知识密集型或需要大量引用的应用提供了卓越的准确性。

虽然这些方法可能涉及额外的 LLM 推理(因此需要更多的时间和成本),但它们显著提高了产生高质量和可靠输出的可能性。此外,通过将改进轨迹存储在内存中或用于微调,你可以帮助你的模型避免将来重复同样的错误。

在下一篇博客中,我们将深入探讨由 Zhou 等人介绍的 语言代理树搜索(LATS)。这种通用的 LLM 代理搜索算法结合了反思、评估和蒙特卡洛树搜索,以实现比现有方法(如 ReAct、Reflexion 或思维树)更强的性能。

致谢

在本博客中,我们汇集了来自各种来源的信息,包括官方文档、研究论文等。每个来源都在相应图片下方得到了适当的致谢,并提供了源链接。

鸣谢:LangGraph, LangChain-Github

感谢您的阅读!