[LangChain]Extraction

0 Useful Docs

1 The Schema

首先,需要使用Pydantic来定义一个schema,来明确我们希望提取什么样的信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
"""Information about a person."""

# ^ Doc-string for the entity Person.
# This doc-string is sent to the LLM as the description of the schema Person,
# and it can help to improve extraction results.

# Note that:
# 1. Each field is an `optional` -- this allows the model to decline to extract it!
# 2. Each field has a `description` -- this description is used by the LLM.
# Having a good description can help improve extraction results.
name: Optional[str] = Field(default=None, description="The name of the person")
hair_color: Optional[str] = Field(
default=None, description="The color of the person's hair if known"
)
height_in_meters: Optional[str] = Field(
default=None, description="Height measured in meters"
)

最佳实践

  1. 为属性和Schema本身添加文档:为属性和Schema提供详细的文档说明。这些信息将被发送到语言模型(LLM),以提高信息提取的质量。清晰的描述有助于模型准确理解各属性的意义及其预期用途。
  2. 避免强迫LLM编造信息:对于不确定或未知的信息,允许属性使用Optional类型,使得LLM可以在没有确切答案时返回None。这防止了模型因强制要求输出而产生不准确或虚构的数据。

2 The Extractor

使用上面定义的schema创建一个信息extractor。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
# about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert extraction algorithm. "
"Only extract relevant information from the text. "
"If you do not know the value of an attribute asked to extract, "
"return null for the attribute's value.",
),
# Please see the how-to about improving performance with
# reference examples.
# MessagesPlaceholder('examples'),
("human", "{text}"),
]
)

使用大模型来提取信息

1
2
3
4
5
6
7
8
9
10
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")
structured_llm = llm.with_structured_output(schema=Person)
1
2
3
text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

3 Multiple Entities

在多数情况下,提取实体列表而非单一实体是更为常见的需求。通过Pydantic模型的嵌套使用,可以轻松实现这一目标。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from typing import List, Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
"""Information about a person."""

# ^ Doc-string for the entity Person.
# This doc-string is sent to the LLM as the description of the schema Person,
# and it can help to improve extraction results.

# Note that:
# 1. Each field is an `optional` -- this allows the model to decline to extract it!
# 2. Each field has a `description` -- this description is used by the LLM.
# Having a good description can help improve extraction results.
name: Optional[str] = Field(default=None, description="The name of the person")
hair_color: Optional[str] = Field(
default=None, description="The color of the person's hair if known"
)
height_in_meters: Optional[str] = Field(
default=None, description="Height measured in meters"
)


class Data(BaseModel):
"""Extracted data about people."""

# Creates a model so that we can extract multiple entities.
people: List[Person]
1
2
3
4
structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

4 Reference examples

大型语言模型(LLM)应用的行为可以通过少量样本提示(few-shot prompting)来引导。对于聊天模型而言,这可以表现为一系列输入与响应消息对,这些例子展示了期望的行为模式。

例如,可以通过user和assistant消息的交替来传达一个符号的意义:

1
2
3
4
5
6
7
messages = [
{"role": "user", "content": "2 🦜 2"},
{"role": "assistant", "content": "4"},
{"role": "user", "content": "2 🦜 3"},
{"role": "assistant", "content": "5"},
{"role": "user", "content": "3 🦜 4"}
]
1
2
response = llm.invoke(messages)
print(response.content)

Structured output通常在底层使用 tool calling。这通常涉及到生成包含工具调用的AI messages,以及包含工具调用结果的 tool messages。在这种情况下,消息序列应如下所示:

不同的聊天模型提供商对有效消息序列有不同的要求。部分提供商接受如下形式的(重复的)消息序列:

  1. User message
  2. AI message with tool call
  3. Tool message with result

而其他提供商则可能要求在上述序列之后添加一个包含某种响应的最终AI message。

为了简化这一过程并生成符合大多数模型提供商要求的有效消息序列,LangChain提供了一个名为tool_example_to_messages的实用函数。只需提供相应tool call的pydantic representation,即可创建structured few-shot examples。

可以直接将输入和期望的Pydantic对象对转换为可以提供给聊天模型的消息序列。在底层,LangChain会将这些工具调用格式化为符合每个提供商要求的格式。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
(
"The ocean is vast and blue. It's more than 20,000 feet deep.",
Data(people=[]),
),
(
"Fiona traveled far from France to Spain.",
Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
),
]


messages = []

for txt, tool_call in examples:
if tool_call.people:
# This final message is optional for some providers
ai_response = "Detected people."
else:
ai_response = "Detected no people."
messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

for message in messages:
message.pretty_print()

API Reference: tool_example_to_messages

比较加上example之前和之后的效果

1
2
3
4
5
6
7
message_no_extraction = {
"role": "user",
"content": "The solar system is large, but earth has only 1 moon.",
}

structured_llm = model.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])
1
structured_llm.invoke(messages + [message_no_extraction])

更多信息 this guide


[LangChain]Extraction
https://erlsrnby04.github.io/2025/03/29/LangChain-Extraction/
作者
ErlsrnBy04
发布于
2025年3月29日
许可协议