开发框架教程(8)：Haystack —— 工业级 NLP 框架

1. 什么是 Haystack？

Deepset Haystack 是一个老牌的、德国严谨风格的 NLP 框架。它比 LangChain 出道还早。在 LLM 火之前，它就是做 Elasticsearch + BERT 问答系统的首选。

为什么选它？

LangChain 什么火就加什么，迭代太快，API 经常变（Breaking Changes 是家常便饭）。 Haystack 非常稳。它的设计理念是 Pipelines (流水线)。

你搭建的系统就像工厂流水线一样： Retriever (检索) -> Ranker (重排序) -> Reader/Generator (阅读/生成)

每个组件都是标准化的积木。

2. 核心特性

1. 极其强大的检索组件

Haystack 对 Elasticsearch, OpenSearch, Milvus 等工业级向量数据库的支持是最深入的。它允许你微调检索参数，做非常复杂的混合检索 (Hybrid Search)。

2. Evaluator (评估)

这是 Haystack 最受好评的功能。很多框架教你把 RAG 跑通就完事了。 Haystack 提供了完整的评估体系：

检索召回率 (Recall) 是多少？
生成的答案准确率 (Exact Match) 是多少？
它能自动生成测试集来"考"你的 AI。

3. 实战代码

# pip install farm-haystack

from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, PromptNode, PromptTemplate
from haystack.pipelines import Pipeline

# 1. 存储
document_store = InMemoryDocumentStore(use_bm25=True)
document_store.write_documents([
    {"content": "我叫小明，住在北京。"},
    {"content": "Haystack 是一个很好的框架。"}
])

# 2. 检索器
retriever = BM25Retriever(document_store=document_store)

# 3. 生成器 (LLM)
prompt_template = PromptTemplate(
    prompt="根据这些文档回答问题: {join(documents)} \n 问题: {query} \n 回答:"
)
prompt_node = PromptNode(model_name_or_path="gpt-3.5-turbo", default_prompt_template=prompt_template)

# 4. 组装 Pipeline
pipe = Pipeline()
pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="LLM", inputs=["Retriever"])

# 5. 运行
result = pipe.run(query="小明住在哪里？")
print(result["results"][0])

4. 总结

如果你的项目是非常严肃的企业知识库，要求数据隐私、要求高并发、要求严谨的评估指标，且不想被 LangChain 频繁的更新搞崩，Haystack 是一个非常值得信赖的选择。它是真正为"生产环境"设计的。