LLM+RAG混合检索实战：BM25与语义搜索的黄金组合

基于Rank-BM25与HuggingFace的电商搜索优化方案

一、技术背景与核心问题

在检索增强生成（RAG）系统中，传统的关键字检索（如BM25）与语义检索（如向量嵌入）各有优劣：

BM25：强于准确匹配（如产品型号、规格参数），但对同义词、模糊查询效果差。
语义检索：擅长理解用户意图（如“适合夏天穿的鞋子” → 透气网面运动鞋），但对专有名词敏感度低。
混合策略：通过加权融合两种检索结果，在电商场景中可将搜索准确率提升15%-30%。

二、算法对比与融合策略

1. BM25 vs 语义检索优劣势分析

维度	BM25	语义检索（Embedding）
匹配逻辑	词频统计+文档长度归一化	文本语义空间向量类似度
优势场景	准确关键词、数字型号（如“iPhone14”）	同义词扩展、长尾查询（如“不掉跟的凉鞋”）
缺点	无法处理语义泛化	专有名词易混淆（如“小米”指品牌还是粮食）
计算开销	低（倒排索引）	高（实时向量计算）

2. 混合检索实现方案

采用线性加权分数融合：
综合得分=α⋅BM25_score+(1−α)⋅Semantic_score

权重调优：通过网格搜索（Grid Search）在验证集上确定最佳α值（电商场景一般α=0.4-0.6）。
归一化处理：对BM25和语义分数分别进行Min-Max归一化，消除量纲差异。

三、代码实战：构建混合检索系统

1. 环境准备

python

# 安装依赖库  
!pip install rank_bm25 sentence-transformers pandas  

# 导入工具包  
from rank_bm25 import BM25Okapi  
from sentence_transformers import SentenceTransformer  
import numpy as np  
import pandas as pd

2. 数据预处理与索引构建

python

# 示例数据：电商商品标题  
products = [  
    "男士透气网面运动鞋 2023夏季新款",  
    "苹果iPhone14 Pro Max 256GB 5G手机",  
    "家用无线吸尘器大吸力静音款",  
    "女式真皮平底乐福鞋 春秋季单鞋",  
    "小米手环7 NFC版 智能运动监测"  
]  

# BM25索引构建  
tokenized_products = [doc.split() for doc in products]  
bm25 = BM25Okapi(tokenized_products)  

# 语义模型加载（HuggingFace）  
encoder = SentenceTransformer('BAAI/bge-base-zh-v1.5')  
embeddings = encoder.encode(products)

3. 混合检索实现

python

def hybrid_search(query, alpha=0.5, top_k=3):  
    # BM25检索  
    tokenized_query = query.split()  
    bm25_scores = bm25.get_scores(tokenized_query)  
    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))  # 归一化  

    # 语义检索  
    query_embedding = encoder.encode([query])  
    semantic_scores = np.dot(embeddings, query_embedding.T).flatten()  
    semantic_scores = (semantic_scores - np.min(semantic_scores)) / (np.max(semantic_scores) - np.min(semantic_scores))  

    # 混合得分  
    combined_scores = alpha * bm25_scores + (1 - alpha) * semantic_scores  
    top_indices = np.argsort(combined_scores)[-top_k:][::-1]  

    # 返回结果  
    return [(products[i], combined_scores[i]) for i in top_indices]

4. 效果验证

python

# 测试查询：“运动鞋”  
results = hybrid_search("运动鞋", alpha=0.5)  
for product, score in results:  
    print(f"Score: {score:.2f} | Product: {product}")  

# 输出示例：  
# Score: 0.87 | Product: 男士透气网面运动鞋 2023夏季新款  
# Score: 0.65 | Product: 女式真皮平底乐福鞋 春秋季单鞋  
# Score: 0.23 | Product: 小米手环7 NFC版 智能运动监测