as-rag模块拓展思考

Posted Oct 16, 2024 Updated Oct 24, 2024

4 min read

一些问题：

Langchain选择使用index导入而不是直接从数据库导入，主要是为了在应用中提供灵活性和简便性。以下是原因分析：

本地化管理：通过index导入，可以在应用层级方便地管理和更新数据，而无需涉及数据库层的复杂操作。这使得开发者可以快速刷新或重建索引，适应动态变化的需求。
一体化工作流：直接集成索引生成与使用，提升了开发体验，使得模型、数据预处理、索引管理一体化。无需额外的集成步骤。

对于弊端：

初始数据集成问题：对于已有的向量数据没有现成索引的情况，确实需要重新创建或使用其他工具进行集成。这可能牺牲了一部分灵活性。
工具限制：当前Langchain和LlamaIndex可能都不支持直接利用无需重新处理的已有存储向量。这是一个潜在的限制，需要开发者根据实际需求进行权衡或者扩展。

解决方案的可能途径包括使用适配器或中间件，来桥接现有数据架构和Langchain的索引需求。这样可以在保留灵活性的同时，支持更广泛的使用场景。

关于llamaindex从向量存储加载数据：

https://docs.llamaindex.ai/en/stable/community/integrations/vector_stores/#loading-data-from-vector-stores-using-data-connector

其实就是把原本的向量数据转换成documents，然后用同样的接口加载进index（注意此处并非vector index）。

langchain不支持这种操作，我一开始以为是因为它的add_documents没有判断doc中是否存在”vectors”的环节：

  
def add_documents(
    self,
    documents: list[Document],
    ids: Optional[list[str]] = None,
    **kwargs: Any,
    ) -> list[str]:
"""Add documents to the store."""
texts = [doc.page_content for doc in documents]
vectors = self.embedding.embed_documents(texts)

if ids and len(ids) != len(texts):
    msg = (
        f"ids must be the same length as texts. "
        f"Got {len(ids)} ids and {len(texts)} texts."
    )
    raise ValueError(msg)

    id_iterator: Iterator[Optional[str]] = (
        iter(ids) if ids else iter(doc.id for doc in documents)
    )

ids_ = []

for doc, vector in zip(documents, vectors):
    doc_id = next(id_iterator)
    doc_id_ = doc_id if doc_id else str(uuid.uuid4())
    ids_.append(doc_id_)
    self.store[doc_id_] = {
        "id": doc_id_,
        "vector": vector,
        "text": doc.page_content,
        "metadata": doc.metadata,
    }

    return ids_

其实langchain应该也是支持的，只不过传入index的存储库不能是add_documents中自动含有self.embedding.embed_documents这种向量转换的存储库，实际上llamaindex也是这个原理，使用别的方法加载，绕开向量转换的操作。

另，llamaindex的vectorstoreindex中，有一个方法专门用来直接加载向量存储库为index：.from_vector_store，

经验证是可以的。阅读源码，发现原理其实就是直接传入一个nodes空列表，也就是说这个方法并没有真正建立新的索引：

  
    def from_vector_store(
        cls,
        vector_store: BasePydanticVectorStore,
        embed_model: Optional[EmbedType] = None,
        **kwargs: Any,
    ) -> "VectorStoreIndex":
        if not vector_store.stores_text:
            raise ValueError(
                "Cannot initialize from a vector store that does not store text."
            )

        kwargs.pop("storage_context", None)
        storage_context = StorageContext.from_defaults(vector_store=vector_store)

        return cls(
            nodes=[],
            embed_model=embed_model,
            storage_context=storage_context,
            **kwargs,
        )

并且as中的实现也是必须基于index通过文档创建而非通过向量数据库创建（即必须自带源）

所以~没问题啦~！

This post is licensed under CC BY 4.0 by the author.

Trending Tags