LangChain: Awesome Until It's Not

I was previously writing articles to cover, at a very high level, all the features of LangChain. That was until I started using it in a project that involved the next module I was going to write about: Retrieval.

The Goal

The project entailed a Retrieval-Augmented Generation (RAG) model. We would create vectors (aka embeddings) for several chunks of text based on our CEO's weekly emails and store them in a vector database (hosted by Pinecone). Then, we would use those vectors to find the most similar chunks of text to a given query. Finally, we would use those chunks of text to add more context to the prompt for GPT-4. This would enable GPT-4 to generate more relevant text.

The Problem

The Pinecone Integration uses the Document interface.

interface Document {
  pageContent: string
  metadata: Record<string, any>

Which seems fine and should allow us to add an id to metadata, so we can run the embedding process repeatedly, getting the same results. However, when integrating with Pinecone, if you add id to metadata, it gets placed into a property called metadata, not the id property. So running it two times doubles the number of records in the vector database.

As a workaround, I tried

const doc = new Document({ id, pageContent, metadata })

but this doesn't just throw a TypeScript error, it throws a runtime error. It's effectively a showstopper. I understand why LangChain makes it a runtime error, but I don't understand why they wouldn't have an optional property for id.

The Solution

Drop LangChain for the Pinecone integration portion only. LangChain still helps out a lot with the rest of the project, particularly the store.similaritySearch function.

On the preprocessing side, I did this by using LangChain and OpenAI to create the embeddings and the npm package @pinecone-database/pinecone to connect to Pinecone.

On the querying side, I connect to Pinecone using @pinecone-database/pinecone and use LangChain's PineconeStore.fromExistingIndex to bring it into the LangChain framework. This allows for function calls like similaritySearch.


Use LangChain until you run into problems. The productivity gains outweigh the pain of switching to something else for a small portion of your project.