LangChain with OpenAI not return full products in RAG QnAuser2168147
I used the Python LangChain UnstructuredURLLoader to retrieve all our products on the company website for RAG purposes. The products were on different pages in the company website. UnstructuredURLLoader was able to retrieve the products in multiple Document objects before they were chunked, embedded and stored in the vector database. With the OpenAI LLM and RAG module, I asked the AI, "How many products in the company A?" AI replied "There are 11 products. You should check the company website fo
6 May 2024 at 12:04

LangChain with OpenAI not return full products in RAG QnA

I used the Python LangChain UnstructuredURLLoader to retrieve all our products on the company website for RAG purposes. The products were on different pages in the company website.

UnstructuredURLLoader was able to retrieve the products in multiple Document objects before they were chunked, embedded and stored in the vector database.

With the OpenAI LLM and RAG module, I asked the AI, "How many products in the company A?" AI replied "There are 11 products. You should check the company website for more info..."

If I asked "Please list all the products in the company A", AI replied the list of the 11 products only.

The problem is, there are more than 11 products. Why can't LLM read and aggregate the products in the Documents to count and to return all of the products?

Is there any context hint or prompt to tell LLM to read and return all products? Is it because of the chunking process?

What to do to pass dictionary to recursivetextsplitter in and get doc to embed in chromadb

I was passing text to split for embeding in chromadb which I was passing through recursivetextsplitter of langchain library, also what other ways to do it...

I tried to convert it into text and then pass it to embed but in embeddings you must have to pass documents , is ther any function who takes straight away text... here is the code snippets that I ran

'Chroma' object has no attribute 'persist'

I'm persisting the Chroma Database but it's giving me an error.

I'm basically redoing what's in this link.

https://github.com/hwchase17/chroma-langchain/blob/master/persistent-qa.ipynb

Is there any update in chromadb version and they have removed persist I don't get it.

!pip -q install chromadb openai langchain tiktoken

!pip install -q langchain-chroma

!pip install -q langchain_chroma  langchain_openai langchain_community

from langchain_chroma import Chroma
from langchain_openai import OpenAI
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader

persist_directory ='db'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

vectordb.persist()

Then I'm getting the below error:

AttributeError Traceback (most recent call last) Cell In[47], line 1 1 vectordb.persist()

AttributeError: 'Chroma' object has no attribute 'persist'

How to create a multi-user chatbot with langchain

Hope you are doing good. I’ve prepared a chatbot based on the below langchain documentation:

Langchain chatbot documentation

In the above langchain documenation, the prompt template has two input variables - history and human input.

I’ve variables for UserID, SessionID. I’m storing UserID, SessionID, UserMessage, LLM-Response in a csv file. I used python pandas module to read the csv and filtered the data frame for given UserID and SessionID and prepared the chat-history for that specific user session. I’m passing this chat-history as the ‘history’ input to the langchain prompt template(which was discussed in the above link). As I set verbose=true, the langchain was printing the prompt template on the console for every API call. I’ve started the conversation for the first user and first session and sent 3 human_inputs one by one. Later I started the second user session(now session ID and user ID are changed). After observing that prompt template on the console, I’ve observed that langchain is not only taking chat-history of second user session, it’s taking some of the chat-history from previous user session as well, even though I’ve written the correct code to prepare chat-history for the given user session. The code to get chat-history is below:

# get chat_history
def get_chat_history(user_id,session_id,user_query):
    chat_history = "You're a chatbot based on a large language model trained by OpenAI. The text followed by Human: will be user input and your response should be followed by AI: as shown below.\n"
    chat_data = pd.read_csv("DB.csv")
    for index in chat_data.index:
        if ((chat_data['user_id'][index] == user_id) and (chat_data['session_id'][index] == session_id)):
            chat_history += "Human: " + chat_data['user_query'][index] + "\n" + "AI: " + chat_data['gpt_response'][index] + "\n"
    chat_history += "Human: " + user_query + "\n" + "AI: "
    return chat_history

How to teach langchain to consider only the given user session chat-history in it’s prompt. Please help

How to specify Nested JSON using Langchain

I am using StructuredParser of Langchain library. I am getting flat dictionary from parser. Please guide me to get a list of dictionaries from output parser.

PROMPT_TEMPLATE = """ 
You are an android developer. 
Parse this error message and provide me identifiers & texts mentioend in error message. 
--------
Error message is {msg}
--------
{format_instructions}
"""

def get_output_parser():
    missing_id = ResponseSchema(name="identifier", description="This is missing identifier.")
    missing_text = ResponseSchema(name="text", description="This is missing text.")

    response_schemas = [missing_id, missing_text]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    return output_parser


def predict_result(msg):
    model = ChatOpenAI(open_api_key="", openai_api_base="", model="llama-2-70b-chat-hf", temperature=0, max_tokens=2000)
    output_parser = get_output_parser()
    format_instructions = output_parser.get_format_instructions()
    
    prompt = ChatPromptTemplate.from_template(template=PROMPT_TEMPLATE)
    message = prompt.format_messages(msg=msg, format_instructions=format_instructions)
    response = model.invoke(message)

    response_as_dict = output_parser.parse(response.content)
    print(response_as_dict)


predict_result("ObjectNotFoundException AnyOf(AllOf(withId:identifier1, withText:text1),AllOf(withId:identifier2, withText:text1),AllOf(withId:identifier3, withText:text1))")

The output I get is

{
    "identifier":"identifier1",
    "text":"text1"
}

Expected output is

[
    {
        "identifier":"identifier1",
        "text":"text1"
    },
    {
        "identifier":"identifier2",
        "text":"text1"
    },
    {
        "identifier":"identifier3",
        "text":"text1"
    }
]

How to specify such nested JSON in OutputParser

LangChainDeprecationWarning Langchain is deprecated

LangChainDeprecationWarning: Importing LLMs from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

from langchain_community.llms import OpenAI.

To install langchain-community run pip install -U langchain-community. warnings.warn( /Users/$$$$$/LangChain/env/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class langchain_community.llms.openai.OpenAI was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run pip install -U langchain-openai and import as from langchain_openai import OpenAI.

from langchain.llms import OpenAI

Although it says invoke,

Problem only stoped after changing the import method.

from langchain_community.llms import OpenAI from langchain_openai import OpenAI This needs to be adding to langchain, instead of

from langchain.llms import OpenAI However for this I do not get the solution:

chain = MultiPromptChain(router_chain=router_chain, 
                         destination_chains=destination_chains, 
                         default_chain=default_chain, 
                         verbose=True
                        )
chain.invoke("What is black body radiation?")

Output:

Python/LangChain/env/lib/python3.10/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
  warnings.warn(

The code is from DeepLearning.ai.

The same problem has been mentioned in github, but know one seems to know the answer

How to use embeddings as vector_store in Langchain?

i'm trying to create a chatbot using OpenAi Langchain and a cloud database (MongoDb in my case). What I do, is load a PDF, I read the data, create chunks from it and then create embeddings using "text-embedding-ada-002" by OpenAi. After that I store in my DB the filename, the text of the PDF the list of embeddings, and the list of messages. It works good, but the problem is that i want to load the list of embeddings to create the Conversation Chain, but i don't know if it is possible to create it from the list of embeddings of i should save another thing and not the list of embeddings, because i don't want to create them each time i open the chat of the current PDf

If i use something like this to generate the vector store and then run the below code to create the conversation chain it works, but i want to load the list of embeddings i saved in the db

def get_embeddings(chunks: list[str]):
    embeddings = OpenAIEmbeddings()
    vector_store = MongoDBAtlasVectorSearch.from_texts(
        texts=chunks,
        embedding=embeddings,
        collection=embeddings_collection,
        index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    )
    return vector_store

def get_conversation_chain(vector_store):
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        vector_store=vector_store,
        similarity_threshold=0.8,
        max_memory_size=100,
        return_messages=True,
        input_key="question")
    conversation_chain = ConversationalRetrievalChain.from_llm(
        retriever=vector_store.as_retriever(),
        llm=llm,  
        memory=memory)
    result = conversation_chain({"question": "what is the text about"})
    print(result)
    return conversation_chain

Is there a way to create a vector_store from the list of embeddings i saved? or should i use another type of conversation chain?

How to create isolated session for ConversationBufferMemory per user in Langchain?

Problem Statement

I wish to create a FastAPI endpoint with isolated users sessions for my LLM, which is using ConversationBufferMemory. This memory will serve as context for conversation between the AI and the user. Currently, it's been shared with the AI and all users. I wish instead to isolate the memory per user.

I have the base implementation of the Langchain core library below.

Boilerplate Code

from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationChain

memory = ConversationBufferMemory(memory_key="chat_history", k=12)

async def interview_function(input_text):
    prompt = PromptTemplate(
        input_variables=["chat_history", "input"], template=interview_template)
    chat_model = ChatOpenAI(model_name="gpt-4-1106-preview", temperature = 0, 
                        openai_api_key = OPENAI_API_KEY, max_tokens=1000)
    llm_chain = ConversationChain(
        llm=chat_model,
        prompt=prompt,
        verbose=True,
        memory=memory,
    )
    
    return llm_chain.predict(input=input_text)

I made progress by subclassing the ConversationChain with the intention of passing custom memory keys, which are related to the user's unique id, from a separate data store, like a SQL table, which I use to reference the various users interacting with my LLM.

Subclassing Progress

def create_extended_conversation_chain(keys: List[str]):
    class ExtendedConversationChain(ConversationChain):
        input_key: List[str] = Field(keys)

        @property
        def input_keys(self) -> List[str]:
            """Override the input_keys property to return the new input_key list."""
            return self.input_key

        @root_validator(allow_reuse=True)
        def validate_prompt_input_variables(cls, values: Dict) -> Dict:
            """Validate that prompt input variables are consistent."""
            memory_keys = values["memory"].memory_variables
            input_key = values["input_key"]
            prompt_variables = values["prompt"].input_variables
            expected_keys = memory_keys + input_key
            
            if set(expected_keys) != set(prompt_variables):
                raise ValueError(
                    "Got unexpected prompt input variables. The prompt expects "
                    f"{prompt_variables}, but got {memory_keys} as inputs from "
                    f"memory, and {input_key} as the normal input keys."
                )
            return values
    return ExtendedConversationChain

However, I am stuck in creating this custom memory key. My memory keys seem to be not accessible after they have been defined at instantiation as I did in my boilerplate code section.

Is there a Langchain specific solution or do I need to create my own Cache and have my LLM interact with it ?

Python install dependency: No matching distribution found for nvidia-cublas-cu12==12.1.3.1

Im having a problem running my python project on macos Chip: Apple M1

When im installing dependencies (Sentence transformers, torch), throw me this error:

6.408      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.2/14.2 MB 29.7 MB/s eta 0:00:00 6.451 ERROR: Could not find a version that satisfies the requirement nvidia-cublas-cu12==12.1.3.1 (from versions: 0.0.1.dev5) 6.451 ERROR: No matching distribution found for nvidia-cublas-cu12==12.1.3.1 ------

But actually works fine in ubuntu? Anyone know about is?

I tried running into ubuntu docker image, and "simulating" that I'm in ubuntu, but this isn't working

How to query a mongo collection using langchain

I want to query my mongo collection using lanchain.

Just like we have SQLDatabaseChain in langchain to connect with a sql database such as postgres, do we have something similar to connect with a nosql database like mongo ?

I looked at the documentation and didn't find any alternative for nosql.

How can you use an already created chromadb collection with a LLM using openai and langchain?

I already have a chromadb collection created with its documents and metadata.

The problem is when I want to use langchain to create a llm and pass this chromadb collection to use as a knowledge base.

langchain_chroma = Chroma(
client=persistent_client,
collection_name=collection.name,
embedding_function=openai_ef,
)

llm_model = "gtp35turbo-latest"

llm = AzureChatOpenAI(
   api_key=openai_api_key,
   api_version=openai_api_version,
   azure_endpoint=openai_api_base,
   model=llm_model)

qa_chain = RetrievalQA.from_chain_type(
   llm,
   retriever=langchain_chroma.as_retriever(),
   chain_type="refine"
)

When I want to run:

qa_chain.run("How many datascientist do I need for a Object detection problem")

I got this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-81-3cdb65aeb43e> in <cell line: 1>()
----> 1 qa.run("How many datascientist do I need for a Object detection problem")

9 frames
/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py in similarity_search_with_score(self, query, k, filter, where_document, **kwargs)
    430             )
    431         else:
--> 432             query_embedding = self._embedding_function.embed_query(query)
    433             results = self.__query_collection(
    434                 query_embeddings=[query_embedding],

AttributeError: 'OpenAIEmbeddingFunction' object has no attribute 'embed_query'

How to solve it?

Is there a conception of state across multiple chains in langchain?

How do I ensure the variable name from chain1 persists in chain2? Is there a conception of maintaining state in langchain?

from langchain.prompts import PromptTemplate 
from langchain.schema import StrOutputParser 
import dotenv 
dotenv.load_dotenv()

prompt1 = PromptTemplate(
  input_variables=["name"], 
  template="I am {name}. Choose one profession for me in one word. Say AYYE when you do.",
)

prompt2 = PromptTemplate(
  input_variables=["profession", "name"],
  template="I am a {profession}. Tell me king of pirates, what is my destiny? Refer to me by my given name.",   
)

llm = OpenAI()

chain1 = prompt1 | llm | StrOutputParser()

chain2 = ({"profession": chain1, } | prompt2 | llm | StrOutputParser())

response = chain2.invoke(({ "name": "Chopper"}))
print(response)```

I tried calling {name} in the second chain but it wasn't referenced correctly there

Automating a Function in Python which runs requests to different Langchain Agents - It oftens gets stuck

I am trying to run a function in python that takes in a specific input and based on that, runs requests to two different langchain agents (create_pandas_dataframe_agent & ChatOpenAI).

I can provide a snippet of the code below. The problem is, the Agent tends to get stuck at random moments and takes forever to process, while other times it is very fast and simple. Is it something wrong with the way I am writing the code? Should I clean the data better? What can be done?

agent = create_pandas_dataframe_agent(ChatOpenAI(temperature=0),
        df, verbose=True,
        agent_type=AgentType.OPENAI_FUNCTIONS,
        max_execution_time=3,
        early_stopping_method="generate")

llm_talkative = ChatOpenAI(temperature=0,
                       model = "gpt-3.5-turbo",
                           max_tokens= 1000)

def envr_trend_analysis(district):
    #Calculate the Environmental Score Trend Over Time
  cs_env_prompt = "What is the current environmental score for district of " + str(district) + "?"
  cs_env = agent.run(cs_env_prompt)

  env_prompt = "What is the percentage increase for the 'environmental score' in the district of " + str(district) + " for each year."
  percentage_increase = agent.run(env_prompt)

  new_prompt = str(cs_env) + "Explain if this is good or bad? If the score is 75, then this is a sign of a healthy environment. If the score is score below 30, then the environment may be at risk. If the score is between 35 and 70, then the environment is performing moderately. " + str(percentage_increase) + " \n Describe this in words. Explain what it means."

  #Describe the trend
  messages = [
      SystemMessage(content=system_prompt),
      HumanMessage(content=new_prompt)
  ]

  response_environmental =llm_talkative(messages)

  return response_environmental.content

I am expecting to be able to provide the function with different districts and automate the outputs to be able to save them.

langchain vectorstore question and answer from a single embedding in vectorstore

I have worked in creating a vectorstore from a series of paragraphs from a text document. The text of the document has been splitted in non-overlapping paragraphs for a good reason, as these represents different informations. These paragraphs have metadata that has been included

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
import time


paragraphs_document_list = []

for paragraph in paragraph_list:
    documents_list.append(Document(page_content=paragraph,
metadata=dict(paragraph_id=paragraph_id,
                       page=pageno))


db = FAISS.from_documents(documents = paragraphs_document_list,
                          embedding = OpenAIEmbeddings(model="gpt-4")
                         )

Normally, I could query the general content of my document asking a question about its whole.

  qa_chain = RetrievalQA.from_chain_type(
                llm=ChatOpenAI(temperature = 0.0, model='gpt-4'), 
                chain_type="stuff", 
                retriever=db_test.as_retriever(), 
                verbose=False
                )

    label_output = qa_chain.run(query="What is this document about?")

However, I would like to, instead, retrieve the different embeddings in my FAISS vectorstore and then query those individually, using as query something like "What's this paragraph about?".

Is there any option to query a specific embedding or to use as retriever a single specific embedding? In any case, I would like to gain access of the original paragraph I'm querying together with its metadata.

I tried filtering using metadata to answer based on a specific paragraph:

filter_dict = {"paragraph_id":19, "page":5}

results = db.similarity_search(query, filter=filter_dict, k=1, fetch_k=1)

How do you apply session state to Gradio's ChatInterface?

I have created a chatbot with the Gradio's gr.ChatInterface() and Langchain's ConversationalRetrievalChain with chat history. Once uploaded on Huggingface Spaces, I noticed the chat history was being shared across users. I have tried opening the link to my model in Huggingface Spaces on different browsers/ devices, and the conversation history is still retained.

I would like the chat history to be different for every user and not to get jumbled between different users. How can I implement Gradio's ChatInterface() with session state, where the chat history is cleared after each session and is different for every user?

My code is here:

import os
from typing import Optional, Tuple

import gradio as gr
from langchain.chains import ConversationChain, ConversationalRetrievalChain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.llms import OpenAI
from langchain.schema import AIMessage, HumanMessage

def load_chain(llm_name="gpt-3.5-turbo"):
    """Logic for loading the chain you want to use should go here."""
    
    # define embedding
    embedding = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': False})
    
    # create vector database from data
    persist_directory = 'docs/chroma/'
    vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
    
    # Wrap our vectorstore
    llm = OpenAI(temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)
    
    # define retriever
    compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)    
    # Build prompt
    template = """
    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. 
    If there are any assumptions or requirements for the answer to apply, please include them in your response. 
    {context}
    Question: {question}
    Helpful Answer:"""
    QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)
    
    # define memory
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    
    # create a chatbot chain
    chain = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        memory=memory, 
        retriever=compression_retriever, 
        combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT}
    )
    return chain

chain = load_chain()

def predict(message, history):
    history_langchain_format = []
    for human, ai in history:
        history_langchain_format.append(HumanMessage(content=human))
        history_langchain_format.append(AIMessage(content=ai))
    history_langchain_format.append(HumanMessage(content=message))
    gpt_response = chain({"question": message})
    return gpt_response['answer']
    
block = gr.Blocks()

with block:
    
    chatbot = gr.ChatInterface(
    fn=predict,
    title="Chatbot")

block.launch()

How does SQLDatabase Chain work internally? (Langchain)

Langchain Doc

I want to understand underlying implementation. I know it uses NLP. But how it is determining whether requested thing is table or column. Maybe they are using spacy but customised a bit to understand database terms.

What does it store in memory? Obviously they are not storing whole database. From this answer,i got to know they are storing DDL of Database.
But huge database will mostly have large ddl. Won't that create issue?

How to chat with multiple pdfs (that have different information) using langchain?

Currently I have managed to make a web interface to chat with a single PDF document using langchain as a framework, OpenAI as an LLM and Pinecone as a vector store. However, when I wanted to introduce new documents (5 new documents) PDF to the vecotres store, I realized that the information is different from the first document.

I have thought about introducing the resulting embeddings of all the pdf documents to Pinecone. But I have a doubt about whether the information can be crossed when specific information is requested from only one PDF document.

So I'm thinking that another way could be to add some selectors in the same web interface so that the user can choose from the PDF they want to obtain answers from. and thus the information is directed to the specific PDF. But perhaps the user's interaction with the web interface would not be so automatic.

This is why I want to find a way to send all pdf documents to pinecone, and perhaps in the vector store itself add an index for each document or add more collections. I appreciate if anyone has worked on something similar and can give me advice to continue with my task.

Got ValueError while trying to track token usage in Langchain

I am following this tutorial from langchain official documentation here were I try to track the number of tokens while usage. However, I wanted to use gpt-3.5-turbo instead of text-davinci-003 so I changed the LLM class used from OpenAI to ChatOpenAI but this a Value Error of unsupported message type

Here is the code snippet:

from langchain.chat_models import ChatOpenAI
from langchain.callbacks import get_openai_callback

os.environ['OPENAI_API_KEY'] = "OPENAI-API-KEY"

llm = ChatOpenAI(
  model_name='gpt-3.5-turbo-16k',
  temperature=0.0
)

with get_openai_callback() as cb:
    result = llm("Tell me a joke")
    print(cb)

Getting this error: ValueError: Got unsupported message type: T

Why changing the class from OpenAI to ChatOpenAI gives this error? How to solve?

Loading different document types in langchain for an all data source qa bot

I am trying to build an application which can be used to chat with multiple types of data using the different langchain and use streamlit to build the application.

I am unable to load the files properly with the langchain document loaders-

Here is the loader mapping dict-

FILE_LOADER_MAPPING = {
    ".csv": (CSVLoader, {"encoding": "utf-8"}),
    ".doc": (UnstructuredWordDocumentLoader, {}),
    ".docx": (UnstructuredWordDocumentLoader, {}),
    ".epub": (UnstructuredEPubLoader, {}),
    ".html": (UnstructuredHTMLLoader, {}),
    ".md": (UnstructuredMarkdownLoader, {}),
    ".odt": (UnstructuredODTLoader, {}),
    ".pdf": (PyPDFLoader, {}),
    ".ppt": (UnstructuredPowerPointLoader, {}),
    ".pptx": (UnstructuredPowerPointLoader, {}),
    ".txt": (TextLoader, {"encoding": "utf8"}),
    ".ipynb": (NotebookLoader, {}),
    ".py": (PythonLoader, {}),
 
}

Here is the main function-

def main():
    st.title("Docuverse")

    # Upload files
    uploaded_files = st.file_uploader("Upload your documents", type=["pdf", "md", "txt", "csv", "py", "epub", "html", "ppt", "pptx", "doc", "docx", "odt", "ipynb"], accept_multiple_files=True)
    loaded_documents = []
    if uploaded_files:
        # Process uploaded files
        for uploaded_file in uploaded_files:
            st.write(f"Uploaded: {uploaded_file.name}")
            st.write(f"Uploaded: {type(uploaded_file)}")
            ext = os.path.splitext(uploaded_file.name)[-1][1:].lower()
            if ext in FILE_LOADER_MAPPING:
                loader_class, loader_args = FILE_LOADER_MAPPING[ext]
                loader = loader_class(uploaded_file, **loader_args)
            else:
                loader = UnstructuredFileLoader(uploaded_file)
            loaded_documents.extend(loader.load())

        st.write("Chat with the Document:")
        query = st.text_input("Ask a question:")

        if st.button("Get Answer"):
            if query:
                # Load model, set prompts, create vector database, and retrieve answer
                try:
                    llm = load_model()
                    prompt = set_custom_prompt()
                    CONDENSE_QUESTION_PROMPT = set_custom_prompt_condense()
                    db = create_vector_database(loaded_documents)
                    response = retrieve_bot_answer(query)

                    # Display bot response
                    st.write("Bot Response:")
                    st.write(response)
                except Exception as e:
                    st.error(f"An error occurred: {str(e)}")
            else:
                st.warning("Please enter a question.")

if __name__ == "__main__":
    main()

I am uploading a pdf named protector.pdf the error I get is

TypeError: expected str, bytes or os.PathLike object, not UploadedFile


File "/home/user/.local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/home/user/app/app.py", line 395, in <module>
    main()
File "/home/user/app/app.py", line 371, in main
    loaded_documents.extend(loader.load())
File "/home/user/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 86, in load
    elements = self._get_elements()
File "/home/user/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 172, in _get_elements
    return partition(filename=self.file_path, **self.unstructured_kwargs)
File "/home/user/.local/lib/python3.10/site-packages/unstructured/partition/auto.py", line 212, in partition
    filetype = detect_filetype(
File "/home/user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 244, in detect_filetype
    _, extension = os.path.splitext(_filename)
File "/usr/local/lib/python3.10/posixpath.py", line 118, in splitext
    p = os.fspath(p)

Here is the full code - link

I am not sure If I am correctly handling the uploaded files.

How can I resolve this?

Normal view