Jamaley Hussain

10 months ago

Creating Your Own PDF Chatbot : LLM

Are you interested in creating your own PDF chatbot but want to have full control over every aspect of the bot? Look no further! In this tutorial, we will guide you through the process of building a PDF chatbot without depending on libraries that have excessive abstraction. By following this approach, not only will you ensure the privacy of your data, but you will also have complete control over preprocessing, top results, search functionalities, and more. We will be utilizing two different models, falcon-7b and falcon-40b, to achieve our goal.

Table of Contents

1 Introduction
2 Prerequisites
3 Code Implementation
4 Explanation
5 2. Setting Up Constants and Parameters
6 Conclusion

Introduction

Chatbots have become increasingly popular in recent years, providing automated responses and assistance in various domains. However, many existing chatbot libraries come with high-level abstractions that limit customization and control. To address this limitation, we will explore a method to create a PDF chatbot with fine-grained control.

Prerequisites

Before we dive into the code, make sure you have the following prerequisites:

1. Python: Make sure you have Python installed on your system.
2. PDFMiner: Install the PDFMiner library, which provides tools for extracting text from PDF documents.

pip install pdfminer.six

3. Sentence Transformers: Install the Sentence Transformers library, which offers an easy way to compute embeddings for sentences and paragraphs.

pip install sentence-transformers

Code Implementation

Let’s start by setting up our code. Open your preferred Python editor and create a new file. Copy and paste the following code into the file:

import argparse

from pdfminer.high_level import extract_text
from sentence_transformers import SentenceTransformer, CrossEncoder, util

from text_generation import Client

PREPROMPT = "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n"
PROMPT = """"Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to
make up an answer. Don't make up new terms which are not available in the context.

{context}"""

END_7B = "\n{query}"
END_40B = "\nUser: {query}\nFalcon:"

PARAMETERS = {
    "temperature": 0.9,
    "top_p": 0.95,
    "repetition_penalty": 1.2,
    "top_k": 50,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "seed": 42,
    "stop_sequences": ["", "</s>"],
}
CLIENT_7B = Client("http://")  # Fill this part
CLIENT_40B = Client("https://")  # Fill this part


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--fname", type=str, required=True)
    parser.add_argument("--top-k", type=int, default=32)
    parser.add_argument("--window-size", type=int, default=128)
    parser.add_argument("--step-size", type=int, default=100)
    return parser.parse_args()


def embed(fname, window_size, step_size):
    text = extract_text(fname)
    text = " ".join(text.split())
    text_tokens = text.split()

    sentences = []
    for i in range(0, len(text_tokens), step_size):
        window = text_tokens[i: i + window_size]
        if len(window) < window_size:
            break
        sentences.append(window)

    paragraphs = [" ".join(s) for s in sentences]
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
    model.max_seq_length = 512
    cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    embeddings = model.encode(
        paragraphs,
        show_progress_bar=True,
        convert_to_tensor=True,
    )
    return model, cross_encoder, embeddings, paragraphs


def search(query, model, cross_encoder, embeddings, paragraphs, top_k):
    query_embeddings = model.encode(query, convert_to_tensor=True)
    query_embeddings = query_embeddings.cuda()
    hits = util.semantic_search(
        query_embeddings,
        embeddings,
        top_k=top_k,
    )[0]

    cross_input = [[query, paragraphs[hit["corpus_id"]]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_input)

    for idx in range(len(cross_scores)):
        hits[idx]["cross_score"] = cross_scores[idx]

    results = []
    hits = sorted(hits, key=lambda x: x["cross_score"], reverse=True)
    for hit in hits[:5]:
        results.append(paragraphs[hit["corpus_id"]].replace("\n", " "))
    return results


if __name__ == "__main__":
    args = parse_args()
    model, cross_encoder, embeddings, paragraphs = embed(
        args.fname,
        args.window_size,
        args.step_size,
    )
    print(embeddings.shape)
    while True:
        print("\n")
        query = input("Enter query: ")
        results = search(
            query,
            model,
            cross_encoder,
            embeddings,
            paragraphs,
            top_k=args.top_k,
        )

        query_7b = PREPROMPT + PROMPT.format(context="\n".join(results))
        query_7b += END_7B.format(query=query)

        query_40b = PREPROMPT + PROMPT.format(context="\n".join(results))
        query_40b += END_40B.format(query=query)

        text = ""
        for response in CLIENT_7B.generate_stream(query_7b, **PARAMETERS):
            if not response.token.special:
                text += response.token.text

        print("\n***7b response***")
        print(text)

        text = ""
        for response in CLIENT_40B.generate_stream(query_40b, **PARAMETERS):
            if not response.token.special:
                text += response.token.text

        print("\n***40b response***")
        print(text)

Explanation

Let’s go through the code to understand how it works.

1. Importing Required Libraries

We begin by importing the necessary libraries for our chatbot implementation. We import the `argparse` module for command-line argument parsing, `extract_text` from `pdfminer.high_level` to extract text from a PDF, and `SentenceTransformer`, `CrossEncoder`, and `util` from the `sentence_transformers` library for generating sentence embeddings and performing a semantic search.

import argparse
from pdfminer.high_level import extract_text
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from text_generation import Client

2. Setting Up Constants and Parameters

Next, we define some constants and parameters that will

be used throughout the code. These constants and parameters include:

– `PREPROMPT`: A string containing introductory text for the chatbot’s responses.
– `PROMPT`: A string template for the prompt that will be used for generating responses based on context.
– `END_7B` and `END_40B`: Strings indicating the end of the prompt for the falcon-7b and falcon-40b models, respectively.
– `PARAMETERS`: A dictionary containing various parameters for generating text, such as temperature, top-p, repetition penalty, top-k, truncate length, maximum number of new tokens, seed, and stop sequences.
– `CLIENT_7B` and `CLIENT_40B`: Instances of the `Client` class from the `text_generation` module, initialized with the appropriate URLs.

PREPROMPT = "Below are a series of dialogues..." # Add the full introductory text
PROMPT = """"Use the following pieces of context..." # Add the prompt template

END_7B = "\n{query}"
END_40B = "\nUser: {query}\nFalcon:"

PARAMETERS = {
"temperature": 0.9,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"truncate": 1000,
"max_new_tokens": 1024,
"seed": 42,
"stop_sequences": ["", "</s>"],
}
CLIENT_7B = Client("http://") # Fill in the appropriate URL
CLIENT_40B = Client("https://") # Fill in the appropriate URL

3. Command-Line Argument Parsing

The `parse_args()` function is responsible for parsing the command-line arguments provided to the script. In this case, we are expecting the `–fname` argument, which represents the path to the PDF file we want to extract text from, as well as optional arguments `–top-k`, `–window-size`, and `–step-size`.

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--fname", type=str, required=True)
    parser.add_argument("--top-k", type=int, default=32)
    parser.add_argument("--window-size", type=int, default=128)
    parser.add_argument("--step-size", type=int, default=100)
    return parser.parse_args()

4. PDF Text Extraction and Embedding

The `embed()` function takes the path to a PDF file, window size, and step size as inputs. It uses the `extract_text()` function from `pdfminer.high_level` to extract the text content from the PDF. The extracted text is then split into tokens and organized into sentences based on the window size and step size provided.

Next, the function initializes a Sentence Transformer model and a Cross Encoder model using the Sentence Transformers library. These models are used to compute embeddings for the extracted sentences and perform semantic search later on.

The function returns the Sentence Transformer model, Cross Encoder model, embeddings, and paragraphs (organized sentences) for further processing.

def embed(fname, window_size, step_size):
    text = extract_text(fname)
    text = " ".join(text.split())
    text_tokens = text.split()

    sentences = []
    for i in range(0, len(text_tokens), step_size):
        window = text_tokens[i: i + window_size]
        if len(window) < window_size:
            break
        sentences.append(window)

    paragraphs = [" ".join(s) for s in sentences]
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
    model.max_seq_length = 512
    cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    embeddings = model.encode(
        paragraphs,
        show_progress_bar=True,
        convert_to_tensor=True,
    )
    return model, cross_encoder, embeddings, paragraphs

5. Semantic Search

The `search()` function takes a query, the Sentence Transformer model, Cross Encoder model, embeddings, paragraphs, and a top-k value as inputs. It performs semantic search by computing the embeddings of the query and comparing them with the embeddings of the paragraphs using the `semantic_search()` function from the Sentence Transformers library.

The function returns a list of the top-k results based on the semantic similarity between the query and the paragraphs.

def search(query, model, cross_encoder, embeddings, paragraphs, top_k):
    query_embeddings = model.encode(query, convert_to_tensor=True)
    query_embeddings = query_embeddings.cuda()
    hits = util.semantic_search(
        query_embeddings,
        embeddings,
        top_k=top_k,
    )[0]

    cross_input = [[query, paragraphs[hit["corpus_id"]]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_input)

    for idx in range(len(cross_scores)):
        hits[idx]["cross_score"] = cross_scores[idx]

    results = []
    hits = sorted(hits, key=lambda x: x["cross_score"], reverse=True)
    for hit in hits[:5]:
        results.append(paragraphs[hit["corpus_id"]].replace("\n", " "))
    return results

6. Main Execution

In the main execution part, we parse the command-line arguments, extract the PDF text and create embeddings using the `embed()` function. We then enter an infinite loop where the user can enter queries.

For each query, we call the `search()` function to retrieve the top results based on semantic search. We then construct prompts for the falcon-7b and falcon-40b models using the retrieved results and the user’s query.

Finally, we generate responses using the falcon-7b and falcon-40b models by making requests to the respective clients (`CLIENT_7B` and `CLIENT_40B`) and print the responses.

if __name__ == "__main__":
    args = parse_args()
    model, cross_encoder, embeddings, paragraphs = embed(
        args.fname,
        args.window_size,
        args.step_size,
    )
    print(embeddings.shape)
    while True:
        print("\n")
        query = input("Enter query: ")
        results = search(
            query,
            model,
            cross_encoder,
            embeddings,
            paragraphs,
            top_k=args.top_k,
        )

        query_7b = PREPROMPT + PROMPT.format(context="\n".join(results))
        query_7b += END_7B.format(query=query)

        query_40b = PREPROMPT + PROMPT.format(context="\n".join(results))
        query_40b += END_40B.format(query=query)

        text = ""
        for response in CLIENT_7B.generate_stream(query_7b, **PARAMETERS):
            if not response.token.special:
                text += response.token.text

        print("\n***7b response***")
        print(text)

        text = ""
        for response in CLIENT_40B.generate_stream(query_40b, **PARAMETERS):
            if not response.token.special:
                text += response.token.text

        print("\n***40b response***")
        print(text)

Conclusion

In this tutorial, we have learned how to create our own PDF chatbot without

relying on libraries with excessive abstraction. By utilizing the falcon-7b and falcon-40b models, we have full control over the bot’s behavior and can customize various aspects such as preprocessing, top results, and search functionalities. This approach ensures data privacy and allows us to tailor the chatbot according to our specific requirements. Feel free to experiment with different models, parameters, and prompt structures to further enhance the capabilities of your PDF chatbot. Happy coding!

Watch this video for reference – https://www.youtube.com/watch?v=hSQY4N1u3v0

Tokenization in NLP: A Comprehensive Guide »

« A Comprehensive Guide to Hugging Face Transformers

Categories: Data Science Machine Learning NLP

Tags: AIChatbotLarge Language ModelLLMNatural language processingNLPPDFChatbot

Jamaley Hussain: Hello, I am Jamaley. I did my graduation from StaffordShire University UK . Fortunately, I find myself quite passionate about Computers and Technology.

Sentiment Analysis with NLP: A Step-by-Step Guide
Sentiment analysis is like teaching a computer to understand feelings - it can tell whether…
Tokenization in NLP: A Comprehensive Guide
Hi Folks, In this article we are going to know about NLP and their deep…