LLM QA Builder

This example deploys a freely-available large langauge model (LLM) and feeds it a chunk of text for the purposes of building question-answer (QA) pairs from this text.

Background

Building an LLM that can analyze or read custom data typically falls into two categories:

Retrieval augmented generation (RAG) in which a set of documents are divided into chunks, stored in a vector store, searched for relevance to a question, and then delivered to a stock LLM with no fine-tuning so that this LLM can interpret the data, answer questions, and cite sources.
Fine-tuning in which a stock LLM is re-trained on an additional set of data to produce entirely new parameters.

We recommend using both methods in combination when building specialist LLMs intended to draw from customized datasets. In particular, fine-tuning can come in several different forms. The most common requires that you curate a set of question and answer pairs. We can use a stock LLM to do this.

Requirements

A HuggingFace account and associated API token.
A Lehigh HPC account with training to enable you to access our GPU partitions.
A chunk of text.

Method

1. Sign up for HuggingFace

This example uses HuggingFace (hereafter, HF) infrastructure to work with LLMs. Sign up for an account at huggingface.co and then create a new access token.

Next, you must apply for access to Llama3 8B by visiting the repository on HF.

2. Get an interactive session

To make sure we can use the GPUs for this project, we need to get an interactive session on the lake-gpu partition. Note that this is not a high-availability partition, meaning that you may have to wait a long time to get access. If access is an impediment, the Research Computing team can install this for you upon request.


# make a directory in your ceph space
SPOT=/share/ceph/hawk/lts_proj/rpb222/tmp-llm-qa
mkdir -p $SPOT
cd $SPOT
# select one of the following two commands for an interactive session
salloc -p rapids-express -c 4 -t 60 srun --pty bash
# alternately: salloc -p lake-gpu -c 8 --gres=gpu:1 -t 180 srun --pty bash
# load the new software tee
sol_lake

3. Build an environment

Continuing from the previous step, we build a new Python virtual environment. First, create a file called req-lake.txt with the lines shown between the EOF flags. You can also paste this block into the command line to write the file automatically.


cat > reqs-lake.txt <<EOF
transformers>=4.40.0
trl
accelerate
bitsandbytes
peft
EOF

Next, create a virtual environment in the usual way.


module load python
python -m venv ./venv-lake 
source ./venv-lake/bin/activate
time pip install -r reqs-lake.txt

4. Download the models

Select a central location on your Ceph space to store HF models. Collect your HF access token and set this as an environment variable along with your HF username.


# create a file with your HF token
cat > ~/.load_hf_config.sh <<EOF
export HF_TOKEN=TOKEN_IS_REDACTED_DO_NOT_SHARE_IT 
export HF_USERNAME=bradleyrp
export HF_HOME=/share/ceph/hawk/lts_proj/rpb222/modelcache/hfhome
EOF
source ~/.load_hf_config.sh
mkdir -p $HF_HOME

Next, we download the "Meta Llama 3 3B Instruct" model from the HF hub. This consumes about 15GB space.


huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --exclude "original/*" --local-dir $HF_HOME/local/Meta-Llama-3-8B-Instruct

5. Select an example

For this example we take an excerpt from an academic paper. We compose the following

6. Build a script

We can connect our data to the model with a simple script. Copy the following text to script-llm-qa-builder.py:


import os
import torch
import transformers

question_template = """
The given text below is the result of the text extraction from the PDF files.
Generate 3 meaningful questions on the text and the respective answers.
Reply strictly in the JSON format:
{{
  "questions": ["question1", "question2", "question3"],
  "answers": ["answer1", "answer2", "answer3"]
}}
Ensure that the lists of questions and answers are complete and properly
formatted. DO NOT include any additional information or characters outside the
specified JSON format. The response must consist only of the requested JSON
structure. If the generated content does not meet the specified format, please
make the necessary adjustments to ensure compliance.
```
{reftext:s}
```
"""

class MinimalLLM:
    def __init__(self, model_path):
        self.model_id = model_path
        self.pipeline = transformers.pipeline(
            "text-generation",
            model=self.model_id,
            model_kwargs={
                "torch_dtype": torch.float16,
                "quantization_config": {
                    "load_in_4bit": True,
                    # via https://stackoverflow.com/a/77354686
                    "bnb_4bit_compute_dtype": torch.bfloat16,},
                "low_cpu_mem_usage": True,
            },
        )
        # via https://huggingface.co/blog/not-lain/rag-chatbot-using-llama3
        self.terminators = [
            self.pipeline.tokenizer.eos_token_id,
            self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        ]

    def get_response(
        self, query, message_history=[],
        max_tokens=4096, temperature=0.6, top_p=0.9):
        user_prompt = message_history + [{"role": "user", "content": query}]
        prompt = self.pipeline.tokenizer.apply_chat_template(
            user_prompt, tokenize=False, add_generation_prompt=True
        )
        outputs = self.pipeline(
            prompt,
            max_new_tokens=max_tokens,
            eos_token_id=self.terminators,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
        )
        response = outputs[0]["generated_text"][len(prompt):]
        return response, user_prompt + [
            {"role": "assistant", "content": response}]

    def chatbot(self, system_instructions=""):
        conversation = [{"role": "system", "content": system_instructions}]
        while True:
            user_input = input("User: ")
            if user_input.lower() in ["exit", "quit"]:
                print("Exiting the chatbot. Goodbye!")
                break
            response, conversation = self.get_response(user_input, conversation)
            print(f"Assistant: {response}")
              
    def question(self, text):
        conversation = [{"role": "system", "content": ""}]
        response, conversation = self.get_response(text, conversation)
        print(f"Assistant: {response}")

if __name__ == "__main__":

    model_cache_dn = os.path.join(os.environ['HF_HOME'],'local')
    if not os.path.isdir(model_cache_dn):
        raise Exception(f'cannot find {model_cache_dn}')
    # the following path must match your call to huggingface-cli download
    model_dn = os.path.join(model_cache_dn,
        'Meta-Llama-3-8B-Instruct')
    if not os.path.isdir(model_dn):
        raise Exception(f'cannot find {model_dn}')
    bot = MinimalLLM(model_dn)
    with open('example.txt') as fp:
        text = fp.read()
    bot.question(question_template.format(reftext=text))
```

We can run this script directly:

``` 
python script-llm-qa-builder.py
```

We find that the "Loading checkpoint" step can take up to ten minutes on the first execution, but after that, it may run more swiftly.

This returns the following summary of the text:

```
{
  "questions": [
    "What is the main difference between capacitors and batteries?",
    "What are the advantages of using supercapacitors over conventional capacitors?",
    "What is the potential use of garlic peels as a precursor for synthesizing porous carbons?"
  ],
  "answers": [
    "Capacitors store energy electrostatically, whereas batteries do not suffer from cyclability issues.",
    "Supercapacitors have a higher specific energy density and equivalent power densities compared to capacitors.",
    "Garlic peels can be converted into porous carbon nano materials as electrodes for supercapacitors, utilizing a large underutilized waste material."
  ]
}
```

After the onetime configuration above, you can repeat this process with:

```
salloc -p rapids-express -c 4 -t 60 srun --pty bash
sol_lake
SPOT=/share/ceph/hawk/lts_proj/rpb222/tmp-llm-qa
cd $SPOT
module load python
source ~/.load_hf_config.sh 
source ./venv-lake/bin/activate
python script-llm-qa-builder.py

Next Steps

The following example provides a method for building a database of QA pairs locally without relying on an API provided by a paid, cloud service. To continue this project, we will need to connect this component, which can create entries in a training set, to a more systematic database so that we can format the training dataset for fine-tuning.

Research Computing Systems