Fine-Tune LLM for Code Generation - Data Collection and Preprocessing
Introduction
Hey there! Nice to have you here again. I recently worked with LLMs and fine-tuning LLMs for python code generation. LLMs stand for Large Language Model and they are good at understanding a large amount a text. ChatGPT and Gemini are two popular LLMs. This blog goes over my work.
Data Collection
I chose MBPP dataset as my base dataset. This dataset includes a python programming problem along with the solution code to that python problem and a number of test statements in order to check the validity of the code. In the paper for the dataset, 374 data points are used for training and 500 data points are used for testing. The same split is used throughout this blog.
The dataset files are loaded.
train_dataset = datasets.load_dataset("json", data_files = "data/train_dataset.json")
train_dataset = train_dataset['train']
test_dataset = datasets.load_dataset("json", data_files = "data/test_dataset.json")
test_dataset = test_dataset['train']
Lets check how 11th entry of this dataset looks like.
# Printing a sample training data.
print(train_dataset[10]['text'])
print(train_dataset[10]['code'])
print(train_dataset[10]['test_list'])
Write a function to find the maximum of nth column from the given tuple list.
def max_of_nth(test_list, N):
res = max([sub[N] for sub in test_list])
return (res)
['assert max_of_nth([(5, 6, 7), (1, 3, 5), (8, 9, 19)], 2) == 19', 'assert max_of_nth([(6, 7, 8), (2, 4, 6), (9, 10, 20)], 1) == 10', 'assert max_of_nth([(7, 8, 9), (3, 5, 7), (10, 11, 21)], 1) == 11']
Now that the data is loaded, it needs to be preprocessed in the way LLM can use it. The LLM of choice is Llama 2. Llama 2 is an open source LLM developed by Meta and it comes in a number of configuration. The one I chose is Llama 2 7B Chat where 7B means 7 billion parameters and Chat means the LLM is fine tuned for chat purpose.
For working with Llama 2 Chat, the prompt needs to be modified. A prompt in the following format is required.
<s> [INST] <<SYS>>
System Prompt
<</SYS>>
User Prompt
[/INST] Model Answer </s>
Here, the System Prompt
is instruction for Llama 2 which is predefined, and User Prompt
is the prompt that user passes.
Data Preprocessing
So I made mapping functions map_mbpp_train_data(sample)
and map_mbpp_test_data(sample)
which takes in sample
as input and adds additional value in sample sample['prompt']
which includes the instructions for llama followed by sample['text']
. Finally sample['test_list'][0]
and sample['code']
is added in case of training data and is omitted in case of testing data as it needs to be generated by LLM. Apart from this, [PYTHON]
and [/PYTHON]
tags are used in order to force LLM to only generate code inside those tags. This further helps in evaluation part too.
def map_mbpp_train_data(sample):
sample['prompt'] = f"""<s>[INST] <<SYS>>
You are a python programming assistant that obeys the constraints and passes the example test case.
You wrap the code answer without any comments between [PYTHON] and [/PYTHON] tags.
In case a test case is available, it is written inside [TEST] and [/TEST] tags.
<</SYS>>
{sample['text']}
[TEST]{sample['test_list'][0]}[/TEST]
[/INST]
[PYTHON]
{sample['code']}
[/PYTHON]</s>
"""
return sample
def map_mbpp_test_data(sample):
sample['prompt'] = f"""<s>[INST] <<SYS>>
You are a python programming assistant that obeys the constraints and passes the example test case.
You wrap the code answer without any comments between [PYTHON] and [/PYTHON] tags.
In case a test case is available, it is written inside [TEST] and [/TEST] tags.
<</SYS>>
{sample['text']}
[TEST]{sample['test_list'][0]}[/TEST]
[/INST]
[PYTHON]
"""
return sample
And finally the mapped training dataset looks like.
# Map the train_dataset with the training mapping function
train_dataset = train_dataset.map(map_mbpp_train_data)
# Print a sample prompt from train dataset.
print(train_dataset[0]['prompt'])
<s>[INST] <<SYS>>
You are a python programming assistant that obeys the constraints and passes the example test case.
You wrap the code answer without any comments between [PYTHON] and [/PYTHON] tags.
In case a test case is available, it is written inside [TEST] and [/TEST] tags.
<</SYS>>
Write a function to find the longest chain which can be formed from the given set of pairs.
[TEST]assert max_chain_length([Pair(5, 24), Pair(15, 25),Pair(27, 40), Pair(50, 60)], 4) == 3[/TEST]
[/INST]
[PYTHON]
class Pair(object):
def __init__(self, a, b):
self.a = a
self.b = b
def max_chain_length(arr, n):
max = 0
mcl = [1 for i in range(n)]
for i in range(1, n):
for j in range(0, i):
if (arr[i].a > arr[j].b and
mcl[i] < mcl[j] + 1):
mcl[i] = mcl[j] + 1
for i in range(n):
if (max < mcl[i]):
max = mcl[i]
return max
[/PYTHON]</s>
And the testing dataset looks like
# Map the test_dataset with the testing mapping function
test_dataset = test_dataset.map(map_mbpp_test_data)
# Print a sample prompt from train and test dataset.
print(test_dataset[0]['prompt'])
<s>[INST] <<SYS>>
You are a python programming assistant that obeys the constraints and passes the example test case.
You wrap the code answer without any comments between [PYTHON] and [/PYTHON] tags.
In case a test case is available, it is written inside [TEST] and [/TEST] tags.
<</SYS>>
Write a python function to remove first and last occurrence of a given character from the string.
[TEST]assert remove_Occ("hello","l") == "heo"[/TEST]
[/INST]
[PYTHON]
Evaluating Models and Fine Tuning
Base Llama2 7B for MBPP
Firstly, I made helper functions in order to test the accuracy of the LLM model. This helper function is also used for the further testing of the models. The idea is to use a prompt similar to the one used in map_mbpp_train_data(sample)
. So I made a function to extract the python code which I expect to find inside [tag1]
and [tag2]
tags. In this case, it is [PYTHON]
and [/PYTHON]
.
def extract_text_between_tags(input_string, tag1, tag2):
pattern = r'' + tag1 + '(.*?)' + tag2 + ''
return re.findall(pattern, input_string, re.DOTALL)
Next, I made a function which can execute the python code which is generated by the LLM. In order to calculate the accuracy, there are just two options: the code runs and the code does not run/gets stuck.
def run_python_code(code):
try:
exec(code)
return True
except:
return False
Now, we are ready to tackle the data with the first LLM.
As I am going to compare different models fine tuned with different datasets, I will make a helper function calculate_mbpp_accuracy
that will determine the accuracy of model. The baseline is the base Llama2 7B Chat model.
Firstly, a 4-bit qunatinzed model is loded.
# Load tokenizer and model with QLoRA configuration
nf4_config = transformers.BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = torch.bfloat16
)
# Load base model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config = nf4_config,
device_map = {"": 0},
use_cache = True
)
model.config.pretraining_tp = 1
Then, the tokenizer of base Llama2 7B Chat is used.
# Load LLaMA tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_a_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Finally the input is tokenized, and it is passed through the LLM and then it is decoded.
# Tokenize input, pass it through the LLM and decode output
input = tokenizer(sentence, return_tensors="pt", padding=True).to(model.device)
output_sequences = model.generate(**input, max_new_tokens=max_new_tokens, do_sample=True, top_p=0.9)
output = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)[0]
Testing Base Llama2 7B
From this output, the code can be extracted and run using the above helper functions. I calculated two metrics for each of the model, the accuracy of the code on the test MBPP dataset, and the inference time for each model on a query. The overall helper function is as follows.
def calculate_mbpp_accuracy(model_name):
# Load tokenizer and model with QLoRA configuration
nf4_config = transformers.BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = torch.bfloat16
)
# Load base model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config = nf4_config,
device_map = {"": 0},
use_cache = True
)
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_a_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Load results checkpoint
results_checkpoint = {}
if os.path.exists("data/tmp/results_checkpoint.json"):
with open("data/tmp/results_checkpoint.json", "r") as infile:
results_checkpoint = json.load(infile)
# Placeholder for accuracy calculation
total_samples = len(test_dataset)
correct_predictions = 0
total_time = 0
# Initialize tqdm bar
progress_bar = tqdm.tqdm(enumerate(test_dataset), total=total_samples, desc="Processing")
for idx, sample in progress_bar:
ID = str(sample['task_id'])
# Make Task ID in results checkpoint in case it is not available
if ID not in results_checkpoint:
results_checkpoint[ID] = {}
# Compute result for a Model in a particular task (Task ID)
if model_name not in results_checkpoint[ID]:
# Initial output token limit
max_new_tokens = 50 # Initial value
# Placeholder for input
input = None
results_checkpoint[ID][model_name] = {
"execution_time": 0
}
# Gradually increase token limit to 500 in step size of 50 so as to
# terminate generation in case end of code is reached.
while max_new_tokens <= 500:
if input is None:
sentence = [sample['prompt']]
else:
# Use the previous output as input
sentence = [output]
# Start timer
start_time = time.time()
# Tokenize input, pass it through the LLM and decode output
input = tokenizer(sentence, return_tensors="pt", padding=True).to(model.device)
output_sequences = model.generate(**input, max_new_tokens=max_new_tokens, do_sample=True, top_p=0.9)
output = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)[0]
# End timer
end_time = time.time()
# Convert to milliseconds
execution_time = (end_time - start_time) * 1000
results_checkpoint[ID][model_name] = {
"output": output,
"execution_time": execution_time + results_checkpoint[ID][model_name]["execution_time"]
}
try:
code = extract_text_between_tags(output, r"\[PYTHON\]", r"\[/PYTHON\]")[1]
# Break the loop if code extraction and execution are successful
break
except:
code = ""
# Increase token limit for next iteration
max_new_tokens += 50
else:
# If max token count is reached and no code is available yet
code = ""
# Add test code after the code generated by LLM
for test in sample['test_list']:
code = code + f"\n{test}"
# Add code with test and the run status to results checkpoint
results_checkpoint[ID][model_name]["code_with_test"] = code
results_checkpoint[ID][model_name]["run_status"] = run_python_code(code)
# Save the result checkpoint
with open("data/tmp/results_checkpoint.json", "w") as outfile:
json.dump(results_checkpoint, outfile)
# Update description with current sample index and accuracy
if results_checkpoint[ID][model_name]["run_status"]:
correct_predictions += 1
accuracy = correct_predictions / (idx + 1) if idx > 0 else 0
# Update progress bar description
progress_bar.set_description(f"Processing | Sample: {idx + 1}/{total_samples} | Accuracy: {accuracy:.2%}")
# Update total runtime for LLM.
total_time = total_time + results_checkpoint[ID][model_name]["execution_time"]
# Clear memory
tokenizer = None
model = None
gc.collect()
torch.cuda.empty_cache()
# Add Final Result to results checkpoint.
if "FINAL_RESULT" not in results_checkpoint:
results_checkpoint["FINAL_RESULT"] = {}
results_checkpoint["FINAL_RESULT"][model_name] = {
"ACCURACY": accuracy,
"TOTAL_TIME": total_time,
"AVERAGE_TIME": total_time / total_samples
}
# Save the result checkpoint AGAIN.
with open("data/tmp/results_checkpoint.json", "w") as outfile:
json.dump(results_checkpoint, outfile)
return correct_predictions / total_samples
With this helper function, the accuracy of base Llama2 7B Chat model is calculated on MBPP Testing Dataset.
if model_a_name in active_model_names:
# Calculate accuracy of Llama 2 7B Chat on MBPP Testing dataset.
accuracy = calculate_mbpp_accuracy(model_a_name)
print(f"Model A Accuracy: {accuracy}")
model_stats[model_a_name] = {
"ACCURACY": accuracy
}
It has an accuracy of 9.4% and a median generation time of 17.31 seconds on NVidia T4 GPU. In case only correct answers are considered, the median generation time is 15.79 seconds.
Finetuning Llama2 7B on MBPP
As in the future, I plan to use a similar structure to fine-tune further model using dataset similar to MBPP, I will wrap the fine-tuning of Llama2 7B Chat into a function. Base Llama2 7B Chat is loaded with 4-bit quantization. Then a LoRA configuration is set for the fine-tuning. Then a supervised fine tuning trainer is set with the base model, the LoRA configuration and the training dataset and the model is fine-tuned for 3 epochs.
def finetune_llama_for_mbpp(dataset, output_name):
# Load tokenizer and model with QLoRA configuration
nf4_config = transformers.BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = torch.bfloat16
)
# Load base model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_a_name,
quantization_config = nf4_config,
device_map = {"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_a_name, trust_remote_code=True)
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
# Load LoRA configuration
peft_config = peft.LoraConfig(
lora_alpha = 16,
lora_dropout = 0.1,
r = 64,
bias="none",
task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = transformers.TrainingArguments(
output_dir = "./results",
num_train_epochs = 3,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 1,
gradient_checkpointing = True,
optim = "paged_adamw_32bit",
save_steps = 0,
logging_steps = 20,
learning_rate = 2e-4,
weight_decay = 0.001,
fp16 = True,
bf16 = False,
max_grad_norm = 0.3,
max_steps = -1,
warmup_ratio = 0.3,
group_by_length = True,
lr_scheduler_type = "cosine",
report_to="tensorboard"
)
# Set supervised fine-tuning parameters
trainer = trl.SFTTrainer(
model = model,
train_dataset = dataset,
peft_config = peft_config,
dataset_text_field = "prompt",
max_seq_length = 1000,
tokenizer = tokenizer,
args = training_arguments,
packing = False,
)
# Train model
trainer.train()
trainer.model.save_pretrained(output_name)
# Clear the memory
model = None
tokenizer = None
trainer = None
gc.collect()
gc.collect()
torch.cuda.empty_cache()
The model is fine-tuned as follows.
if model_b_name in active_model_names:
# Finetune Llama 2 7B Chat using training dataset from MBPP
finetune_llama_for_mbpp(train_dataset, model_b_name)
Testing Finetuned Llama2 7B on MBPP dataset
And finally the model is tested as follows.
if model_b_name in active_model_names:
# Calculate accuracy of Llama 2 7B Chat finetuned with MBPP training dataset
# on MBPP Testing dataset.
accuracy = calculate_mbpp_accuracy(model_b_name)
print(f"Model B Accuracy: {accuracy}")
model_stats[model_b_name] = {
"ACCURACY": accuracy
}
It has an accuracy of 13.8% and a median generation time of 19.1 seconds on NVidia T4 GPU. In case only correct answers are considered, the median generation time is 12.42 seconds. In the next blog, I will fine tune Llama2 with a synthetic dataset to see how well it performs. Until then, see you!
Finetune Llama2 7B on Synthetic MBPP dataset.
Synthetic Dataset Generation
This method, powered by AWS Llama’s API, allows us to generate a diverse set of training data without relying on manually curated datasets like MBPP, which can be time-consuming and difficult to scale.
import requests
import json
import re
AWS_API_TOKEN = "XXX"
def llama_generate(prompt,
max_gen_len = 512,
temperature = 0.5,
top_p = 0.9):
url = 'https://6xtdhvodk2.execute-api.us-west-2.amazonaws.com/dsa_llm/generate'
body = {
"prompt": prompt,
"max_gen_len": max_gen_len,
"temperature": temperature,
"top_p": top_p,
"api_token": AWS_API_TOKEN
}
res = requests.post(url, json = body)
return res
The following helper function helps in generating response based on the prompt
.
Following this, I followed some strategy from the paper CodeLlama.
The first step in creating a synthetic dataset is generating a list of Python programming topics. Using one-shot prompting through the AWS Llama API, I prompted the model to generate 20 beginner topics related to Python programming.
# Sample prompt to generate topics using one shot prompting.
prompt = \
"""
<s> [INST] Write me 20 beginner topics for python programming which I can test. Each topic must be between [T] and [/T] tags [/INST]
1. [T]Variables[/T]
2. [T]"""
res = llama_generate(prompt, 2048)
However, these results may vary on each API call. Thus, while the generation of synthetic dataset, I saved results in the synthetic dataset checkpoint. In case we have a corresponding checkpoint, the AWS API is not called for that part.
Here, the first synthetic dataset checkpoint is loaded. In case it is not available, the response from the API is used to generate 25 relevant python topics.
if os.path.exists("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_1.json"):
# Load the current synthetic dataset state from checkpoint
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_1.json", "r") as infile:
synthetic_dataset = json.load(infile)
else:
# Load topics from API response by extracting texts between [T] and [/T]
synthetic_dataset = {
"topics" : extract_text_between_tags(prompt + json.loads(res.text)['body']['generation'], r"\[T\]", r"\[/T\]")[1:]
}
# Save current synthetic dataset checkpoint
if not os.path.exists("data/tmp/synthetic_dataset/checkpoint"):
os.makedirs("data/tmp/synthetic_dataset/checkpoint")
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_1.json", "w") as outfile:
json.dump(synthetic_dataset, outfile)
Now for the above chosen topics, the questions are generated.
In order to do two-shot prompt for generating questions, a question is created for each topic. Along with this, a MINIMUM_NUMBER_OF_QUESTIONS
attribute is added to each topic to choose atleast how many questions from a topic must be formed.
if os.path.exists("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_2.json"):
# Load the current synthetic dataset state from checkpoint
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_2.json", "r") as infile:
synthetic_dataset = json.load(infile)
else:
# Manually write 2 questions for each topic in synthetic dataset. Add a minimum number of question for each topic.
synthetic_dataset['Variables'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 50,
"QUESTIONS" : ["Write a Python program to declare one variable and assign value.", "Write a Python program to declare two variables and assign values"]
}
synthetic_dataset['Data Types'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to declare variables of int data type.", "Write a Python program to declare variables of float data type."]
}
synthetic_dataset['Operators'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 100,
"QUESTIONS" : ["Write a Python program using basic arithmetic addition operator.", "Write a Python program using basic arithmetic subtraction operator"]
}
synthetic_dataset['Control Structures'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 100,
"QUESTIONS" : ["Write a Python program using if statement.", "Write a Python program using if-else statement"]
}
synthetic_dataset['Functions'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 100,
"QUESTIONS" : ["Write a Python program defining a function to add two numbers.", "Write a Python program defining a function to subtract two numbers"]
}
synthetic_dataset['Modules'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program that imports Open CV library.", "Write a Python program that imports numpy library."]
}
synthetic_dataset['Lists'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to create a list of int.", "Write a Python program that creates a list of float."]
}
synthetic_dataset['Tuples'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to create a tuple of int.", "Write a Python program that creates a tuple of float."]
}
synthetic_dataset['Dictionaries'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to create a dictionary of int.", "Write a Python program to create a dictionary of float."]
}
synthetic_dataset['Sets'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to create a set of int.", "Write a Python program to create a set of float."]
}
synthetic_dataset['String Manipulation'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 100,
"QUESTIONS" : ["Write a Python program to concatenate two strings", "Write a Python program to replace one string with other"]
}
synthetic_dataset['File Input/Output'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to import a text file.", "Write a Python program to import a csv file."]
}
synthetic_dataset['Exception Handling'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 60,
"QUESTIONS" : ["Write a Python program to handle division by zero error", "Write a Python program to handle import error"]
}
synthetic_dataset['Regular Expressions'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 100,
"QUESTIONS" : ["Write a python program to find 'a' in string using regex.", "Write a python program to find 'hello' in string using regex."]
}
synthetic_dataset['Object-Oriented Programming'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 5,
"QUESTIONS" : ["Write a Python program to create a class for student.", "Write a Python program to creates a hello world function."]
}
synthetic_dataset['Class and Objects'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 5,
"QUESTIONS" : ["Write a Python program to create a complex number object.", "Write a Python program to create a date object."]
}
synthetic_dataset['Inheritance'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 15,
"QUESTIONS" : ["Write a Python program demonstrating inheritance by creating a parent class and a child class inheriting from it.",
"Write a Python program demonstrating inheritance by creating a parent class and two child classes inheriting from it."]
}
synthetic_dataset['Polymorphism'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 5,
"QUESTIONS" : ["Write a Python program demonstrating polymorphism with a base class and two subclasses.",
"Write a Python program demonstrating polymorphism with a base class and one subclass."]
}
synthetic_dataset['Encapsulation'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 15,
"QUESTIONS" : ["Write a Python program demonstrating encapsulation by defining a class with private attributes.",
"Write a Python program demonstrating encapsulation by defining a class with public attributes."]
}
synthetic_dataset['Decorators'] = {
"MINIMUM_NUMBER_OF_QUESTIONS" : 15,
"QUESTIONS" : ["Write a Python program demonstrating the use of decorators to modify the behavior of a function.",
"Write a Python program demonstrating the creation and application of a decorator to print the execution time of a function."]
}
# Save current synthetic dataset checkpoint.
if not os.path.exists("data/tmp/synthetic_dataset/checkpoint"):
os.makedirs("data/tmp/synthetic_dataset/checkpoint")
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_2.json", "w") as outfile:
json.dump(synthetic_dataset, outfile)
At this stage, the current state of synthetic dataset is used to generate a prompt and call AWS API to see the response.
# Sample prompt to generate synthetic questions using two shot prompting.
prompt = f"""
<s>[INST] Generate {synthetic_dataset["Variables"]["MINIMUM_NUMBER_OF_QUESTIONS"]} python programming questions only on the topic of Variables. Each question must be inside [QUESTION] and [/QUESTION] tags.[/INST]
1. [QUESTION]{synthetic_dataset["Variables"]["QUESTIONS"][0]}[/QUESTION]
2. [QUESTION]{synthetic_dataset["Variables"]["QUESTIONS"][1]}[/QUESTION]
3. [QUESTION]"""
res = llama_generate(prompt, 1000, top_p = 0.9)
Because the questions are generated as expected, this is wrapped into a function which generates atleast MINIMUM_NUMBER_OF_QUESTIONS
questions for the respective topic.
def synthetic_questions_generator(MAX_API_LIMIT = 1):
for topic in synthetic_dataset["topics"]:
print(f"-- Generating for topic: {topic}")
print(f"-- Minimum required questions: {synthetic_dataset[topic]['MINIMUM_NUMBER_OF_QUESTIONS']}")
MAX_API_CALL_LIMIT = MAX_API_LIMIT
API_CALLS = 0
synthetic_questions = set(synthetic_dataset[topic]["QUESTIONS"])
# Generate base prompt for a topic
prompt = f"""
<s>[INST] Generate {synthetic_dataset[topic]["MINIMUM_NUMBER_OF_QUESTIONS"]} python programming questions only on the topic of {topic}. Each question must be inside [QUESTION] and [/QUESTION] tags.[/INST]
1. [QUESTION]{synthetic_dataset[topic]["QUESTIONS"][-1]}[/QUESTION]
2. [QUESTION]{synthetic_dataset[topic]["QUESTIONS"][-2]}[/QUESTION]
3. [QUESTION]"""
while len(synthetic_questions) < synthetic_dataset[topic]["MINIMUM_NUMBER_OF_QUESTIONS"] and API_CALLS < MAX_API_CALL_LIMIT:
print(f"---- Current count of synthetic questions: {len(synthetic_questions)}")
res = llama_generate(prompt, 1000, top_p = 0.9)
API_CALLS = API_CALLS + 1
print(f"---- Call: {API_CALLS}")
try:
# Add new questions from AWS API.
synthetic_questions.update(extract_text_between_tags(prompt + json.loads(res.text)['body']['generation'], r"\[QUESTION\]", r"\[/QUESTION\]")[1:])
synthetic_dataset[topic]["QUESTIONS"] = list(synthetic_questions)
prompt = prompt + json.loads(res.text)['body']['generation']
except:
print(res.text)
print("---- Failed API Call: Refreshing prompt")
# Refresh prompt with latest questions
prompt = f"""
<s>[INST] Generate {synthetic_dataset[topic]["MINIMUM_NUMBER_OF_QUESTIONS"]} python programming questions on the topic '{topic}'. Each question must be inside [QUESTION] and [/QUESTION] tags.[/INST]
1. [QUESTION]{synthetic_dataset[topic]["QUESTIONS"][-1]}[/QUESTION]
2. [QUESTION]{synthetic_dataset[topic]["QUESTIONS"][-2]}[/QUESTION]
3. [QUESTION]"""
print(f"---- Current count of synthetic questions: {len(synthetic_questions)}")
if os.path.exists("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_3.json"):
# Load the current synthetic dataset state from checkpoint
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_3.json", "r") as infile:
synthetic_dataset = json.load(infile)
else:
# Generate new questions
synthetic_questions_generator(4)
# Save current synthetic dataset checkpoint
if not os.path.exists("data/tmp/synthetic_dataset/checkpoint"):
os.makedirs("data/tmp/synthetic_dataset/checkpoint")
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_3.json", "w") as outfile:
json.dump(synthetic_dataset, outfile)
A final boilerplate dictionary is created which is similar to the benchmark mbpp dataset with task_id
, text
, code
and test_list
. In case a checkpoint is available, this part is skipped.
The code for the corresponding question is not generated in this part.
if os.path.exists("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_4.json"):
# Load the current synthetic dataset state from checkpoint
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_4.json", "r") as infile:
synthetic_dataset = json.load(infile)
else:
_ = 0
# Generate a template similar to MBPP dataset.
synthetic_dataset["data"] = []
for topic in synthetic_dataset["topics"]:
for question in synthetic_dataset[topic]["QUESTIONS"]:
data_row = {
"task_id" : _,
"text": question,
"code": None, # Insert code here
"test_list": [""]
}
_ = _ + 1
synthetic_dataset["data"].append(data_row)
# Save current synthetic dataset checkpoint
if not os.path.exists("data/tmp/synthetic_dataset/checkpoint"):
os.makedirs("data/tmp/synthetic_dataset/checkpoint")
with open("data/tmp/synthetic_dataset/checkpoint/synthetic_dataset_checkpoint_4.json", "w") as outfile:
json.dump(synthetic_dataset, outfile)
I sample a random question from the dataset and check the response from the API.
# Sample prompt for python code generation.
prompt = f"""<s>[INST]Write a python code to solve the following coding problem that obeys the constraints and
passes the example test cases. Please wrap your code answer between [PYTHON] and [/PYTHON] tags:
{synthetic_dataset["data"][90]['text']}[/INST]"""
res = llama_generate(prompt, 1000, top_p = 0.9)
print(prompt + json.loads(res.text)['body']['generation'])
<s>[INST]Write a python code to solve the following coding problem that obeys the constraints and
passes the example test cases. Please wrap your code answer between [PYTHON] and [/PYTHON] tags:
How to convert a list to a dictionary in Python?[/INST] Sure! Here's a possible solution to the problem:
[PYTHON]
def list_to_dict(lst):
"""
Convert a list of key-value pairs to a dictionary.
Args:
lst (list): A list of tuples, where each tuple contains a key-value pair.
Returns:
dict: A dictionary with the keys and values from the list.
"""
return dict(lst)
[/PYTHON]
Here's an explanation of the code:
* The `list_to_dict` function takes a list `lst` as input, which contains tuples of key-value pairs.
* The `dict` function is called with the `lst` argument, which converts the list of tuples to a dictionary.
* The `return` statement returns the resulting dictionary.
Here are some example test cases to demonstrate that the code works correctly:
[PYTHON]
print(list_to_dict([("a", 1), ("b", 2), ("c", 3)])) # Output: {'a': 1, 'b': 2, 'c': 3}
print(list_to_dict([])) # Output: {}
print(list_to_dict([("a",), ("b",)])) # Output: {'a': None, 'b': None}
[/PYTHON]
The code passes all three test cases:
* The first test case converts a list of key-value pairs to a dictionary with the correct keys and values.
* The second test case returns an empty dictionary when passed an empty list.
* The third test case returns a dictionary with keys but no values when passed a list of keys without values.
I hope this helps! Let me know if you have any questions or need further clarification.
As proper python code is observed in [PYTHON]
and [/PYTHON]
tags, a wrapper function is created to generate code for each of the question in the synthetic dataset. Only the relevant python code is saved as an answer.
def generate_synthetic_answers():
for data_row in synthetic_dataset["data"]:
if data_row['task_id'] % 100 == 0:
print(f"-- Current task: {data_row['task_id']}")
# Skip a question in case the code already exists.
if data_row["code"] is not None:
continue
# Generate a prompt for AWS API
prompt = f"""<s>[INST]Write a python code to solve the following coding problem that obeys the constraints and
passes the example test cases. Please wrap your code answer between [PYTHON] and [/PYTHON] tags:
{data_row["text"]}[/INST]"""
res = llama_generate(prompt, 1000, top_p = 0.9)
try:
code = extract_text_between_tags(json.loads(res.text)['body']['generation'], r"\[PYTHON\]", r"\[/PYTHON\]")[0]
data_row["code"] = code
# Save each generated question.
with open("synthetic_dataset/synthetic_dataset.json", "w") as outfile:
json.dump(synthetic_dataset, outfile)
except:
print(f"-- Error in: {data_row['task_id']} >> {res.text}")
if os.path.exists("data/tmp/synthetic_dataset/synthetic_dataset.json"):
# Load the current synthetic dataset state from checkpoint
with open("data/tmp/synthetic_dataset/synthetic_dataset.json", "r") as infile:
synthetic_dataset = json.load(infile)
generate_synthetic_answers()
This dataset is available at: Hugging Face
The code generated by AWS Llama API includes a number of comments, print statements and input statements. To make it similar to MBPP dataset, these parts of the code is cleaned.
# Load the full synthetic dataset
synthetic_dataset = datasets.load_dataset("json", data_files = "data/synthetic_dataset.json")
synthetic_dataset = synthetic_dataset['train']
def clean_synthetic_code(sample):
sample['code'] = sample['code'].replace('print', 'pass # print')
sample['code'] = sample['code'].replace('assert', '# assert')
sample['code'] = re.sub(r'#.*', '', sample['code'])
sample['code'] = re.sub(r'(\'\'\'(.*?)\'\'\')|(\"\"\"(.*?)\"\"\")', '', sample['code'], flags=re.DOTALL)
sample['code'] = re.sub(r'input\((?:".*?"|f".*?"|)\)', "'0'", sample['code'])
return sample
synthetic_dataset = synthetic_dataset.map(clean_synthetic_code)
synthetic_dataset = synthetic_dataset.map(map_mbpp_train_data)
And finally the model is fine-tuned!
if model_c_name in active_model_names:
finetune_llama_for_mbpp(synthetic_dataset, model_c_name)
Testing Llama2 7B finetuned with Synthetic MBPP Dataset.
And tested!
if model_c_name in active_model_names:
accuracy = calculate_mbpp_accuracy(model_c_name)
print(f"Model C Accuracy: {accuracy}")
model_stats[model_c_name] = {
"ACCURACY": accuracy
}
It has an accuracy of 13.4% and a median generation time of 19.11 seconds on NVidia T4 GPU. In case only correct answers are considered, the median generation time is 7.21 seconds. In the next blog, I will fine tune Llama2 with a MBPP and synthetic datasets mixed together to see how well it performs. Until then, see you!
Enjoy Reading This Article?
Here are some more articles you might like to read next: