Data Science, Analytics, Deep Learning, and tech products. Publish & share on what I learned.

Utilizing LLM API to Evaluate Mathematical Solutions

Questions: Lots of studies on LLMs have figured out their capability of ‘reasoning’. Could we use this capability in educational context? - like, evaluate mathematical solutions.

This is a wrapup of a casual research which tries to propose a metric, which can evaluate mathematical solutions based on ‘understanding’, not correct answers. Even though the research ended up with solid official publication, since it had some meaningful lessons learned, I appreciate prof. Lee of Yonsei University who provided guidance and instruction in this project.


How could we utilize LLMs better?

We know they are good at reasoning now

For recent years, we heard a lot about reasoning capability of LLMs. Not only they speak and chat like human, but they also started to give some reasonable outcomes. I was excited to se many researches had found out LLMs are being capable of solid reasoning, and many researches suggests prompting techniques which can leverage it better. Among them, I was interested by approaches to solve mathematical problems with LLM. Some examples are studies including: He-Yueya et al, Solving Math Word Problems by Combining Language Models With Symbolic Solvers, or Imani et al, MathPrompter: Mathematical Reasoning using Large Language Models. Many of the studies suggest that 1) LLMs are capable of solving mathematival problems, and 2) Properly prompting using techniques as Chain-Of-Thought can maximize reasoning capability of LLMs.

Solving math problems with LLM seemed awsome, but I started to have some questions.

  1. If only the solution gives the correct answer, can we say it is a proper (or good) solution in math?
  2. In educational context, would ‘solving’ problems be the best way to leverage LLM’s reasoning capability?
  3. Not everyone is capable of training or fine-tuning their own LLM models. This is because of hugh time & cost constraints when you are trying to customize and build your own model. In highly limited environment in terms of cost, computational power and time, can we build a meaningful module utilizing LLMs? (With open source APIs)

Correct answer is not everything in ‘Education’

Screenshot 2024-08-11 at 3 45 18 PM

Try to think of any math courses you had. Maybe you have some experiences that you got correct answer to the questions, but failed to get full points because your logic, approach or solution was not appropriate.

Since LLMs are capable of reasoning, maybe we can not only solve math problems, but also ‘evautate’ solutions - not in terms of correct answers, but approaches or concepts included in the solution!


Expereiment - try to solve, and evaluate the solution with LLM.

Summary of the process

Screenshot 2024-08-11 at 4 07 36 PM

Full Codes are in: 🔗Github Repo: https://github.com/ethHong/MSU-Mathematical-Solution-Understanding-of-LLM-Evaluation

In this experiment, we’ll assume that LLMs are students, who solve the problem - We are not sure how they are accurate in solving problems. Therefore, we will

  1. Prepare MATH problem dataset (pair of problem - solution). We useMATH Dataset by Hendrycks et al, 2021. Here, solutions in this dataset are solved and validated by human, so we assume they are correct (The ‘Golden’ set).
  2. Solve mathe problems with LLMs (Students). Here, we use different levels of LLM model APIs (which are assumed to have different capability of solving problems.)
    • Since higher perfoming model have better reasoning capability, we expect their solution to be better.
  3. We will evaluate solutions generated by LLMs (students), by comparing it to the Golden Solution - both in terms of correctness of answer, and properness of the solution.
  4. We will see how evaluation based on ‘properness of the solution’ correlates with evaluation based on correctness of answer.

LIMITATIONS OF THIS RESEARCH APPROACH due to time and resource constraints.

  1. In step 3 - getting correctness of answer, it should be evaluated by human experts to check if the answer is correct, but this research use another LLM automation to check if answer is correct. There could be some mistakes.
  2. To evaluate if our suggested approach (evaluation based on properness of the solution) is working fine, maybe some interview or evaluation from experts should be processed. However, this experiment only see correlation with correct answer based measure, so find out it does not totally deviate from traditional measure.

Generate math solutions using LLMs

This experiment used three different prompts

  1. Zeroshot prompt, which does not give any context about the problem
  2. Fewshot prompt , which gives one example of problem - solution pair
  3. Zeroshot COT (Chain of thought) which is a zeroshot, but instructing the model to ‘think step by step.’
...

prompt_solution_generation_zeroshot = """
Solve the question below - give solution and get the answer. 

Problem: {}
Solution:
"""
prompt_solution_generation_fewshot = """
Solve the question below - give solution and get the answer. 
#
    
Problem: {}
Solution: {}

#
Problem: {}
Solution:
"""

prompt_solution_generation_zeroshot_COT = """
Solve the question below - give solution and get the answer. 

Problem: {}
Solution: Let's think step by step.
"""

...

After that, we define a function which takes the input prompt, put it into a LLM model by calling API. Here, I utilize OpenAI API, different variations of GPT models.

def gpt(input_prompt, model="text-davinci-003"):
    """Generate solution to question based on input prompt"""
    response = openai.Completion.create(
        model=model,
        prompt=input_prompt,
        temperature=0.7,
        max_tokens=256,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
   
### If you are using ChatGPT API (Chat completion API): 
def chatgpt(input_prompt):
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", messages=[{"role": "user", "content": input_prompt}]
    )

    return completion["choices"][0]["message"]["content"]
    

The dataset includes different levels of problems from 1 to 5. Below is example of how each of the prompts result in

Screenshot 2024-08-11 at 4 34 29 PM

We can formulate this by:

  • Begin with a problem set defined as: $X = (x_1, x_2, \ldots, x_n)$
  • For each problem, corresponding human-generated solutions are represented as:

    $S_h = (s{h_1}, s{h_2}, \ldots, s_{h_n})$

  • $S_l = LLM(prompt{solving}, X) = (s{l_1}, s{l_2}, \ldots, s{l_n})$

Evaluate solutions based on concept understanding - Inclusion of concepts.

There might be several different approaches to evaluate math solutions based on peroperness of solution. The approach we took in this research is to figure out ‘if the solution includes all the proper concepts?’ To do this:

  1. Extract notions (concepts, or keywords) included in solutions.
  2. Compare sets of notions in ‘golden solution’, and solution of studenst (LLMs in this case.)

Below is an prompt we used to extract notion, and example of how extracted notions look like.

prompt_notion_extraction = """
From a solution of mathematical problem, extract general mathematical concepts required in given solution.
- Only list up mathematical knowledge or concepts
- Use comma seperation

#
concepts: concept1, concept2, concept3 ...

#
solution: {}
concepts:
"""

Screenshot 2024-08-11 at 4 59 19 PM

How do we compare ‘properness’ of notion inclusion?

Here, we first embed texts (extracted keywords), and use cosign similaritry measure to compare distances between extracted terms. However, each of the solutions may have different numbers of total concepts required. For example, some question may have to include ‘derivative’, ‘system of equation’, ‘coefficient’ while some simpler question will only require one concept, like ‘addition’. Therefore, the complexity of the problem itself, or the number of concepts required to solve the problem, may influence the evaluation—because depending on how many concepts need to be included, making a mistake on just one can have a different impact on the overall score.

To alleviate this, there is a trick we used: to evaluate semantic similarity between two sets of notions $N_g$ and $N_h$ , we use ~evaluate_for_question~.

### Evaluate scores
def cos_sim(A, B):
    return dot(A, B) / (norm(A) * norm(B))

def get_embedding(text):
    out = openai.Embedding.create(input=str(text), engine="text-embedding-ada-002")[
        "data"
    ][0]["embedding"]
    return out

def compare(a, b):
    return cos_sim(get_embedding(a), get_embedding(b))

def evaluate_for_question(N_g, N_h): -- This part!!
    score = []
    for n_g in N_g:
        score.append(max([compare(n_g, i) for i in N_h]))
    return (np.mean(score), score)

Which means, each of the notions extracted from the solution are scored, based on the maximum similarity after comparing similarity with all notions included in ‘golden solution’. Finally, we take averaged value of each notions’ evaluation score. This is our final MSU Score, which evaluates solution based on proper inculsion of concepts, regardless of the answer.

Finally, compare answer accuracy, and MSU score.

Now, we have MSU score of LLM (student) generated solutions. We have to also get answer accurecy of the solutions.

If MSU evaluation if valid enough, it should anyway correspond to answer accuracy. If so, maybe MSU score cannot be THE only measure to evaluate solutions, but it may work as a supportive measure used along with answer accuracy - e.g. if a student got high accuracy score, but poor MSU score, we might think the solution or process to get the answer is not good enough.

It would be great if we could have some human evaluation to check if LLM generated solutions also got correct answer through math experts, but due to limitation of time and resources, we decided to give this job to another LLM module.

First, load output data, and define prompt & model to evaluate if the answer is correct.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import openai
import time

f = open("api_key.txt", "r") #Please load your API Keyhere
openai.api_key = f.readline()
new_df = pd.read_csv("output/scoring_output.csv") #new_df has all the outcomes: hunam solution, LLM generated solution and MSU score results

def model(input_prompt, max_length = 256, model="text-davinci-003"):
    response = openai.Completion.create(
        model=model,
        prompt=input_prompt,
        temperature=0.7,
        max_tokens=max_length,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response["choices"][0]["text"]
  
  
#Evaluation prompt engineering: This prompt evaluate if the answer is correct or not
def evaluate(solution_original, solution_candidate):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
                {"role": "system", "content": """You are validator of mathematical solutions. I will give you original solution, and a candidate solution. 
                Please evaluate if the answer is correct or not.
                """},
                {"role": "assistant", "content": "Okay, I'll evaluate of the answer is correct or not. I'll only say True or False. "},
                {"role": "user", "content": "This is the origianl solution: {}".format(solution_original)},
                {"role": "user", "content": "This is the candidate solution: {}".format(solution_candidate)},
                {"role": "user", "content": "Is answer of candidate solution correct? Tuplize your response: (True/False)"}
                
        ]
    
    )
    return response["choices"][0]["message"]["content"]

Next, for each of the LLM generated solutions, check if they got right in ‘answers’. To make it easier to process, I tried to cleanse and transform ‘True / False’ output from LLM in Bool type.

colnames = [i.split("_notion_SCORE")[0] for i in new_df.columns if i.split("_")[-1]=="SCORE"]

for col in tqdm(colnames):
    print("proecssing {}...".format(col))
    Accurate = []
    for(solution_original, solution_candidate) in zip(tqdm(new_df["solution"].values), new_df[col].values):
        try:
            result = evaluate(solution_original, solution_candidate)
        except:
            time.sleep(60)
            result = evaluate(solution_original, solution_candidate)
            
        Accurate.append(result)
    new_df[col + "_accurate"] = Accurate

for col in tqdm(colnames):
    print("proecssing {}...".format(col))
    new_df[col + "_accurate"] = new_df[col + "_accurate"].apply(lambda x : True if "True" in x else False)
    
#Plotting
models = ["davinci2", "davinci3", "chatgpt"]
prompts = ["zeroshot", "fewshot", "zeroshotCOT"]
from prompting import *

data = {
    "model": [],
    "prompt" : [],
    "score" : [],
    "accuracy" : []
    
}

for m in models:
    for p in prompts:
        grouped_score = "gpt_{}_{}_solution_notion_SCORE".format(m, p)
        grouped_accuracy = "gpt_{}_{}_solution_accurate".format(m, p)
        temp = new_df[[grouped_score, grouped_accuracy]]
        
        score = temp[grouped_score].mean()
        accuracy = temp[grouped_accuracy].sum()/temp.shape[0]
        
        data["model"].append(m)
        data["prompt"].append(p)
        data["score"].append(score)
        data["accuracy"].append(accuracy)
        
styles = {'zeroshot': 'o', 'fewshot': 's', 'zeroshotCOT': 'v'}
sns.scatterplot(data=data, x="score", y="accuracy", hue="model", style="prompt", markers = styles,s = 100)
plt.title("Evaluation score and accuracy of answers")
plt.savefig("output/stats.pdf")

How was the result?

Screenshot 2024-08-11 at 5 13 19 PM

The result revealed tendency where solutions generated by larger-scaled, and higher-performance models exhibited a higher level of accordance with human generated solution. Which means, the higher performing model (~ students with better mathematical, reasoning capability) showed higher MSU score, while also achieving higher answer accuracy. Statistical correlation between MSU score and accuracy was 0.79 (by Pearson Correlation Coefficient.)

Still, there are lots of limitations…

This experiment highly relied on usage of open source LLMs, using APIs. Concept extration, or similarity computation was not optimized. Also semantic it only conducted experiment with OpenAI LLMs, which are basically all different versions of identical model - ‘GPT’. There also were numerioius naive actions, such as using LLM prompt to evaluate correctness of answer. I believe these had been critical weak points to be a ‘solid academic research’, and reason why I failed to get approved after submitting for publication.

But, there are some implications I believe!

However, I believe this research gave me some insights on the third question I casted in the very beginning of this post: In highly limited environment in terms of cost, computational power and time, can we build a meaningful module utilizing LLMs? (With open source APIs). I believe LLMs are not for ‘everyone’ these days. I mean, ‘owning and building LLMs’ are not for everyone.

Even I had worked as a full time Product Manager in IT software companies for 3+ years, I had whitnessed many industry players are stepping away from building their own models, but rather choosing to optimize, and utilize API calls from OpenAI, Google, or Azure. At least AI is moving to the ‘Big-model’ games, I believe finding the best ways to utilize LLM APIs would be more and more important.

From this research, due to many limitations, I learned that making efforts to prompt better, and control the outcomes from LLM can result in outcomes that goes along with our intuition and traditional belief.