Harnessing LLMs for Efficient DataFrame Processing with Pandas
Written on
Chapter 1: Introduction to LLM Integration
In today's tech landscape, accessing various large language models (LLMs) through web interfaces or public APIs has become straightforward. But the question arises: can we effectively weave LLMs into our data analysis workflows using Python or Jupyter Notebooks? The answer is a resounding yes! In this article, I will illustrate three distinct methods to achieve this integration. As always, the tools and resources discussed will be freely accessible.
Let’s dive in!
Section 1.1: Utilizing Pandas AI
The first library we will explore is Pandas AI, which enables natural language inquiries on our Pandas DataFrame. For demonstration, I constructed a simple DataFrame containing EU countries along with their populations:
import pandas as pd
df = pd.DataFrame({
"Country": ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland',
'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania',
'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Serbia',
'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland'],
"Population": [8205000, 10403000, 7148785, 4491000, 1102677, 10476000, 5484000, 1291170, 5244000,
64768389, 82369000, 11000000, 9930000, 308910, 4622917, 58145000, 2217969, 35000, 3565000,
497538, 403000, 32965, 666730, 16645000, 4907000, 38500000, 10676000, 21959278, 7344847,
5455000, 2007000, 46505963, 9045000, 7581000]
})
df.to_csv('data.csv', index=False)
Before we start using Pandas AI, we need to instantiate the LLM:
from pandasai.llm.local_llm import LocalLLM
from pandasai.llm import OpenAI
# Local LLM
pandas_llm = LocalLLM(api_base="http://localhost:8000/v1")
# OR
# OpenAI
pandas_llm = OpenAI(api_token="...")
You have two options here: If you possess an OpenAI API key, you can utilize the OpenAI class for faster and more efficient results; however, this service is not free. Alternatively, using a LocalLLM instance is a viable option, especially if you have a compatible OpenAI server setup (like Llama-CPP). For this example, I will run a CodeLlama-13B-Instruct model locally.
Next, we can pose a question to the LLM regarding our DataFrame. Let’s start with a simple query:
import logging
from pandasai import SmartDataframe
logging.basicConfig(level=logging.DEBUG,
format="[%(levelname)s] [%(asctime)-15s] %(message)s")
sdf = SmartDataframe(df, config={"llm": pandas_llm})
sdf.chat("Find a country with the highest population.")
By enabling logging, we can observe the internal workings of the library. The log shows a complex prompt being crafted, which ultimately aims to generate Python code for our query.
However, when I executed the code using my local LLM, I received no response, only the message: "No code found in the response." This prompt functions well with OpenAI's API but seems a bit too intricate for a 13B model.
Let’s shift to a more advanced method using the Agent class:
agent = Agent(
[df],
config={"llm": pandas_llm},
description="[INST]Create a Python code and help user to answer the question.[/INST]"
)
query = """You have a dataframe with fields "Country" and "Population",
saved in "data.csv". Find the country with the highest population
in the dataframe using Pandas."""
code = agent.generate_code(query)
agent.execute_code(code)
This refined prompt helps the model better understand its task. After executing the code, I obtained a valid output.
The first video, titled "Natural Language Processing with Pandas DataFrames | SciPy 2021," delves into using Pandas for NLP tasks, showcasing practical applications that align with our discussion.
Section 1.2: Exploring LangChain
The LangChain library also features a dedicated Pandas DataFrame agent capable of executing similar tasks. First, we instantiate a language model:
from langchain_openai import OpenAI as LangChainOpenAI
llm = LangChainOpenAI(openai_api_key="12345678",
openai_api_base="http://localhost:8000/v1",
verbose=True)
With this setup, we can create our agent:
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_core.callbacks import StdOutCallbackHandler, BaseCallbackManager
prefix = """[INST]You are a Python expert. Create a Python code and help user to answer the question.[/INST].
You have the following tool:"""
handlers = [StdOutCallbackHandler()]
callback_manager = BaseCallbackManager(handlers)
agent = create_pandas_dataframe_agent(llm,
df,
verbose=True,
agent_executor_kwargs={"handle_parsing_errors": True},
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
callback_manager=callback_manager,
prefix=prefix)
Now, we’re ready to ask the model a question:
agent.invoke("Write the Python code to calculate total population of all countries.")
This time, the model successfully comprehended the prompt and generated the required code.
The second video, "Pandas DataFrame Agent... the future of data analysis?" discusses the potential advancements in data analysis through the use of Pandas and LLMs, complementing our exploration.
Section 1.3: LLM for Text Processing
In our previous examples, we examined the use of agents for generating Python code. Now, let's shift gears and focus on the LLM's inherent capabilities for natural language processing.
For our next example, we will create a DataFrame listing items from a "lost and found" department:
df_goods = pd.DataFrame({
"Item": ["Toshiba laptop", "iPhone 12", "iPhone 14", "Old bicycle",
"Public Transport card", "Pair of gloves", "Kids Helmet",
"Samsung Smartphone", "iPhone 14", "Cap", "Shawl"],
})
display(df_goods)
Assuming we want to categorize these items into "Electronics," "Clothes," "Documents," and "Other," we can leverage LLMs instead of traditional coding methods.
We initiate by loading the Llama 13B model:
from llama_cpp import Llama
llm = Llama(
model_path="llama-2-13b-chat.Q4_K_M.gguf",
n_gpu_layers=-1,
n_ctx=2048,
verbose=True
)
Now, we can formulate our request:
question = "You are a sales expert. You have a list of categories: Electronics, Clothes, Documents, and Other. Write a category for each item."
We construct a prompt:
prompt = f"""[INST]{question}
I will give you a list of items in comma-separated format.
Write the output in JSON format {{NAME}}: {{RESULT}}.
Here is the list: {items_csv}.
Now write the answer.
[/INST]"""
The model effectively categorizes the items and provides a well-structured JSON output.
Lastly, we can apply this method for further data processing in a streamlined manner.
Conclusion: The Future of DataFrame Processing
In this article, we explored two primary methods for processing Pandas DataFrames using large language models. The first method employs the concept of tools and agents, allowing the model to generate Python code that is then executed. This is particularly advantageous for data analysis tasks, helping to navigate the limitations of LLMs concerning mathematical operations and prompt size constraints.
The second method leverages the LLM's natural language processing capabilities, proving useful for handling unstructured text data. The potential for seamless integration between software libraries and LLMs is promising, as it enables quick experimentation with various queries while witnessing the results in real-time.
While local models may exhibit slower performance, the future of AI assistants in aiding developers and data analysts is undoubtedly bright.
Thank you for reading! If you found this article insightful, consider subscribing to Medium for notifications on future articles and access to a wealth of content from various authors. Feel free to connect with me on LinkedIn, and for the full source code, visit my Patreon page.