charmingcompanions.com

Harnessing LLMs for Efficient DataFrame Processing with Pandas

Written on

Chapter 1: Introduction to LLM Integration

In today's tech landscape, accessing various large language models (LLMs) through web interfaces or public APIs has become straightforward. But the question arises: can we effectively weave LLMs into our data analysis workflows using Python or Jupyter Notebooks? The answer is a resounding yes! In this article, I will illustrate three distinct methods to achieve this integration. As always, the tools and resources discussed will be freely accessible.

Let’s dive in!

Section 1.1: Utilizing Pandas AI

The first library we will explore is Pandas AI, which enables natural language inquiries on our Pandas DataFrame. For demonstration, I constructed a simple DataFrame containing EU countries along with their populations:

import pandas as pd

df = pd.DataFrame({

"Country": ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland',

'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania',

'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Serbia',

'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland'],

"Population": [8205000, 10403000, 7148785, 4491000, 1102677, 10476000, 5484000, 1291170, 5244000,

64768389, 82369000, 11000000, 9930000, 308910, 4622917, 58145000, 2217969, 35000, 3565000,

497538, 403000, 32965, 666730, 16645000, 4907000, 38500000, 10676000, 21959278, 7344847,

5455000, 2007000, 46505963, 9045000, 7581000]

})

df.to_csv('data.csv', index=False)

Before we start using Pandas AI, we need to instantiate the LLM:

from pandasai.llm.local_llm import LocalLLM

from pandasai.llm import OpenAI

# Local LLM

pandas_llm = LocalLLM(api_base="http://localhost:8000/v1")

# OR

# OpenAI

pandas_llm = OpenAI(api_token="...")

You have two options here: If you possess an OpenAI API key, you can utilize the OpenAI class for faster and more efficient results; however, this service is not free. Alternatively, using a LocalLLM instance is a viable option, especially if you have a compatible OpenAI server setup (like Llama-CPP). For this example, I will run a CodeLlama-13B-Instruct model locally.

Next, we can pose a question to the LLM regarding our DataFrame. Let’s start with a simple query:

import logging

from pandasai import SmartDataframe

logging.basicConfig(level=logging.DEBUG,

format="[%(levelname)s] [%(asctime)-15s] %(message)s")

sdf = SmartDataframe(df, config={"llm": pandas_llm})

sdf.chat("Find a country with the highest population.")

By enabling logging, we can observe the internal workings of the library. The log shows a complex prompt being crafted, which ultimately aims to generate Python code for our query.

However, when I executed the code using my local LLM, I received no response, only the message: "No code found in the response." This prompt functions well with OpenAI's API but seems a bit too intricate for a 13B model.

Let’s shift to a more advanced method using the Agent class:

agent = Agent(

[df],

config={"llm": pandas_llm},

description="[INST]Create a Python code and help user to answer the question.[/INST]"

)

query = """You have a dataframe with fields "Country" and "Population",

saved in "data.csv". Find the country with the highest population

in the dataframe using Pandas."""

code = agent.generate_code(query)

agent.execute_code(code)

This refined prompt helps the model better understand its task. After executing the code, I obtained a valid output.

The first video, titled "Natural Language Processing with Pandas DataFrames | SciPy 2021," delves into using Pandas for NLP tasks, showcasing practical applications that align with our discussion.

Section 1.2: Exploring LangChain

The LangChain library also features a dedicated Pandas DataFrame agent capable of executing similar tasks. First, we instantiate a language model:

from langchain_openai import OpenAI as LangChainOpenAI

llm = LangChainOpenAI(openai_api_key="12345678",

openai_api_base="http://localhost:8000/v1",

verbose=True)

With this setup, we can create our agent:

from langchain.agents.agent_types import AgentType

from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

from langchain_core.callbacks import StdOutCallbackHandler, BaseCallbackManager

prefix = """[INST]You are a Python expert. Create a Python code and help user to answer the question.[/INST].

You have the following tool:"""

handlers = [StdOutCallbackHandler()]

callback_manager = BaseCallbackManager(handlers)

agent = create_pandas_dataframe_agent(llm,

df,

verbose=True,

agent_executor_kwargs={"handle_parsing_errors": True},

agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,

callback_manager=callback_manager,

prefix=prefix)

Now, we’re ready to ask the model a question:

agent.invoke("Write the Python code to calculate total population of all countries.")

This time, the model successfully comprehended the prompt and generated the required code.

The second video, "Pandas DataFrame Agent... the future of data analysis?" discusses the potential advancements in data analysis through the use of Pandas and LLMs, complementing our exploration.

Section 1.3: LLM for Text Processing

In our previous examples, we examined the use of agents for generating Python code. Now, let's shift gears and focus on the LLM's inherent capabilities for natural language processing.

For our next example, we will create a DataFrame listing items from a "lost and found" department:

df_goods = pd.DataFrame({

"Item": ["Toshiba laptop", "iPhone 12", "iPhone 14", "Old bicycle",

"Public Transport card", "Pair of gloves", "Kids Helmet",

"Samsung Smartphone", "iPhone 14", "Cap", "Shawl"],

})

display(df_goods)

Assuming we want to categorize these items into "Electronics," "Clothes," "Documents," and "Other," we can leverage LLMs instead of traditional coding methods.

We initiate by loading the Llama 13B model:

from llama_cpp import Llama

llm = Llama(

model_path="llama-2-13b-chat.Q4_K_M.gguf",

n_gpu_layers=-1,

n_ctx=2048,

verbose=True

)

Now, we can formulate our request:

question = "You are a sales expert. You have a list of categories: Electronics, Clothes, Documents, and Other. Write a category for each item."

We construct a prompt:

prompt = f"""[INST]{question}

I will give you a list of items in comma-separated format.

Write the output in JSON format {{NAME}}: {{RESULT}}.

Here is the list: {items_csv}.

Now write the answer.

[/INST]"""

The model effectively categorizes the items and provides a well-structured JSON output.

Lastly, we can apply this method for further data processing in a streamlined manner.

Conclusion: The Future of DataFrame Processing

In this article, we explored two primary methods for processing Pandas DataFrames using large language models. The first method employs the concept of tools and agents, allowing the model to generate Python code that is then executed. This is particularly advantageous for data analysis tasks, helping to navigate the limitations of LLMs concerning mathematical operations and prompt size constraints.

The second method leverages the LLM's natural language processing capabilities, proving useful for handling unstructured text data. The potential for seamless integration between software libraries and LLMs is promising, as it enables quick experimentation with various queries while witnessing the results in real-time.

While local models may exhibit slower performance, the future of AI assistants in aiding developers and data analysts is undoubtedly bright.

Thank you for reading! If you found this article insightful, consider subscribing to Medium for notifications on future articles and access to a wealth of content from various authors. Feel free to connect with me on LinkedIn, and for the full source code, visit my Patreon page.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Revolutionizing Interaction: How ChatGPT Changes Our Search Experience

Discover how ChatGPT is transforming our approach to searching for information and the implications for future technology.

Unlocking the Power of Predictive Analytics for Your Business

Discover how predictive analytics can enhance decision-making and business strategies while navigating its limitations.

Strategies to Slash Phone Usage to Just 30 Minutes Daily

Discover effective strategies to reduce your phone usage to just 30 minutes a day while improving your focus and productivity.