In this article, we will explore how large language models (LLMs) can be used to integrate data from different sources and formats. We will discuss the challenges of working with heterogeneous data, present an interesting experiment using LLMs for data integration, and explore the potential benefits and risks of this approach. To follow along, a basic understanding of Python and data manipulation is helpful.

The Challenge of Combining Data from Different Sources

Valuable insights can be generated when data from different formats, types, and sources are combined into a single, unified data structure. However, this process often involves a lot of hard manual work, effort, and time, which can be costly. As a result, valuable insights may be missed due to underutilized data.

Large language models (LLMs) have gained significant attention in recent months, and the question arises if these powerful AI tools can help accelerate or even completely automate the integration of data from various sources.

An Interesting Experiment

Imagine having data in a JSON-like format (e.g., a Python dictionary) and similar data in a different format, such as an HTML table from a website. We want to parse the data from the HTML table and combine it with the data in the Python dictionary.

To do this, we can use a large language model with a Python library, such as OpenAI's GPT-3.5 (Chat GPT), and provide it with a prompt like this*:

data_json = { "restaurant employees" :[
{"name":"Hans", "email":""},
{"name":"Vanessa", "email":""},
html_table = """<table>
<caption>Restaurant Employees</caption>

prompt = f"""Extract the data from an HTML table and append it to a JSON.
Follow these steps:

Extract the data from the HTML table into a JSON of the same format as the  
provided JSON  
join the two JSONs and return a JSON

The JSON (in python dict format):

The html table:

Structure you output as follows:

extracted_JSON = <json format>

final_JSON = <json format>
response = get_completion(prompt)

The output of this prompt is:

extracted_JSON = {'restaurant employees': [{'name': 'Shyam', 'email':
''}, {'name': 'Bob', 'email': ''},
{'name': 'Jai', 'email': ''}]}

final_JSON = {'restaurant employees': [{'name': 'Hans', 'email':
''}, {'name': 'Vanessa', 'email': ''},
{'name': 'Shyam', 'email': ''}, {'name': 'Bob',
'email': ''}, {'name': 'Jai', 'email': ''}]}

This experiment demonstrates that LLMs can understand different data formats and merge them into a single format without human intervention.

The Magic and the Risks

While this might seem like magic, using this approach in production could be careless, expensive, and even dangerous. Sending raw data to a third party might violate data protection policies, and each data transformation requires computationally expensive LLM calls. Additionally, LLMs sometimes produce unexpected outputs or errors, which could lead to a messy outcome.

Despite these concerns, LLMs' capabilities are impressive. Can we use them in a more reliable and secure way?

A Better Approach: Generating Code with LLMs

Instead of asking the LLM to transform the data directly, we can ask it to provide Python code that performs the transformation for us.

Let's try the following prompt:

prompt = f"""
Write a python code that Extracts the data from an HTML table and append  
it to a python dict.
It should follow these steps:
Extract the data from the HTML table into a python dict of the same format  
as the provided python dict  
join the two python dict and return a python dict

The python dict:

The html table:

your outout should be three python functions:
transform_and_join(htms, dict)

The LLM will output Python code like this:

Here's the code for the three functions:

from bs4 import BeautifulSoup

def transform_html_to_dict(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')
    rows = table.find_all('tr')
    headers = [header.text.strip() for header in rows[0].find_all('th')]
    data = []
    for row in rows[1:]:
values = [value.text.strip() for value in row.find_all('td')]
data.append(dict(zip(headers, values)))
    return {'restaurant employees': data}

def join_dicts(dict1, dict2):
    for key in dict2:
if key in dict1:
    dict1[key] = dict2[key]
    return dict1

def transform_and_join(html, dict1):
    dict2 = transform_html_to_dict(html)
    return join_dicts(dict1, dict2)

The transform_html_to_dict function takes an HTML string as input, uses BeautifulSoup to parse it, and extracts the data from the table into a Python dict. The resulting dict has the same format as the provided dict.

The join_dicts function takes two dicts as input and merges them into a single dict. If a key exists in both dicts, the values are combined into a single list.

The transform_and_join function combines the functionality of the previous two functions. It first transforms the HTML into a dict using transform_html_to_dict, and then joins the resulting dict with the provided dict using join_dicts. The final result is returned as a single dict. ``` Copy-pasting this code into a Python script and running it produces the following outcome:

{'restaurant employees': [{'name': 'Hans', 'email': ''}, 
{'name': 'Vanessa', 'email': ''}, 
{'Name': 'Shyam', 'Email': ''}, 
{'Name': 'Bob', 'Email': ''}, 
{'Name': 'Jai', 'Email': ''}]}

Apart from the capitalization of the keys ‘Name’ and ‘Email’, the result is what we would expect.

So, while more development needs to be done here for fine-tuning and implementing safeguards, this approach can help address some of the challenges posed by the initial method:

  • Only example data, which can be anonymized or dummy data, needs to be sent to the AI vendor
  • Fewer API calls are needed, reducing costs.
  • Data transformations can happen on-premises, ensuring compliance with data protection regulations
  • Once the functions are tested, they work as reliably as any other Python script.

Future Considerations and Challenges

The experiments discussed here suggest that there is enormous potential for using LLMs to help unify heterogeneous data. However, several questions need to be addressed before this approach can be widely adopted in production:

  • Flexibility: Can we write prompts that work with a wide range of data formats, or will we need to write new prompts for each new data type?
  • Reliability: Can we develop general enough tests to ensure the generated functions work as intended? Can we automate the testing process, including the creation of the tests?
  • Security: Allowing AI-generated code to run in an automated manner introduces new cybersecurity concerns that need to be addressed.

In conclusion, while there is still a long way to go before LLMs can revolutionize data integration, the initial experiments are promising. Exploring this direction further is essential, not only to harness the full potential of AI but also to relieve data scientists from mundane tasks, allowing them to focus on more creative and fulfilling work. Onwards!


The method get_completion wraps an api-call to open-ai using the openai python package

def get_completion(prompt, model="gpt-3.5-turbo", temperature=0):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
    return response.choices[0].message["content"]

Note that this code relies on the OpenAI API, so you will need to have valid API credentials and the openai-Python package installed in order to use it.