Using CSV data and DDI Codebook metadata with ChatGPT code interpreter

Pascal Heus
8 min readJul 24, 2023

--

Now that code interpreter is available to OpenAI paid ChatGPT users, I wanted to quickly see if I could use it to read and analyze data stored in a CSV file with the associated documentation in a machine actionable DDI-Codebook XML file.

The good news is that the answer is yes. Below a summary of the steps to reproduce my initial experience along with some tips and lessons learned.

What’s DDI?

The Data Documentation Initiative (DDI) is a family of standards widely used around the world by data producers, archives, and researchers for documenting data.

The use of metadata standards such as DDI is highly encouraged to support the FAIR data principles or align on the Data on the Web Best Practices.

Two major flavors of DDI exists:

  • DDI-Codebook (version 2.x): a light model describing a data dictionary and providing contextual information for a dataset (survey, provenance, stakeholders, citations, methodologies, reference materials, and more)
  • DDI-Lifecycle (version 3.x): an more advanced and powerful model for capturing comprehensive metadata across studies, promoting reuse, and enabling the management of higher level elements such as concepts, classifications, or questionnaires.

For the purpose if this exercise, I used the easy to use DDI-Codebook specification.

The dataset and its metadata

I didn’t have a specific use case in mind and just picked a random dataset I found in my local data archive. This ended up being a sample patient beneficiary file from CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF), specifically the DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv file (14.6Mb), which you can find on this download page.

The synthetic dataset contains 32 variables and 116,352 records, and provides demographic and health information for a sample of Medicare beneficiaries from 2008. It includes demographic details, Medicare coverage information, diagnosis of various chronic conditions, and details about the Medicare reimbursement, beneficiary responsibility, and primary payer reimbursement amounts for different types of services. See the end of this article for a data dictionary.

I had previously documented this dataset with MTNA’s metasheet (an open source Python package that uses a simple spreadsheet format as input), which I used to generate the associated DDI-Codebook XML file (download here). But of course you can reproduce this process with any pair of CSV + DDI-C files.

Activating code interpreter

If you have never used code interpreter before, you’ll first need to enable it in the ChatGT settings under Beta Features:

You can then activate it when you start a new ChatGPT session:

Loading the data and metadata

The first step of the process is similar to what you may have seen in other Code Interpreter demos, whereby you use the new file upload feature to submit the CSV data file.

Once the CSV file has been loaded, ChatGPT will suggest a few potential next steps. But we do not want to do that yet, as we instead want to complement the data with the DDI metadata.

So we rsimply epeat the upload process for the DDI-XML file:

As we can see, ChatGPT suspects that the XML file likely holds documentation for the data file. What we need to is make sure it recognizes it as a DDI-Codebook XML and extracts the variable level documentation containing the variables names, labels, description, and categories (value labels).

Getting ChatGPT to understand the DDI

Answering yes to the initial ChatGPT suggestion though points at some level of uncertainty.

So I provided a bit more context, which helped, but ChatGPT failed to find the dataset information.

Hmmm, that didn’t work out of the box.

To understand what happened, click on Show work which reveals the underlying generated Python code (partially shown below):

The issue in this case is that ChatGPT makes two assumptions (at the top of the code):

These are perfectly valid assumptions, as that particular namespace was used for many years by DDI Codebook (it was changed in version 2.5), and it is common practice to use the ddiprefix.

While DDI is a standard, there are various ways to serialize XML, in particular when it comes to defining the ‘namespace’.

Our XML file actually uses the latest version of DDI-Codebook, whose namespace is ddi:codebook:2_5, and does not use a prefix (default namespace), as can be seen below:

Note that if you are not familiar with XML namespace mechanisms, you can learn more about it on this w3schools page.

I was actually hoping that ChatGPT would figure this out on its own. But not a big deal, as basically we just need to guide the AI in the right direction.

This actually took me a few iterations, as ChatGPT kept going back and forth to using the old namespace or a prefix.

The following hints repeatedly failed to generate the right Python code, with ChatGPT apologizing and recommending me to seek human assistance.

Wait, does this suggests I’m not human? I always suspected that.

This however finally did the trick:

which led to:

Victory! Success! Hooray!

Learning from this, I subsequently reworked an initial prompt in a fresh session that addresses the problem. After uploading the DDI file, using the following worked right away:

Note that the Python xml.etree.ElementTree like to have a namespace prefix so using XML with an explicit prefix like ddi:is a good option anyway.

Performing analysis

At this point we can start working with the data and perform exploratory analysis, ask questions, or create visualizations. A few examples are illustrated below. But this was not the objective of this exercise.

The key is that, thanks to DDI, ChatGPT gained significant knowledge about the content of the dataset, greatly facilitating working with the data and enriching the outputs.

A CSV file on its own lacks such insights, and typically ChatGPT would attempt to guess the meaning of the variables and records, which is naturally error prone and potentially inaccurate.

Note that as I was writing this article, I was notified by ChatGPT that my code interpreter session had timed out, which basically means the environment has been wiped out and we need to restart from scratch. Something to be mindful about.

Conclusions

Using CSV + DDI-Codebook with code interpreter works well! Essentially wiht a bit of guidance, the AI understands what DDI is and how to parse the XML file. I expected this, but it’s great to have it confirmed.

This is of course just the beginning and I’ll continue to explore code interpreter and other AI’s data+metadata capabilities. Beyond analysis, I’m also particularly interested to see how AI can be used to generate documentation and infer metadata around data.

Hope this was useful, and looking forward to see what others can accomplish around data and standard metadata with ChatGPT. And tf you are not familiar with DDI and FAIR, I would also encourage you read more about these initiatives.

Annex: Data Dictionary

A simplified data dictionary generated by ChatGPT from the DDI-XML file.

  • DESYNPUF_ID: Beneficiary Code
  • BENE_BIRTH_DT: Date of birth
  • BENE_DEATH_DT: Date of death
  • BENE_SEX_IDENT_CD: Sex (1: Male, 2: Female)
  • BENE_RACE_CD: Beneficiary Race Code (1: White, 2: Black, 3: Others, 5: Hispanic)
  • BENE_ESRD_IND: End stage renal disease Indicator ('0': The beneficiary does not have ESRD, 'Y': The beneficiary has ESRD)
  • SP_STATE_CODE: State Code (01: Alabama, 02: Alaska, 03: Arizona, ..., 54: Others)
  • BENE_COUNTY_CD: County Code
  • BENE_HI_CVRAGE_TOT_MONS: Total number of months of part A coverage for the beneficiary
  • BENE_SMI_CVRAGE_TOT_MONS: Total number of months of part B coverage for the beneficiary
  • BENE_HMO_CVRAGE_TOT_MONS: Total number of months of HMO coverage for the beneficiary
  • PLAN_CVRG_MOS_NUM: Total number of months of part D plan coverage for the beneficiary
  • SP_ALZHDMTA: Chronic Condition: Alzheimer or related disorders or senile (1: Yes, 2: No)
  • SP_CHF: Chronic Condition: Heart Failure (1: Yes, 2: No)
  • SP_CHRNKIDN: Chronic Condition: Chronic Kidney Disease (1: Yes, 2: No)
  • SP_CNCR: Chronic Condition: Cancer (1: Yes, 2: No)
  • SP_COPD: Chronic Condition: Chronic Obstructive Pulmonary Disease (1: Yes, 2: No)
  • SP_DEPRESSN: Chronic Condition: Depression (1: Yes, 2: No)
  • SP_DIABETES: Chronic Condition: Diabetes (1: Yes, 2: No)
  • SP_ISCHMCHT: Chronic Condition: Ischemic Heart Disease (1: Yes, 2: No)
  • SP_OSTEOPRS: Chronic Condition: Osteoporosis (1: Yes, 2: No)
  • SP_RA_OA: Chronic Condition: Rheumatoid arthritis and osteoarthritis (RA/OA) (1: Yes, 2: No)
  • SP_STRKETIA: Chronic Condition: Stroke/transient Ischemic Attack (1: Yes, 2: No)
  • MEDREIMB_IP: Inpatient annual Medicare reimbursement amount
  • BENRES_IP: Inpatient annual beneficiary responsibility amount
  • PPPYMT_IP: Inpatient annual primary payer reimbursement amount
  • MEDREIMB_OP: Outpatient Institutional annual Medicare reimbursement amount
  • BENRES_OP: Outpatient Institutional annual beneficiary responsibility amount
  • PPPYMT_OP: Outpatient Institutional annual primary payer reimbursement amount
  • MEDREIMB_CAR: Carrier annual Medicare reimbursement amount
  • BENRES_CAR: Carrier annual beneficiary responsibility amount
  • PPPYMT_CAR: Carrier annual primary payer reimbursement amount

--

--

Pascal Heus
Pascal Heus

Written by Pascal Heus

Data Lead at Postman Open Technologies. Information Technologist / Data Engineer / Metadata Expert. Interests in Gamification, Quantum Physics, Astrophysics.

No responses yet