How to convert nested JSON to a dataframe or CSV in Julia

You can’t do it in one line of Julia code, but you can do it!

Ron Erdos
Updated March 9, 2023
Tested with Julia version 1.8.5

Inputs and outputs

Let’s say you have the following nested JSON:

{
    "org": "SpaceX",
    "org_id": 12345,
    "equipment": [
        {
            "launcher": "Falcon 1",
            "max_payload": 180
        },
        {
            "launcher": "Falcon 9",
            "max_payload": 22800
        },
        {
            "launcher": "Falcon Heavy",
            "max_payload": 63800
        }
    ]
}

… and you want to use Julia to convert it to a CSV like this:

orgorg_idlaunchermax_payload
SpaceX12345Falcon 1180
SpaceX12345Falcon 922800
SpaceX12345Falcon Heavy63800

This tutorial will show you how to do that using Julia. Let’s get started!

Converting the JSON to a string variable

First, we need to escape the double quotes, so that we can create this JSON as a string in Julia.

The easiest way to escape double quotes in Julia is to use triple double quotes at the start and end of the data.

I’ve done that here, as well as storing the data inside a new variable I’ve called spacex. The code looks like this:

spacex = """
{
    "org": "SpaceX",
    "org_id": 12345,
    "equipment": [
        {
            "launcher": "Falcon 1",
            "max_payload": 180
        },
        {
            "launcher": "Falcon 9",
            "max_payload": 22800
        },
        {
            "launcher": "Falcon Heavy",
            "max_payload": 63800
        }
    ]
}
"""

So what are we working with here? What sort of variable is spacex?

Let’s check using Julia’s built-in typeof() function:

typeof(spacex)

We get:

String

Okay, spacex is a String. Now we need to turn it into structured data so we can use it to make our table.

Enter the JSON.jl package.

Converting our string variable into structured data

Let’s go ahead and install the JSON.jl package. Instructions are in the box below.

First, let’s tell Julia we want to use the package:

using JSON

Now we can go ahead and actually use it. We can do so like this:

spacex_dict = JSON.parse(spacex)

In the line of code above, we’re creating a new variable, spacex_dict, by parsing the spacex data with the JSON package.

The value of our spacex_dict variable is:

Dict{String, Any} with 3 entries:
  "org_id"    => 12345
  "org"       => "SpaceX"
  "equipment" => Any[Dict{String, Any}("max_payload"=>180, "launcher"=>"Falcon 1"), Dict{String, Any}("max_payload"=>22800, "launcher"=>"Falcon 9"), Dict{String, Any}("max_payload"=>63800, "launcher"=>"Falcon Heavy")]

You can scroll horizontally inside the code block above.

Let’s go ahead and check the type of our new spacex_dict variable:

typeof(spacex_dict)

We get:

Dict{String, Any}

So now we have a dictionary—our spacex_dict variable—to work with.

That’s great—it’s way more structured than our original variable, spacex, which was just a simple String.

Now it’s time to make a table—specifically, a DataFrame—from our spacex_dict variable.

Enter the DataFrames.jl package.

Creating a dataframe from our structured data variable

Let’s install the DataFrames.jl package. The instructions are in the box below.

We now need to tell Julia we want to use the DataFrames package:

using DataFrames

Now we can actually create a dataframe from our data. I’m going to use the conventional df (which stands for “data frame”) as our variable name, but you can choose any variable name you like. Here’s my code:

df = DataFrame(spacex_dict)

This code says “make a variable named df which contains a dataframe of the spacex_dict dictionary.

We get:

3×3 DataFrame
 Row │ equipment                          org     org_id
     │ Any                                String  Int64
─────┼───────────────────────────────────────────────────
   1 │ Dict{String, Any}("max_payload"=…  SpaceX   12345
   2 │ Dict{String, Any}("max_payload"=…  SpaceX   12345
   3 │ Dict{String, Any}("max_payload"=…  SpaceX   12345

The above dataframe is sort of what we want. The good news is that the non-nested JSON (the org_id and org columns) is appearing properly in distinct columns.

However, the equipment column is a bit of a mess. This is because the nested JSON—the launcher names and their max_payloads for low earth orbit—have not been separated into distinct columns.

We’ll fix that next.

Cleaning up the dataframe

Let’s take a closer look at the equipment column. Let’s make a variable called equipment_col so we can do just that:

equipment_col = df[!, :equipment]

I’ve written a tutorial on Julia dataframes if you’re not familiar with them, but let’s walk through this code briefly:

equipment_col =

Here we’re creating a new variable, equipment_col to use later.

df[!, :equipment]

From the dataframe named “df” (df[]), get all rows (signified with an exclamation mark !) in the equipment column (signified with a colon and then the column name: :equipment).

So the value of our newly-created equipment_col variable is:

3-element Vector{Any}:
 Dict{String, Any}("max_payload" => 180, "launcher" => "Falcon 1")
 Dict{String, Any}("max_payload" => 22800, "launcher" => "Falcon 9")
 Dict{String, Any}("max_payload" => 63800, "launcher" => "Falcon Heavy")

Me again, Italics Guy. Remember you can scroll horizontally in the code block above.

Before we go on, let’s recall our desired end result. We want a dataframe and/or CSV that looks like this:

orgorg_idlaunchermax_payload
SpaceX12345Falcon 1180
SpaceX12345Falcon 922800
SpaceX12345Falcon Heavy63800

However, our current dataframe looks like this:

3×3 DataFrame
 Row │ equipment                          org     org_id
     │ Any                                String  Int64
─────┼───────────────────────────────────────────────────
   1 │ Dict{String, Any}("max_payload"=…  SpaceX   12345
   2 │ Dict{String, Any}("max_payload"=…  SpaceX   12345
   3 │ Dict{String, Any}("max_payload"=…  SpaceX   12345

So we have the org_id and org columns already, but we don’t have the launcher and max_payload columns. We do have an equipment column, but it’s a mess, so we’ll delete it once we’ve extracted its data.

To that end, let’s create empty variables for our desired new columns:

launchers_array = String[]
max_payloads_array = Int64[]

These will become our desired columns later.

Now for probably the hardest bit of the whole tutorial. Hang in there! We’re going to iterate over the equipment column, grab each launcher’s name (Falcon 1, Falcon 9 etc) and its max_payload, and put them in the empty arrays we just created earlier.

Time to put this into action. The code looks like this:

for i in eachindex(equipment_col)
    push!(launchers_array, equipment_col[i]["launcher"])
    push!(max_payloads_array, equipment_col[i]["max_payload"])
end

Let’s walk through that code line by line:

for i in eachindex(equipment_col)

Here we are iterating over the number of items in our equipment column. In our SpaceX example, we have three rows (one each for Falcon 1, Falcon 9, and Falcon Heavy). So in our iterator, i will equal 1, then 2, and finally 3. We’ll use this changing value of i below to get our launcher names and their max payloads.

push!(launchers_array, equipment_col[i]["launcher"])

Now we are adding (push!) the name of each launcher (“Falcon 1”, “Falcon 9” etc) into our launchers_array. When i is equal to 1, we are adding equipment_col[1]["launcher"], which is equal to "Falcon 1". When i is equal to 2, we are adding equipment_col[2]["launcher"], which equals "Falcon 9". And finally, when i equals 3, we’re adding the launcher name ("Falcon Heavy") from the third element in our equipment_col. If this is at all unclear, let me know and I’ll do a tutorial just on this.

push!(max_payloads_array, equipment_col[i]["max_payload"])

Here we’re doing the same thing as the line above, but instead of adding launcher names into the launchers array, we’re adding the maximum payload for each launcher into our max_payloads_array.

end

Here we close the for loop we started a few lines earlier.

Let’s take a look at our new launchers_array:

3-element Vector{String}:
 "Falcon 1"
 "Falcon 9"
 "Falcon Heavy"

And our new max_payloads_array:

3-element Vector{Int64}:
  180
22800
63800

Great, everything’s in order.

Now we can add them to our dataframe.

Let’s add the launcher column first:

df[!, :launcher] = launchers_array

… and our max_payload column second:

df[!, :max_payload] = max_payloads_array

Let’s take a look so far:

df

We get:

3×5 DataFrame
Row │ equipment                          org     org_id  launcher      max_payload
    │ Any                                String  Int64   String        Int64
────┼──────────────────────────────────────────────────────────────────────────────
  1 │ Dict{String, Any}("max_payload"=…  SpaceX   12345  Falcon 1              180
  2 │ Dict{String, Any}("max_payload"=…  SpaceX   12345  Falcon 9            22800
  3 │ Dict{String, Any}("max_payload"=…  SpaceX   12345  Falcon Heavy        63800

As you probably know, you can scroll horizontally inside the code box above.

At this point, we no longer need the unstructured equipment column—we have replaced it fully with our new, structured columns launcher and max_payload. We delete the equipment column like this:

select!(df, Not(:equipment))

The above code says “select all columns from our df dataframe but Not the column named equipment, and assign the result to the value of df”. Or in plain English: delete the column named equipment from our dataframe.

We get our clean dataframe:

3×4 DataFrame
 Row │ org     org_id  launcher      max_payload
     │ String  Int64   String        Int64
─────┼───────────────────────────────────────────
   1 │ SpaceX   12345  Falcon 1              180
   2 │ SpaceX   12345  Falcon 9            22800
   3 │ SpaceX   12345  Falcon Heavy        63800

If you want to reorder the columns so that the unique fields in each row—i.e. launcher and max_payload—come first, you could do that like so:

select!(df, [:launcher, :max_payload, :org, :org_id])

Now everything is in the order we want:

3×4 DataFrame
 Row │ launcher      max_payload  org     org_id
     │ String        Int64        String  Int64
─────┼───────────────────────────────────────────
   1 │ Falcon 1              180  SpaceX   12345
   2 │ Falcon 9            22800  SpaceX   12345
   3 │ Falcon Heavy        63800  SpaceX   12345

Writing our finished dataframe to a CSV

We can now trivially write our finished dataframe to a CSV.

First, you’ll need to install the CSV.jl package if you haven’t already. Instructions are in the box below.

Now we need to tell Julia we want to use the CSV package:

using CSV

Finally, we can write our CSV file with the contents of our df dataframe:

CSV.write("df.csv", df)

The code above says “write a CSV named df.csv (you can name it anything you like) with the contents of the df dataframe”.

Using the code above, the CSV will be written to the same folder as your Julia script. If you’re coding this directly into your terminal, then the CSV will be written to your home directory. On my Mac, this is at /Users/ron/.

If you want the CSV to end up in a specific place, you can do it like this:

CSV.write("/Users/ron/Desktop/df.csv", df)

The above code will write the CSV to the desktop of my Mac. Remember to change the /ron/ part if you run this code for yourself!

Get Julia tips in your inbox a few times per year. Unsubscribe anytime.

Despite being an amateur at Julia programming and having no prior web-scraping experience, a JuliaSchool tutorial helped me learn enough to make a contribution to an open-source project in the Julia community using these tools, all in the course of a single afternoon. Thanks JuliaSchool! - Joe from Germany