How to convert nested JSON to a dataframe or CSV in Julia
You can’t do it in one line of Julia code, but you can do it!
Updated March 9, 2023Tested with Julia version 1.8.5
Inputs and outputs
Let’s say you have the following nested JSON:
{
"org": "SpaceX",
"org_id": 12345,
"equipment": [
{
"launcher": "Falcon 1",
"max_payload": 180
},
{
"launcher": "Falcon 9",
"max_payload": 22800
},
{
"launcher": "Falcon Heavy",
"max_payload": 63800
}
]
}
… and you want to use Julia to convert it to a CSV like this:
org | org_id | launcher | max_payload |
---|---|---|---|
SpaceX | 12345 | Falcon 1 | 180 |
SpaceX | 12345 | Falcon 9 | 22800 |
SpaceX | 12345 | Falcon Heavy | 63800 |
This tutorial will show you how to do that using Julia. Let’s get started!
Converting the JSON to a string variable
First, we need to escape the double quotes, so that we can create this JSON as a string in Julia.
The easiest way to escape double quotes in Julia is to use triple double quotes at the start and end of the data.
I’ve done that here, as well as storing the data inside a new variable I’ve called spacex
. The code looks like this:
spacex = """
{
"org": "SpaceX",
"org_id": 12345,
"equipment": [
{
"launcher": "Falcon 1",
"max_payload": 180
},
{
"launcher": "Falcon 9",
"max_payload": 22800
},
{
"launcher": "Falcon Heavy",
"max_payload": 63800
}
]
}
"""
So what are we working with here? What sort of variable is spacex
?
Let’s check using Julia’s built-in typeof()
function:
typeof(spacex)
We get:
String
Okay, spacex
is a String
. Now we need to turn it into structured data so we can use it to make our table.
Enter the JSON.jl
package.
Converting our string variable into structured data
Let’s go ahead and install the JSON.jl
package. Instructions are in the box below.
First, let’s tell Julia we want to use the package:
using JSON
Now we can go ahead and actually use it. We can do so like this:
spacex_dict = JSON.parse(spacex)
In the line of code above, we’re creating a new variable, spacex_dict
, by parsing the spacex
data with the JSON
package.
The value of our spacex_dict
variable is:
Dict{String, Any} with 3 entries:
"org_id" => 12345
"org" => "SpaceX"
"equipment" => Any[Dict{String, Any}("max_payload"=>180, "launcher"=>"Falcon 1"), Dict{String, Any}("max_payload"=>22800, "launcher"=>"Falcon 9"), Dict{String, Any}("max_payload"=>63800, "launcher"=>"Falcon Heavy")]
You can scroll horizontally inside the code block above.
Let’s go ahead and check the type of our new spacex_dict
variable:
typeof(spacex_dict)
We get:
Dict{String, Any}
So now we have a dictionary—our spacex_dict
variable—to work with.
That’s great—it’s way more structured than our original variable, spacex
, which was just a simple String
.
Now it’s time to make a table—specifically, a DataFrame
—from our spacex_dict
variable.
Enter the DataFrames.jl
package.
Creating a dataframe from our structured data variable
Let’s install the DataFrames.jl
package. The instructions are in the box below.
We now need to tell Julia we want to use the DataFrames package:
using DataFrames
Now we can actually create a dataframe from our data. I’m going to use the conventional df
(which stands for “data frame”) as our variable name, but you can choose any variable name you like. Here’s my code:
df = DataFrame(spacex_dict)
This code says “make a variable named df
which contains a dataframe of the spacex_dict
dictionary.
We get:
3×3 DataFrame
Row │ equipment org org_id
│ Any String Int64
─────┼───────────────────────────────────────────────────
1 │ Dict{String, Any}("max_payload"=… SpaceX 12345
2 │ Dict{String, Any}("max_payload"=… SpaceX 12345
3 │ Dict{String, Any}("max_payload"=… SpaceX 12345
The above dataframe is sort of what we want. The good news is that the non-nested JSON (the org_id
and org
columns) is appearing properly in distinct columns.
However, the equipment
column is a bit of a mess. This is because the nested JSON—the launcher
names and their max_payload
s for low earth orbit—have not been separated into distinct columns.
We’ll fix that next.
Cleaning up the dataframe
Let’s take a closer look at the equipment
column. Let’s make a variable called equipment_col
so we can do just that:
equipment_col = df[!, :equipment]
I’ve written a tutorial on Julia dataframes if you’re not familiar with them, but let’s walk through this code briefly:
equipment_col =
Here we’re creating a new variable,
equipment_col
to use later.
df[!, :equipment]
From the dataframe named “df” (
df[]
), get all rows (signified with an exclamation mark!
) in theequipment
column (signified with a colon and then the column name::equipment
).
So the value of our newly-created equipment_col
variable is:
3-element Vector{Any}:
Dict{String, Any}("max_payload" => 180, "launcher" => "Falcon 1")
Dict{String, Any}("max_payload" => 22800, "launcher" => "Falcon 9")
Dict{String, Any}("max_payload" => 63800, "launcher" => "Falcon Heavy")
Me again, Italics Guy. Remember you can scroll horizontally in the code block above.
Before we go on, let’s recall our desired end result. We want a dataframe and/or CSV that looks like this:
org | org_id | launcher | max_payload |
---|---|---|---|
SpaceX | 12345 | Falcon 1 | 180 |
SpaceX | 12345 | Falcon 9 | 22800 |
SpaceX | 12345 | Falcon Heavy | 63800 |
However, our current dataframe looks like this:
3×3 DataFrame
Row │ equipment org org_id
│ Any String Int64
─────┼───────────────────────────────────────────────────
1 │ Dict{String, Any}("max_payload"=… SpaceX 12345
2 │ Dict{String, Any}("max_payload"=… SpaceX 12345
3 │ Dict{String, Any}("max_payload"=… SpaceX 12345
So we have the org_id
and org
columns already, but we don’t have the launcher
and max_payload
columns. We do have an equipment
column, but it’s a mess, so we’ll delete it once we’ve extracted its data.
To that end, let’s create empty variables for our desired new columns:
launchers_array = String[]
max_payloads_array = Int64[]
These will become our desired columns later.
Now for probably the hardest bit of the whole tutorial. Hang in there! We’re going to iterate over the equipment
column, grab each launcher
’s name (Falcon 1
, Falcon 9
etc) and its max_payload
, and put them in the empty arrays we just created earlier.
Time to put this into action. The code looks like this:
for i in eachindex(equipment_col)
push!(launchers_array, equipment_col[i]["launcher"])
push!(max_payloads_array, equipment_col[i]["max_payload"])
end
Let’s walk through that code line by line:
for i in eachindex(equipment_col)
Here we are iterating over the number of items in our equipment column. In our SpaceX example, we have three rows (one each for Falcon 1, Falcon 9, and Falcon Heavy). So in our iterator,
i
will equal1
, then2
, and finally3
. We’ll use this changing value ofi
below to get our launcher names and their max payloads.
push!(launchers_array, equipment_col[i]["launcher"])
Now we are adding (
push!
) the name of each launcher (“Falcon 1”, “Falcon 9” etc) into ourlaunchers_array
. Wheni
is equal to1
, we are addingequipment_col[1]["launcher"]
, which is equal to"Falcon 1"
. Wheni
is equal to2
, we are addingequipment_col[2]["launcher"]
, which equals"Falcon 9"
. And finally, wheni
equals3
, we’re adding the launcher name ("Falcon Heavy"
) from the third element in ourequipment_col
. If this is at all unclear, let me know and I’ll do a tutorial just on this.
push!(max_payloads_array, equipment_col[i]["max_payload"])
Here we’re doing the same thing as the line above, but instead of adding launcher names into the launchers array, we’re adding the maximum payload for each launcher into our
max_payloads_array
.
end
Here we close the
for
loop we started a few lines earlier.
Let’s take a look at our new launchers_array
:
3-element Vector{String}:
"Falcon 1"
"Falcon 9"
"Falcon Heavy"
And our new max_payloads_array
:
3-element Vector{Int64}:
180
22800
63800
Great, everything’s in order.
Now we can add them to our dataframe.
Let’s add the launcher
column first:
df[!, :launcher] = launchers_array
… and our max_payload
column second:
df[!, :max_payload] = max_payloads_array
Let’s take a look so far:
df
We get:
3×5 DataFrame
Row │ equipment org org_id launcher max_payload
│ Any String Int64 String Int64
────┼──────────────────────────────────────────────────────────────────────────────
1 │ Dict{String, Any}("max_payload"=… SpaceX 12345 Falcon 1 180
2 │ Dict{String, Any}("max_payload"=… SpaceX 12345 Falcon 9 22800
3 │ Dict{String, Any}("max_payload"=… SpaceX 12345 Falcon Heavy 63800
As you probably know, you can scroll horizontally inside the code box above.
At this point, we no longer need the unstructured equipment
column—we have replaced it fully with our new, structured columns launcher
and max_payload
. We delete the equipment
column like this:
select!(df, Not(:equipment))
The above code says “select all columns from our df
dataframe but Not
the column named equipment
, and assign the result to the value of df
”. Or in plain English: delete the column named equipment
from our dataframe.
We get our clean dataframe:
3×4 DataFrame
Row │ org org_id launcher max_payload
│ String Int64 String Int64
─────┼───────────────────────────────────────────
1 │ SpaceX 12345 Falcon 1 180
2 │ SpaceX 12345 Falcon 9 22800
3 │ SpaceX 12345 Falcon Heavy 63800
If you want to reorder the columns so that the unique fields in each row—i.e. launcher
and max_payload
—come first, you could do that like so:
select!(df, [:launcher, :max_payload, :org, :org_id])
Now everything is in the order we want:
3×4 DataFrame
Row │ launcher max_payload org org_id
│ String Int64 String Int64
─────┼───────────────────────────────────────────
1 │ Falcon 1 180 SpaceX 12345
2 │ Falcon 9 22800 SpaceX 12345
3 │ Falcon Heavy 63800 SpaceX 12345
Writing our finished dataframe to a CSV
We can now trivially write our finished dataframe to a CSV.
First, you’ll need to install the CSV.jl
package if you haven’t already. Instructions are in the box below.
Now we need to tell Julia we want to use the CSV package:
using CSV
Finally, we can write our CSV file with the contents of our df
dataframe:
CSV.write("df.csv", df)
The code above says “write a CSV named df.csv
(you can name it anything you like) with the contents of the df
dataframe”.
Using the code above, the CSV will be written to the same folder as your Julia script. If you’re coding this directly into your terminal, then the CSV will be written to your home directory. On my Mac, this is at /Users/ron/
.
If you want the CSV to end up in a specific place, you can do it like this:
CSV.write("/Users/ron/Desktop/df.csv", df)
The above code will write the CSV to the desktop of my Mac. Remember to change the /ron/
part if you run this code for yourself!