How to use dataframes in Julia

Julia dataframes let you do anything you want: pivot tables, data cleaning, table joins, filtering, and more, all with a nice clean syntax.

Ron Erdos
Updated May 22, 2023
Tested with Julia version 1.9.0

What are dataframes?

If you’re not familiar with dataframes, read on. If you are, feel free to skip ahead to the next section.

Dataframes are basically a table that you can manipulate in Julia. You can:

  • Do a “find and replace” on a given column, or multiple columns. This is useful for correcting typos, fixing capitalisation, and many more tasks.
  • Perform mathematical operations, such as add ten per cent to every number in a given column. This would be useful, for, say, calculating sales tax on retail prices.
  • Make a pivot table, just like in Excel or Google Sheets, only better! Excel has a row limit of 1,048,576. Julia doesn’t. And Google Sheets has a cell limit of 10 million, which can get used up surprisingly quickly. In addition, pivot tables in Julia are—or can be made into—first-class citizens, ready for more transformations, joins and so on. This is decidedly not the case in Google Sheets or Excel—you often have to copy and paste the pivot table into a new sheet in order to keep working on it.
  • Merge two or more tables (dataframes) together according to rules you set. This is conceptually the same as “joins” in SQL and other database languages.

Okay, so dataframes have a lot of advantages over Excel. Let’s look at how to use them here.

Installing Julia

If you want to play along (as opposed to just reading through this tutorial), you’ll need to download, install, and run Julia if you haven’t already.

Once you have Julia up and running, you’ll see this prompt:

julia>

This prompt will be green in colour, probably on a black background. Yeah, like in The Matrix.

Onwards!

Installing the DataFrames package

We’re using DataFrames.jl version v1.5.0 in this tutorial. And, as detailed at the top of this tutorial, we’re using Julia version 1.9.0.

The first thing we need to is to install the DataFrames package. This is a one-time task.

The easiest way to do this is to input a right square bracket ] into the Julia terminal.

This changes the julia> prompt we saw above into one that says:

pkg>

This prompt will be blue / purple in colour.

This means we are now in Julia’s package manager, hence the abbreviation pkg. We can install packages with a lot less typing in here, compared to at the green julia> prompt.

Go ahead and add the DataFrames package by inputting the line below:

add DataFrames

Julia’s package manager will install the DataFrames package for you.

You’ll be taken back to the package manager prompt:

pkg>

To exit the package manager area, and to go back to the main Julia area, simply hit the delete key. How cool is that?

You’ll now see the Julia prompt again:

julia>

How to convert an array into a dataframe in Julia

Let’s say you input an array like this into your Julia terminal:

space_companies = ["SpaceX", "Blue Origin", "Boeing"]

… and now you want to convert this array into a Julia dataframe.

First, we need to tell Julia that we’ll be using the DataFrames package:

using DataFrames

This loads the DataFrames package into your workspace.

Now, assuming we want our new dataframe to be called df, with a single column we’d like to call space_brands, we input this:

df = DataFrame([space_companies], [:space_brands])

We get:

3×1 DataFrame
 Row │ space_brands
     │ String
─────┼──────────────
   1 │ SpaceX
   2 │ Blue Origin
   3 │ Boeing

Let’s walk through the code above now.

df =

We are creating a new variable, which we have named df.

DataFrame(

Here we tell Julia to create a dataframe, and we open the brackets, inside of which we’ll be configuring said dataframe.

[space_companies]

Here we’re telling Julia which array (space_companies) we want to reference for our dataframe’s content. Important: The array name must be inside square brackets []. This is because Julia is expecting a list (otherwise known as an array) of array names, and just because we happen to only have one array (space_companies) in this example, doesn’t mean Julia isn’t able to handle more—and therefore, it expects an array of array names. If that’s confusing then just remember to put square brackets around your array name.

,

We separate our array name from the next argument with a comma.

[:space_brands]

Here we tell Julia what we want to name the column in our new dataframe. In this case, I’ve gone for space_brands as the column name, but we have to add some additional characters to the name. We first add a colon : to signify that this is a dataframes column name (it’s part of the DataFrames standard). Next, we have to wrap the colon and column name in square brackets []. So instead of :space_brands, we write [:space_brands]. The reason for the square brackets is that Julia needs to be ready for more than one array/column name combo, and therefore expects an array—even if you only have one array and column name.

)

Here we close the brackets we opened earlier.

How to create a Julia dataframe in the terminal

For much of the rest of this tutorial, we’re going to be using a small dataframe of NASA inventory. The data is fictional; it’s just for this tutorial.

We’ll call the dataframe inventory.

It will have three columns: item, id, and kind.

First, as before, we need to tell Julia we want to use the DataFrames package we just installed.

We do that by inputting the line below:

using DataFrames

Next, we input the lines below. This will create our NASA inventory dataframe:

inventory = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"%Lunar Rover",
		"30% Sun Filter"
	],
	id = [
		100,
		101,
		102,
		103
	],
	kind = [
		"rover",
		"spaceship",
		"rover",
		"Sun Filter"
	]
)

We get:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  rover
   2 │ Venus Explorer    101  spaceship
   3 │ %Lunar Rover      102  rover
   4 │ 30% Sun Filter    103  Sun Filter

Let’s quickly talk through what we see above.

Quick tour of a Julia dataframe

The first words of the output above, 4x3 DataFrame says that we have a 4-row, 3-column dataframe. Simple enough.

Next, we have the column headers, Row, item, id, and kind.

The first of these, Row, is not something we asked for, though it is very useful. It’s a column of row numbers that Julia generates automatically. But don’t worry, they’re not part of the dataframe proper, nor will they show up in CSV exports and so forth.

The next column headers, item, id, and kind are those we supplied to Julia in our code above.

Next up, we have the type of each column. Let’s go over what that means now.

We see String, Int64, and another String.

The Row column doesn’t have a type because, as mentioned above, it’s not part of the dataframe proper. It’s just there to help you keep track.

Our first “real” column, item, has the type String.

This means that anything that goes in this column will be treated as text.

The next column, id, has the type Int64. Although this might look weird, it just means “integer” (a whole number, not a decimal). The 64 just refers to 64 bit, which means huge numbers can be stored here. “Huge numbers” is my technical term for the range of integers between -2^63 and 2^63 - 1.

Finally, the last column, kind, is another String-type column, the same as the item column.

And finally, we have the actual data within the rows and columns of our dataframe.

How to “find and replace” in a Julia dataframe

As mentioned above, I’ve deliberately included some typos in the inventory dataframe, so we can practise “find and replace”.

In the kind column, rover should be Rover, and spaceship should be Spaceship.

Let’s clean that up.

Here’s how you do that:

for i in eachrow(inventory)
	i[:kind] = replace(i[:kind], "rover"=>"Rover")
	i[:kind] = replace(i[:kind], "spaceship"=>"Spaceship")
end

To see the results, we call up our dataframe again:

inventory

We get:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  Rover
   2 │ Venus Explorer    101  Spaceship
   3 │ %Lunar Rover      102  Rover
   4 │ 30% Sun Filter    103  Sun Filter

Let’s walk through that code line-by-line:

for i in eachrow(inventory)

Here we open up a “for loop”, where we’ll iterate over each row in the inventory dataframe.

i[:kind] = replace(i[:kind], "rover"=>"Rover")

Here we replace any instance of rover in the description column, and replace it with the correctly-cased Rover.

i[:kind] = replace(i[:kind], "spaceship"=>"Spaceship")

Here we do a similar replacement, but this time we’re swapping out “spaceship” for “Spaceship”.

end

Here we close the “for loop” we started in the first line.

OK, moving on.

There’s one more typo in the inventory dataframe.

In the item column, %Lunar Rover should be Lunar Rover.

Let’s clean that up too. But we’ll need to be careful—we don’t want to remove the percentage sign from 30% Sun Filter.

Read on to see how to do this.

How to “find and replace” with regex in a Julia dataframe

We’re going to use regex (regular expressions) to do our find and replace.

The code looks like this:

for i in eachrow(inventory)
	i[:item] = replace(i[:item], r"^%"=>"")
end

After inputting that code, we call up our dataframe again:

inventory

We get:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  Rover
   2 │ Venus Explorer    101  Spaceship
   3 │ Lunar Rover       102  Rover
   4 │ 30% Sun Filter    103  Sun Filter

Let’s walk through that code line-by-line.

for i in eachrow(inventory)

Here we start a “for loop” over the inventory dataframe. We use eachrow(), a function that ships with Julia’s DataFrames.jl package, to iterate over the rows.

i[:item] = replace(i[:item], r"^%"=>"")

This line instructs Julia to conduct a find-and-replace on each row of the item column, using Julia’s replace() function. The actual “find and replace” part comes in the form of the second argument, r"^%"=>"". It starts with the letter r, to signify that we are using regex. Next, we have double quotes surrounding our regex: "^%". The caret, ^ is a regex character which means “match the beginning of the string”. The percentage sign, %, is a literal match for a percentage sign. Then we have a fat arrow, => which tells Julia what the replacement should be. And finally, we have empty double quote marks, "", which tells Julia to replace our regex match with nothing, i.e. delete it. We want to replace the percentage sign at the start of %Lunar Rover with nothing—that is, delete it.

end

Finally, we close the “for loop” we started on the first line.

How to filter a dataframe in Julia

Here’s our spruced-up inventory dataframe again:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  Rover
   2 │ Venus Explorer    101  Spaceship
   3 │ Lunar Rover       102  Rover
   4 │ 30% Sun Filter    103  Sun Filter

Let’s say we want only the rows which mention “Rover”.

If you’re just joining us now, you can create the dataframe by inputting the following at the Julia prompt:

inventory = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"Lunar Rover",
		"30% Sun Filter"
	],
	id = [
		100,
		101,
		102,
		103
	],
	kind = [
		"Rover",
		"Spaceship",
		"Rover",
		"Sun Filter"
	]
)

Now, to make a new dataframe named rovers containing just the rows with “Rover” in the item column, use this code:

rovers = filter(
	x -> any(occursin.(["Rover"], x.item)),
	inventory
)

We get:

2×3 DataFrame
 Row │ item         id     kind
     │ String       Int64  String
─────┼────────────────────────────
   1 │ Mars Rover     100  Rover
   2 │ Lunar Rover    102  Rover

Let’s walk through that code line by line:

rovers = filter(

Here we create a variable named rovers, which we’ll use to store the filtered, dataframe we’re about to create. We’ll use Julia’s built-in filter function to do so. This function takes two “arguments”, or inputs, inside its brackets—these are covered below.

x -> any(occursin.(["Rover"], x.item)),

The first “argument” is the rule for the filtering. In our case, we want to include only the rows which contain the word “Rover” in the item column.

If you look closely, our rule is actually a function. It’s not a named function like print().

Rather, it’s an “anonymous” function, because it doesn’t have a name. Specifically, it’s a “stabby lambda” anonymous function. “Stabby” because it has a stabby arrow, ->.

And the reason we’re using an anonymous rather than named function is because we’re only using it once, and it’s more concise this way. Whereas we might use our print function every second day, so it makes sense for it to have a name.

So what does this “stabby lambda” anonymous function actually do? It says for a given x, are there any occasions where "Rover" occurs in the item column (x.item). And because we’re using it inside our filter function, we are filtering by this rule, this anonymous function.

inventory

The second “argument” is the dataframe we want to reference. In our case, that’s the inventory dataframe.

)

Here we close the filter function we began using in the first line.

How to join two dataframes in Julia

Did you know Julia can join two tables just like SQL?

Creating our second dataframe

Let’s say we have another dataframe named sku which lists each item and its corresponding sku (stock keeping unit).

Here’s the Julia code to create the sku dataframe:

sku = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"Lunar Rover",
		"30% Sun Filter"
	],
	sku = [
		34566,
		78945,
		15179,
		77254
		]
)

We get:

4×2 DataFrame
│ Row │ item           │ sku   │
│     │ String         │ Int64 │
├─────┼────────────────┼───────┤
│ 1   │ Mars Rover     │ 34566 │
│ 2   │ Venus Explorer │ 78945 │
│ 3   │ Lunar Rover    │ 15179 │
│ 4   │ 30% Sun Filter │ 77254 │

And here’s the code to create our original inventory dataframe:

inventory = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"Lunar Rover",
		"30% Sun Filter"
	],
	id = [
		100,
		101,
		102,
		103
	],
	kind = [
		"Rover",
		"Spaceship",
		"Rover",
		"Sun Filter"
	]
)

We get:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  Rover
   2 │ Venus Explorer    101  Spaceship
   3 │ Lunar Rover       102  Rover
   4 │ 30% Sun Filter    103  Sun Filter

Let’s join those two dataframes in a new dataframe named inventory_sku. It’s only one line of code:

inventory_sku = outerjoin(inventory, sku, on = :item)

We get:

4×4 DataFrame
 Row │ item            id     kind        sku
     │ String          Int64  String      Int64
─────┼──────────────────────────────────────────
   1 │ Mars Rover        100  Rover       34566
   2 │ Venus Explorer    101  Spaceship   78945
   3 │ Lunar Rover       102  Rover       15179
   4 │ 30% Sun Filter    103  Sun Filter  77254

Let’s walk through this code. Since it’s only one line of code, we’ll break it up into parts to discuss.

First, we create a new variable, inventory_sku that we’ll use to store the joined table.

Next, outerjoin() is the function, the instruction to Julia that we want to join two dataframes together. The types of join you can do are innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin, or crossjoin. I chose outerjoin, but since our two dataframes are a perfect match, we could have used innerjoin, leftjoin, or rightjoin and received the same result.

The first two items (known as “arguments”) inside the join() function are the two dataframes we want to join, inventory and sku. Whichever dataframe you mention first inside the join() function will go on the left of our joined dataframe. The unique columns from the other dataframe will go on the right.

The third “argument” inside the join() function is on = :item.

This simply means that the :item column is the one we’re using to match the correct rows from each dataframe. In SQL and other database languages, this is called a “primary key”.

More complex dataframe joins in Julia

Now, this was an easy join. We had four items in inventory and the same four items in sku. There were no missing rows, in other words.

In addition, this join was super easy because the item column had the same name in both dataframes.

If it had been named item in one dataframe, and item_name in the other, we would have had to rename one column. Let’s take a look at how to rename columns next.

How to rename columns in a Julia dataframe

OK, so we want to rename the pings_sum column from the dataframe above so it becomes pings.

We do that like this:

rename!(hubble_pings_weekly, :pings_sum => :pings)

We get:

2×2 DataFrame
│ Row │ iso_week │ pings     │
│     │ Int64    │ Int64     │
├─────┼──────────┼───────────┤
│ 1   │ 12       │ 4850      │
│ 2   │ 13       │ 5199      │

How to add an array as a column to a dataframe in Julia

Julia makes it trivial to add a new column to your dataframe, especially if you already have the data in array form.

For example, if we have the following dataframe we’ve been using for some of this tutorial:

inventory = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"Lunar Rover",
		"30% Sun Filter"
	],
	id = [
		100,
		101,
		102,
		103
	],
	kind = [
		"Rover",
		"Spaceship",
		"Rover",
		"Sun Filter"
	]
)

… it looks like this:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  Rover
   2 │ Venus Explorer    101  Spaceship
   3 │ Lunar Rover       102  Rover
   4 │ 30% Sun Filter    103  Sun Filter

Let’s say we want to add a new column called condition, where the values can be Excellent, Fair or Poor.

And let’s say you already have an array with the condition of each item:

condition_array = ["Fair", "Excellent", "Fair", "Poor"]

To add condition_array as a column to our inventory dataframe, you do this:

inventory[!, :condition] = condition_array

When we call up our newly-expanded dataframe:

inventory

We get:

4×4 DataFrame
 Row │ item            id     kind        condition
     │ String          Int64  String      String
─────┼──────────────────────────────────────────────
   1 │ Mars Rover        100  Rover       Fair
   2 │ Venus Explorer    101  Spaceship   Excellent
   3 │ Lunar Rover       102  Rover       Fair
   4 │ 30% Sun Filter    103  Sun Filter  Poor

Let’s walk through that code. First, we’re telling Julia to add a new column (:condition) to each row (!) of the inventory dataframe, and that it should get this data from the array named condition_array (on the right hand side of the equals sign).

How to reorder the columns of a dataframe in Julia

Let’s say we have the inventory dataframe above. If you’ve just arrived, and want to recreate this in the Julia REPL, here’s the code:

using DataFrames

inventory = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"Lunar Rover",
		"30% Sun Filter"
	],
	id = [
		100,
		101,
		102,
		103
	],
	kind = [
		"Rover",
		"Spaceship",
		"Rover",
		"Sun Filter"
	],
	condition = [
		"Fair",
		"Excellent",
		"Fair",
		"Poor"
	]
)

Now let’s say we want the id column to go first—you know, just like in most databases.

Here’s the code to do that:

select!(inventory, :id, :item, :kind, :condition)

Now if we input our inventory variable into the terminal so we can see what it looks like, we get:

4×4 DataFrame
 Row │ id     item            kind        condition
     │ Int64  String          String      String
─────┼──────────────────────────────────────────────
   1 │   100  Mars Rover      Rover       Fair
   2 │   101  Venus Explorer  Spaceship   Excellent
   3 │   102  Lunar Rover     Rover       Fair
   4 │   103  30% Sun Filter  Sun Filter  Poor

Of course, you can pick any order you like for the columns.

Here’s what our code is actually doing: we’re using the select!() function that ships with the DataFrames package to mutate our inventory dataframe so that:

  • it has only the columns we specify (we chose all of them)
  • in the order we set (we changed the order).

This change is permanent (unless we explicitly do something else to mutate our dataframe further). We know it will be permanent because we’re using a select!() function (with an exclamation mark) instead of select() (with no exclamation mark). Functions in Julia which have a trailing exclamation mark in their name are called “bang functions” and mutate their targets in a way that persists.

Non-permanent column reordering

If you want to leave the original inventory dataframe alone, you could always create a new dataframe—say, inventory_new—and use a non-bang function:

inventory_new = select(inventory, :id, :item, :kind, :condition)

This will yield a dataframe called inventory_new which looks like the original inventory dataframe, except with the id column at the start.

If you start in a new REPL for this part of the exercise and declare inventory from scratch (using the code at the start of this section), you’ll see that it never changes. The only changes will be reflected in our new variable, inventory_new.

How to delete a column from a dataframe in Julia

This section is very similar to the previous one on how to reorder Julia dataframe columns.

Here, we’ll simply select!() all the columns we want, and none of the ones we don’t.

First, here’s the code to create the inventory dataframe if you’re just joining us:

using DataFrames

inventory = DataFrame(
	item = [
		"Mars Rover",
		"Venus Explorer",
		"Lunar Rover",
		"30% Sun Filter"
	],
	id = [
		100,
		101,
		102,
		103
	],
	kind = [
		"Rover",
		"Spaceship",
		"Rover",
		"Sun Filter"
	],
	condition = [
		"Fair",
		"Excellent",
		"Fair",
		"Poor"
	]
)

We get:

4×4 DataFrame
 Row │ item            id     kind        condition
     │ String          Int64  String      String
─────┼──────────────────────────────────────────────
   1 │ Mars Rover        100  Rover       Fair
   2 │ Venus Explorer    101  Spaceship   Excellent
   3 │ Lunar Rover       102  Rover       Fair
   4 │ 30% Sun Filter    103  Sun Filter  Poor

Now let’s say we don’t want the condition column any more.

To permanently delete the condition column, do this:

select!(inventory, :item, :id, :kind)

We get:

4×3 DataFrame
 Row │ item            id     kind
     │ String          Int64  String
─────┼───────────────────────────────────
   1 │ Mars Rover        100  Rover
   2 │ Venus Explorer    101  Spaceship
   3 │ Lunar Rover       102  Rover
   4 │ 30% Sun Filter    103  Sun Filter

All we’re doing here is using the DataFrames package’s in-built function select!() to—you guessed it—select the columns we want.

We have deleted the condition column because we omitted it from the list of desired columns.

The exclamation mark after select means the change will be permanent—unless of course, we explicitly add the column back later.

You can even delete a column and reorder the dataframe at the same time. With the resultant dataframe immediately above, we can:

  • delete the kind column
  • reorder the remaining two columns so that the id column comes first

We do that like this:

select!(inventory, :id, :item)

We get:

4×2 DataFrame
 Row │ id     item
     │ Int64  String
─────┼───────────────────────
   1 │   100  Mars Rover
   2 │   101  Venus Explorer
   3 │   102  Lunar Rover
   4 │   103  30% Sun Filter

We’re down to two columns and now the id column is first.

Non-destructive column deletion

You can also opt to create a new dataframe instead of destructively editing the original one.

If we started again with the original four-column inventory dataframe (see code at the top of this section), we could create a new dataframe without the condition column like this:

inventory_new = select(inventory, :id, :item, :kind)

This wouldn’t change our inventory dataframe at all, because we used the version of select() without an exclamation mark—the non-bang function version. This means our inventory dataframe won’t be mutated; all changes will instead be reflected in our new variable named inventory_new.

How to write a CSV from a dataframe in Julia

It’s easy to write a CSV from our above inventory dataframe.

First, you’ll need to install the CSV.jl package if you haven’t already. (I covered how to install the DataFrames.jl package earlier—simply follow that but add CSV in the pkg> area as well.)

Now, assuming you’ve already told Julia that you’re using the DataFrames and CSV packages:

using DataFrames, CSV

… then we can write our CSV like this:

CSV.write("/Users/ron/Desktop/inventory.csv", inventory)

This takes our inventory dataframe and writes a CSV named inventory.csv to my Mac’s desktop (you can save it wherever you’d like).

How to write a TSV from a dataframe in Julia

If we want to write a TSV (tab-separated values) file from a Julia dataframe, then we use the same process as writing a CSV except for one little change.

Again, you’ll need the DataFrames and CSV packages installed—hit the right square bracket key ] to enter the package manager area, then type add DataFrames, CSV. Once done, hit delete to exit the package manager and return to the Julia prompt.

You’ll also need to tell Julia you want to use these packages:

using DataFrames, CSV

Now let’s write the inventory dataframe we’ve been working on as a TSV:

CSV.write("/Users/ron/Desktop/inventory.csv", inventory; delim="\t")

The only difference here versus in the earlier example where we wrote a CSV file is that we’ve told Julia to make the delimiter (delim) a tab character (which we write using the regex \t).

And voila … Julia will write our dataframe as a .tsv file.

How to import a CSV as a dataframe in Julia

Once we’ve installed the CSV.jl package already (see above section on writing CSV files), then we tell Julia we want to use it:

using CSV

Now we can import a CSV as a dataframe into Julia like this:

df = CSV.read("my-file.csv", DataFrame)

In the code above, we’re telling Julia that we want to create a new dataframe, and that we want to call it df. (You can name it whatever you want, though.)

Next, we tell Julia that we have a CSV file, and where to find it. In the example above, my-file.csv is located in the same folder as the Julia script that calls it.

However, if the CSV were in a different folder, you could reference it relative to the Julia script. For example, if the CSV is one folder higher than the Julia script, on a Mac you’d write this:

df = CSV.read("../my-file.csv", DataFrame)

(The two dots .. signify the parent directory in Unix and Mac systems.)

You can also reference the absolute location of the CSV. For instance, if the CSV were on my Mac desktop, I’d write this:

df = CSV.read("/Users/ron/Desktop/my-file.csv", DataFrame)

Finally, the DataFrame “argument” tells Julia to create a dataframe from the CSV.

How to import a TSV as a dataframe in Julia

For the example I tested, you don’t actually need to do anything different to import a TSV as a Julia dataframe; Julia and/or its DataFrames package automatically detects that tabs are used as separators instead of commas.

However, you can choose to specify the delimiter, like so:

df = CSV.read("/Users/ron/Desktop/my-file.csv", DataFrame; delim="\t")

… where, as in the example above on writing TSV files, the only difference to the CSV version is that we add ; delim="\t" to our code.

By the way, if you’re not familiar with regex, then \t stands for a tab character.

Next up: pivot tables in Julia.

How to use pivot tables with Julia using the FreqTables package

Let’s say we have the following table:

itemmaterialcategory
chairwoodfurniture
cricket batwoodsporting goods
footballleathersporting goods
bookshelfwoodfurniture

… and we want to use a pivot table in Julia so that we end up with:

category / materialleatherwood
furniture01
sporting goods11

It’s actually super simple when you know how.

First, though, if you want to play along, here’s the Julia code to get you started.

Let’s tell Julia we want to use the DataFrames package:

using DataFrames

Now let’s create the dataframe for the first table above:

df = DataFrame(
	item = [
		"chair",
		"cricket bat",
		"football",
		"bookshelf"
	],
	material = [
		"wood",
		"wood",
		"leather",
		"wood"
	],
	category = [
		"furniture",
		"sporting goods",
		"sporting goods",
		"furniture"
	]
)

To create the pivot table we want, we’ll first need to install the FreqTables Julia package.

Now we need to tell Julia we want to use the package:

using FreqTables

Finally, we can create our pivot table. I’ve chosen to store this in a new variable called pivot, but you can name yours whatever you like:

pivot = freqtable(df, :category, :material)

We’ll do a code walkthrough in a minute, but here’s what we get:

2×2 Named Matrix{Int64}
category ╲ material │ leather     wood
────────────────────┼─────────────────
furniture           │       0        2
sporting goods      │       1        1

Easy when you know how, huh?

Okay, let’s do that code walkthrough:

pivot

As mentioned earlier, this is the name of the variable we’ll be storing our pivot table in.

freqtable()

This is the inbuilt function within the FreqTables package we just installed that actually generates a pivot table for us.

df, :category, :material

These are the arguments we supply to the freqtables() function. The first argument is the dataframe (in our case, df) to be pivoted; the second argument (in our example, :category) is where we specify what we want for the rows in our pivot table; and the third is what we want for our columns—in our case, this is :material.

How to convert a FreqTables pivot table to a dataframe or CSV in Julia

Unfortunately, there are a few more steps if we want to get the pivot table into a dataframe for further analysis, or export it into a CSV.

Still, these steps are relatively straightforward, and are vastly preferable to importing data into Google Sheets to pivot it (which is what I used to do numerous times a day).

Anyway, here’s you convert a pivot table into a dataframe or CSV:

First, we convert our pivot table, which we named pivot, into a dataframe. Let’s call this new dataframe df_pivot:

df_pivot = DataFrame(pivot, [:leather, :wood])

Inside the DataFrame() function, we have to supply two arguments:

  1. the object—in our case, a Named Matrix we called pivot—to be converted into a dataframe
  2. an array—in our case [:leather, :wood]—containing the names of the columns we want in our new dataframe.

We end up with:

2×2 DataFrame
 Row │ leather  wood
     │ Int64    Int64
─────┼────────────────
   1 │       0      2
   2 │       1      1

Okay, so we have a dataframe, and yes, the values are there, but we’re missing the category names (“furniture” and “sporting goods”).

We’re going to have to go back to our original pivot to get these, and then add them as a column to our new df_pivot dataframe.

Let’s do this now. Let’s get the category names (“furniture” and “sporting goods”) and store them in a new variable we’ll call cats:

cats = names(pivot)[1]

This gives us:

2-element Vector{String}:
 "furniture"
 "sporting goods"

That’s the data we want—now to add it as a column to our df_pivot dataframe:

df_pivot[!, :category] = cats

When we have Julia show us our updated df_pivot, we get:

2×3 DataFrame
 Row │ leather  wood   category
     │ Int64    Int64  String
─────┼────────────────────────────────
   1 │       0      2  furniture
   2 │       1      1  sporting goods

Okay, everything we want is there. Let’s just reorder the columns so we have the category column at the start:

select!(df_pivot, :category, :leather, :wood)

Now we have our dataframe looking very similar to the original output from FreqTables. This is just what we want for further analysis or exporting to CSV:

2×3 DataFrame
 Row │ category        leather  wood
     │ String          Int64    Int64
─────┼────────────────────────────────
   1 │ furniture             0      2
   2 │ sporting goods        1      1

Okay, so there was a fair bit there. Now it’s time to look at another method of making pivot tables in Julia.

How to use pivot tables using the combine and groupby functions

In this section, we’ll use the DataFrames package exclusively to create our pivot table—we won’t be using the FreqTables package we relied on in the last section.

Let’s say we have a dataframe which stored the (fictional) number of pings the Hubble telescope received on each day over a two week period.

In this section, we’re going to use a pivot table to roll up the daily Hubble telescope ping data into weeks.

First, we’re going to use Julia’s inbuilt Dates package.

Unlike the DataFrames package that we installed around the time we began this tutorial, we don’t have to install the Dates package—it’s built into the Julia application.

However, we do have to tell Julia that we want to use it. The box below describes how:

Onwards. So, if you haven’t already, enter the code below into your Julia terminal:

using Dates

Now let’s create our Hubble telescope pings dataframe with the code below. I’ve shown it in elongated form so you don’t have to scroll horizontally to see it all.

hubble_pings = DataFrame(
	date = [
		Date(2020,03,16),
		Date(2020,03,17),
		Date(2020,03,18),
		Date(2020,03,19),
		Date(2020,03,20),
		Date(2020,03,21),
		Date(2020,03,22),
		Date(2020,03,23),
		Date(2020,03,24),
		Date(2020,03,25),
		Date(2020,03,26),
		Date(2020,03,27),
		Date(2020,03,28),
		Date(2020,03,29)
	],
	iso_week = [
		12,
		12,
		12,
		12,
		12,
		12,
		12,
		13,
		13,
		13,
		13,
		13,
		13,
		13
	],
	pings = [
		774,
		565,
		801,
		739,
		671,
		702,
		598,
		740,
		598,
		761,
		758,
		733,
		803,
		806
	]
)

We get:

14×3 DataFrame
│ Row │ date       │ iso_week │ pings │
│     │ Date       │ Int64    │ Int64 │
├─────┼────────────┼──────────┼───────┤
│ 1   │ 2020-03-16 │ 12       │ 774   │
│ 2   │ 2020-03-17 │ 12       │ 565   │
│ 3   │ 2020-03-18 │ 12       │ 801   │
│ 4   │ 2020-03-19 │ 12       │ 739   │
│ 5   │ 2020-03-20 │ 12       │ 671   │
│ 6   │ 2020-03-21 │ 12       │ 702   │
│ 7   │ 2020-03-22 │ 12       │ 598   │
│ 8   │ 2020-03-23 │ 13       │ 740   │
│ 9   │ 2020-03-24 │ 13       │ 598   │
│ 10  │ 2020-03-25 │ 13       │ 761   │
│ 11  │ 2020-03-26 │ 13       │ 758   │
│ 12  │ 2020-03-27 │ 13       │ 733   │
│ 13  │ 2020-03-28 │ 13       │ 803   │
│ 14  │ 2020-03-29 │ 13       │ 806   │

All that work we did earlier with setting up our dates correctly has paid off—notice how the type of the date column is Date, not String.

Now, you might be wondering what the iso_week column is. That’s just the International Organization for Standardization week number for the given date.

In our dataframe, we have seven days in Week 12 (of the year 2020), and another seven in Week 13 (also of the year 2020).

We’re going to use that iso_week column to pivot our data to get our weekly pings total.

The Julia code to make a pivot table is quite simple:

hubble_pings_weekly = combine(
	groupby(hubble_pings, :iso_week),
	:pings => sum
	)

Let’s walk through this code.

First, we tell Julia that we want to store our forthcoming pivot table in a new variable, hubble_pings_weekly.

Next, we use the combine() function—this is built in to the DataFrames.jl Julia package.

Inside combine(), we’re using two arguments.

The first is another function, groupby() (also built in to DataFrames.jl).

So we’re using a function—groupby()—inside another function, combine(). groupby() is therefore a nested function.

Inside the nested function, groupby(), we use two arguments. The first is hubble_pings. This tells Julia the name of the source dataframe we want to pivot.

The second argument inside groupby() is :iso_week. This tells Julia the column in the source dataframe we want to use for grouping.

Okay, so that’s groupby() done and dusted. Now we need to include the final argument inside combine(). We need to tell DataFrames.jl how we want to combine the data. We do so with the following argument: :pings => sum.

This tells Julia to take a given column, :pings, in the source dataframe and to sum them in our pivot table.

Take a look at our pivot table below—if you’re playing along, input hubble_pings_weekly into your terminal.

We get:

2×2 DataFrame
│ Row │ iso_week │ pings_sum │
│     │ Int64    │ Int64     │
├─────┼──────────┼───────────┤
│ 1   │ 12       │ 4850      │
│ 2   │ 13       │ 5199      │

So, from the above dataframe (which we named hubble_pings_weekly), we can see there were 4,850 pings from the Hubble telescope in Week 12, and 5,199 pings in Week 13.

Why pivot tables are better in Julia compared to Excel or Google Sheets

The reason I love Julia pivot tables is they are a first-class citizen. You can take this hubble_pings_weekly dataframe, run transformations on it, and it works just like any other Julia dataframe.

By contrast, pivot table worksheets in Excel or Google Sheets are decidedly not first-class citizens. Far too often, I had to manually copy and paste an Excel or Google Sheets pivot table into a new worksheet, just so I could run further transformations and analysis on it.

Get Julia tips in your inbox a few times per year. Unsubscribe anytime.

Despite being an amateur at Julia programming and having no prior web-scraping experience, a JuliaSchool tutorial helped me learn enough to make a contribution to an open-source project in the Julia community using these tools, all in the course of a single afternoon. Thanks JuliaSchool! - Joe from Germany