DuckDB and Parquet

DuckDB is a an embedded database similar to SQLite. Though its primary selling point is "SQLite, but for OLAP" my favorite feature of it boasts is how it can treat files as database tables transparently. Once you grok what DuckDB can do for you it will become one of those tools that never leaves your PATH.

It's Probably a Duck

Recall the following adage:

If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

A CSV file certainly looks like a database table, and with DuckDB it is a database table. As an example I grabbed a random CSV from here (note that there is is a lot of fun data there). Let's see what we got:

duckdb -markdown -c "select * from 'example.csv' limit 5;"

| Year of Birth | Gender | Ethnicity | Child's First Name | Count | Rank |
|--------------:|--------|-----------|--------------------|------:|-----:|
| 2011          | FEMALE | HISPANIC  | GERALDINE          | 13    | 75   |
| 2011          | FEMALE | HISPANIC  | GIA                | 21    | 67   |
| 2011          | FEMALE | HISPANIC  | GIANNA             | 49    | 42   |
| 2011          | FEMALE | HISPANIC  | GISELLE            | 38    | 51   |
| 2011          | FEMALE | HISPANIC  | GRACE              | 36    | 53   |

Looks like this one is popular baby names. Let's see what the most popular name was:

duckdb -markdown -c "select * from 'example.csv' order by Rank asc limit 1;"

| Year of Birth | Gender | Ethnicity | Child's First Name | Count | Rank |
|--------------:|--------|-----------|--------------------|------:|-----:|
| 2011          | FEMALE | HISPANIC  | ISABELLA           | 331   | 1    |

As you can see the sky is the limit. Whatever queries you can express in SQL you can now easily run against files. You could even join two different files, and then aggregate the results. You can read from many kinds of files including:

JSON
CSV
Excel
Parquet

Of these Parquet is particularly interesting, and the rest of blog post will focus on that.

Parquet Format

Parquet is a format used to store data on disk like CSV. However the similarities end there:

Parquet is column oriented rather than row oriented
Parquet files have metadata that allow readers to skip parts of it
Parquet files have granular compression
Parquet files have something akin to indexes with bloom filters

The net result is that Parquet files are generally much smaller on disk than CSVs, and they can be read much more efficiently by readers (like DuckDB).

This is a somewhat over-simplified view of a Parquet files format:

Basically the file is a series of row groups. Each of these row groups can be read independently of other row groups. Within each row group the data is further divided by column. These per row group per column chunks are individually compressed and have their own bloom filters. The net result is that clients do not need to read the whole file as long as they know which columns in which row groups they care about.

Writing a Parquet File

We will use the parquet-go library to generate an example Parquet file. First we define the type of data in the file with a struct:

// Book represents a book with ID, Title, Author, and Genre.
// This is a row in the Parquet file.
type Book struct {
	ID     string `parquet:"id"`     // unique identifier for the book
	Title  string `parquet:"title"`  // title of the book
	Author string `parquet:"author"` // author of the book
	Genre  string `parquet:"genre"`  // genre of the book
}

Then we open a file for writing:

os.Remove(filePath)
file, err := os.Create(filePath)
if err != nil {
	panic(err)
}
defer file.Close()

Then we instantiate a Parquet writer:

compression := parquet.Compression(&parquet.Snappy)
bloom := parquet.BloomFilters(parquet.SplitBlockFilter(10, "id"))

writer := parquet.NewGenericWriter[Book](file, compression, bloom)
defer writer.Close()

Here we do some important configuration:

Compress column chunks with snappy
Add per column chunk bloom filter on id

Then finally we can write the data:

for range numRowGroups {
	// generate the books for a row group

	var books []Book
	for range numBooks / numRowGroups {
		books = append(books, Book{
			ID:     gofakeit.UUID(),
			Title:  gofakeit.BookTitle(),
			Author: gofakeit.BookAuthor(),
			Genre:  gofakeit.BookGenre(),
		})
	}

	// add the books to a row group

	buffer := parquet.NewGenericBuffer[Book]()
	_, err := buffer.Write(books)
	if err != nil {
		panic(err)
	}

	// write the row group to the file

	_, err = writer.WriteRowGroup(buffer)
	if err != nil {
		panic(err)
	}
}

As you can see we have exacting control over how many records are in each row group. The library will automatically create the per column chunks within these row groups for us.

Reading from a Parquet File

Here we will use parquet-go to read from the file we just created. This is very low level code, and it just serves to illustrate how clients can make use of the Parquet features to read more efficiently. First we open a file for reading:

file, err := os.Open(filePath)
if err != nil {
	panic(err)
}
defer file.Close()

info, err := file.Stat()
if err != nil {
	panic(err)
}

Then we instantiate a Parquet reader:

reader, err := parquet.OpenFile(file, info.Size())
if err != nil {
	panic(err)
}

Now our goal here is to figure out what index a particular id is at in the file with minimal reading of the file itself. To that end we will make use of our bloom filters:

var candidateRowGroups []parquet.RowGroup
for _, rowGroup := range reader.RowGroups() {
	columnChunk := rowGroup.ColumnChunks()[idColumn]
	bloomFilter := columnChunk.BloomFilter()
	if bloomFilter != nil {
		if ok, err := bloomFilter.Check(parquet.ByteArrayValue([]byte(id))); err != nil {
			panic(err)
		} else if !ok {
			continue // bloom filters may return fale positives; but never false negatives
		}
	}

	candidateRowGroups = append(candidateRowGroups, rowGroup)
}

Rather than read every value in every column in every row in the file we get to focus on just the column we care about with rowGroup.ColumnChunks()[idColumn]. That could be a huge win already if the file has lots of columns in it. But even better all we have to do is load the bloom filter for that column in each row group, and do a membership test. With that we can figure out which row groups we actually need to read. This is also a huge win. If you have a file with a dozen columns and a billion rows, and the row groups are only 100k you will be reading much less data this way.

Finally we need to read the candidate row groups and iterate through them to find the id:

for _, rowGroup := range candidateRowGroups {
	columnChunk := rowGroup.ColumnChunks()[idColumn]

	pages := columnChunk.Pages()
	defer pages.Close()

	var index int64
	for {
		page, err := pages.ReadPage()
		if err != nil {
			if errors.Is(err, io.EOF) {
				break
			}
			panic(err)
		}

		values := make([]parquet.Value, page.NumValues())
		_, err = page.Values().ReadValues(values)
		if err != nil && !errors.Is(err, io.EOF) {
			panic(err)
		}

		for _, value := range values {
			if value.String() == id {
				return index
			}
			index++
		}
	}
}

return -1

Now if all of this seems like a lot of work that's because it is. However recall that DuckDB can read from Parquet files. Much easier.

DuckDB and Parquet

Let's take a look at the file we generated:

duckdb -markdown -c "select * from 'books.parquet' limit 5;"

|                  id                  |          title          |      author       |   genre    |
|--------------------------------------|-------------------------|-------------------|------------|
| fd95a628-6137-4ea8-ae8d-6bdf13d99373 | The Old Man and the Sea | Yasunari Kawabata | Historical |
| 46df0003-2826-49e1-b357-aa1aa8d0b58c | Moby Dick               | George Orwell     | Saga       |
| 4205d03b-8906-4412-af12-2f4654744194 | Wuthering Heights       | Charles Dickens   | Mystery    |
| b76800d1-abde-4347-9759-6643812192a1 | Bostan                  | Franz Kafka       | Political  |
| 803ce992-40ec-4c75-915d-46c49ee40791 | War and Peace           | William Faulkner  | Romance    |

Yup it works. However unlike the CSV example we worked earlier it can use all of the Parquet features we just made use of to limit the amount of data read. While that is very useful locally, it's absolutely critical if you are consuming a Parquet file over the network, which you can do with DuckDB using httpfs. That extensions allows you to directly consume files from cloud storage providers like S3 or GCS. Recall that the data in a Parquet file can be read per row group per column, and those chunks are compressed. DuckDB will be clever here and:

Download just the metadata
Figure out which chunks it needs
Download just those chunks

It could be difference between downloading megabytes and gigabytes of data.

Additional Notes on Parquet Files

You don't have to use Snappy for compression; there are other options as well
We didn't really talk about encodings but they can massively affect the file size
We kind of glossed over the fact that columns chunks are further divided into pages

Conclusion

DuckDB is a very cool tool that you will find all kinds of uses for the more you dig into its capabilities. Of the file formats that it can read Parquet is especially interesting because its feature set works extremely well with the DuckDB philosophy.