Revision as of 13:00, 14 September 2023

Overview

Parquet is a file format that collects schema, columnar data, and columnar metadata in a partitioned collection of files. The partitioned files allow many processes to work together simultaneously, the columnar data enables fast aggregate values for data analytics, and the columnar metadata ensures each process has the information it needs to access the fields quickly.

Partitioning

Partitioning your data well means each part file within a partition will have fewer rows, a greater percentage of them will be relevant to your query, and fewer accesses to external datasets will be required.

Reducing the row length of a collection of rows means you can include more columns or a larger share of the total dataset in the same size Parquet part file.

Having more columns means you can do more complex statistics without joining against other datasets. Increasing the share of the total dataset means you can get the same job done with fewer task nodes.

But the cost is that are splitting up the dataset. If your query mismatches the index, you lose all the benefit of indexing and may introduce problems memory footprint for a single task.

This category currently contains no pages or media.

@@ Line 4: / Line 4: @@
 == Partitioning ==
-Partitioning your data well means each part file within a partition will have fewer rows that do not need to be filtered.
+Partitioning your data well means each part file within a partition will have fewer rows, a greater percentage of them will be relevant to your query, and fewer accesses to external datasets will be required.
- Reducing the row length of a collection of rows means you can include more columns in the same size Parquet part file.
+Reducing the row length of a collection of rows means you can include more columns or a larger share of the total dataset in the same size Parquet part file.
-Assuming this aligns well with your queries, this means increasing the number of
+Having more columns means you can do more complex statistics without joining against other datasets. Increasing the share of the total dataset means you can get the same job done with fewer task nodes.
+But the cost is that are splitting up the dataset. If your query mismatches the index, you lose all the benefit of indexing and may introduce problems memory footprint for a single task.

Category:Parquet: Difference between revisions

Revision as of 13:00, 14 September 2023

Overview

Partitioning

Navigation menu

Category:Parquet: Difference between revisions

Revision as of 13:00, 14 September 2023

Overview

Partitioning

Navigation menu

Search