Skip to content

Get length of parquet-backed dask dataframe from metadata #9973

@LucaMarconato

Description

@LucaMarconato

Using pyarrow one can get the length of a table stored in .parquet immediately.

from pyarrow.parquet import read_table
table = read_table('my_table.parquet')
print(len(table))

Dask dataframes instead store this quantity as a Delayed. I have a use case in which I only rarely modify the dask dataframe that I load from disk and I would like to immediately know their lenght.

Would it be possible for dask dataframes to read the length of a table from the .parquet metadata and store it, so that this value can be returned immediately, if the table length has not been modified? Of course if an operation like a subsetting is performed, then the new length must be computed, but dask may have the knowledge of which operation may alter the length and which not, so returning the cached length could be a nice feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageNeeds a response from a contributor

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions