Using pyarrow one can get the length of a table stored in .parquet immediately.
from pyarrow.parquet import read_table
table = read_table('my_table.parquet')
print(len(table))
Dask dataframes instead store this quantity as a Delayed. I have a use case in which I only rarely modify the dask dataframe that I load from disk and I would like to immediately know their lenght.
Would it be possible for dask dataframes to read the length of a table from the .parquet metadata and store it, so that this value can be returned immediately, if the table length has not been modified? Of course if an operation like a subsetting is performed, then the new length must be computed, but dask may have the knowledge of which operation may alter the length and which not, so returning the cached length could be a nice feature.
Using
pyarrowone can get the length of a table stored in.parquetimmediately.Dask dataframes instead store this quantity as a
Delayed. I have a use case in which I only rarely modify the dask dataframe that I load from disk and I would like to immediately know their lenght.Would it be possible for dask dataframes to read the length of a table from the
.parquetmetadata and store it, so that this value can be returned immediately, if the table length has not been modified? Of course if an operation like a subsetting is performed, then the new length must be computed, but dask may have the knowledge of which operation may alter the length and which not, so returning the cached length could be a nice feature.