Skip to content

Bug: FilePruner cannot prune constant columns #23272

Description

@niebayes

Describe the bug

For a PartitionedFile with constant columns, min(col) = max(col), FilePruner fails to prune the file.

constant_columns_from_stats()
  └─ detects sid min==max==5 → literal_columns = {"sid": 5}
      └─ replace_columns_with_literals()
          └─ sid > 100 → 5 > 100        ← column reference replaced by literal
              └─ FilePruner::try_new(5 > 100, schema, file)
                  └─ should_prune()
                      └─ build_pruning_predicate(5 > 100, schema) → None  ← constant expression yields no statistics-based predicate
                          └─ return Ok(false)                            ← file not pruned

To Reproduce

I have written a test in datafusion/datasource-parquet/src/opener/mod.rs

   #[tokio::test]
    async fn test_prune_on_constant_columns() {
        let store = Arc::new(InMemory::new()) as Arc<dyn ObjectStore>;

        let batch = record_batch!((
            "sid",
            Int32,
            vec![Some(5), Some(5), Some(5), Some(5), Some(5)]
        ))
        .unwrap();
        let data_size =
            write_parquet(Arc::clone(&store), "constant.parquet", batch.clone()).await;
        let schema = batch.schema();

        let file = PartitionedFile::new(
            "constant.parquet".to_string(),
            u64::try_from(data_size).unwrap(),
        )
        .with_statistics(Arc::new(
            Statistics::new_unknown(&schema).add_column_statistics(
                ColumnStatistics::new_unknown()
                    .with_min_value(Precision::Exact(ScalarValue::Int32(Some(5))))
                    .with_max_value(Precision::Exact(ScalarValue::Int32(Some(5))))
                    .with_null_count(Precision::Exact(0)),
            ),
        ));

        let make_opener = |predicate| {
            let metrics = ExecutionPlanMetricsSet::new();
            let morselizer = ParquetMorselizerBuilder::new()
                .with_store(Arc::clone(&store))
                .with_schema(Arc::clone(&schema))
                .with_projection_indices(&[0])
                .with_predicate(predicate)
                .with_metrics(metrics.clone())
                .build();
            (morselizer, metrics)
        };

        let expr = col("sid").gt(lit(100));
        let predicate = logical2physical(&expr, &schema);
        let (opener, metrics) = make_opener(predicate);
        let stream = open_file(&opener, file).await.unwrap();
        let (_num_batches, _num_rows) = count_batches_and_rows(stream).await;

        // The bug: `files_ranges_pruned_statistics` pruned count is 0 because
        // `constant_columns_from_stats` folds the column reference into a literal,
        // and `FilePruner` cannot prune a constant expression.
        let pruned = {
            use datafusion_physical_plan::metrics::MetricValue;
            metrics
                .clone_inner()
                .iter()
                .filter_map(|m| match m.value() {
                    MetricValue::PruningMetrics {
                        name,
                        pruning_metrics,
                    } if name.as_ref() == "files_ranges_pruned_statistics" => {
                        Some(pruning_metrics.pruned())
                    }
                    _ => None,
                })
                .sum::<usize>()
        };
        assert!(
            pruned > 0,
            "file with constant column should be pruned at file level, got pruned={pruned}"
        );
    }

Expected behavior

Correctly prune.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions