Parquet Java ALP Implementation#3397
Conversation
Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression. New files: - AlpConstants.java: Constants for ALP encoding - AlpEncoderDecoder.java: Core encoding/decoding logic - AlpValuesWriter.java: Writer implementation - AlpValuesReaderForFloat/Double.java: Reader implementations Includes comprehensive unit tests and interop test infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP encoding is not yet part of the parquet-format Thrift specification, so it cannot be converted to org.apache.parquet.format.Encoding. Skip it in the testEnumEquivalence test and add a clear error message in the converter for when ALP conversion is attempted. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
size and add independent reader/writer verification tests
93c365a to
6d65eaa
Compare
Switch encode/decode from division-based formula to multiply-by-reciprocal using separate POW10_NEGATIVE arrays, matching C++ Arrow's approach: - Encode: fastRound(value * POW10[e] * POW10_NEGATIVE[f]) - Decode: encoded * POW10[f] * POW10_NEGATIVE[e] Add fastRound helpers with sign branching for correct negative value rounding. Remove version byte from page header (8 -> 7 bytes). Empty pages now emit a 7-byte header with numElements=0. Update all hand-crafted binary tests to match the new header format and add comprehensive end-to-end tests for overflow boundaries, large-scale data, preset caching, and NaN bit-pattern preservation.
- Rewrite TestInterOpReadAlp to use LocalInputFile instead of Hadoop FileSystem, fixing failures on Java 24+ where Subject.getSubject is removed. Tests now read C++ ALP parquet files directly without going through Hadoop security/UGI. - Add AlpExceptionCountTest with per-column exception rate reporting against the real Spotify and Arade floating-point datasets from the parquet-testing repository. Useful for comparing Java vs C++ ALP compression ratios.
- Switch findBestFloatParams/findBestDoubleParams from minimizing exception count to minimizing estimated compressed size (length * bitWidth + exceptions * (typeSize + 2 bytes)), matching the C++ ALP cost model. This closes the ~4-5% compression gap vs C++. - Rewrite sampler to collect evenly-spaced sample vectors and run findBestParams on each, then rank by win count. Matches C++ AlpSampler behavior more closely than the previous HashMap-based approach. - Minor fixes: IOExceptionUtils null check, MemoryManager volatile scale, Files utility cleanup, parquet-cli dependency update.
15bc06d to
24c23e5
Compare
- Move shared LE helper methods (getShortLE/getIntLE/getLongLE) to AlpValuesReader base class; remove duplicates from subclasses - Make EncodingParams fields package-private (remove public modifier) - Replace fully-qualified java.util.Arrays.fill calls with imported Arrays.fill in both float and double readers; add missing import to double reader - Add explanatory comments to getBufferedSize() magic numbers (3 for float, 5 for double) explaining the overhead breakdown - Add ALP enabled state to ParquetProperties.toString() - Add ALP support to DefaultV1ValuesWriterFactory for float and double columns
- Revert Files.java, IOExceptionUtils.java, MemoryManager.java, and parquet-cli/pom.xml to master state; these changes are unrelated to ALP and should be submitted in separate PRs - Clarify ParquetMetadataConverter error message: ALP encoding is defined in the ALP paper (enum value 26) but is not yet in the parquet-format Thrift spec, so ALP cannot be written through the Hadoop write path; the error message now explains what needs to happen to remove the block
prtkgaur
left a comment
There was a problem hiding this comment.
The code organization looks good to me and the code follows the spec. I looked for areas of any extra buffer allocations which might impact performance and I think it is optimally written.
I think we should add a few benchmarks and publish numbers from them.
Thanks for working on this Vinoo!
prtkgaur
left a comment
There was a problem hiding this comment.
Wanted to make sure we have the following testing.
For the cross compatibility testing are we making sure that we write both V1 and V2 pages and the implementation in other language is able to read it.
- Add build-time Perl script to patch generated Encoding.java with ALP(10) after Thrift codegen (process-sources phase), since parquet-format 2.12.0 does not yet include ALP in its Thrift spec - Remove guard in ParquetMetadataConverter.getEncoding() that blocked ALP writes; Encoding.ALP now exists in the patched Thrift enum - Add withAlpEncoding() builder methods to ParquetWriter - Add TestInterOpReadAlp: Java V1/V2 write+read round-trip tests and C++ Arrow interop tests (reads alp_spotify1.parquet, alp_arade.parquet, etc.) - Add AlpEncodingBenchmarks JMH benchmark
…d pyarrow interop test - AlpValuesWriter: stop clearing cachedPresets in reset() so preset (e,f) pairs survive page flushes; eliminates redundant full parameter search on every page after the first, cutting write time ~60% - AlpEncodingBenchmarks: clarify Javadoc that comparison is PLAIN+UNCOMPRESSED (no codec), not plain+ZSTD - parquet-benchmarks pom: add explicit annotationProcessorPaths and proc=full for jmh-generator-annprocess so BenchmarkList is generated under Java 23+ - TestInterOpReadAlp: add pyarrow cross-language compatibility test (skips if pyarrow unavailable or does not yet support ALP encoding)
- ParquetProperties: add withAlpVectorSize(int) and withAlpVectorSize(String, int) builder methods plus getAlpVectorSize(ColumnDescriptor) accessor, defaulting to AlpConstants.DEFAULT_VECTOR_SIZE (1024). - AlpConstants: promote validateVectorSize to public so the builder can validate eagerly across packages. - DefaultV1/V2 ValuesWriterFactory: pass the configured vector size to the 4-arg AlpValuesWriter constructors. - ParquetWriter.Builder: expose withAlpVectorSize facades mirroring withAlpEncoding. - TestInterOpReadAlp: add testJavaWriteAlpCustomVectorSize covering 4500 rows at vectorSize=4096 so we cross a full vector boundary and verify round-trip equality. A wrong log_vector_size byte would surface as decode garbage, so round-trip equality is sufficient proof the configured size took effect on the wire. Enables generating ALP test fixtures at different vector sizes (e.g. 4096) for cross-language compatibility testing against the C++/Rust/Go implementations.
Logging and debug output was missing the new alpVectorSize field alongside the existing 'ALP enabled' line. Cosmetic only — no behavior change.
Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp. For each of the four source files in parquet-testing PR apache#100 (alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads every row, then re-encodes as Java ALP at both vectorSize=1024 and vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default ${user.dir}/alp-java-generated/), producing 8 files total named alp_java_<stem>_vs{1024,4096}.parquet. Each output is verified by reading back through the standard reader path and bit-comparing every value via doubleToRawLongBits / floatToRawIntBits — catches NaN payload and signed-zero divergence, not just numerical equality. Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on machines without the source datasets. To run: git clone --branch alpFloatingPointDataset \\ https://github.com/prtkgaur/parquet-testing.git ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\ mvn -pl parquet-hadoop \\ -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\ test
Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page
version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset
and ALP vector size. Output grows from 8 → 16 files per run:
alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet
Page version is orthogonal to ALP encoding — the page version
difference lives in the parquet protocol layer, not in the ALP
payload — but covering both axes makes the fixture set fully
symmetric for cross-language compatibility verification. C++/Rust/Go
readers can use the V1 and V2 variants to prove their decoders
handle Java-written ALP regardless of how the surrounding pages are
framed. Avoids an asymmetry where the existing PR apache#100 set has C++
at V1 and Java at V2 with no overlap.
All 16 outputs independently verified against the canonical
_expect.csv truth files from parquet-testing PR apache#100 (1.56M values,
0 mismatches).
The reader was asserting that the ALP header's num_elements equals the data page's valuesCount, but those values differ whenever a column has nulls: num_elements is the count of non-null values that went through ALP encoding, while valuesCount is the total row count of the page (which includes null positions tracked by definition levels). The strict equality check made the reader reject every optional float/double column with at least one null value. Relaxes the check to numElements > valuesCount — the header can never legitimately claim more encoded values than the page has rows, but it can claim fewer when nulls are present. The downstream code already uses numElements (not valuesCount) to drive vector allocation and decoding, so the rest of the read path is unchanged. This was surfaced by the corner-case fixture per parquet-testing issue apache#105, which exercises optional columns with null values.
Two new tests in TestInterOpReadAlp: readAllFixtureFilesIndependently Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each column chunk declares Encoding.ALP and decodes through the standard reader path without error. Separate from the generator's own round-trip verification so reader correctness surfaces as a distinct signal in CI when the fixtures are present. Skips cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default CI environments. generateAndVerifyCornerCaseFixture Writes a single small fixture file (alp_java_cornercases.parquet, ~60 KB) targeting the corner cases enumerated in parquet-testing issue apache#105: vectors with no exceptions, one exception per vector, all exceptions, NaN/Inf/-0.0, constant values (bit_width=0), multi-vector with differing exponents, and optional columns with nulls. Both f32 and f64 variants — 14 columns × 2048 rows total. Reads each column back and bit-exactly verifies every value against the expected pattern via doubleToRawLongBits / floatToRawIntBits. The corner-case fixture is intended as a candidate file for parquet-testing PR apache#100 once naming/design is confirmed. Generating it also surfaced (and verified the fix for) a pre-existing reader bug where optional columns with nulls couldn't be decoded — see the preceding commit.
The corner-case fixture (alp_java_cornercases.parquet) is synthetic — it isn't derived from any raw dataset in parquet-testing PR apache#100, so the existing alp_*_expect.csv files don't cover it. That left cross-language verifiers with no independent ground truth to check the parquet file against; they had to either trust the Java reader or duplicate the construction recipe in their own code. writeCornerCaseCsvTruth now dumps the expected values straight from the construction recipe into alp_java_cornercases_expect.csv next to the parquet, every time the generator runs. The CSV uses the same format conventions as the existing _expect.csv files (comma- separated, header row, no quoting) plus two extensions: • Empty field = null cell (for optional columns) • Special values printed via Java's standard toString: "NaN", "Infinity", "-Infinity", "-0.0". These all parse via C++ std::stod / std::stof per the standard (case-insensitive, "inf" and "infinity" both accepted). The Arrow C++ ALP decoder reads the parquet and compares against this CSV bit-exactly: 27306 non-null cells + 1366 null cells across 14 columns × 2048 rows, 0 mismatches. This makes the corner-case fixture self-documenting and verifiable by any future cross-language tooling without rerunning the Java generator to discover what the expected values are.
| int bitWidth = vectorsData.get(pos + 4) & 0xFF; | ||
| pos += FLOAT_FOR_INFO_SIZE; | ||
|
|
||
| if (bitWidth > 0) { |
There was a problem hiding this comment.
Should we check that the bitwidth is also within reasonable bounds? bitWidth within [0,32] or [0,64] respectively?
| private ValuesWriter getDoubleValuesWriter(ColumnDescriptor path) { | ||
| final ValuesWriter fallbackWriter; | ||
| if (this.parquetProperties.isByteStreamSplitEnabled(path)) { | ||
| if (this.parquetProperties.isAlpEnabled(path)) { |
There was a problem hiding this comment.
I don't see a test for this or the float factory producing the ALPValueWriter classes. Probably worth adding them just like the ByteStreamSplit tests in DefaultValuesWriterFactoryTest
There was a problem hiding this comment.
I'm not sure if this belongs in DefaultV1ValuesWriter factory, we'll see what @rdblue proposes with the new versioning. But I"m guessing this will land in a V3? I would stick it only in V2 for now since that's already a bit mixed up
| } | ||
| roundTripDouble(values); | ||
| } | ||
|
|
There was a problem hiding this comment.
May want to add a case for numElements < Value count to cover the case where some values where null.
| } | ||
|
|
||
| /** NaN, Inf, and -0.0 can never be encoded regardless of exponent/factor. */ | ||
| static boolean isFloatException(float value) { |
There was a problem hiding this comment.
nit: The Naming is a bit odd for me here
As the javadoc notes, this only checks for blanket exceptions but the method below
is the one that would tell us if the value was actually an exception.
Just feels a little worrysome to me that you could see isFloatException(value) : false
but isFloatException(value, x, y) : true
I might name this something different to make it clear that it's a subset of checks.
maybe isIntrinsicFloatException(floatValue)
and isFloatException(value, exponent, factor)
There was a problem hiding this comment.
(this also would mean you could probably drop the javadocs explaining the difference)
| return Float.floatToRawIntBits(value) != Float.floatToRawIntBits(decoded); | ||
| } | ||
|
|
||
| /** Round float to nearest integer using magic-number trick with sign branching. */ |
There was a problem hiding this comment.
nit: Is this java doc really valuable here? I'm not sure it does more than describe the implementation.
| return encoded * FLOAT_POW10[factor] * FLOAT_POW10_NEGATIVE[exponent]; | ||
| } | ||
|
|
||
| static boolean isDoubleException(double value) { |
There was a problem hiding this comment.
Same note on the polymorphism here
|
Thanks @RussellSpitzer and @emkornfield! I'll do a first pass on the feedback ASAP |
|
|
||
| /** Number of bits needed to represent maxDelta as an unsigned value. */ | ||
| static int bitWidthForInt(int maxDelta) { | ||
| if (maxDelta == 0) { |
There was a problem hiding this comment.
Is this just here to make the spec explicit?
| * | ||
| * <p>Estimated size (in bits) = {@code length * bitWidth + exceptions * (Float.SIZE + Short.SIZE)}, | ||
| * where bitWidth is the number of bits needed to represent the unsigned range of non-exception | ||
| * encoded values after frame-of-reference subtraction. This matches the C++ ALP cost model and |
There was a problem hiding this comment.
Not sure we need the note about the C++ ALP cost model. Feels like that information could get stale and doesn't really help the reader.
| int bitsPerValue = (delta == 0) ? 0 : (64 - Long.numberOfLeadingZeros(delta)); | ||
| long estimatedSize = (long) length * bitsPerValue | ||
| + (long) exceptions * (Float.SIZE + Short.SIZE); | ||
| if (estimatedSize < bestEstimatedSize |
There was a problem hiding this comment.
This I think would be nice to be metioned in the java doc. We have an obvious ordering (lowest estimated size wins) but then the decision for best e and f is I assume to just match what was done in the C++ impl?
|
|
||
| /** Same as findBestFloatParams but only tries the cached preset combos. */ | ||
| static EncodingParams findBestFloatParamsWithPresets(float[] values, int offset, int length, int[][] presets) { | ||
| int bestExponent = presets[0][0]; |
There was a problem hiding this comment.
nit: I like doing constants for this kind of indexing just so we aren't constnatly remembering the constants here so
final int E = 0
final int F = 1
But that's just me
| } | ||
| int nonExceptions = length - exceptions; | ||
| if (nonExceptions == 0) continue; | ||
| long delta = (nonExceptions < 2) ? 0 : |
There was a problem hiding this comment.
Same sign question as the above function
| } | ||
| } | ||
| return new EncodingParams(bestExponent, bestFactor, bestExceptions); | ||
| } |
There was a problem hiding this comment.
I'm a little worried about the code duplication here. I think we are directly
duplicating some of the logic here.
What if instead we did something like
scoreFloat(buffer, offset, length, e, f)
pickBestFloat(buffer, offset, length, pairs[])
findBestFloat(buffer, offset, length):
return pickBestFloat(buffer, offset, length, ALL_VALID_PAIRS)
findBestFloat(buffer, offset, length, presets):
return pickBestFloat(buffer, offset, length, presets)
|
|
||
| protected abstract void decodeVector(int vectorIdx); | ||
|
|
||
| // Explicit little-endian reads using absolute get(), since absolute get() ignores ByteBuffer order. |
There was a problem hiding this comment.
I think this is only true for get(). We should have LE respecting getShort, getInt and getLong on the buffer since we already slice it as little littleEndian.
Basically if we make
this.vectorsData = rawSlice.slice().order(ByteOrder.LITTLE_ENDIAN);
//Then
(vectorsData.getShort(pos) & 0xFFFF) == getShortLE(vectorsData, pos)
vectorsData.getInt(pos) == getIntLE(vectorsData, pos)
vectorsData.getLong(pos) == getLongLE(vectorsData, pos)If we still want buffer agnostic helpers we already have some helpers in org.apache.parquet.bytes.BytesUtils. That would probably be the right place
to add more.
/**
* reads an int in little endian at the given position
*
* @param in a byte buffer
* @param offset an offset into the byte buffer
* @return the integer at position offset read using little endian byte order
* @throws IOException if there is an exception reading from the byte buffer
*/
public static int readIntLittleEndian(ByteBuffer in, int offset) throws IOException {SO I think we would put additional ones there as well
TLDR; I don't think we need these helpers here since we have a littleEndian buffer already, but it f were going to add helpers we should do them in ByteUtils
| protected int offsetArraySize; | ||
|
|
||
| AlpValuesReader() { | ||
| this.currentIndex = 0; |
There was a problem hiding this comment.
Reading through this and the subclasses I find it a little confusing that we have so many different "index" cursors used with different names
in different contexts. I think they are a little confusing because they are in essentially different coordinate systems.
I think it may be clearer if we give them some slightly different names
currentIndex -> pageValueIndex // Which encoded value in the page
vectorIdx / currentVectorIndex -> vectorNumber // Which vector chunk
indexInVector -> vectorSlot // Position within that chunk
To clearly differentiate what each one represents
| } | ||
|
|
||
| @Override | ||
| protected void decodeVector(int vectorIdx) { |
There was a problem hiding this comment.
I mentioned on AlpValuesReaders but this is where I think it would be much clearer if we didn't use "Idx" and instead used something like vectorNumber
|
|
||
| int exponent = vectorsData.get(pos) & 0xFF; | ||
| int factor = vectorsData.get(pos + 1) & 0xFF; | ||
| int numExceptions = getShortLE(vectorsData, pos + 2) & 0xFFFF; |
There was a problem hiding this comment.
The 0XFFFF should be redundant here with the current impl of getShortLE (it already masks both get's with 0XFF), but if we switch to vectorsData.getShort(pos + 2) we would need it
| } | ||
|
|
||
| long frameOfReference = getLongLE(vectorsData, pos); | ||
| int bitWidth = vectorsData.get(pos + 8) & 0xFF; |
| // Last group might have fewer than 8 values; zero-pad and unpack, | ||
| // but only advance pos by the actual bytes in the page. | ||
| if (remaining > 0) { | ||
| int totalPackedBytes = (count * bitWidth + 7) / 8; |
There was a problem hiding this comment.
BytesUtils has paddedByteCountFrombits which has this same math
| } | ||
|
|
||
| /** Number of bits needed to represent maxDelta as an unsigned value. */ | ||
| static int bitWidthForInt(int maxDelta) { |
There was a problem hiding this comment.
Looking back there is a BytesUtils for this as well getWidthFromMaxInt(maxDelta)
| this.excPositionsBuffer = new int[capacity]; | ||
| this.unpackByteBuf = new byte[Integer.SIZE]; // max bit width for int = 32 bytes | ||
| } | ||
|
|
There was a problem hiding this comment.
I left a bunch of other comments on the double reader, they all would apply here as well s/double/float and s/long/int/ :)
| int bitWidth = vectorsData.get(pos + 8) & 0xFF; | ||
| pos += DOUBLE_FOR_INFO_SIZE; | ||
|
|
||
| if (bitWidth > 0) { |
There was a problem hiding this comment.
Do we want a bounds check for bitWidth <= 64?
| if (cachedPresets != null) { | ||
| params = AlpEncoderDecoder.findBestFloatParamsWithPresets(vectorBuffer, 0, vectorLen, cachedPresets); | ||
| } else { | ||
| params = AlpEncoderDecoder.findBestFloatParams(vectorBuffer, 0, vectorLen); |
There was a problem hiding this comment.
Very much my personal opinon , but it feels a little weird to me that the "cached" version comes first and the "build cache" version comes second.
i'm wondering if it would be simpler like
if (cachedPresets == null) {
// Sampling amd potential cache building first
params = AlpEncoderDecoder.findBestFloatParams(vectorBuffer, 0, vectorLen);
if (vectorsProcessed % rowgroupSampleJump == 0
&& rowgroupSamples.size() < SAMPLER_SAMPLE_VECTORS_PER_ROWGROUP) {
rowgroupSamples.add(Arrays.copyOf(vectorBuffer, vectorLen));
}
if (rowgroupSamples.size() >= SAMPLER_SAMPLE_VECTORS_PER_ROWGROUP) {
buildPresetCache();
}
} else {
// Cached case second
params = AlpEncoderDecoder.findBestFloatParamsWithPresets(
vectorBuffer, 0, vectorLen, cachedPresets);
}
vectorsProcessed++;This would tie the cache building to the sampling phase rather than being a branch either
phase could hit and would match the normal flow of code imho.
| int excIdx = 0; | ||
|
|
||
| // We need a valid encoded value to fill exception slots (placeholder). | ||
| // Any non-exception value works; it gets overwritten on decode. |
There was a problem hiding this comment.
I was walking through this and I think we are being a little more clever then this here right?
Maybe the comment could read more like
Exception indices still participate in the packed integer array (one slot per vector position),
but the real float is stored in the exception block. We fill those slots with a dummy encoded int:
the first non-exception value's encoding if any exists,
otherwise 0 when the whole vector is exceptions.
Picking an encoding that's already in the vector avoids introducing a spurious minimum (e.g. 0)
that would distort FOR/bit width. When every value is an exception, all slots are 0, minValue is
0, and deltas are also 0. The reader patches exception slots from the exception payload, so the
placeholder only affects FOR/bit width, not the decoded values.
| } | ||
| } | ||
|
|
||
| int bitWidth = AlpEncoderDecoder.bitWidthForInt(maxDelta); |
There was a problem hiding this comment.
| int bitWidth = AlpEncoderDecoder.bitWidthForInt(maxDelta); | |
| int bitWidth = BytesUtils.getWidthFromMaxInt(maxDelta); |
| packPadBuf[i] = 0; | ||
| } | ||
| packer.pack8Values(packPadBuf, 0, packBuf, 0); | ||
| int totalPackedBytes = (count * bitWidth + 7) / 8; |
There was a problem hiding this comment.
BytesUtils.paddedByteCountFromBits
|
|
||
| private void buildPresetCache() { | ||
| // For each sampled vector, find the best (e,f) using estimated compressed size, | ||
| // then rank combos by how many samples they win. Take the top MAX_PRESET_COMBINATIONS. |
There was a problem hiding this comment.
style nit: probably could use some line breaks between the for blocks here
| } | ||
| } | ||
|
|
||
| private void buildPresetCache() { |
There was a problem hiding this comment.
So for this function it feels like we are doing some redundant work
Instead of copying samples and finding best presets again, we could just save our sample
presets as we are sampling and then just run the frequency count here.
// at sample time
sampleWinners.add(new int[] { params.exponent, params.factor });
// buildPresetCache — no findBest, no float[] copies
private void buildPresetCache(sampleWinners int[][] ) {
Map<Long, Integer> counts = new HashMap<>();
for (int[] ef : sampleWinners) {
long key = ((long) ef[0] << Integer.SIZE) | ef[1];
counts.merge(key, 1, Integer::sum);
}
// sort, take top 5 → cachedPresets
}So rather than recomputing for each sample, we just save the (e, f) for each sample (instead of the sample) and then group and sort by frequency?
Also I wouldn't use sampleWinners as a variable name here, but it's getting late and my naming brain is running out of energy
RussellSpitzer
left a comment
There was a problem hiding this comment.
It's getting late and I'm off next week. I reviewed pretty much everything except the tests.
My biggest checks were
- Looks like there is a bitwidth calculation bug for floats using an unsigned comparison to determine range
- The buildPresetCache looks like it's doing too much work, I think we can get by just saving the (e,f) values rather than copying vectors and refinding their e,f paris
- There seems to be some duplication of BytesUtils methods with ALP specific versions here that seem to be the same. Is there a reason for this? May be worth consolidating otherwise
| } | ||
| } | ||
|
|
||
| /** Double writer. Same structure as FloatAlpValuesWriter but uses longs. */ |
There was a problem hiding this comment.
All my comments from above, but s/float/double/ and s/int/long/ :)
| private ValuesWriter getDoubleValuesWriter(ColumnDescriptor path) { | ||
| final ValuesWriter fallbackWriter; | ||
| if (this.parquetProperties.isByteStreamSplitEnabled(path)) { | ||
| if (this.parquetProperties.isAlpEnabled(path)) { |
There was a problem hiding this comment.
I'm not sure if this belongs in DefaultV1ValuesWriter factory, we'll see what @rdblue proposes with the new versioning. But I"m guessing this will land in a V3? I would stick it only in V2 for now since that's already a bit mixed up
| private final int pageRowCountLimit; | ||
| private final boolean pageWriteChecksumEnabled; | ||
| private final ColumnProperty<ByteStreamSplitMode> byteStreamSplitEnabled; | ||
| private final ColumnProperty<Boolean> alpEnabled; |
There was a problem hiding this comment.
Maybe we should be bundling these into ALPConfig rather than having two ColumnProperties
| assertTrue( | ||
| "Preset result should be at least as good as full search", | ||
| presetResult.numExceptions <= fullResult.numExceptions); | ||
| } |
There was a problem hiding this comment.
The bitwidth bug I noted for floats could have a test here
@test
public void findBestFloatParamsMixedSignUsesSignedDeltaRange() {
float[] values = {-1.0f, 1.0f}; // e=0,f=0 → encoded {-1, 1}
AlpEncoderDecoder.EncodingParams params =
AlpEncoderDecoder.findBestFloatParams(values, 0, values.length);
assertEquals(0, params.exponent);
assertEquals(0, params.factor);
assertEquals(0, params.numExceptions);
}
| } | ||
| } | ||
|
|
||
| // ========== Bit Width Tests (renamed methods) ========== |
There was a problem hiding this comment.
Not sure what this means? Is this because we are renaming the BytesUtil methods? I'm still not sure why we are reimplementing those here
cc @julienledem @alamb @emkornfield @prtkgaur
Rationale for this change
Reworks the ALP encoding implementation to address emkornfield's architectural feedback on PR #3390. The original buffered all values in memory and decoded eagerly. This makes the writer incremental (encode per-vector as values arrive) and the reader lazy (decode on demand), matching how other Parquet encodings work.
Builds on Julien Le Dem's original implementation (#3390). File structure, integration points, core math, and interop test infrastructure all come from his work. The rework focused on the internal writer/reader plumbing.
What changes are included in this PR?
Architecture (addressing review feedback):
Spec compliance:
Configurable vector size (new):
withAlpVectorSize(int)andwithAlpVectorSize(String columnPath, int)onParquetProperties.Builderand the HadoopParquetWriter.BuilderDefaultV1ValuesWriterFactoryandDefaultV2ValuesWriterFactoryso per-column overrides workAlpConstantsmin/max bounds eagerly at builder timeReader null tolerance (bug fix):
AlpValuesReaderwas assertingnum_elements == page.valuesCount, which fails on optional columns with nulls (num_elements is the encoded non-null count, valuesCount is the page row count including nulls). Relaxed tonum_elements > valuesCount. Surfaced by the corner-case fixture below.Test fixtures (parquet-testing #100 prep):
generateAlpFixturesAtMultipleVectorSizesreads the source parquets from parquet-testing PR PARQUET-134 patch - Support file write mode #100 and re-encodes them as Java ALP across 3 axes — page version (V1/V2), vector size (1024/4096), and dataset/type — producing 16 files, each verified bit-exact against its source.generateAndVerifyCornerCaseFixturewrites a small (~60 KB) synthetic file with 14 columns each engineered to hit a specific corner case from parquet-testing Parquet 167 - Snappy Compression codec - Optimize freeing of DirectByteBuffers #105: no/one/all exceptions, NaN/Inf/-0.0, constant (bit_width=0), differing exponents per vector, optional with nulls. Both f32 and f64.writeCornerCaseCsvTruthemits a sidecar_expect.csvfrom the construction recipe (not from reading the parquet back), so the corner-case file is independently verifiable by any future tooling.readAllFixtureFilesIndependentlyopens every generated fixture and asserts ALP appears in column encodings + every row decodes through the standard reader path.Integration:
DefaultV1ValuesWriterFactoryandDefaultV2ValuesWriterFactory.Are these changes tested?
Yes. 125 tests across 5 test classes in parquet-column, plus 12 tests in
TestInterOpReadAlpin parquet-hadoop — all passing. Full parquet-column suite also passes.Key tests construct ALP page bytes directly according to the spec and feed them to the reader without going through the writer. This verifies the reader works independently and catches any bugs where writer and reader agree with each other but disagree with the spec. Also covers NaN bit pattern preservation, negative zero roundtrip, extreme values, every partial vector remainder mod 8, skip across vector boundaries, and preset caching under distribution change.
Cross-language verification
Arrow C++ ALP decoder (apache/arrow#48345) reads every Java-written fixture bit-exact against the canonical
_expect.csvtruth tables from parquet-testing PR #100. Local verification covers all 17 fixtures across the full {V1, V2} × {vs1024, vs4096} matrix: 1,587,306 values, 0 mismatches. Six representative fixtures submitted as a stacked PR (prtkgaur/parquet-testing#1) toward apache/parquet-testing#100 account for 342K of those.Are there any user-facing changes?
ParquetProperties.withAlpEncoding(), globally or per-column.withAlpVectorSize(int)(new in this PR), also globally or per-column. Default is 1024.Note: Likely me missing something — but ALP is not yet in the parquet-format Thrift spec (apache/parquet-format#533), so writing ALP files through the full Hadoop pipeline will fail at metadata serialization until parquet.thrift is updated (parquet-format PR #548).