ORC-1671: update timestamp doc in specfication by Jefffrey · Pull Request #1867 · apache/orc

Jefffrey · 2024-03-31T00:44:54Z

What changes were proposed in this pull request?

Updating timestamp specification documentation to be more accurate to the implementation.

Why are the changes needed?

Timestamp specification is not clear on some nuances of it's implementation. Pulling knowledge from the following for these updates:

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

No

Jefffrey · 2024-03-31T00:45:26Z

site/specification/ORCv1.md

+* Note that if writer timezone is set, 1 January 2015 is according to
+this timezone and not according to UTC


According to implementation:

orc/c++/src/Timezone.hh

Lines 66 to 72 in 3b5b2a6

/**

* Get the number of seconds between the ORC epoch in this timezone

* and Unix epoch.

* ORC epoch is 1 Jan 2015 00:00:00 local.

* Unix epoch is 1 Jan 1970 00:00:00 UTC.

*/

virtual int64_t getEpoch() const = 0;

orc/c++/src/ColumnReader.cc

Line 335 in 9b79de9

int64_t writerTime = secsBuffer[i] + epochOffset;

Also, I'm not certain if the writer timezone is mandatory in the stripe if a TIMESTAMP column is present in the file. Might need some clarification on this?

IIUC, writer timezone field is mandatory if there is any timestamp type:

orc/java/core/src/java/org/apache/orc/impl/WriterImpl.java

Lines 559 to 565 in 513922a

if (writeTimeZone) {

if (useUTCTimeZone) {

builder.setWriterTimezone("UTC");

} else {

builder.setWriterTimezone(TimeZone.getDefault().getID());

}

}

Thanks, I think I'll add this to the spec as well (maybe proto too, in the other repo?)

Jefffrey · 2024-03-31T00:45:43Z

site/specification/ORCv1.md

+Due to ORC-763, values before the UNIX epoch which have nanoseconds greater
+than 999,999 are adjusted to have 1 second less.


According to

orc/c++/src/ColumnReader.cc

Lines 350 to 352 in 9b79de9

if (secsBuffer[i] < 0 && nanoBuffer[i] > 999999) {

secsBuffer[i] -= 1;

}

Jefffrey · 2024-03-31T00:47:48Z

site/specification/ORCv1.md

+First we must adjust the DATA value to be relative to the UNIX epoch. The ORC
+epoch is 1 January 2015 00:00:00 US/Pacific, since we must take into account the writer
+timezone. This translates to 1 January 2015 08:00:00 UTC, as US/Pacific is equivalent
+to a -08:00 offset from UTC at that date (no daylight savings). The number of seconds
+from 1 January 1970 00:00:00 UTC to 1 January 2015 08:00:00 UTC is 1,420,099,200. This is
+added to the DATA value to produce a value of -20,751,903. As this is before the
+UNIX epoch (since it is negative), and the SECONDARY value, 199,900,000, is
+greater than 999,999, then this DATA value is adjusted to become -20,751,904
+(1 second subtracted).
+
+This value by itself represents 5 May 1969 19:34:56.1999, which now needs to be adjusted
+from US/Pacific (the writer's timezone) to UTC (the reader's timezone). As the value is
+within daylight savings for US/Pacific, 7 hours are subtracted to give the final value
+of 5 May 1969 12:34:56.1999.


I'm kinda unclear on how this exactly works, I'm just going off the C++ implementation here:

orc/c++/src/ColumnReader.cc

Lines 335 to 348 in 9b79de9

int64_t writerTime = secsBuffer[i] + epochOffset;

if (!sameTimezone) {

// adjust timestamp value to same wall clock time if writer and reader

// time zones have different rules, which is required for Apache Orc.

const auto& wv = writerTimezone->getVariant(writerTime);

const auto& rv = readerTimezone->getVariant(writerTime);

if (!wv.hasSameTzRule(rv)) {

// If the timezone adjustment moves the millis across a DST boundary,

// we need to reevaluate the offsets.

int64_t adjustedTime = writerTime + wv.gmtOffset - rv.gmtOffset;

const auto& adjustedReader = readerTimezone->getVariant(adjustedTime);

writerTime = writerTime + wv.gmtOffset - adjustedReader.gmtOffset;

}

}

Happy to be corrected if my understanding or wording is inaccurate anywhere 👍

Basically here it tries to mimic the behavior on the Java side: https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1280-L1295

Similarly, I'm not inclined to add these details to the specs if we are not 100% sure about it.

wgtmac · 2024-04-02T04:54:52Z

site/_docs/types.md

-* **Timestamp** is a date and time without a time zone, which does not change based on the time zone of the reader.
+* **Timestamp** is a date and time without a time zone, where the timestamp value is stored in the writer timezone
+encoded at the stripe level, if present. ORC readers will read this value back into the reader's timezone. Usually
+both writer and reader timezones default to UTC, however older ORC files may contain non-UTC writer timezones


Usually
both writer and reader timezones default to UTC, however older ORC files may contain non-UTC writer timezones`

The main purpose of this timestamp type is to restore the wall clock time regardless of the reader time zone. IIRC, the Java implementation uses the writer local time zone instead of UTC. Due to insufficient C++ time zone utility support in earlier days, the C++ code by default uses UTC as the writer time zone to avoid complexity.

I see. I was confused as from the discussion/comments here: apache/arrow#34591

It suggested that the reader's timezone was important when deserializing the values

IMHO, we'd better not expose the implementation detail here. The original summary and the table below well explains what to expect from users.

I'm not sure I agree with that, as this timestamp type has been a great source of confusion for me when attempting to implement an ORC to Arrow reader in Rust.

For example, in the test file TestOrcFile.testDate2038.orc, according to PyArrow (which uses the ORC C++ reader underneath), the first value is:

>> from pyarrow import orc >>> orc.read_table("/home/jeffrey/Code/orc/examples/TestOrcFile.testDate2038.orc")[0][0] <pyarrow.TimestampScalar: '2038-05-05T12:34:56.100000000'>

The underlying integer value for the DATA stream is 736_601_696 seconds (let's ignore nanoseconds). When adding this to 1 Jan 2015 00:00:00, regardless of timezone, we get 5 May 2038 11:34:56. Which seems... surprising.

https://www.timeanddate.com/date/dateadded.html?d1=1&m1=1&y1=2015&type=add&ay=&am=&aw=&ad=&h1=0&i1=0&s1=0&ah=&ai=&as=736601696&rec=

This isn't at all what I would expect hence my confusion (perhaps I just don't understand properly how DST factors into this). But diving into the internals and seeing reader & writer timezone offsets being considered when decoding the timestamp value just adds to the confusion, especially as that contradicts the specification.

I was hoping to open up some discussion on clarifying the exact definition for this timestamp type as part of this PR.

Sorry for the late reply. If you find any inconsistent behavior between C++ and Java sides, please refer to the Java implementation as reference. We need to fix any issue on the C++ side.

I'll try check more into this to confirm my understanding 👍

Moving to draft in the meantime

ORC-1671: update timestamp doc in specfication

e363a42

github-actions bot added the DOCS label Mar 31, 2024

Jefffrey commented Mar 31, 2024

View reviewed changes

Jefffrey mentioned this pull request Mar 31, 2024

Fix Timestamp to align with ORC spec datafusion-contrib/orc-rust#26

Open

wgtmac reviewed Apr 2, 2024

View reviewed changes

Jefffrey marked this pull request as draft June 1, 2024 09:26

		* Note that if writer timezone is set, 1 January 2015 is according to
		this timezone and not according to UTC

	/**
	* Get the number of seconds between the ORC epoch in this timezone
	* and Unix epoch.
	* ORC epoch is 1 Jan 2015 00:00:00 local.
	* Unix epoch is 1 Jan 1970 00:00:00 UTC.
	*/
	virtual int64_t getEpoch() const = 0;

	if (writeTimeZone) {
	if (useUTCTimeZone) {
	builder.setWriterTimezone("UTC");
	} else {
	builder.setWriterTimezone(TimeZone.getDefault().getID());
	}
	}

		Due to ORC-763, values before the UNIX epoch which have nanoseconds greater
		than 999,999 are adjusted to have 1 second less.

	if (secsBuffer[i] < 0 && nanoBuffer[i] > 999999) {
	secsBuffer[i] -= 1;
	}

	int64_t writerTime = secsBuffer[i] + epochOffset;
	if (!sameTimezone) {
	// adjust timestamp value to same wall clock time if writer and reader
	// time zones have different rules, which is required for Apache Orc.
	const auto& wv = writerTimezone->getVariant(writerTime);
	const auto& rv = readerTimezone->getVariant(writerTime);
	if (!wv.hasSameTzRule(rv)) {
	// If the timezone adjustment moves the millis across a DST boundary,
	// we need to reevaluate the offsets.
	int64_t adjustedTime = writerTime + wv.gmtOffset - rv.gmtOffset;
	const auto& adjustedReader = readerTimezone->getVariant(adjustedTime);
	writerTime = writerTime + wv.gmtOffset - adjustedReader.gmtOffset;
	}
	}

Conversation

Jefffrey commented Mar 31, 2024

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants