Monday, June 4, 2012

GIST NOTES 17 - Java Serialization


GIST NOTES 17 - Java Serialization

[DISCLAIMER: This is solely for non-commercial use. I don't claim ownership of this content. This is a crux of all my readings studies and analysis. Some of them are excerpts from famous books/sources on  the subject. Some of them are my contemplation upon experiments with direct hand coded code samples using IDE or notepad.


I've created this mainly to reduce an entire book into few pages of critical content that we should never forget. Even after years, you don't need to read the entire book again to get back its philosophy. I hope these notes will help you to replay the entire book in your mind once again.]

>Serialization - convert an object into bit streams so that it can be saved in a file or sent over the network to another machine


>Special handling is required for arrays, enum constants, and objects of type Class, ObjectStreamClass, and String. Other objects must implement either the Serializable or the Externalizable interface to be saved in or restored from a stream.

>Enum constants are serialized differently than ordinary serializable or externalizable objects. The serialized form of an enum constant consists solely of its name; field values of the constant are not transmitted. To serialize an enum constant, ObjectOutputStream writes the string returned by the constant's name method. Like other serializable or externalizable objects, enum constants can function as the targets of back references appearing subsequently in the serialization stream. The process by which enum constants are serialized cannot be customized; any class-specific writeObject and writeReplace methods defined by enum types are ignored during serialization. Similarly, any serialPersistentFields or serialVersionUID field declarations are also ignored--all enum types have a fixed serialVersionUID of 0L.

>The flush method is used to empty any buffers being held by the stream and to forward the flush to the underlying stream. The drain method may be used by subclassers to empty only the ObjectOutputStream's buffers without forcing the underlying stream to be flushed.

>Each subclass of a serializable object may define its own writeObject method. If a class does not implement the method, the default serialization provided by defaultWriteObject will be used. When implemented, the class is only responsible for writing its own fields, not those of its supertypes or subtypes.

>Note - The ObjectInputStream constructor blocks until it completes reading the serialization stream header. Code which waits for an ObjectInputStream to be constructed before creating the corresponding ObjectOutputStream for that stream will deadlock, since the ObjectInputStream constructor will block until a header is written to the stream, and the header will not be written to the stream until the ObjectOutputStream constructor executes. This problem can be resolved by creating the ObjectOutputStream before the ObjectInputStream, or otherwise removing the timing dependency between completion of ObjectInputStream construction and the creation of the ObjectOutputStream.

>The registerValidation method can be called to request a callback when the entire graph(object graph) has been restored but before the object is returned to the original caller of readObject. The order of validate callbacks can be controlled using the priority. Callbacks registered with higher values are called before those with lower values. The object to be validated must support the ObjectInputValidation interface and implement the validateObject method. It is only correct to register validations during a call to a class's readObject method. Otherwise, a NotActiveException is thrown. If the callback object supplied to registerValidation is null, an InvalidObjectException is thrown.

>The readObject method of the class, if implemented, is responsible for restoring the state of the class. The values of every field of the object whether transient or not, static or not are set to the default value for the fields type. Either ObjectInputStream's defaultReadObject or readFields method must be called once (and only once) before reading any optional data written by the corresponding writeObject method; even if no optional data is read, defaultReadObject or readFields must still be invoked once. If the readObject method of the class attempts to read more data than is present in the optional part of the stream for this class, the stream will return -1 for bytewise reads, throw an EOFException for primitive data reads (e.g., readInt, readFloat), or throw an OptionalDataException with the eof field set to true for object reads.

>Reading an object from the ObjectInputStream is analogous to creating a new object. Just as a new object's constructors are invoked in the order from the superclass to the subclass, an object being read from a stream is deserialized from superclass to subclass. The readObject or readObjectNoData method is called instead of the constructor for each Serializable subclass during deserialization.

>One last similarity between a constructor and a readObject method is that both provide the opportunity to invoke a method on an object that is not fully constructed. Any overridable (neither private, static nor final) method called while an object is being constructed can potentially be overridden by a subclass. Methods called during the construction phase of an object are resolved by the actual type of the object, not the type currently being initialized by either its constructor or readObject/readObjectNoData method. Therefore, calling an overridable method from within a readObject or readObjectNoData method may result in the unintentional invocation of a subclass method before the superclass has been fully initialized.

>For serializable objects, the readObjectNoData method allows a class to control the initialization of its own fields in the event that a subclass instance is deserialized and the serialization stream does not list the class in question as a superclass of the deserialized object. This may occur in cases where the receiving party uses a different version of the deserialized instance's class than the sending party, and the receiver's version extends classes that are not extended by the sender's version. This may also occur if the serialization stream has been tampered; hence, readObjectNoData is useful for initializing deserialized objects properly despite a "hostile" or incomplete source stream.

  private void readObjectNoData() throws ObjectStreamException;


>A new stream protocol version has been introduced in JDK 1.2 to correct a problem with Externalizable objects. The old definition of Externalizable objects required the local virtual machine to find a readExternal method to be able to properly read an Externalizable object from the stream. The new format adds enough information to the stream protocol so serialization can skip an Externalizable object when the local readExternal method is not available. Due to class evolution rules, serialization must be able to skip an Externalizable object in the input stream if there is not a mapping for the object using the local classes.

Due to the format change, JDK 1.1.6 and earlier releases are not able to read the new format. StreamCorruptedException is thrown when JDK 1.1.6 or earlier attempts to read an Externalizable object from a stream written in PROTOCOL_VERSION_2. Compatibility issues are discussed in more detail in Section 6.3, "Stream Protocol Versions."

>The ObjectStreamClass.getSerialVersionUID method returns the serialVersionUID of this class. Refer to Section 4.6, "Stream Unique Identifiers." If not specified by the class, the value returned is a hash computed from the class's name, interfaces, methods, and fields using the Secure Hash Algorithm (SHA) as defined by the National Institute of Standards.

>4.2 Dynamic Proxy Class Descriptors

ObjectStreamClass descriptors are also used to provide information about dynamic proxy classes (e.g., classes obtained via calls to the getProxyClass method of java.lang.reflect.Proxy) saved in a serialization stream. A dynamic proxy class itself has no serializable fields and a serialVersionUID of 0L. In other words, when the Class object for a dynamic proxy class is passed to the static lookup method of ObjectStreamClass, the returned ObjectStreamClass instance will have the following properties:

 Invoking its getSerialVersionUID method will return 0L.
 Invoking its getFields method will return an array of length zero.
 Invoking its getField method with any String argument will return null.

>4.3 Serialized Form

The serialized form of an ObjectStreamClass instance depends on whether or not the Class object it represents is serializable, externalizable, or a dynamic proxy class.

When an ObjectStreamClass instance that does not represent a dynamic proxy class is written to the stream, it writes the class name and serialVersionUID, flags, and the number of fields. Depending on the class, additional information may be written:

 For non-serializable classes, the number of fields is always zero. Neither the SC_SERIALIZABLE nor the SC_EXTERNALIZABLE flag bits are set.
 For serializable classes, the SC_SERIALIZABLE flag is set, the number of fields counts the number of serializable fields and is followed by a descriptor for each serializable field. The descriptors are written in canonical order. The descriptors for primitive typed fields are written first sorted by field name followed by descriptors for the object typed fields sorted by field name. The names are sorted using String.compareTo. For details of the format, refer to Section 6.4, "Grammar for the Stream Format".
 For externalizable classes, flags includes the SC_EXTERNALIZABLE flag, and the number of fields is always zero.
 For enum types, flags includes the SC_ENUM flag, and the number of fields is always zero.
When an ObjectOutputStream serializes the ObjectStreamClass descriptor for a dynamic proxy class, as determined by passing its Class object to the isProxyClass method of java.lang.reflect.Proxy, it writes the number of interfaces that the dynamic proxy class implements, followed by the interface names. Interfaces are listed in the order that they are returned by invoking the getInterfaces method on the Class object of the dynamic proxy class.

The serialized representations of ObjectStreamClass descriptors for dynamic proxy classes and non-dynamic proxy classes are differentiated through the use of different typecodes (TC_PROXYCLASSDESC and TC_CLASSDESC, respectively); for a more detailed specification of the grammar, see Section 6.4, "Grammar for the Stream Format".

>'serialver -show' shows a GUI that lets to find out serialVersionUID of any class

>4.6 Stream Unique Identifiers

Each versioned class must identify the original class version for which it is capable of writing streams and from which it can read. For example, a versioned class must declare:

    private static final long serialVersionUID = 3487495895819393L;

The stream-unique identifier is a 64-bit hash of the class name, interface class names, methods, and fields. The value must be declared in all versions of a class except the first. It may be declared in the original class but is not required. The value is fixed for all compatible classes. If the SUID is not declared for a class, the value defaults to the hash for that class. The serialVersionUID for dynamic proxy classes and enum types always have the value 0L. Array classes cannot declare an explicit serialVersionUID, so they always have the default computed value, but the requirement for matching serialVersionUID values is waived for array classes.

Note - It is strongly recommended that all serializable classes explicitly declare serialVersionUID values, since the default serialVersionUID computation is highly sensitive to class details that may vary depending on compiler implementations, and can thus result in unexpected serialVersionUID conflicts during deserialization, causing deserialization to fail.

>The serialVersionUID is computed using the signature of a stream of bytes that reflect the class definition. The National Institute of Standards and Technology (NIST) Secure Hash Algorithm (SHA-1) is used to compute a signature for the stream. The first two 32-bit quantities are used to form a 64-bit hash. A java.lang.DataOutputStream is used to convert primitive data types to a sequence of bytes. The values input to the stream are defined by the Java Virtual Machine (VM) specification for classes. Class modifiers may include the ACC_PUBLIC, ACC_FINAL, ACC_INTERFACE, and ACC_ABSTRACT flags; other flags are ignored and do not affect serialVersionUID computation. Similarly, for field modifiers, only the ACC_PUBLIC, ACC_PRIVATE, ACC_PROTECTED, ACC_STATIC, ACC_FINAL, ACC_VOLATILE, and ACC_TRANSIENT flags are used when computing serialVersionUID values. For constructor and method modifiers, only the ACC_PUBLIC, ACC_PRIVATE, ACC_PROTECTED, ACC_STATIC, ACC_FINAL, ACC_SYNCHRONIZED, ACC_NATIVE, ACC_ABSTRACT and ACC_STRICT flags are used. Names and descriptors are written in the format used by the java.io.DataOutputStream.writeUTF method.

The sequence of items in the stream is as follows:


The class name.
The class modifiers written as a 32-bit integer.
The name of each interface sorted by name.
For each field of the class sorted by field name (except private static and private transient fields:
The name of the field.
The modifiers of the field written as a 32-bit integer.
The descriptor of the field.
If a class initializer exists, write out the following:
The name of the method, .
The modifier of the method, java.lang.reflect.Modifier.STATIC, written as a 32-bit integer.
The descriptor of the method, ()V.
For each non-private constructor sorted by method name and signature:
The name of the method, .
The modifiers of the method written as a 32-bit integer.
The descriptor of the method.
For each non-private method sorted by method name and signature:
The name of the method.
The modifiers of the method written as a 32-bit integer.
The descriptor of the method.
The SHA-1 algorithm is executed on the stream of bytes produced by DataOutputStream and produces five 32-bit values sha[0..4].
The hash value is assembled from the first and second 32-bit values of the SHA-1 message digest. If the result of the message digest, the five 32-bit words H0 H1 H2 H3 H4, is in an array of five int values named sha, the hash value would be computed as follows:
  long hash = ((sha[0] >>> 24) & 0xFF) |
              ((sha[0] >>> 16) & 0xFF) << 8 |
              ((sha[0] >>> 8) & 0xFF) << 16 |
              ((sha[0] >>> 0) & 0xFF) << 24 |
              ((sha[1] >>> 24) & 0xFF) << 32 |
              ((sha[1] >>> 16) & 0xFF) << 40 |
              ((sha[1] >>> 8) & 0xFF) << 48 |
              ((sha[1] >>> 0) & 0xFF) << 56;

Versioning
----------
>Versioning raises some fundamental questions about the identity of a class, including what constitutes a compatible change. A compatible change is a change that does not affect the contract between the class and its callers.

>5.2 Goals

The goals are to:

 Support bidirectional communication between different versions of a class operating in different virtual machines by:
Defining a mechanism that allows Java classes to read streams written by older versions of the same class.
Defining a mechanism that allows Java classes to write streams intended to be read by older versions of the same class.
 Provide default serialization for persistence and for RMI.
 Perform well and produce compact streams in simple cases, so that RMI can use serialization.
 Be able to identify and load classes that match the exact class used to write the stream.
 Keep the overhead low for nonversioned classes.
 Use a stream format that allows the traversal of the stream without having to invoke methods specific to the objects saved in the stream.

>The following are the principle aspects of the design for versioning of serialized object streams.

 The default serialization mechanism will use a symbolic model for binding the fields in the stream to the fields in the corresponding class in the virtual machine.
 Each class referenced in the stream will uniquely identify itself, its supertype, and the types and names of each serializable field written to the stream. The fields are ordered with the primitive types first sorted by field name, followed by the object fields sorted by field name.
 Two types of data may occur in the stream for each class: required data (corresponding directly to the serializable fields of the object); and optional data (consisting of an arbitrary sequence of primitives and objects). The stream format defines how the required and optional data occur in the stream so that the whole class, the required, or the optional parts can be skipped if necessary.
The required data consists of the fields of the object in the order defined by the class descriptor.
The optional data is written to the stream and does not correspond directly to fields of the class. The class itself is responsible for the length, types, and versioning of this optional information.
 If defined for a class, the writeObject/readObject methods supersede the default mechanism to write/read the state of the class. These methods write and read the optional data for a class. The required data is written by calling defaultWriteObject and read by calling defaultReadObject.
 The stream format of each class is identified by the use of a Stream Unique Identifier (SUID). By default, this is the hash of the class. All later versions of the class must declare the Stream Unique Identifier (SUID) that they are compatible with. This guards against classes with the same name that might inadvertently be identified as being versions of a single class.
 Subtypes of ObjectOutputStream and ObjectInputStream may include their own information identifying the class using the annotateClass method; for example, MarshalOutputStream embeds the URL of the class.


>5.6.1 Incompatible Changes

Incompatible changes to classes are those changes for which the guarantee of interoperability cannot be maintained. The incompatible changes that may occur while evolving a class are:

 Deleting fields - If a field is deleted in a class, the stream written will not contain its value. When the stream is read by an earlier class, the value of the field will be set to the default value because no value is available in the stream. However, this default value may adversely impair the ability of the earlier version to fulfill its contract.
 Moving classes up or down the hierarchy - This cannot be allowed since the data in the stream appears in the wrong sequence.
 Changing a nonstatic field to static or a nontransient field to transient - When relying on default serialization, this change is equivalent to deleting a field from the class. This version of the class will not write that data to the stream, so it will not be available to be read by earlier versions of the class. As when deleting a field, the field of the earlier version will be initialized to the default value, which can cause the class to fail in unexpected ways.
 Changing the declared type of a primitive field - Each version of the class writes the data with its declared type. Earlier versions of the class attempting to read the field will fail because the type of the data in the stream does not match the type of the field.
 Changing the writeObject or readObject method so that it no longer writes or reads the default field data or changing it so that it attempts to write it or read it when the previous version did not. The default field data must consistently either appear or not appear in the stream.
 Changing a class from Serializable to Externalizable or vice versa is an incompatible change since the stream will contain data that is incompatible with the implementation of the available class.
 Changing a class from a non-enum type to an enum type or vice versa since the stream will contain data that is incompatible with the implementation of the available class.
 Removing either Serializable or Externalizable is an incompatible change since when written it will no longer supply the fields needed by older versions of the class.
 Adding the writeReplace or readResolve method to a class is incompatible if the behavior would produce an object that is incompatible with any older version of the class.


 >5.6.2 Compatible Changes

The compatible changes to a class are handled as follows:

 Adding fields - When the class being reconstituted has a field that does not occur in the stream, that field in the object will be initialized to the default value for its type. If class-specific initialization is needed, the class may provide a readObject method that can initialize the field to nondefault values.
 Adding classes - The stream will contain the type hierarchy of each object in the stream. Comparing this hierarchy in the stream with the current class can detect additional classes. Since there is no information in the stream from which to initialize the object, the class's fields will be initialized to the default values.
 Removing classes - Comparing the class hierarchy in the stream with that of the current class can detect that a class has been deleted. In this case, the fields and objects corresponding to that class are read from the stream. Primitive fields are discarded, but the objects referenced by the deleted class are created, since they may be referred to later in the stream. They will be garbage-collected when the stream is garbage-collected or reset.
 Adding writeObject/readObject methods - If the version reading the stream has these methods then readObject is expected, as usual, to read the required data written to the stream by the default serialization. It should call defaultReadObject first before reading any optional data. The writeObject method is expected as usual to call defaultWriteObject to write the required data and then may write optional data.
 Removing writeObject/readObject methods - If the class reading the stream does not have these methods, the required data will be read by default serialization, and the optional data will be discarded.
 Adding java.io.Serializable - This is equivalent to adding types. There will be no values in the stream for this class so its fields will be initialized to default values. The support for subclassing nonserializable classes requires that the class's supertype have a no-arg constructor and the class itself will be initialized to default values. If the no-arg constructor is not available, the InvalidClassException is thrown.
 Changing the access to a field - The access modifiers public, package, protected, and private have no effect on the ability of serialization to assign values to the fields.
 Changing a field from static to nonstatic or transient to nontransient - When relying on default serialization to compute the serializable fields, this change is equivalent to adding a field to the class. The new field will be written to the stream but earlier classes will ignore the value since serialization will not assign values to static or transient fields.


Behavior when both writeObject/readObject and writeReplace/readResolve methods present
-----------------------------------------------------------------------------------
The call sequence follows this order: Serialized object class is Seri and its replacement object is SeriR (SeriR extends Seri)

call to writeReplace():Seri
call to writeReplace():SeriR
call to writeObject():SeriR
call to writeReplace():SeriR
call to writeObject():SeriR
call to readObject():SeriR
call to readObject():SeriR
call to readResolve():SeriR
call to readResolve():SeriR

#readResolve() of Seri class is never called
#object returned by writeReplace() can be ignored in writeObject() and something else can be written to the stream
#writeObject() has the final say in what goes to file or stream
#since readResolve() is called finally, anything that was read from file can be overridden
#readResolve() has the final say in what comes out of serialization
#note that both writeReplace and readResolve should return Object type; these methods can have any access modifiers
#readObject/writeObject should be private for them to be called during serialization/deserialization

>Externalizable class should explicitly declare a default no-arg constructor, otherwise runtime exception(InvalidClassException) is thrown during serialization

>By extending ObjectOutputStream/ObjectInputStream, you can have your own custom streams
>Object replacement, resolution, protection of sensitive data, JVM wide security policy implementation and such functions can be moved to such custom streams; this makes the user of the custom streams transparent to such complexities
>For example, a custom stream can wrap every object or primitive written into it with a wrapper class that provides uniform custom encryption protection

What's the difference between the SUID (Stream Unique IDentifier) and the private static member serialVersionUID?

Answer:
The SUID is one of a number of things that the serialization protocol writes to the stream in addition to the serialized object (other things include a magic number and the fully- qualified class name of the object). SUID is not the same as the static variable serialVersionUID, although SUID is computed using that field, if it exists. In psuedocode,

if (serialVersionUID is defined) then
    SUID is set equal to serialVersionUID
else
    SUID is computed algoritmically
Because serialVersionUID is a static member, it is not written to the stream as part of the serialized object. Instead, serialization uses the serialVersionUID to compute the SUID. The SUID is then sent to the stream as part of the stream protocol, not as part of the object definition.
Deserializing requires two things:

The serialized object. This does not include the static member serialVersionUID, but it does include the SUID, fully-qualified class name, etc.
The .class file. This does include the static members.
When deserializing, the SUID embedded in the object input stream is compared to the SUID computed from the local .class file according to the psuedocode above. If the SUIDs are equal, then the serialized object is compatible with the class file definition.

Reference: JDK Docs and Tutorials

No comments:

Post a Comment