Tuesday, May 15, 2012

GIST NOTES 6 - Java Strings, IO Formatting, Parsing



GIST NOTES 6 - Java Strings, IO Formatting, Parsing

[DISCLAIMER: This is solely for non-commercial use. I don't claim ownership of this content. This is a crux of all my readings studies and analysis. Some of them are excerpts from famous books on  the subject. Some of them are my contemplation upon experiments with direct hand coded code samples using IDE or notepad.


I've created this mainly to reduce an entire book into few pages of critical content that we should never forget. Even after years, you don't need to read the entire book again to get back its philosophy. I hope these notes will help you to replay the entire book in your mind once again.]


[JDK 7]

String, StringBuffer, StringBuilder
-----------------------------------
String - immutable
StringBuffer - mutable, old, synchronized(slower)
StringBuilder - mutable, new, NOT synchronized(so faster)

>once a String object is created, it can never be changed
>String is immutable object
>string contains 16bit Unicode characters
>String class has zillion constructors

String s = "hi"; s = new String(); s = new String("abcd"); s=s.concat(" are alphabets"); and so on

>though string operations make string look like mutable(modifiable), they are actually immutable; the illusion is due to silent creation of new string objects by JVM whenever someone tries/wants to modify a string

>any operation on a string object does not modify that object; it merely returns appropriate result in the form of another string object

>when we use string objects left and right, it looks like JVM also is creating new new String objects left and right; and many of them are lost in heap without any references to them; what happens to all those memory occupied by orphaned strings; what about the performance cost in creating so many string objects everytime? these issues were resolved by java by creating a String Literal Pool where all the string objects are cached and reused

>as applications grow, string literals would occupy large part of program memory

String Constant Pool/String Literal Pool
----------------------------------------
whenever a string is needed, JVM checks the pool first to see if a matching string is there; if yes, its reference is returned otherwise new string gets created; the strings in the pool just sit there and their references grow in count and more and more people(variables) start using the same string objects; that's why nobody is allowed to modify string objects(many people are using the same string object without being aware of each other); that's why java made String class final so that nobody changes the behavior of String objects

String s = "abc"; //one object gets created and sits in literal pool; s refers to it

String s = new String("abc"); //two string objects get created; object created using new operator goes to nonpool area
//and is referred by s; whereas "abc" literal goes and sits in literal pool as an orphan (for now)

>indices addressing characters in string are 0-based

Common String methods
---------------------
charAt()
*concat()
equalsIgnoreCase()
length()
*replace(char oldchar, char newchar) - replaces all occurences of a given char with another
*substring(int startIndex) - string from startIndex to end of the string
*substring(int startIndex, int endIndex) - return string from startIndex to endIndex excluding endIndex char
*toLowerCase()
toString()
*toUpperCase()
*trim() - cuts leading and trailing spaces

*all the starred methods even if they look like, they don't modify the original string; be wary of it

#arrays have member 'length', String objects have method 'length()'; be wary of compiler errors

StringBuffer and StringBuilder
-------------------------------
>they are in java.lang package
>they are used when lots of modifications to strings of characters is done; when we do many modification operation(so to speak) with string objects, we leave behind many orphaned string literals which would waste memory; hence using StringBuffer/StringBuilder leaves no such string literals behind
>used when streams of characters are buffered in IO operations; blocks of data processed by StringBuilder/StringBuffer; for each block same StringBuilder/StringBuffer object can be used(efficient use of memory)
>StringBuilder was added in Java 5; it is not synchronized(not thread safe) and hence faster
>StringBuilder and StringBuffer are exactly the same except for thread safety and speed

String s = "hi";
s = s.concat(" hello"); //leaves "hi" literal behind

StringBuffer s = new StringBuffer("hi");
s.append(" hello"); //does not leave any other StringBuffer objects behind; there is only one object from the beginning;
//all modifications are done on the same object

>since StringBuffer methods return the resulting StringBuffer objects just like String class methods (but unlike String it modifies itself first), operation chaining can be done as follows:-

s.append(" how are you?").reverse().insert(3, "dude "); //all operations are done over and over the same object finally resulting
//in some bizzare output :)

>hmmm, I wonder what happens to those string literals we use as input arguments to StringBuffer constructors and methods; they still stick around?

>StringBuffer - it's called buffer because it is synchronized, threadsafe and like an actual buffer

>StringBuilder - it's called builder, because its only purpose is to allow people build and play with strings of characters (no thread safety)

>StringBuffer.append() can take boolean, byte, char, double, float, int, long and others as well

>StringBuilder.delete(int startIndex, int endIndex) - deletes chars from startIndex to endIndex excluding endIndex char from the orignal object and returns the reference it itself

>StringBuilder.insert(int indexToPlace, String stringToPlace) - stringToPlace is inserted at index indexToPlace; in the final string the first char of stringToPlace will be at index indexToPlace.

>StringBuffer.reverse() - characters are reversed; last char comes to first; and so on

File Navigation and IO
======================

File -> object pointing to a file or director; create new file, delete files, search files, make new directories are done by this class

FileReader -> reads character files; low-level character reading methods; every read operation is performance intensive

BufferedReader -> makes low level reader classes more efficient and user friendly; reads relatively large chunks at once and buffers, hence high performance; minimizes no.of times the time-intensive file read operations is performed; works with buffer; can read lines instead of character

FileWriter -> char based low level writer; every write operation is performance intensive/time consuming

BufferedWriter -> wraps low level wrtiers(e.g. FileWriter); less no.of writes; works with buffer to save time and effort; writes larger chunk at once; can write a line at a time;

PrintWriter -> enhanced in java 5; this can use a string or a file as the data source ; flexible and powerful methods

Console -> new in Java 6; reads input from console; writes formatted output to the console;

Stream classes -> read/write bytes (associated with serialization also)
Readers and Writers -> read/write characters
Buffered Reader/Writer -> read/write lines(usually)

File class objects represent files or directories, not the data inside the file.

File
-----
>Instantiating a new File object doesn't create a file on the disk
>Instantiating FileWriter object automatically creates the mentioned file on the disk if it doesn't exist already
>FileWriter.flush() method call is necessary to make sure that every last bit of data we wrote was sent to the file; since a kind of buffering is done in the write stream, one has to make sure the every last bit was written to the file; hence you call flush() when you are done writing
>flush() is not available in Reader classes; no flushing is required while reading
>FileReader/FileWriter.close() method releases all operating system resources related to the file; once your job is over, no need to hang on to those precious OS resources; so we call close()

>FileReader.read(char[]) will read the entire file into the array or as much as the size of the array and return the count of chars read accordingly

File myDir = new File("mydir");
myDir.mkdir();//create directory

File file = new File(myDir, "log.txt");
file.createNewFile(); //create a file under that directory

file.renameTo(new File("log_renamed.txt")); //rename a file
file.delete(); //delete a file
myDir.renameTo(new File("mydir_renamed")); //rename a directory
String[] contents = myDir.list(); //returns an array of files and directories in myDir

>PrintWriter(File) can directly write into a file - we don't need FileWriter
>non-empty directory cannot be deleted

>System.console() returns reference to java.io.Console object; if the runtime environment(machine) has no console device (application not run from command line), then it returns null

Console
-------
>can accept input both echoed and nonechoed(passwords)
>can print formatted output to the console
>Console.readLine() returns String (input from console)
>Console.readPassword() returns the user input as char array; because arrays can be destroyed easily after use(GC collected), but String literal would be lying around in the literal pool which a notorious hacker might get hold of.

Serialization
-------------
>serialization/deserialization can use java.io: DataInputStream,DataOutputStream,FileInputStream,FileOutputStream,ObjectInputStream, ObjectOutputStream, Serializable

>Serialization involves objects - hence Object input/output stream
>Serialization involves persisting objects into files - hence File Input/Output Stream
>Serialization involves transferring objects over the network - hence Data Input/Output Stream
>Serialization involves binary data - hence it is Stream based classes (not Writer/Reader or Buffered based classes)
>Serialization involves writing binary output for converting objects to bits(serialization) - hence Output classes
>Serialization involves reading binary input for converting bits to objects(deserialization) - hence Input classes

[enough? :)]

Serialization - saving object and its state (all member data except transient variables - it's a feature)

>Mark unwanted variables as transient if you don't want them to be saved/persisted

Serialize: ObjectOutputStream.writeObject(Object obj)
Deserialize: ObjectInputStream.readObject() - returns Object

ObjectOutputStream and ObjectInputStream are considered higher level classes in java.io package because they deal with the cool Objects. Since, they are higher level you can wrap them around lower level classes like FileInputStream and FileOutputStream.

>FileOutputStream(String filename) constructor automatically creates the file if it doesn't exist
>wrap FileOutputStream object into ObjectOutputStream object and start serializing objects with writeObject method
>writeObject() - serializes and writes into the stream (can be a file stream or network stream)
>readObject() - reads from stream and deserializes object; that is reconstructs the objects and gives it to you

>when you serialize an object, java serializes the entire "object graph"; that is all the objects inside your object and further down and so on; "a deep copy" of everything the saved object needs to be restored

>all the objects inside your object should be serializable; otherwise exception is thrown at runtime

>you can choose to ignore unserializable object members in your class by marking them as 'transient'

>sometimes the member object may be of a third party class whose source code you don't have; in such cases, it is hard to make that member as serializable; so you can mark that member as transient and get on with your serializing business; but when you restore your saved object it will not have this third party object member properly initialized; if you can't live without this third party object, there is a way to intervene serialization/deserialization processes in the middle and save/read some extra data obtained from the third party object, so that you will be able to restore that object too.

Serialization mechanism provides the following sneaky way to help you add extra bits to your serialized data. Provide the following pair of private methods(with exact signature as given) in your class which you are attempting to serialize but which contains an unserializable third party object:-

private void writeObject(ObjectOutputStream os) {
os.defaultWriteObject(); //ask JVM to do the usual serialization
os.writeInt(thirdPartyObject.getData()); //add third party data bits manually to serialization stream
}

private void readObject(ObjectInputStream is) {
is.defaultReadObject(); //ask JVM to do the usual deserialization; this restores your parent object back
thirdPartyObject = new ThirdPartyObject();
thirdPartyObject.setData(is.readInt()); //read back the extra bits u added during serialization, and restore your third party
//object manually
}

#reading the extra data is done in the same order as it was done during writing time

#these two methods look similar to ObjectOutputStream/ObjectInputStream class methods; but these are different; only the names are same; the input argument is different and these are 'private' methods!

#how the hell JVM manages to call our private methods? why serialization mechanism provides such 'informal' way to do some manual serialization?

#this methodology is used when you have to save some part of the obejct state manually

#why not all classes in java are serializable? why didn't they make Object class serializable?

One thing is for security reasons. Not all objects are safe to be left around in files or sent over network. And those classes which are runtime specific obviously makes no sense to serialize and save them; e.g. streams, threads, runtime, some GUI classes connected to underlying OS, etc.

Inheritance affects Serialization
---------------------------------

>If parent class implements Serializable, all sub classes automatically become serializable; but any new non-serializable member introduced in subclasses will cause runtime error

>so to check whether a class is serializable or not, one has to examine the entire class hierarchy not just that class alone

>Object class does not implement Serializable

>when an object is constructed using new operator, the following things happen in that order

1.All instance variables are assigned default values (even if some of them have initial value in their declaration they are not used yet)
2.The constructor is is invoked which invokes super class constructor(or other constructor of the same class and then superclass constructor ultimately)

3.All superclass constructors complete

4.Instance variables that are assigned initial values as part of their declaration are given those values

5.The constructor complete

#All the above things DO NOT HAPPEN when an object is deserialized(when all ancestors are also serializable)

>Deserialization does not call the constructors (or super class constructors for that matter)
>Deserialization does not provide initially assigned values(as part of the declaration) to data members
>Otherwise, it would defeat the purpose of restoring the object state right? we don't want initial values of a brand new object; we want what we saved
>transient variables also do not get their declared initial values; instead they get default values after restoration (if private void writeObject() is not implemented sneakily)

Subclass is Serializable Super class is NOT
-------------------------------------------
>So, coming to inheritance, if a class implements Serializable and its super class is not serializable, the above said rule of skipping the normal object initialization routines (constructors, member initialization) is BROKEN partly

>object initialization process is done starting from the first non-serializable parent all the upto top most parent in the class hierarchy

>that is, any member as part of non-serializable super class does not get restored when we deserialize the child; those inherited member variables get brand new values as though we are creating a new object using 'new' operator (remember it's only for parent members)

class Animal {
int weight = 10;
}

class Dog extends Animal implements Serializable {
String name;
Dog(int w, String n)
{
weight = w;
name = n;
}
}

 #in this example, if you persist a dog with weight = 20 and name = Sally, and restore it back, you will see that weight became 10(not good) and name is Sally(as expected).

>So, when you try to serialize your class, be wary of those inherited members of a non-serializable super class; they will not participate in your serialization process no matter what; you should do something else if you want to persit them too

>when serializing an array or collection, all elements should be serializable; otherwise, runtime error

Static variables not serialized
-------------------------------
>Serialization only applies to objects; though static variable values can change and needs to be preserved, they are still not part of any object; they are part of the class itself; so they are never bothered about in serialization and deserialization

Versioning issue in serilization
--------------------------------
>when you save an object using one version of the class, and attempt to restore it with another version of the same class, then serialization will fail
>if you open the serialized file, you could still make out the values of private fields of the stored object; if you attempt mess up with the serialized file, it throws StreamCorruptedException when you restore the object from that modified file

Object Serialization System Specification
=========================================

>Objects to be saved in the stream may support either the Serializable or Externalizable interface
>For serializable objects, the stream includes sufficient information to restore the fields in the stream to a compatible version of the class
>For Externalizable objects, the class is solely responsible for the external format of it contents
>when an object is stored, all other objects reachable from this object are stored as well to maintain the relationship between the objects
>objects are written with writeObject() method from ObjectOutput and primitives are written to the stream with the methods of DataOutput

========[WE WILL CONTINUE SERIOUS SERIALIZATION in other post]========



Dates, Numbers and Currency
---------------------------

java.util.Date - most of the methods are deprecated; the instance of this class represents a mutable date,time to a millisecond; this object can act as a bridge between Calendar and DateFormat classes

java.util.Calendar - allows manipulation of dates and time; this is an abstract class; use getInstance() overloaded static factory method to get hold of a Calendar object

java.text.DateFormat - it provides many styles of date formatting and also in many Locales

java.text.NumberFormat - formatting numbers and currencies in locales around the world

java.util.Locale - Every Locale around the world, has different time, currency, and formatting styles for date, numbers and currency; to cater to this need, Local class can work with DateFormat, NumberFormat and Calendar classes

new Date() - points to current time

Calendar c = Calendar.getInstance() - c allows you to perform date time manipulations in your locale

Locale loc = new Locale(language) or new Locale(language, country) - gives you locale objects
Calendar c = Calendar.getInstance(loc) - gives you the calendar for a specific locale

A DateFormat object can be created for a particular style and Locale.

A NumberFormat object can be created for a particular Locale.

Since, Date class did not handle the Internationalization and Localization situations well, most of its methods were deprecated, and Calendar could be used instead.

Date.setTime() and getTime() methods are still usable and work with long values (time in milliseconds)

Calendar c = new Calendar();  //Illegal code

Calendar c = Calendar.getInstance();//legal, returns one of the concrete subclasses of Calendar class(could be java.util.GregorianCalendar)

Calendar.setTime(Date d) - sets the calendar time to any arbitrary date you pass

In some locales, first day of the week is 'Monday'. To find this out use, calendar.getFirstDayOfWeek() [returns a predefined Calendar constant, e.g. Calendar.SUNDAY].

calendar.add(Calendar.HOUR, -4); //subtracts 4hrs from calendar object time
calendar.add(Calendar.YEAR, 2); //adds 2yrs to the time
calendar.add(Calendar.DAY_OF_WEEK, -2); //subtracts 2 days from calendar's time

calendar.roll(Calendar.MONTH, 10); //adds 10 units to the specified time unit and lets it overflow; in this example, it only changes month
//the year of the calendar's time remains as it is; while adding 10 to the current month, it it goes
//beyond 12, then 12 is subtracted and the remaining number is treated as the resulting month; so it
//can either increase or decrease any time unit(hour, day, month or year)

> roll() doesn't change the larger part of the date than the part specified in the input argument.

DateFormat
----------
DateFormat.getDateInstance(DateFormat.FULL, new Locale("it","IT")) will create italian dateformat object. Locale can be set to DateFormat only at the time of instance creation. After that no way to change the locale on the existing DateFormat object. Applies to NumberFormat objects also.

Locale indiaLocale = new Locale("hi","IN");
indiaLocale.getDisplayCountry() - gives a string name of the country(INDIA) in the default locale(can be any locale)
indiaLocale.getDisplayCountry(new Locale("it","IT")) - gives INDIA in italian locale
indiaLocale.getDisplayLanguage() - gives string name of the language(HINDI) in the default locale
indiaLocale.getDisplayLanguage(new Locale("it","IT")) - gives HINDI in italian locale

API
---
new Date();
new Date(long millis);

Calendar.getInstance();
Calendar.getInstance(Locale l);

Locale.getDefault();
new Locale(String language);
new Locale(String language, String country);

DateFormat.getInstance();
DateFormat.getDateInstance();
DateFormat.getDateInstance(style);
DateFormat.getDateInstnace(style, Locale);

NumberFormat.getInstance();
NumberFormat.getInstance(Locale);
NumberFormat.getNumberInstance();
NumberFormat.getNumberInstance(Locale);
NumberFormat.getCurrencyInstance();
NumberFormat.getCurrencyInstance(Locale);

Parsing, Tokenizing and Formatting
==================================

>In general, a regex search is run from left to right, and once a source's character has been used in a match(consumed), it cannot be reused.

\d digits
\s white space
\w word characters(alphabets, digits and underscore)
[a-zA-Z] range of characters
^ negate characters specified
[][] nested brackets - union of sets
&& intersection of sets
. any single character

Qauntifiers:
-----------
+ one or more
* zero of more
? zero or one


Match hexadecimal numbers --> 0[xX]([0-9a-fA-F])+

Match anything that is not a or b or c --> [^abc]

Match 7 digit phone number which may or may not contain a space or a hyphen after the 3rd digit --> \d\d\d([-\s])?\d\d\d\d

? greedy (zero or one)
?? reluctant (zero or one)
* greedy (zero or more)
*? reluctant (zero or more)
+ greedy (one or more)
+? reluctant (one or more)

Greedy quantifier, to consume as much as possible, it reads the entire source, and then starts matching from right to left so that it can consume as much as possible from the left hand side when a match is found.

\ backslash is a meta character and hence has to be escaped in java source file.

patternexpression = "\d"; //will not compile
patternexpression = "\\d"; //will compile

patternexpression = "\\."; //search for a dot; dot is not the regex meta char here because it is escaped with slash

>delimiters in tokenizing can be as big a string as represented by a complex regex

String.split(String regex) - tokenizes the entire string
Scanner - can do on the fly tokenizing and you can quit tokenizing if you have found your token already (need not tokenize the whole file)

Scanner
--------
>Scanner's can be constructed using files, streams or strings as a source.
>Tokenizing is performed in a loop; so you can exit the loop at any time
>Tokens can be converted to their appropriate primitive types automatically
>Scanner's default delimiter is white space

Scanner.useDelimiter() lets you set the delimiter to be any regex.

Scanner s = new Scanner("source string which has to be tokenized");

s.next() returns next token as a String; e.g. "source" in this example

java.io.PrintStream
-------------------
>format() and printf() are exactly equal to each other

>format() method uses java.util.Formatter class behind the scenes

Format string syntax:

%[arg_index$][flags][width][.precision]conversion_char

values within [] are optional; just the percentage and conversion_char are enough.

e.g. System.out.printf("%2$d + %1$d",123,456); //prints '456 + 123'

arg_index --> an integer followed by $; indicates which argument should be printed in this position

flags -->

- left justify
+ include sign(+ or -) with this argument
0 pad this argument with zeros
, use locale specific grouping separators
( enclose negative numbers in parenthesis

width --> indicates minimum no.of characters to print; used for column formatting

precision --> to format floating point numbers; indicates no.of digits to print after the decimal point

conversion_type -->

%b boolean
%c char
%d integer
%f floating point
%s string

System.out.format("%d", 12.3); //throws IllegalFormatConversionException

System.out.printf("%1$+-7d", 46); //first argument is used, includes sign, left justifies, 7 width space is used, integer is expected

# StringBuffer/StringBuilder.equals() is not overridden; it doesn't compare values

# The String.split() method tokenizes the entire source data all at once, so
large amounts of data can be quite slow to process.

#PrintWriter - is a writer/helper class which uses other streams to write
#PrintStream - is a stream itself extending from FilterOutputStream class
#System.out and System.err are PrintStream objects
#PrintWriter uses PrintStream and Formatter internally; generally it uses OutputStreamWriter unless the user passes his own output stream for file writing

Reference: Kathy Sierra's SCJP Book

No comments:

Post a Comment