Data Structures
March 4, 2008 by muralis
The Java Collection classes provide implementations of data structures in three distinct categories, defined as interfaces:
# Sets: a collection of unique elements
# Lists: a collection of elements, duplicates allowed
# Maps: a mapping of keys to values
And there are two sub-interfaces derived from these:
# SortedSet: an ordered collection of unique elements
# SortedMap: a mapping of ordered keys to values
The various implementations of each have their advantages and disadvantages that depend on your requirements. The purpose of this section is to present the performance and usage tradeoffs of the most popular Collection Classes.
Sets and SortedSets
A set is a collection of unique elements; duplicates are not permitted. There are two classes that implement the java.util.Set interface:
# HashSet
# LinkedHashSet
# TreeSet
The difference is that a HashSet stores all of its data inside of a hash table while a TreeSet stores all of its information in a tree.
From a performance standpoint, the Set interface has three methods of interest: add(), remove(), and contains(); add() adds a new object to the Set, remove removes a specified object from the Set and contains() returns a boolean value of true if the specified element is in the Set or false otherwise. The HashSet performs each of these methods in constant time, which is the best you can do. The TreeSet performs each of these methods in O( log N ) time, which is good.
The functional tradeoff between the three is that the elements in the TreeSet are sorted in their “natural” order, those in the LinkedHashSet are ordered in the sequence they were inserted in, and HashSet is not sorted in any way. Therefore, if it is a requirement that you have to be able to display the contents of the Set in its “natural” order, then you must use a TreeSet; if you need to preserve the input order then you must use LinkedHashSet; if the order is not a requirement then you are better off using a HashSet.
Lists
A list is a collection of elements; duplicates are permitted. The following classes implement the java.util.List interface:
# ArrayList
# Vector
# LinkedList
Both ArrayList and Vector are growable arrays that provide constant time random access to its elements. Vector originally appeared in the first release of Java (1.0) while the ArrayList officially appeared in the API in version 1.2. Thus the ArrayList conforms more closely to the Collections API than does the Vector; the Vector was retrofitted to support this functionality. The primary difference between the two is that ArrayList is not synchronized by default while Vector is: if you are running single threaded code against the List, then ArrayList should have better performance. If you want to use an ArrayList in synchronized code, you can make it synchronized with the following code:
List list = Collections.synchronizeList( new ArrayList() )
The java.util.LinkedList class differs from the growable array classes in that it is implemented by a linked list.
The difference between the two categories of lists is that growable arrays are required to maintain a specific size that is equal to or greater than the number of elements it contains and when that size is exceeded it must be resized to support the additional elements; this is an expensive operation. By maintaining that additional overhead however, you gain constant time inserts and deletes. Locating an element in the array however is O( N ), as potentially all elements will have to be examined. Linked lists, on the other hand, use exactly the amount of memory required to hold its data. As new elements are added to the linked list, new Entry’s are created and positioned appropriately in the list. So if memory is at a premium then linked lists will be a good option, but because of the nature of the underlying data structure, inserts into the list, locating elements, and removing specific elements from the list all take O( N ) time.
Maps and Sorted Maps
Maps maintain key and value pairs; keys are maintained in a Set (hence no duplicates) while values are maintained in a Collection (can be thought of as a List – hence duplicates are permitted). The following classes (that you would use as a data structure in your own code) implement the Map interface:
# HashMap
# Hashtable
# LinkedHashMap
# TreeMap
As you might guess, the difference between a HashMap, a LinkedHashMap, and a TreeMap is the ordering of its keys. All of the benefits and drawbacks discussed with Sets is applicable here. Hashtable, like Vector, is a byproduct of the initial 1.0 release of the Java API and retrofitted to support the Map interface while HashMap officially appeared with the Collections API in version 1.2. Subtle differences include Hashtable being inherently synchronized while HashMap can be synchronized as follows:
Map map = Collections.synchronizeMap( new HashMap() );
The final difference is that HashMap permits null keys and values while Hashtable does not.
Summary
I hope that this section has inspired you to investigate algorithm analysis further and carefully examine your code, particularly related to loops and nested loops. Finally I encourage you to spend time looking through the Javadoc and more specifically the source code that ships with your JDK to investigate the implementation details of the differing data structures. Be sure that you understand your requirements before you choose the appropriate data structure from the wealth of implementations presented through the Collections API.