nz.util
Class TrieStringSet

java.lang.Object
  |
  +--java.util.AbstractCollection
        |
        +--java.util.AbstractSet
              |
              +--nz.util.TrieStringSet
All Implemented Interfaces:
java.util.Collection, java.io.Serializable, java.util.Set

public class TrieStringSet
extends java.util.AbstractSet
implements java.io.Serializable

This Collection class provides a Set interface implemented by a data structure called a Trie. As such, it is efficient for certain set operations, like contains(). Note that this implementation is not synchronized. If multiple threads access this set concurrently, and at least one of the threads modifies the set, you must supply external synchronization. Otherwise, the internal data structures almost certainly will be become corrupted.

Because the trie data structure is wholly dependent on the nature of strings, this Collection stores all objects as their String representation; in other words, you can put any kind of object you want to into this collection, but you can only get Strings out of it (insertion is done using the toString() operator.) The most efficient operation of this class can be obtained when you put char[] array objects into it. Note that when you put a char[] array into the set, you still get a String out.

The TrieStringSet Iterator does not support the remove operation, however it is safe to remove the item just returned by the iterator or any previous item. If you the next item to be iterated over, then the iterator will return it anyway. See the description of the clear() method for more remarks about mixing operations.

Note that the Trie is naturally a sorted data structure, although it always sorts in numerical order of its character values, which is almost always not the same as dictionary order. That is why this class implements Set rather than SortedSet. However, you can count on this numerical-sort behavior. The iterator will ALWAYS deliver results in this order, shorter strings first, etc.

TrieStringSet is fully serializable, although the internal structure of the set practically guarantees that the serialized output will be quite large, and that reading the results will be slow. It is almost always faster to get all the strings out with toArray(), serialize that to a file, and later load the strings back into a TrieStringSet.

While the Trie data structure is slower than a good tree or hash table for just about everything, and uses more memory, it can do some operations very quickly. For example, in heavily query-oriented applications, the TrieStringSet outperforms the Java TreeSet and approaches the performance of the Java HashSet; it can even exceed the performance of HashSet in applications where char[] arrays are used instead of Strings. Also, the trie can do some useful string set operations efficiently that the other structures cannot do at all: prefix checking. In this class, the operation to check whether the given string is a prefix of any string in the set is implemented by the containsPrefix() method.

Note that there are a lot more things that trie data structures can do well, under the right circumstances. In this particular case, the TrieStringSet is kept slow by the requirement to handle unicode characters in a reasonable way.

The guts of a trie use an array to hold things. Since the Java character set is unicode, and thus quite large, a simple sparse array implementation is used for this purpose. Note that the current implementation will be more wasteful for strings composed mostly of characters in upper unicode pages. Of course, it is pretty darn wasteful now, actually.

See Also:
Set, nom.ziring.util.PagedArray, Serialized Form

Constructor Summary
TrieStringSet()
          Create a new, empty TrieStringSet.
TrieStringSet(java.lang.String[] list)
          Create a new TrieStringSet that contains all the elements from an array.
 
Method Summary
 boolean add(java.lang.Object o)
          Add a string or char array to this TrieStringSet.
protected  boolean addInternal(nz.util.TrieStringSet.TSSPagedArray p, char[] a, int c, int maxC)
           
protected  boolean addInternal(nz.util.TrieStringSet.TSSPagedArray p, java.lang.String s, int c, int maxC)
           
 void clear()
          Clear out this TrieStringSet.
 boolean contains(char[] a, int n)
          Check whether this TrieStringSet contains a particular element.
 boolean contains(java.lang.Object o)
          Check whether this TrieStringSet contains a particular element.
 boolean containsPrefix(java.lang.Object o)
          Check whether the supplied string or char array is in the set, or is a prefix of any string in the set.
protected  nz.util.TrieStringSet.TSSPagedArray internalContains(nz.util.TrieStringSet.TSSPagedArray p, char[] s, int c, int maxC)
           
protected  nz.util.TrieStringSet.TSSPagedArray internalContains(nz.util.TrieStringSet.TSSPagedArray p, java.lang.String s, int c, int maxC)
           
 java.util.Iterator iterator()
          Create and return a new Iterator for this TrieStringSet.
 boolean remove(java.lang.Object o)
          Remove an element from the TrieStringSet.
 int size()
          Return the number of (unique) Strings in this TrieStringSet.
 java.lang.Object[] toArray(java.lang.Object[] a)
          The toArray(Object []a) method is not supported.
 
Methods inherited from class java.util.AbstractSet
equals, hashCode, removeAll
 
Methods inherited from class java.util.AbstractCollection
addAll, containsAll, isEmpty, retainAll, toArray, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.Set
addAll, containsAll, isEmpty, retainAll, toArray
 

Constructor Detail

TrieStringSet

public TrieStringSet()
Create a new, empty TrieStringSet.

TrieStringSet

public TrieStringSet(java.lang.String[] list)
Create a new TrieStringSet that contains all the elements from an array. For adding from any other kind of collection, use the addAll(Collection c) method.
Parameters:
list - An array of String values
Method Detail

size

public int size()
Return the number of (unique) Strings in this TrieStringSet. This operation is very cheap, because a count is kept during insertion, so that this method can simply return the value of an internal counter.
Overrides:
size in class java.util.AbstractCollection

add

public boolean add(java.lang.Object o)
Add a string or char array to this TrieStringSet. This function will throw IllegalArgumentException of the object o is null.
Overrides:
add in class java.util.AbstractCollection
Parameters:
o - String object to be added to the set.

addInternal

protected boolean addInternal(nz.util.TrieStringSet.TSSPagedArray p,
                              java.lang.String s,
                              int c,
                              int maxC)

addInternal

protected boolean addInternal(nz.util.TrieStringSet.TSSPagedArray p,
                              char[] a,
                              int c,
                              int maxC)

clear

public void clear()
Clear out this TrieStringSet. After this call, the set will be empty. All outstanding iterators will still work, but will iterate over the old elements, not any new ones that are added after this operation.
Overrides:
clear in class java.util.AbstractCollection

contains

public boolean contains(java.lang.Object o)
Check whether this TrieStringSet contains a particular element. This operation is pretty fast; it is linear in the length of the string (it is not affected by the number of elements added to the set). The input parameter can be a String or a char array (char[]), or anything that delivers a reliable toString() value.
Overrides:
contains in class java.util.AbstractCollection
Parameters:
o - String or char[] object that might be in this set

contains

public boolean contains(char[] a,
                        int n)
Check whether this TrieStringSet contains a particular element. This operation is pretty fast; it is linear in the length of the string (it is not affected by the number of elements added to the set).
Parameters:
a - A char array
n - How many elements of the char array to use, positive

containsPrefix

public boolean containsPrefix(java.lang.Object o)
Check whether the supplied string or char array is in the set, or is a prefix of any string in the set. This operation is just as fast as normal contains(), and is one thing that the TrieStringSet can do well that other Set collection classes cannot.

Example: if the set contained the strings "keyboard", "mouse", and "monitor", then containsPrefix("key") would return true.

Parameters:
o - String or char[] prefix to look for in the set

internalContains

protected nz.util.TrieStringSet.TSSPagedArray internalContains(nz.util.TrieStringSet.TSSPagedArray p,
                                                               java.lang.String s,
                                                               int c,
                                                               int maxC)

internalContains

protected nz.util.TrieStringSet.TSSPagedArray internalContains(nz.util.TrieStringSet.TSSPagedArray p,
                                                               char[] s,
                                                               int c,
                                                               int maxC)

toArray

public java.lang.Object[] toArray(java.lang.Object[] a)
The toArray(Object []a) method is not supported. Use the toArray() method instead.
Overrides:
toArray in class java.util.AbstractCollection
Throws:
ArrayStoreException - always

remove

public boolean remove(java.lang.Object o)
Remove an element from the TrieStringSet. This operation is fairly fast, but memory that might have been used by deleted items is never reclaimed until the entire set is empty. To clear all the items quickly, use clear().
Overrides:
remove in class java.util.AbstractCollection
Parameters:
o - An object purportedly in this set

iterator

public java.util.Iterator iterator()
Create and return a new Iterator for this TrieStringSet. The Iterator that this returns does not support the remove() operation, and its next() operation will always return a String.
Overrides:
iterator in class java.util.AbstractCollection