Wednesday 24 June 2015

Traversing LinkedList considered (somewhat) harmful

I compare traversal of LinkedList, ArrayList and an array. First things first, given the mechanics of linked lists, iterating with a for loop, e.g:
        for (int i = 0; i < N; i++) {
            linkedList.get(i));
        }
is expected to be comparatively slow since the list has to be scanned every time to get() to the i-th element. While sub-optimal, it may still happen in the wild, especially when a call returns a List and the client programmer is not aware of the actual implementation.
 The benchmark measures performance of "for-style" traversal of LinkedList, ArrayList and array as well as "iterator-style" traversal of LinkedList and ArrayList e.g:
        Iterator iter = list.iterator();
        while (iter.hasNext()) {
            iter.next();
        } 

The code is available here. The output from JMH is
Benchmark                           Mode  Cnt       Score      Error  Units        
ListsBenchmark.forArray            thrpt   10  423016.591 ± 2745.121  ops/s        
ListsBenchmark.forArrayList        thrpt   10  345255.508 ± 1367.974  ops/s        
ListsBenchmark.forLinkedList       thrpt   10    2559.053 ±   14.089  ops/s        
ListsBenchmark.iteratorArrayList   thrpt   10  299156.440 ± 8655.560  ops/s        
ListsBenchmark.iteratorLinkedList  thrpt   10  212684.595 ± 1036.017  ops/s        
The score is number of operations per second (so higher is better) where an operation is a single traversal of a list/array of size 1000

Comments
  • As expected, traversal of a LinkedList with a for loop performs badly - about n * n/2 on the average
  • Although ArrayList is backed by an array, it performs significantly worse then a plain array. 
  • Iterating over LinkedList is significantly worse then iterating over ArrayList. This is most likely due to the fact that to move from one node to the next in a LinkedList,  the next node if fetched from the heap and then the contained value is fetched from the heap again. In the case of the ArrayList the references to the values are loaded into CPU caches and so only one visit to the heap is needed

Tuesday 16 June 2015

Lazy Map initialization

Here is one Java idiom I am happy to see go away with the tools made available in Java 8. In this scenario we have a Map with mutable values for which not all keys are known at creation time. Before Java 8, the way to add a mapping would look like this:
    private final Map< String, List< String > > map = new HashMap<>();

    public void addValue(String key, String value) {
        List valueList = map.get(key);
        if (valueList == null) {
            valueList = new ArrayList<>();
            map.put(key, valueList);
        }
        valueList.add(value);
    }
The equivalent logic with Java 8 is:
    public void addValue(String key, String value) {
        map.computeIfAbsent(key, v -> new ArrayList<>()).add(value);
    }
If the Map values are immutable, the following also works:
    private final Map< String, String > map = new HashMap<>();

    public void addValue(String key, String value) {
        map.merge(key, value, (v1, v2) -> value);
    }

Thursday 11 June 2015

Generating collisions in String.hashCode

In a previous post I mentioned collisions in String.hashCode(). Turns out that generating such collisions is quite easy.
String.hashCode() is defined as:
 
    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
For a string A of length N (a0, a1,..., aN-2, aN-1),
this can be written in a closed form as: 31^(N-1)a0 + 31^(N-2)a1 + ... + 31aN-2 + aN-1.


Pad by zeros

The first observation is that 0 is a valid character and prepending arbitrary number of 0 characters doesn't change the value of the hash. The following code takes advantage of this:
    public static String padByZeroes(String original, int numberOfExtraZeroes) {
        int length = original.length();
        char[] chars = new char[length + numberOfExtraZeroes];
        for (int i = 0; i < length; i++) {
            chars[i + numberOfExtraZeroes] = original.charAt(i);
        }
        return new String(chars);
    }
padByZeroes produces strings of length greater then the original with identical hashcode.

Pair Perturbation

As a starting point to this approach we take a string B which is identical to A: (b0, b1,..., bN-2, bN-1). We then modify two consecutive character of B without affecting it's hashcode.
If the modified characters are are at index K and K+1 we have: hashCode(A)-hashCode(B) = ... = 31(aK - bK)+(aK+1 - bK+1).
We want the last expression to be zero, that is:  31(aK - bK) =  bK+1 - aK+1. As we are free to choose bK and bK+1, to make things simple (and reduce the chance of overflowing/underflowing the two byte value of char) we choose bK = aK - 1 which gives us bK+1 = aK+1 + 31.
    public static String pairPerturbation(String original, int pairIndex) {
        int length = original.length();
        if (pairIndex > length - 2) {
            throw new IllegalArgumentException("pairIndex can't be greater then " + (length - 2));
        }

        char[] chars = original.toCharArray();

        chars[pairIndex] = (char) (original.charAt(pairIndex) - 1);
        chars[pairIndex + 1] = (char) (original.charAt(pairIndex + 1) + 31);

        return new String(chars);
    }
Here is a link to the Gist of the code and tests: https://gist.github.com/sorokod/a6848e6cd4c9fe32a523#file-stringhashcollisions