the logDice score in Word Sketches

Dice score gives very good results of collocation candidates. The only problem is that the values of the Dice score are usually very small numbers. We have defined logDice to fix this problem.

Values of the logDice have the following features:
– Theoretical maximum is 14, in case when all occurrences of X co-occur with Y and all occurrences of Y co-occur with X. Usually the value is less then 10.

– Value 0 means there is less than 1 co-occurrence of XY per 16,000 X or 16,000 Y. We can say that negative values means there is no statistical significance of XY collocation.

– Comparing two scores, plus 1 point means twice as often collocation, plus 7 points means roughly 100 times frequent collocation.

– The score does not depend on the total size of a corpus. The score combine relative frequencies of XY in relation to X and Y.

All these characteristics are useful orientation points for any field linguist working with collocation candidate lists.

From: A Lexicographer-Friendly Association Score, by Pavel Rychlý