Saturday, 11 February 2012

Correlations in linguistic data

Geoff Pullum at Language Log recently reluctantly (because it's not yet published) commented on a paper by a Yale economist, Keith Chen. In this paper, Chen argues that if your language has a grammatical future tense marker, you are less likely to save money, live healthily etc because the future seems like some other time, not to be worried about now. If your language uses present tense to refer to the future, you treat is an extension of the present and you'll be much more sensible about it. Pullum is guardedly sceptical about these claims, for reasons which you can read about yourself

He is also sceptical about this kind of claim (made based on correlations found in large amounts of data) because
I also worry that it is too easy to find correlations of this kind, and we don't have any idea just how easy until a concerted effort has been made to show that the spurious ones are not supportable. For example, if we took "has (vs. does not have) pharyngeal consonants", or "uses (vs. does not use) close front rounded vowels", would we find correlations there too? I have some colleagues here at the University of Edinburgh, within Simon Kirby's research group, who have run some informal experiments on the data Chen uses to see if dredging up spurious correlations of this kind is easy or hard, and so far they have found it jaw-droppingly easy.
He doesn't comment further on these experiments, but it reminded me of the talk Martin Haspelmath gave when our university's linguistics research centre opened a few years ago, and he told us about the World Atlas of Language Structures (WALS). After telling us what a wonderful, useful tool it is (and it is, I've found it invaluable), he ended on a note of caution. It's easy, he said, to find false correlations. For example, you can show a map of languages which have a different word for hand and arm or use the same word for both. That map shows that the languages that don't distinguish are, broadly speaking, around the warmer areas of the globe (yellow dots) and the ones that do distinguish are in colder areas (red dots):
(Map from WALS, feature 129A)
Now might one not hypothesise, asked Haspelmath, that this language fact is due to the climate? In colder countries the distinction is important, in that one wears items of clothing that cover only the hands (gloves), or sleeves that come down to the wrist. In warm countries, sleeves are not so long and gloves are not worn, so a separate word for hands never becomes necessary. A far-fetched example, but a lesson in not putting too much faith in correlations.

