linguistlaura: corpus

Showing posts with label corpus. Show all posts

Monday, 30 November 2015

Lazy to do laundry

This advert appeared in my workplace recently:

It offers laundry services under the line 'Lazy to do Laundry?'. This is not quite felicitous for me (that's linguist-speak for 'there's something not quite right about the syntax'). Of course I can understand what it means: it doesn't seem to me to be any different from 'Too lazy to do laundry?' or 'Lazy when it comes to doing your laundry?', and that's obviously the intended meaning.

I can't find any more examples like it: Google, CoCA and the British National Corpus all have the string 'lazy to+infinitive' preceded by 'too'. Normally, adjective+to+infinitive means 'To [infinitive] is [adjective]', as in 'It's lazy to sleep all day', and indeed there was one example of this in CoCA.

[Update] My friend Stuart pointed out that there are some examples on Twitter. Interesting in itself that they are there but not on Google: They must be rare enough that they're drowned out by the 'to lazy to' examples.

Monday, 4 November 2013

Interesting uses for linguistics 1

In the book '48 hours' by J Jackson Bentley (a fairly crappy crime thriller), forensic corpus linguistics turned up. In the book, a woman is kidnapped and her kidnappers make her send a videoed message to her boyfriend. She's a smart cookie so she gives lots of coded information about where she's being held in the message, including the word 'print'. The police try to find out what she might mean by this. They do that by running the word through a database to see what words 'print' occurs with most often.

“Luke again,” the speaker chirped. The computer is showing that the word ‘print’ can be associated with the word ‘press’ in the next sentence, as in ‘printing press’. This could be code for Dee telling us that the industrial unit houses a printing press.”

There's a good deal of suspension of disbelief required to get through this book, but this is a real thing. It's called 'collocation': when a word tends to co-occur with another one with greater than chance frequency.

If you look at a corpus (collection of texts) like the British National Corpus, you can very easily make it tell you this stuff (I'm no corpus linguist and even I can do it). Here's a screenshot of what happens if you look for the words that most frequently occur immediately after the word 'print' or one of its derivatives (such as 'printing'). The photo's a bit small but 'press' is there on the list, in eighth most common position (if I've worked the search terms right), after things like 'characters', 'material' and so on. I suppose, if you were looking for a clue to a place, 'press' would be the first one to give you anything to go on.

You can click on the words and find out what the context is, just in case there's some false results or you want more detail or whatever, and you get this:

That shows you the type of writing it was found in, and gives you the bit of sentence either side so that you can understand the phrase in context. I'm not really au fait with police techniques, but I wouldn't be at all surprised if they do use this kind of method when it's appropriate. There are such people as forensic linguists who work closely with the police, whose job is often to determine if a particular person is the author or a document.

Spoiler alert:

So they located a nearby printing press and, after much showdown, rescued 'the girls', as the two adult hostages were patronisingly referred to throughout.

Friday, 20 July 2012

'Informant incompetence'

I can't now remember where I heard the phrase 'informant incompetence', but it's a slightly cruel way of describing a perennial problem in linguistics (and presumably other disciplines too): when the people giving you linguistic data simply fail to understand what you want from them.

Double modal or double fluff?

Those of us who are interested in dialect syntax but don't make it their business to conduct experiments into it are always on the listen-out for interesting examples. You can't help it, after a while. On the Antiques Roadshow back in April, I heard one of the experts say this:

What date would that might have been?

He didn't stumble over it, it was very fluent production, so he either meant to say it or didn't notice what he'd said. But we seem to have a double modal construction here, something which is not found in Standard English and is attested but not common in certain dialects.

The modals are would and might, and if we put the sentence into a declarative form, you can see what the issue is:

That would might have been what date.

Either modal on its own is fine, but both together is not permitted in Standard English. As this is not part of my dialect I can't be sure that this particular combination is allowed in any dialect, but certainly two modal verbs can co-occur in many people's speech.

Not, however, in the antiques expert's speech, I'll bet. I would put money on this being a performance error, which went unnoticed because the fronting of the first modal would means that it's not adjacent to the second modal might. I would guess that he started out asking what date it would have been, and switched halfway through to asking what date it might have been, and the two met in the middle in a sticky mess. Perhaps the much higher frequency of would-questions than might-questions had some influence too (frequency estimation not based on any data or actual facts at all).

This kind of thing makes it so much harder to do dialect syntax through data collection. You might only have a few instances of double modal questions in hours of data, if you're working from interviews, and if a couple of them might be performance errors, how can you be sure of anything? This is why dialect syntacticians have to be cunning as a fox who's just been appointed Professor of Cunning at Oxford, and devise data collection methods that they think will cause people to use more double modals, but without telling them that they want them to use double modals. And getting people to say something in a certain way is really bloody hard. Normal people seem to have this quaint idea that what you say is more important than the way you say it.

Friday, 9 March 2012

Doubleplusungood

George Orwell created a new form of English for his novel Nineteen Eighty-Four called Newspeak. Its aim was to reduce the ability of the people to think unorthodox or subversive thoughts, so for instance the word free was reduced in meaning to have only the sense as in 'this dog is free from lice'. One couldn't use it to express other concepts of freedom such as free speech. It also aimed to simplify the language in other ways, such as eliminating antonyms (opposites). Thus there is no good/bad pair, but rather the opposite of good is ungood. Instead of warm we have uncold. It's suggested that the more unpleasant word is chosen to keep out of the pair, but I wonder if there's another explanation.

I was reminded of this when I was reading about marked and unmarked pairs in Greenberg (1966). The basic idea is that out of any correlation pair, one member is marked and the other unmarked. The marked member has a much more restricted meaning, whereas the unmarked member can stand for the neutral value. A few examples will clarify, from maths, lexicon and morphology (all from Greenberg).

-5 can only stand for the negative value of 5, whereas 5 may refer to +5 or to the abstract notion of 5.

Man, traditionally, could have a solely masculine meaning or could refer to humankind as a whole (that is, men and women). Woman could never be used with anything other than a feminine meaning.

Likewise, in Spanish, where nouns and adjectives are inflected according to whether they are masculine and feminine, the same pattern occurs. A group of men, if referred to as 'good', has the adjective buenos with the masculine -os ending. A group of women has the adjective buenas, with the feminine -as ending. A group of men and women together will be described as buenos but never buenas, even if there are ten women and only one man.

In each of the examples, the unmarked option is the one that is used in neutral contexts (and therefore shows up with greater frequency in corpora, so we can count if testing for a neutral context is not possible).

Back to Newspeak now, Greenberg also notes that

A considerable number of languages, African, Amerind and Oceanic, have no separate term for 'bad' which is expressed by 'not good'. On the other hand, there is as far as is known to me, no language which lacks a separate term for 'good' and expresses it normally by 'not-bad'. (Greenberg 1966: 52)

He gives this as one of a number of universal pairs in which one is always the marked member. It's this one that we use in questions such as 'How wide is it?' (not 'How narrow is it?'). It seems that of the pair bad and good, Orwell selected the unmarked option to keep (after all, he clearly didn't follow the rule of keeping the less pleasant adjective). If anyone out there doesn't have a PhD to write, it might be fruitful to test Orwell's adjectives and see if the one that is kept is the unmarked member of a pair. Sometimes this will be possible with looking at neutral contexts, at other times a corpus check of frequency will be in order (for example, I can ask both 'How warm is it?' and 'How cold is it?', so neither can immediately be said to be the neutral option. If, however, one turns up significantly more frequently in the corpus, it is likely to be because it is used to express the unmarked, neutral meaning as well as its own specific meaning, whereas the other member is restricted to its specific meaning).

Reference:
Greenberg, J. H. 1966. Language universals, with special reference to feature hierarchies. Janua Linguarum. The Hague: Mouton & Co.

Wednesday, 21 September 2011

Is that a fish in your ear?

There's a new book out which I haven't read yet. However, that never stopped anyone posting an Amazon review, so I'll throw my thoughts into the pot. It's called Is that a fish in your ear?: Translation and the meaning of everything, by David Bellos. His son Alex wrote a book called Numberland, which I also haven't read but is always on Waterstone's featured displays.

I've got another book called The meaning of everything (which is excellent, by the way - by Simon Winchester, about the Oxford English Dictionary), so no points for the sub-title. Points for the title though, which references the Babelfish from Douglas Adams' Hitchhiker's guide to the galaxy.

There was an extract of this book featured in the Independent the other day, describing how Google Translate works. Google Translate is a much-mocked tool, and originally rightly so. It could be relied upon to give you absolute garbage, no matter what you put into it. Hours of fun could be had translating text from one language to another and back again, and sniggering at the Chinese whispers result. Even better fun if you put it through more than one language on the way. These days, however, Google translate is disappointingly good. It gets translations pretty much completely accurate most of the time (NB It still should NOT be used to translate if you don't know the output language - you cannot guarantee it isn't utter nonsense).

The section featured in the Independent describes how it works. Here's an extract from the extract:

In fact, at bottom, it doesn't deal with meaning at all. Instead of taking a linguistic expression as something that requires decoding, Google Translate (GT) takes it as something that has probably been said before.

The corpus it can scan includes all the paper put out since 1957 by the EU in two dozen languages, everything the UN and its agencies have ever done in writing in six official languages, and huge amounts of other material, from the records of international tribunals to company reports and all the articles and books in bilingual form that have been put up on the web by individuals, libraries, booksellers, authors and academic departments.

It uses vast computing power to scour the internet in the blink of an eye, looking for the expression in some text that exists alongside its paired translation. Drawing on the already established patterns of matches between these millions of paired documents, Google Translate uses statistical methods to pick out the most probable acceptable version of what's been submitted to it.

This is fascinating, and obviously a good way to do it. After all, people do speak and write in fairly formulaic chunks a lot of the time. It's an efficiency device, so that we don't have to create new expressions from scratch all the time. This is why you get annoying cliches like at the end of the day and in any way, shape or form. It's also why you have standard greetings (how's it going) and ways of expressing yourself like I'm so sick of (X).

And as the author points out, human translators basically work this way too: they can often pre-empt the person they're translating and guess what will come next, based on frequently-used expressions. But this way of translating assumes that everything we say or write (or almost everything) has been said before. One of the first things we tell beginning linguistics students is that we can come up with a completely new sentence, that's never been uttered before, and any speaker of English can understand it. The standard practice is then to come up with some ridiculous sentence, like All of my armadillos have been put through too hot a wash and have shrunk.

I suppose that, faced with this sentence, Translate would take its constituent parts and translate them. So, for instance, it might find the string too hot a wash, or even have been put through too hot a wash, paired with a translation, somewhere in its corpus.

In fact, I just tried it and it didn't fare so well. I put it through an English-French-English process and it came back with this translation:

All my tattoos have been too hot to wash one and have narrowed

If you fiddle with the alternate translations you get there eventually, though I'm not sure how idiomatic it is. Ah well. There's jobs for human translators yet.

If you're waiting for the paperback edition of this book, in the meantime I highly recommend Mouse or rat?: Translation as negotiation by Umberto Eco. I have read that one, and it's utterly engrossing.