linguistlaura: July 2021

Monday, 19 July 2021

How (not) to do academic surveys

As you know, me and my co-author were running a survey recently that lots of you took part in (thanks!). We had 959 responses by the time it closed. Most of the data is numerical so I'll be working on the analysis for a while yet, relearning how to do all that stuff, but from a first eyeball it looks like we've got some really clear results. Turns out people's judgements on most of these things is very clear! I thought I'd reflect a little bit on the specific way we set up this survey and the methodological lessons we learnt.

We had a few questions at the end to find out a bit more about the respondents. We asked what their language background was, whether they considered themself to be a native English speaker, how much they socialise online, and whether there was anything else that they felt was relevant. We asked these questions - and these questions only - for reasons.

The 'native speaker' question was because in our corpus study, we'd found lots of examples of the construction that we'd have considered ungrammatical, and that seemed to be written by non-native speakers. However, if this is an 'internet language' phenomenon, and internet language is global, we need to consider those varieties too. Some people were surprised that we had 'other' as an option alongside yes and no. This is because people's language situations are complicated! It's not always easy to define what counts as a native speaker. We wanted people to be able to say 'well technically no, not on the narrow definition, but I think of myself as such' or whatever. And some people did.

We also asked about language background. This was because we thought it might make a difference for the reasons above, and also because we found a lot of examples of some specific types that, again, we found ungrammatical, in tweets written in Indian English, so we wanted to try to capture some of this information. I also wanted to allow people to state this in their own words.

One person (and I hope they don't mind me talking about it here) mentioned in their comment that the survey didn't take into account other varieties such as AAVE (their example). AAVE is also known as African American English or African American Language, so-called because its speakers are mostly Black Americans. It's true, we didn't explicitly ask about this, just as we didn't explicitly ask about Indian English (which we knew might behave differently) or Multicultural London English (the rough equivalent of AAL relevant to our UK context) or French (which we know also has a because-X construction). The 'language background' and 'any other information' boxes were there for people to provide any information that might be relevant there, such as 'I'm bilingual in Mainstream American English and AAL and here are the differences in my judgements' or 'This is exactly the same in French by the way'. No one did this, but they could have done. I kind of wished they would, though, and maybe I should have explicitly asked about it. I'd be interested to know if this person knows that the because-X construction is similar or different in AAL, and I would very much like to read that study, but this survey wasn't about comparing those two varieties so it was beyond our scope. We were happy to accept responses from any variety of English because we want to know how because-X works in general, and we know that it spans a number of varieties.

We asked about how much people socialise online because we think this phenomenon is more widespread, or at least familiar, to people who are more used to 'internet English'. Lots of the comments we got confirmed that other people also think this. This was a vague nod to Gretchen McCulloch's 'internet age', and in fact many people used her scale to give their answer here. This is about not how old you are in years, but in how immersed in the internet you are. It's a complex scale because there are the kiddos who've never known anything but the internet as it is now, late adopters, early adopters... age doesn't match how long you've been online. It's an interesting typology and you should read her book 'Because Internet' to find out more about it.

We didn't ask about people's age - for exactly the same reasons. Some people were surprised we didn't ask this, and gave their age in the 'any other information' question. But also I had asked about age in the survey I did in 2014, and I had no reason to ask for this information again. You should only ask about the personal information that you actually need. For those who are interested, here's the data from that survey, which was a bit sloppy so don't judge me. I hope that if you click this image you can expand it so that it's readable. It shows the acceptability ratings for 22 sentences for each of three age groups (I didn't include the oldest and youngest because there weren't many respondents in those groups), arranged by the overall rating of the youngest age group (blue bar on the left of each group) from lowest to highest.

A graph with 22 sentences ranked in order of acceptability rating from lowest to highest, with each one having the ratings for three different age groups. A description of the key results in words follows the image.

On the right hand side are the sentences that should be fully acceptable for everyone, like I was late because I forgot to set my alarm. Not much difference in the age groups here. On the left are the ones that everyone hates, like I was late because that I got lost. Here the millennials seem most accepting, with unusually high ratings for verb phrases and full noun phrases with articles. Gen X were particularly happy with I'm edgy because if I left the oven on and with the ones with prepositional phrases like I'll be late because at the doctor's. These are ones that could be elliptical: short for I'll be late because I'm at the doctor's. This is a much, much, older and more established construction than because-X, so it makes sense they'd interpret the sentences that way and be happier with them (still not that happy, mind).

From about halfway along the graph until the highest rated ones, there's a clear set where the youngest respondents, aged 18-25, were the most accepting of the sentences. These are the ones that are characteristic of because-X, so they can't really be ellipsis like the prepositional phrases above, and they have a noun or an exclamation after the because, like I'm here because the internet, Studying because school or I can't believe she did that because honestly. So yes, we think age does make a difference, but now we know this we didn't need to ask it again. We also only need to know that if a whole group of people simply doesn't accept this construction at all, maybe we need to factor that into the analysis; we're not interested in tracking the change in the construction via an apparent time study, for instance, which is one good reason for asking about age. This is not a sociolinguistic study, so sociolinguistic variables are only relevant to the extent they'll affect our results.

Another thing we didn't ask about was gender. I didn't ask about that last time, either, because I had no reason to think it would make any difference. Even more so than with age, we have no reason to believe that a whole gender of people simply don't use this construction. If there is a bit of difference among genders, that's fine - our analysis can cope with that. We want to know, for the people who use the construction, how does it behave syntactically? The gender of the language users, as long as there's not a total categorical difference, therefore isn't that informative here. Again, we aren't trying to find out who the speakers are; that's a study for another linguist. In terms of sampling it might be a problem if, for instance, we only had men taking our study. That would mean we couldn't generalise to all language users, and if we didn't know who the participants were we wouldn't know that and might generalise wrongly. We took the view that this was very unlikely to be the case with nearly a thousand responses. We know that at least some people of various genders took part because they told us in the comments.

So those are the things that I think we're happy with in terms of how we set up the survey. We got a lot of data that will be pretty hard to work with, because it's all free text, but it's also very rich so we'll see what we can do with it.

(cw: discussion of fatphobia)

But we also did some things that weren't quite right. The biggest one of these was one of our examples, which several people pointed out was fatphobic. The way we created our sentences was to take them from our corpus if a sentence of the right form existed, and then modify them (replacing words) to prevent them being searchable and therefore identifiable (thanks to Mercedes Durham for this tip). For ones that didn't exist, we took similar constructions that did, and modified them to be the right syntactic form. This meant, we hoped, that they were all realistic examples. In doing this, we also thought we had avoided using any that were potentially offensive or harmful (obvious examples being offensive language). Clearly, we messed this one up. I can't speak for my co-author on this but I come from a position of my personal relationship with weight being basically the default/stereotypical societal one, and therefore I have to work harder to remember not everyone's experience is the same - just like as a white person I have to remember that I might miss instances of racism and be more aware. I'm aware of campaigns like Health At Every Size, but I just wasn't aware enough here to catch this. Sorry to anyone who we triggered or upset with that sentence; lesson learnt and thanks for pointing it out to us in the survey comments.

Less harmfully, but annoyingly, we ended up using some wording that didn't chime with everyone. I thought again I'd removed anything that was region-specific (like I asked people about the verbs call vs ring), but some people mentioned that 'club together' is a British phrase (at least, they thought so). So that might affect that particular item, which is not what we wanted. Similarly, on mobile was a bit unidiomatic for some respondents.

One type of comment that really interested me was the ones that took issue with the wording of the survey. Some were just along the lines of 'None of these sentences make any sense to me', which is fine, we knew for some people that would be the case. But some said things like 'These are not sentences', and they didn't mean exactly that they're not grammatical, but that they don't meet the definition of a sentences and they're something else. We used the word 'sentence' in the survey because that's what normally seems familiar to people. Linguists typically don't use it in any technical sense, precisely because it doesn't have a good definition. We might use 'utterance' instead, which would have probably been more accurate for these commenters as it doesn't imply a certain form, but that's not a familiar term for everyone. I'm guessing these commenters feel that a sentence must be grammatical, and otherwise it's not a sentence, which is a position similar to the people who say that something is or is not a real word. It's a perfectly acceptable definition of a sentence for someone whose goal is grammatical writing, but it's a circular definition for a linguist so it's no good if you're studying utterances that are grammatical for some people and not others, as we were here. I'm not sure what we should have done here instead; you can make explicit that some of them might not be full grammatical sentences but we really wanted to get away from priming people to give the 'right' answer.

One last thing that I hope doesn't affect our results too much is that some people missed a part of the instructions. We had a 'fill in the blank' question. We wanted to allow for people to say nothing was missing, but we didn't want to make the questions optional as that wouldn't tell us if they thought the sentence was fine as it was or if they'd just skipped it. So we made the questions required, but said 'put an x in the box if you think it's fine as it is'. Quite a few people didn't see that part of the instructions, which we could tell because they wrote something else like 'This is fine'. So I hope that not too many people wanted to leave it blank but felt obliged to fill it in. If they did, it's OK, because we really just wanted to know what people filled in there, but still, it makes the survey annoying for them to do.

Long post, sorry! But reflecting on this was a really useful experience for me and I hope that it's interesting to you as well. You didn't have to read this far, so thanks for doing so!

Monday, 12 July 2021

Let's lead led away

All the way back in 2013, I declared the spelling of the past tense of the verb lead, which is standardly spelt led, dead. Or ded. I'd noticed it being misspelt as lead so many times, including on the BBC news website, that I thought it was probably simply prolonging its agony to try to preserve led. Of course it's still around, because inexplicably I'm not in charge, and written language doesn't change that fast. But I was reminded about it the other day and was frustrated all over again by the fact that this one is actually one that has sensible spelling and pronunciation, unlike most of our irregular past tenses.

Lead is pronounced with an 'ee' sound, like read, and led is pronounced like red, so it really ought to be totally transparent and memorable and unproblematic. The problem with lead/led is, though, that we also have read/read, which is not spelt red, though we have another word that is. Oh and we also have the word lead, for the metal, which is pronounced like led. All of which obscures the fact that lead and led are pronounced more or less as they're spelt, unlike read (past tense) and lead (the metal).

I feel bad for it, I really do, but I also think it would just be simpler to let it slip quietly away.

Monday, 5 July 2021

Because linguistics, again: your help needed!

Have you ever used ‘because’ like this: Yeah, no, because reasons? You aren’t giving a proper reason at all, you’re making a metalinguistic comment about something. Together with Ellie Cook, one of our graduates from 2020, I’m investigating this phenomenon, which you might remember is known to linguists as ‘because X’.

You might remember it because I first wrote about ‘because X’ all the way back in 2012. It was just a quick blog post noting it as an interesting construction. A couple of people talked about because becoming a preposition during 2013, notably Neal Whitman and Stan Carey. This Atlantic article appeared, quoting me and attributing it to Gretchen McCulloch (to be fair, she got it from a post where Gretchen was quoting me - though with attribution). Then it was voted as 2013's ‘Word of the Year’ by the American Dialect Society, and I did a quick study on it as a holiday project in early 2014. Well, it sort of snowballed since then, and it became obvious that this seemingly unimportant point of usage variation can tell us something about how language works.

I don't talk about it much on here, but behind the scenes I've been working on this off and on for a few years, picking away at little bits of it to find out what's going on. I've given a few conference papers and talks on the topic, including this one and this one (let me know if you want the version for college students, which is very accessible and has bonus #CheekyNandos content).

From that first survey, put together as a quick and fun project with no real aims in mind beyond finding out what the heck was happening, I discovered that although bare nouns like because reasons are frequent, it also shows up with lots of other parts of speech: because fake news is a common one (with a modified noun), and because just in case is the slogan of a well known holiday company in the UK. It also tends to be something that is a complete concept in itself with specific connotations (so because reasons means ‘because of some vague and probably not very well-thought-out unspecified reasons’). We probably share some knowledge (e.g. ‘people do things for stupid reasons or no reason at all’), and it might have a slightly tongue-in-cheek usage (e.g. ‘You and I both know that I have no good reason for this but let's pretend I do’).

Lots of careful research later, and we’ve been able to describe ‘because X’ quite precisely, as involving ‘sentence fragments’ – that is, incomplete sentences that express a whole thought. It’s like when you say Going out! in answer to the question What are you doing?. This is really unexpected because these sentence fragments, by definition, shouldn’t show up within sentences! But this is what gives them their quirky sound: doing something unexpected gives a slightly jarring pragmatic effect to make the listener realise this isn’t normal ‘because’, giving a reason, but new ‘because X’.

So what now? Well, before we can write up this research properly, we need to test a few specific things about this analysis. We've set up another survey. Where the last one necessarily took a scattergun approach, because we didn't know what was acceptable and what was not beyond just what seemed right to us, this one is more careful. Based on the predictions of a number of hypotheses, we've created another list of sentences that might sound more or less natural, and I need people's opinions on this. We need lots of people, because the more people who give their opinion, the more reliable the results are.

If you want to take part in this research, you can! You can fill in the survey here, which will take no more than ten minutes, giving your own opinion on how different sentences work with 'because X'. We really need a lot of participants to get good results, so share it with anyone you think might be interested!