Loading summary
David Palfreyman
Study and play come together on a Windows 11 PC and for a limited time, college students get the best of both worlds. Get the Unreal college deal everything you need to study and play with select Windows 11 PCs. Eligible students get a year of Microsoft 365 Premium and a year of Xbox Game Pass ultimate with a custom color Xbox wireless controller. Learn more@windows.com studentoffer while supplies last ends June 30 terms@akamsCollegePC this episode is brought
Advertisement Voice
to you by Prime Obsession is in session and this summer, Prime Originals have everything you want. Steamy romances, irresistible love stories and the book to screen favorites you've already read twice off campus. Elle every year after the Love Hypothesis, Sterling Point and more Slow burns, Second Chances chemistry you can feel through the screen. Your next obsession is waiting. Watch only on Prime.
David Palfreyman
It's time to refresh your yard during Spring Backyard Days at the Home Depot. Get low prices guaranteed on propane grills starting at $179 like the next grill 3 burner gas grill. Or get $50 off a select Weber Spirit Grill and bring big flavor to your backyard. Then set the scene with Hampton Bay String lights that bring it all together. Shop Spring backyard days for seven days at the Home Depot now through May 6th. Exclusions apply. Seehomedivot.com Pricematch for details welcome to the New Books Network
Ingrid Pillar
welcome to the Language on the Move podcast, a channel on the New Books Network. My name is Ingrid Pillar and I'm a Humboldt professor at the University of Hamburg in Germany, as well as Distinguished professor of Applied Linguistics at Macquarie University in Sydney, Australia. My guest today is Professor David Palfreyman. David is a professor in the Department of Curriculum and Methods of Instruction at the United Arab Emirates University in Al Ain in the United Arab Emirates, not too far away from Dubai. David is an applied linguist whose research expertise is in academic literacies in multilingual contexts, such as in the higher education sector here in the United Arab Emirates, where students study predominantly through English as a medium of instruction, but also need to master reading and writing in Arabic. We are meeting today on the Dubai campus of Syed University to chat about David's most recent book, Bilingual Writers and Corpus Analysis, which was published by Rutledge in 2023. Welcome to the show, David.
David Palfreyman
Thank you. It's great to be great to be with you and discussing this.
Ingrid Pillar
Hi David. So today we want to talk about your new book, Bilingual Writers and Corpus Analysis, which you've co edited together with Nizar Habash. The book is one of the first Examples of a bilingual writer corpus, the Syed Arabic English bilingual undergraduate corpus. That's a mouthful. The abbreviation is Z A E B and the book includes writings of hundreds of students in two languages. Can you maybe start us off by telling our listeners how you actually got to write this book?
David Palfreyman
Right, yes. So I'll go back a bit in my career, where these interests came from. So my first degree was in linguistics and I was always interested in different languages. Not everybody in linguistics is necessarily interested in learning different languages, but I was. So my first degree was in linguistics and then after that I began teaching English as a foreign language. So that was the first part of my career was focused very much on English and I noticed that people in the efl, English as a foreign language world, were not necessarily that interested in other languages. It was very much focused on English. Of course, that was the subject. I was working in other countries in Italy, in Spain, in Turkey, and people who were teaching English would pick up some of the local language. But there was very much of a focus on the native speaker as a separate world from other languages. So when I came to do my master's, I chose, rather than a TESOL related masters, I chose a program in London which was called Language in the Multicultural Society or Community. And so I was in contact with like medical workers, police inspectors, all kinds of people who were, who were research, I mean, who were interested from their work in how language plays a role in multilingual multicultural communities. So that was a real eye opener for me. And then I continued with my career in teaching EFL and then teaching like literacy, university literacy and training teachers in EFL and then working with faculty across disciplines. So that was also interesting to get those different perspectives. So then for the last some years, many years I've been working, I'm British in origin, but I've been working in Dubai, mainly in the United Arab Emirates. And I found it a very interesting situation in relation to, well, it's very multicultural community. There's a local community which is very influential. I mean, the citizens of the UAE are maybe, I don't know if listeners would know, but it's maybe 10% of the population are citizens of the country and all the other 90% are people from all different countries of the world who are doing all different kinds of jobs.
Ingrid Pillar
And yeah, this is something that's really difficult to imagine outside the uae, that the native, if you want to call it that population of the citizens are actually such a tiny minority.
David Palfreyman
That's right, yeah.
Ingrid Pillar
That gives us an interesting sociolinguistic Situation, right?
David Palfreyman
It does, yes. So in the uae, the United Arab Emirates, and especially in Dubai, English is very predominant, you could say. And it's used in sort of high status domains like in higher education, for example. Most of higher education is through the medium of English, but it's also used between all these people from different countries as a lingua franca on a daily basis in shops and taxis and dealing with all kinds of people. So in some ways. So you can see that the 10% of the population who are kind of locals, they have a very close community and they value English. Obviously it's highly promoted in the UAE and in other countries in the Gulf as well. But also they value Arabic as their first language. And Arabic also is a kind of world language. It's used in many different countries across the world, especially Muslim countries. And so that. So both in the uae, both English and Arabic have prestige, but of different kinds. And really foreigners coming to the UAE do not learn Arabic very much.
Ingrid Pillar
And because I know that you've learned Arabic, David, so can you maybe tell us how that came about? Because you've just spoken about the monolingual mindset that prevails often amongst English language teachers and in the TESOL profession, and you're clearly an exception to that, that. How did you get to learn Arabic?
David Palfreyman
Wow. Okay. I would say at the outset that considering how long I have been living in Arabic speaking countries, my Arabic has not developed as much as, for example, I worked in Turkey for a year and I learned a lot of Turkish through. I'm much more fluent in Turkish, for example, than I am in Arabic.
Ingrid Pillar
Interesting.
David Palfreyman
But yeah, so how did I learn Arabic? So there was not a lot of. It's not like I arrived and there was an orientation course set up to teach me Arabic, really apart from a couple of words, maybe just for interest, but I was interested in it. So the first time I came to the UAE beforehand I bought a book called Teach Yourself Gulf Arabic. So I was interested in the Gulf Arabic. So that's another issue about Arabic. But there are many different kinds of Arabic. It's spoken across a great range of countries from Morocco to Yemen. And there are many different dialects which are really not necessarily comprehensible to each other. And then there is a kind of classical Arabic of the Quran, and then there is something called Modern Standard Arabic, which is a kind of not entirely settled standard version, which is not as old as the Quranic Arabic, but tries to sort of be academic and formal, but is really only used like it's used in Writing. And people speak it on the news when they're reading from a script or maybe in very academic discussions, for example, or in a lecture or something, they might use that. So there was a question about what variety of Arabic to learn. So I learned a bit of Gulf Arabic. And at that time, I was teaching in a middle school, teenage boys. So I listened to their version of. So if you can imagine listening to teenage boys in any country, you get a particular impression of what the language is like. And then later, I was teaching at Zayed University, and so I would hear the students. So I can understand a fair amount. I can read the Alphabet. I don't necessarily. My vocabulary is not great, so I don't necessarily know what it is that I've read. Yeah, that's a problem. Oh, yeah. So that was a consideration in preparing this corpus and the book.
Ingrid Pillar
Yeah. Cool. So let's now move on to the book. And you've already sort of given us a really tiny glimpse into the very complex sociolinguistic situation in which this book has developed. And so, as I said earlier, it's about a bilingual writer's corpus. So maybe you need to tell us first what is actually a corpus and what's so special about a bilingual writer's corpus?
David Palfreyman
Okay, so firstly, what is a corpus? Traditionally, the first way, I think the word corpus was used in this sense. So a corpus is a collection of language, often a collection of writing. Now, traditionally, you would talk about the corpus of Shakespeare's writing, for example. So it would be a literary source, and it would be the. So corpus in Latin means body. So it's like the body of writing by that particular person. Or then from. I would say maybe the 1970s, people started to look more at descriptive corpora of everyday language. Or, you know, it could be newspapers, for example. So we started to have big collections of back issues of newspapers. And so this provided a lot of data. And with the advent of computing processing of text, people were able to keep track of, you know, thousands of words, hundreds of thousands of words of text. So a corpus is. So the definition that I used in the book is a large digitized collection of authentic representative texts. Now, so I'll come back in a minute to say a bit more about how large is large and representative of one. But the use of these corpora is really for seeing examples of real language in use. So there are corpora. So corpora is the plural of corpus. Yeah, there are corpora of. For example. Yes. Newspaper writing in English or medical textbook English or conversational English. Like later there came to be maybe transcribed, spoken, corporate. And these have been used to prepare, for example, dictionaries. So for dictionaries, you want to see, okay, what are the words that people are using in real life? What are the, you know, what do those words mean? What are the different meanings they have in everyday use? So corporate of English were very much important as a model, but not a prescriptive model, but a descriptive model of what people actually use in English. So many years ago there was a corpus from Birmingham called the Cobild corpus, and there was a textbook which was based on, on that, using extracts from real conversations that people had, which is quite a difference from the way of analyzing language, which there might have been in the past based on rules like prescriptive rules, or based on classical models, or based on, in the case of generative grammar and Chomsky and perspectives based on intuitions, grammaticality judgments.
Ingrid Pillar
So I guess that's where authentic came in, right? You mentioned earlier that it's an authentic collection.
David Palfreyman
Yeah, So a corpus is a big collection of everyday language. Well, authentic language now. So I said a big collection. How big does it need to be? So as computer power developed, the size of corpora kind of became thousands of words, hundreds of thousands of words, a million words. And then with the advent of the Internet, so there's so much text available online, because this text really needs to be in digital form to be analyzed easily. So with the Internet, Wikipedia for example, now is he recorded? And so there are, for English, there are many billion word corpora.
Ingrid Pillar
Can I just ask you. So many of our listeners will be familiar now, inevitably, with large language models. And large language models are essentially corpora too. Right?
David Palfreyman
So large language models, as I understand it, then it's not my specialization, but they are models of what is happening in a corpus. So what linguists would do 20 years ago is look at these corpora, sort of sift through them, search through them for particular words. How are those words used in general? How frequent is this word? How frequent is that word to compare and analyze the corpus and draw conclusions from it and find patterns. Now. So a large language model is a kind of artificial intelligence. It's a computer program which basically does that, which looks through an even huger corpus and finds patterns, and that's how it's able to understand things, and it's how it's able to produce language as well. So that's the key thing, the generative AI, the probability.
Ingrid Pillar
Okay, so let's maybe. So you've explained to us what a Corpus is. And now the special thing about your book and the corpus you've developed is actually that it is a bilingual writer's corpus. So can you explain that side of the thing a bit more?
David Palfreyman
Sure. So the corpora I've been mentioning are. So I mentioned, for example, the, you know, there's a large corpus based on Wikipedia, for example, or a corpus of newspaper writing in English. So these are the largest developed corpora and they can be more specialized as well, but they're all based on. I mean, I don't think they really have a name, but I've. I call them genre based corpora because they're. So we said that, I said that a corpus is a collection of representative texts. Now what do those texts represent? So the early corpora would be they were supposed to represent kind of native speaker use of English because English was the first one. Later there were more specialized corpora representing, for example, medical English or everyday conversation or, or business writing, things like that. So what they represent is they represent a particular genre of writing. Now, at the same time, in the study of language and in second language acquisition, there's a lot of study of learners, so individual people who are using language. So, so on the one hand you have a corpus which is a very large kind of open collection around a particular genre, and then on the other you have data which is gathered for language research projects which is collected from specific individuals. And we know a lot about those specific individuals. We know which person said this, which person said that, we know their age, we know their gender, we know their nationality, and we have a lot of information about, typically in corpora, the genre corpora. We don't know very much about the writers of that text, like Wikipedia. So many different people write it. Sometimes there's. So corpora are usually annotated with some kind of metadata. So for example, each text, each newspaper article, for example, in a news corpus would be tagged with its date, maybe what section of the newspaper it was in. So that's some kind of data that we have there, but we don't really know anything about the writers of those. So the reason I came to the idea of a writer corpus is because I was interested in creating a bilingual corpus. So I was involved in one stage in the university's planning of graduate outcomes, graduate learning outcomes.
Ingrid Pillar
And when you say the university, you mean Syed Universe.
David Palfreyman
That's right, yeah, some years ago. And so they were saying at that time, okay, when our students graduate, what do we want them to be able to do? And one of the things was that we want them to be able to use English in an academic way, in a professional way. But then there arose the issue of, okay, doesn't their Arabic also need to be academic and professional? So in the UAE there's a tradition, I mean, in the last 50 years of the country, because it's only existed for 50 years, it made an enormous progress. So there's a tradition of promoting English in higher education, but maintaining Arabic through. So students take maybe one course in Arabic in the year to develop their Arabic writing. So there is a bit of an element of the curriculum of that. So the idea was that a student graduating should also. Okay, I'm talking here about local. So these are Emirati students. So their first language, most of them is Arabic, colloquial Arabic. So when they graduate, they're supposed to be able to, for example, write in English, speak in English in professional situations, but also write and speak in Arabic in professional situations. So an intermediate step now is the issue of a learner corpus. Right? So the corpora I've been talking about were kind of model corpora of what native speakers use, even if it's not great language, it's what people are really using and it's what native speakers are really. Now there's another kind of corpus which is called a learner corpus. So that is a collection of learners writing typically in English, like most of them are in English, not all. So a learner corpus is a collection of, for example, essays written by beginning learners of English and intermediate learners of English and high levels of English. So, for example, the Cambridge exams, a lot of the writing that students does in the Cambridge exams, first certificate and proficiency and so on, is collected into a corpus, which is an international corpus of learner English. Now why would you want a corpus of mistake ridden learner writing?
Ingrid Pillar
You know that mistakes don't exist, right?
David Palfreyman
So learner corpus is of interest to researchers on language development and language assessment also to look at what kind of mistakes, for example, occur at this level in beginning learners and in intermediate learners. And also not only errors, but also. So errors are interesting in themselves, but also things like patterns of usage. So maybe students, you know, they use there are two words like I don't know, but and however. And students use however when they're speaking much more than native speakers do, for example. So it's a question not only of, not only of errors or mistakes, but also overuse, what we call overuse, underuse in air quotes, differences. So learner corpora are often compared with native speaker corpora to see what kind of differences are there patterns of grammar, patterns of vocabulary used, patterns of organization of what they say. Okay, so we have in so what I thought was okay, there are kind of genre based native speaker corpora. There are learner corpora, but what about those learners first language? So the students that I was dealing with, you know, they were developing their first language and not getting a lot of practice in writing in their first language. And they were writing a lot of English and using a lot of English
Advertisement Voice
Ready to soundtrack your summer with Red Bull Summer All Day Play. You choose a playlist that fits your summer vibe the best. Are you a festival fanatic, a deep end dj, a road dog, or a trail mixer? Just add a song to your chosen playlist and put your summer on track. Red Bull Summer All Day Play Red Bull gives you wings. Visit red bull.com brightsummer ahead to learn more. See you this Sunday Summer who cares about your poops? Ollie does. That's why Ollie's science backed Gut Health lineup helps support your family's regularity. From daily probiotics to fiber gummies your kiddos will love. Find it all on olli.com that's o l l y.com these statements have not been evaluated by the Food and Drug Administration. This product is not intended to diagnose, treat, cure or provide prevent any disease as the Crispy chicken sandwich from 711 people always call me loud and I'm like yeah, I know I'm crispy. Did you expect me to whisper? If you want quiet, go eat some soup and reflect like I know I'm a handful, I'm bold, I'm juicy, throw some pickles and barbecue sauce on me and baby I'm a whole meal and with seven rewards I'm just $4 quiet, no crispy, saucy and $4 very only at 7 Eleven Valley 362326 participating stores only while supplies lastly app for full terms.
Ingrid Pillar
So can you maybe just to provide a bit more context because I think for most of our listeners we're probably based in the US and Australia and the English speaking world. Just to wrap our head around the linguistic profile of the students we are talking about. So you've mentioned their first language is Arabic, but they're not getting a whole lot of practice in it. And then English is the language they mostly write in. So can you maybe just explain this a bit more for us? Like what your typical writer or your typical student who is part of this bilingual corpus what their linguistic profile is.
David Palfreyman
Right? Okay, so typical students in higher education in the uae, they will have grown up speaking basically Arabic, mostly Arabic at home, Arabic dialect. So local dialect, which is not that understandable to other Arabs, maybe, and also
Ingrid Pillar
relatively low status, comparison status, except in
David Palfreyman
certain kinds of music. Yeah, that's right. So depending on the domain. But definitely, yeah, if somebody gets up and gives a lecture in Gulf Arabic, then Arabs from other countries would kind of find that straight, or might not take them seriously. So, yeah, so the students will have. And I'm talking here about this 10% of the population of this place, the local Emirati students. So they have grown up speaking Arabic dialect, colloquial Arabic, as we say, on a daily basis with their friends, with their family, they will hear some standard Arabic in TV programs. If they watch Arabic tv, even cartoons are kind of dubbed in in quite a formal kind of Arabic to sort of promote formal Arabic standard Arabic. And they will have heard a lot of English. So they're growing up in, you know, they'll be watching English films, listening to English music. Many of them will go to a school where. So over the years, the age of learning, starting to learn English at school has got younger and younger. So now most schools have some English right from the beginning, especially in certain subjects, like in maths and science. Other schools, private schools especially, are completely in English, and students will have maybe just one or two lessons of Arabic a week. So they can end up with a kind of divided proficiency. Colloquially, very much they're using Arabic, colloquial Arabic as a spoken language. As a spoken language. But then academically and formally, they would use English. Now they're. So that's the basic picture. But also one of the characteristics of Gulf Arabic for Arabs from other countries is that, I mean, certainly in the past was that they used a lot of English words because there was a lot of English presence, a lot of Indian presence in this region. And so colloquial Arabic is even more permeable to kind of loans from English, and not even regular loans, but just people will mix the two when they're speaking. So certainly in the spoken language, there's a lot of mixing of English and Arabic. And then at university, students still struggle to write in formal Arabic.
Ingrid Pillar
They might not even know that a particular word they think is Arabic is maybe in the norm, again in air quotes, is not even an Arabic word, Maybe an English word or sort of a Gulf Arabic word.
David Palfreyman
Yeah, exactly. Yeah. And for them, it might be the regular word, but then. And it is Arabic, like, it's not English, but for an Arab from Syria or Egypt, they would say, well, that's not proper Arabic. So, yeah, so basically it's Arabic as the first language. English is their second language, but their English is often much more developed in certain domains. In terms of writing, first and second
Ingrid Pillar
maybe doesn't make so much sense even. It's more about spoken.
David Palfreyman
Because they begin learning English so early, even in kindergarten, even before that, parents will often kind of give them English cartoons to watch because they're keen on children learning English. So for some students, English is a first language. Maybe alongside Arabic, two first languages. Yeah. So that's the linguistic situation.
Ingrid Pillar
Yeah. And I guess one of the amazing things is that you are actually recognizing this complex linguistic situation in your corpus, because I think oftentimes we see that corpora are around these idealized linguistic situations, monolingual situations. And so. So your book really looks at authentic language in this context, which is very diverse. So maybe tell us a bit more about what actually happens in the bilingual.
David Palfreyman
Okay, so what I wanted to put together was a corpus of texts written by students in both their languages. So the idea was to access, not to get an idea of what proper usage is or what normal usage is in English or in Arabic, but to see what they use not in the preliminary corpus, not in spoken language, but in writing. So I was in the university context, and students would. Were doing a lot of writing in English, for example. In my courses, I would be teaching English composition, for example. But I knew that they were also having to write essays in their Arabic courses. And so what I wanted to do was to put together a corpus of students writing in both their languages. And it was important that I knew which student wrote this English essay and which student wrote this Arabic essay so that I could compare for each individual also. So that's why it's a writer corpus. So when we say a corpus is a body, a collection of representative text, in this case, it's representing what we would say, plurilingual proficiency of these particular students. And this is a situation. We can talk a bit more about this, but this is a situation which is not unique to this small population of learners in the uae. There's a lot of places in the world where students are biliterate. So they're writing, for example, in English as a language of higher education, and then they go to work and they need to write in their first language or on a regular basis, they may need to use both. Thinking in terms of literacy.
Ingrid Pillar
Yeah. And maybe just as an aside, here at the Literacy and Diversity Settings Research center at Hamburg University, my colleague, Professor Inge Gogolin and some other colleagues have actually also created a multilingual corpus that is built precisely on this recognition that most people around the world maybe know, except in the English speaking world really nowadays, need to become bilingual writers because becoming educated involves for most people in the world also to learn how to read and write in English in addition to their various first language or languages.
David Palfreyman
And I realized this through contact with this when I was working with university learning outcomes. So we were having to look at not just the students in the classroom in front of you in the university, but also where are they going, you know, how employable are they going to be when they leave this English bubble of a university and are working with Arab customers, for example, or colleagues. So yeah, so that's so the, the. So zeybook corpus. So zaybuk. Oh, that's how you pronounce it, actually. So my co editor, Nizar Habash, who's an Arab and specialist in Arabic linguistics and computing, and he pointed out that ze' bok in Arabic means mercury. It's the element of mercury, like the quitspot.
Ingrid Pillar
Okay, well that's good to know, mercury.
David Palfreyman
But yeah, so it's short for the Zayed Arabic English Bilingual undergraduate corpus. So that gives you an idea of what this corpus is representative of. It's representative of undergraduate students writing in both their languages in Arabic and English.
Ingrid Pillar
And so how did you actually collect the corpus? Did you get the students to write essays as part of their everyday assessments and collect them? Or was it like special writings or how does it work and how many students are involved?
David Palfreyman
So we wanted to get writing from the same students in two different languages. And we needed to do this really through their courses. We couldn't just write to students and say, could you send us an essay? For various reasons. I mean, one is that they might not do it, why should they? And another is that it could be quite random what they wrote about, what help they got from a dictionary or whatever while they were writing. So what we did was in 2019, in the fall semester, so students coming into the university, they begin a course in English writing, English composition, and they do a course called Arabic Concepts, which is kind of writing and formal speaking in Arabic. And so when they come into the university, they do a diagnostic writing, which is for their teacher to sort of see a sample of their writing and get an idea about them. So what we did was we contacted all the teachers across many sections across two campuses. And so, okay, so the English composition teachers had a choice of what diagnostic task they gave so they, what the students would write about at the beginning of their first semester. And so this was not assessed in the sense, because that's another issue in writing. If a piece of writing is assessed then there's certain kind of pressures and responsibilities involved. So this was a diagnostic piece of writing. So it was for the teacher, but it didn't have a grade attached to it. So what we did was in the end was we talked to the Arabic teachers and asked them what kind of writing they were giving. So they gave an actual placement test to students and got them to write about. There were three topics that they were going to write about. There was social media development, national development and tolerance in society. So we found out. So the students in their Arabic class were already going to be writing an essay about this. So what we did was we asked the English composition teachers if they could use the same topics for their initial diagnostic writing for the students. So we managed to get. There were logistical difficulties, But we were confident that we had a sample of writing in both languages on comparable topics because that's important and also that the students did. Most of the students did the writing in a controlled online environment. So they were kind of typing and they didn't have access to a dictionary or anything. So it's a bit more like authentic, their authentic writing, I mean authentic unaided writing. So we ended up with a total of, I believe, about 600 essays across the two languages. We had more English essays than Arabic ones we were able to collect. But there were about 150 students who submitted an essay in both languages. So considering the challenges, I'm quite happy with that. Now in the process of this, I got in touch with Nizar, who's a professor of computing at New York University in Abu Dhabi.
Ingrid Pillar
All right.
David Palfreyman
And he's a, he's a great guy. And I went, I went to see him and I told him about the project and he was very into, he's always very, he's very enthusiastic about things and. But when I told him the, the size of the corpus, he said that's tiny. It's a tiny corpus. And I'd been thinking it was quite a impressive one. But he's used to dealing with, you know, these billion, 10 billion word corpora
Ingrid Pillar
with a large language.
David Palfreyman
I mean, part of what I, what was great with working with them, it was that he adapted like he could see the point of having a corpus like this. He said, well, okay, so we'll just need to do some things a different way. So once the essays were collected. They went through a process of, okay, so first of all, we had people who rated the essays. So we had bilingual Arabic, English bilinguals who were experienced with using the Common European Framework of Reference for language assessment. And so they gave a rating to each of these essays. Now, so for English, for example, there's kind of A, B, and C. So A is the lowest level, A one, A two, intermediate is kind of B one, b two. And then higher level up to near native speaker is C1, C2. Now, the common European framework, the CEFR, has been applied a lot to English, to some other languages, not very much to Arabic, not really to Arabic. So they were applying it to Arabic, really, for the first time. It hasn't been used very much for that. And also the CEFR has not been used for assessing first language proficiency. Now, if we go back to the idea of the corpus, the idea of the original idea of a corpus was that it's a collection of native language to give a kind of model of what English, for example, is like. But then all of these students are writing in their first language, supposedly. I mean, so won't they all get top marks? You know, if somebody's writing like somebody coming from an EFL background, it was like, wait a minute, how are we going to assess somebody's first language and be able to compare it with their proficiency in a second language? So that was an interesting point about the corpus as well. So this brought us to the idea of thinking about, so thinking about writers as learners. So basically, when you talk about a learner corpus, you mean somebody who's learning the language, learning English. And I did find, when I was looking at the literature, trying to find out, has somebody already made a bilingual corpus? And I found that what people mean by a bilingual corpus varies a lot. So there are, for example, translation corpora, which are like EU documents, are routinely translated parallel documents. So that's a bit different because that tells you how translation works. So the person who wrote the original document is not the person who translates it. So it's different people involved. There are also, I mean, I also looked at studies of learner language, learner corpora, and bilingual. And I found that often bilingual is used as a kind of euphemism for non native. So I would look at a study to see, okay, you know, it mentions bilingual, and it mentions a learner corpus, but really it was just a learner corpus of English. But these people were talked about as bilinguals to kind of give a. Because it sounds better than native. It didn't Sometimes. Sometimes it would tell you what their first language was, but that was as far as it went. So it's not. It was not really bilingual in the sense I was looking for. Yeah, so. Right. So the whole assessment issue was really interesting in terms of. So, right. So we can look at these people as learners or as users of a language. So typically we think of native speakers, so called, as users. You know, they are using English. When we have a corpus from Wikipedia that's people using English, when those people, when we know they're learners, we say it's learner English. Maybe a lot of the people writing in Wikipedia are not necessarily native.
Ingrid Pillar
No, we don't.
David Palfreyman
We don't know. So there's these two perspectives looking at. And typically we look at somebody's L1 as, you know, they're a user of that L1, but they're a learner of their L2. But this. I sort of realized a lot from this project, how much people. So people, Everybody is a learner also of their first language. So, you know, I mean, I learn new words.
Ingrid Pillar
Yeah, I think we forget that often because we are. So, I mean, just to talk about language really isn't enough because it's this conflation between spoken and written language. And of course, writing is something that requires education and requires continuous development and develops throughout your life much more than the spoken language does.
David Palfreyman
But also speaking. I mean, you know, some people are better at speaking than others. Like, yeah, in and depending on the context, you know, like, if you put me in a, you know, put me in a lecture and I'm happy to talk for hours, but put me in a business meeting and I, you know, I might. And even, for example, talking with Nizar and his team about machine learning and their kind of interest in this corpus, I found differences. Like, I would ask a question and Nizar would say, oh, that's trivial. And at first I thought, oh, well, okay, it means it's not worth. Not worth asking. But it turned out what. What they mean, like, in that it means that it's easy. It's easy to answer. Like a computer program, a couple of lines of code can answer that. So. So it doesn't mean that it's not worth asking. It means that it's. That it's easy to do, easy to address, and easy to solve. So, yeah, so. So. So we're all. I would say we're all learners of our own. So this language. So language changes, you know, and there's new words that, I don't know, ways of talking but then also on the other hand, people in a second language are also users. They're not only learners, they're also users. Like, depending on the situation, we may view them as a learner or as a user. But for example, our students would be going on to work in company, international companies, or in diplomacy. International diplomacy. And so they would be using the English while they're at university, they're using the English to learn. I mean, their courses are in English. So this dual perspective of both L1 and L2 as we are using them, we're always using them, but we're also learning.
Advertisement Voice
With Plan B emergency contraception, we're in control of our future. It's backup birth control you take after unprotected sex that helps prevent pregnancies before it starts. It works by temporarily delaying ovulation, and it won't impact your future fertility. Plan B is available in all 50 US states at all major retailers near you, with no ID, prescription, or age requirement needed. Together we've got this. Follow Plan B on insta at Plan B. One step to learn more Use as directed.
David Palfreyman
You can't reason with the sun. Trust us. We've tried this summer, it's time to put that angry ball of fire on mute. Columbia's Omnishade technology is engineered to protect you from the sun's harsh rays that can burn and damage your skin. The sun is relentless, but so is our gear. Level up your summer@columbia.com to spend more time outside and less time slathering on aloe lotion. You're welcome, Columbia. Engineered for Whatever.
New Books Network Announcer
Hey, NBN listeners. We're running our 2026 New Books Network Audience Service, and we'd love just a few minutes of your time. NBN has been bringing you in depth conversations with authors and scholars for over 15 years. We haven't done a comprehensive audience survey since 2022, and a lot has changed since then. It's time to hear from you again. Here's why we're asking. We want to understand who's listening, what subjects and podcasts you love most, and where you'd like to see us grow. Your responses help us tell NBN's story to the publishers, live libraries and institutions we partner with, when we can show that our listeners are serious readers, lifelong learners, and heavy library users. It opens doors to new partnerships, better resources, and ultimately, a stronger NBN for everyone. And one more thing. If you leave your email address at the end of the survey, you'll be entered to win a $100 gift card to bookshop.org, a chance to stock up on books while supporting independent bookstores at the same time. The survey takes just five minutes. Your answers are completely confidential and your email will never be shared. Head to new booksnetwork.com to take the survey today. We really appreciate your support. Now go take the survey.
Ingrid Pillar
So I guess there are lots of ideological questions to address when you actually do compile a bilingual writer's corpus like this. All these ideological around what is a native speaker? What is the first language? Are we all continuously learning? Is that a good way to think about things? So that's one of the things I really love about your book. I thought maybe we can now move on to the practical side of the corpus. So you've told us a lot about what the corpus is, how it was designed, but then what is it actually good for? What kind of research can we do with it? And what kind of practical applications in teaching does it have? Can you walk us through the research and the applied side of the corpus?
David Palfreyman
Right, yeah, sure. So we ended up with a corpus, which is a lot of texts. We also have. We know who the writer of that text is. I mean, it's anonymized, but we have an ID for who wrote this essay and that one. We have information about the level which it's been assessed at. We have information about where the student lives, if they're in an urban area or in the countryside, kind of thing, like a more rural background. So for research, there's two kind of areas, from my point of view, that you can find out about. So one is about students development in English, their development, about language development. So how you can compare students whose English essay was rated at intermediate level with ones who were rated at a lower level and see what differences there are between them. And that helps you to understand what happens as a student gets better in English, but also in Arabic in the first language, because some of the Arabic essays were rated low and others were rated much higher and were more articulate and better at writing.
Ingrid Pillar
Are there tendencies that you can see or that you have looked for in terms of high level proficiency in one language also leads to high level writing in the other language. Is that how it works or is it like more of a trade off?
David Palfreyman
Right, so. Yeah, because.
Ingrid Pillar
Oh, you didn't really look at that?
David Palfreyman
Yeah, no, no, I did. Well, so there's a concern in society like in the uae, that students Arabic is suffering because they're learning more English. So it's like a kind of competitive model of how language learning works. So one Thing which led into this corpus, I mean, I did it before was I had a smaller sample of students and I got writing in both languages and looked at the sophistication of their vocabulary in their writing. So did they use just common everyday words or did they use more sort of unusual formal, academic words, the kind of variety or development of their lexis? And I found that it was fairly correlated between the languages. So some students were more. Used a wider range of vocabulary in both languages. Some students used a more narrow everyday range of vocabulary in both languages. So it was not like a competitive. Like one is better, so the other is lower.
Ingrid Pillar
Yeah, because oftentimes writing skills are transferable across languages. Good writers are often good writers in all their languages, and poor writers can be poor writers in all their languages.
David Palfreyman
Yeah, especially in literacy and especially kind of global skills. I mean, I know this from research which has been done not with corpora, but with kind of sample, like smaller numbers of learners, sometimes one learner and you know, just like a case study comparing. So things like organizing an essay, for example, transfers from 1, whereas vocabulary maybe doesn't. Yeah. Or, you know, you might make a lot of grammatical mistakes in your second language, not in your first language. But the more kind of global literacy skills are more definitely more transferable from one to the other. Which is why in some ways, why the students Arabic. So I think a lot of the input that they got in English about how to write essays, for example, how to write an assignment, that kind of fed into their Arabic writing as well, although they hadn't had a lot of input about how to do that. Yeah, so that's one. So looking at development of language, that's one useful research purpose. Another is. So, okay, so for example, let me talk about a specific example. So I didn't want to just create a corpus and then not have it used. So that's why I kind of got people together to do different research studies of this corpus and put it into the book. So one of the studies in there, which I did with Dr. Omni Amin, was to look at students use a metaphor. And we compared the highly rated essays with the lower rated essays in terms of their use of metaphors. And we found that in English, the sort of more advanced students, ones who did better writing, they used more metaphorical use of words rather than literal use of words. So as the levels were higher, the. The percentage of their words which were used in a metaphorical way was more. So that's something. That's a sort of discourse level development as learners go on which you might not have thought of. Now what we also did, so that's looking at those people as learners of English. In Arabic, there wasn't so much of the same increase. Like there wasn't such a difference between the levels. As far as we could tell. That's the learner perspective. Then we took a user perspective on those. So we just treated them as people who were writing. We looked at what metaphors they used for talking about social media. So this is more like the kind of corpus assisted discourse analysis work which exists looking at how people use the language and what that shows you about their perceptions. And we found that there were some metaphors which students used in both Arabic and English about addiction, for example, people being addicted to social media. And a largely negative, because it was about the effects of social media on the individual and society. So they focused a lot on the negative effect. But then there were other metaphors which were use more in one language than the other. So in Arabic there was a bit more sort of struggle and invasion kind of metaphor that they used talking about it. And when they were writing in English, as well as the addiction metaphor, which they also used there, they would talk about kind of exchange and trade and kind of like getting ideas from open doors. And although actually the open door metaphor was often used as like a door to depravity and sort of.
Ingrid Pillar
So not necessarily parts of problems.
David Palfreyman
But yeah, and learning also learning from the Internet was something that came up that they talked about more in English when the same students wrote in Arabic. So that sort of shows the learner perspective and the user perspective and the kind of of findings that you get from looking at those now through working with Nizar and his team, who are basically like a computing team. So they were very interested in corpora, but not from the point of view of teachers or curriculum planners learning from them. Or also actually students can learn from corpora. I don't know if we'll have time to talk about this, but data driven learning is a kind of a way of getting students to look at corpora and look at examples from corpora to get an idea of how the language works. But. So Nizar and his team were very interested in kind of basically what we now call AI and what that can do with a corpus. And. And in fact, in contrast to. So in linguistics, it's nice to make a corpus, but it's like it doesn't really count unless I feel like it doesn't really count in a scholarly way unless you've discovered something from that corpus, like Just creating a corpus in itself. I mean, it is.
Ingrid Pillar
Yeah, that's right. We don't value infrastructure. It's like it's less prestigious to create research infrastructure than it is to, I don't know, come up with a brilliant concept. Or the brilliance is usually attached to a fancy new term or something like that. Yeah, for sure.
David Palfreyman
Whereas I found that in the computing world there are whole conferences which are just about presenting resources for research. So the idea of this corpus, and I would say all corporate, but the idea of this corpus was to not just create a corpus, but to make it available to people and to offer it for use by researchers. So it's downloadable, it's available online@zabuk.org yeah.
Ingrid Pillar
And we'll put that in the show note for our listener.
David Palfreyman
And it has been downloaded a lot, especially by people on the computing side. So people who are. So. No, no. Why do they want this call? So basically, it's providing samples of language to train models on, for example, grammar correction or spelling correction. So that's when we started. That was the kind of application that. That people were interested in. And every corpus, because our corpus is also representative of a genre, it's representative of academic kind of undergraduate writing. And so quite a lot of people download it because maybe they're preparing a computer model which is going to understand academic writing in Arabic, for example. And so it's good for training them on, but they need to know also about what kind of spelling mistakes, what kind of grammar mistakes users will make when they're like, if you have a spelling corrector, what kind of spelling errors is it going to come across? Like, what kind of errors do people make? Now, since 2019, when we've collected the. The main part of the sample which is discussed in the book, AI has kind of taken off. And so there's even more interest. So a lot of people are downloading the corpus because they're making some language model or they're so not researchers in the educational side, which is what I was thinking of to start with. But.
Ingrid Pillar
So there are chatbots out there trained on your corporate.
David Palfreyman
Among many other things. Yeah, that's right. It's part of the input. Yeah, yeah.
Ingrid Pillar
Amazing.
David Palfreyman
Which is. I mean, another thing which has come out of the AI movement or the age is that. Is that a lot of what you find online now is written by AI.
Ingrid Pillar
Yeah.
David Palfreyman
So this is starting to be a concern for people who assemble corpora. Now for some people, it doesn't matter if that's the language that's there for people to see then it doesn't matter who it's written by. But from an educational point of view and various other points of view, it's nice to know that you know where this text came from.
Ingrid Pillar
Yeah, for sure. Otherwise, I mean, I think there's a lot of talk about that.
David Palfreyman
Yeah, exactly, yeah. Like when we say authentic, authentic language, you know, what we mean traditionally is written by people. Now if something's written by ChatGPT, does that make it not authentic? You know, it's an question.
Ingrid Pillar
I don't think it's an open question. It's synthetic, obviously, it's machine created text. I mean, unless you change the definition of authentic to include machine generated. I think it's an amazing corpus, an amazing achievement. And there are so many perspectives in there, I think, that make us question fundamental beliefs about language proficiency, about literacy, but also all these practical applications that you've just outlined for us. So let's maybe close this off by. Can you maybe outline, on the basis of the research that you've done, extend this a bit beyond, beyond the context of the UAE and tell us what you see as the future global research agenda in multilingual literacies?
David Palfreyman
Well, I think one point is making connections between different contexts which have similarities. So when I was, I deliberately didn't put the word Arabic in the title of the book, for example, because when I was putting the book together, I found that people seemed to focus on the sort of the fact that it was about Arabic rather than the fact that it was about first language and second language proficiency and literacy. And I think there's a lot of the issues involved are similar, very relatable to other contexts. So post colonial context, Singapore, Hong Kong, India, There are many parts of the world where different languages are used. Algeria, for example, where everybody seems to use French for many purposes.
Ingrid Pillar
Yeah, as I mentioned, look, Germany, I mean this Hamua corpus includes German, includes English, includes French as another taught language, Spanish as a taught language, and then heritage languages, Turkish and Russian.
David Palfreyman
And there's often a very different perspective on Turkish from what there is on English, on Spanish.
Ingrid Pillar
Yeah, for sure.
David Palfreyman
So there are all these, I mean, every context is different, but a lot of these issues are very, you know, could be explored certainly. I mean, it would be great if people could make biliterate corpora in other contexts and compare them. I mean, I know there are corpora of speaking, bilingual speaking, which brings in a whole other issue of mixing the languages. Like in our context the languages are separate. Like you're in an Arabic course writing an Arabic essay. Or you're in an English course writing an English essay. But in spoken language, there's more sort of permeability between them. And there's a corpus from, I guess it's the University of Louvain in Belgium, a Belgian corpus, which looks at children, I believe, French and Dutch, and then also kind of school pupils in Holland, and comparing it. So they have kind of reference corpus. So I don't know if I mentioned the term reference corpus. So, you know, you often want to compare your corpus with something else, like comparing learner corpus with a native writer corpus, for example. So. Yeah, so there's a. There's a corpus from there. There's a German one. I don't know if it was the. It wasn't really a corpus, but it was like a data set. It was a fairly small data set of. Yes. Writing in both languages. So. So there are corpora from these. But I think I'm really happy that I got this book together, published the book, because it shows something about what, you know, how does this. Having a corpus like this help you? Yeah, and there's a whole. We try to get a whole range of studies from semantic tagging of words, looking at, you know, what kind of ideas do people talk about, to collocation of words like which words co. Occur together. So all of these issues.
Ingrid Pillar
Yeah, and that's really one of the aspects I enjoyed so much about the book, that it has all these diverse studies and different people looking at this fantastic resource. So I'm sure there is. There's more to come before we go. What's next for you in terms of your research, David?
David Palfreyman
Well, okay, so we will soon be releasing a second stage to Zbunk, which has two components. One is writing again, essays in English and in Arabic from the same cohort of students. So they entered the university in 2019, and then in 2021, we got some more written samples.
Ingrid Pillar
Also, there's a longitudinal perspective.
David Palfreyman
Well, there is, yes, but because it's a cohort, it's not necessarily the same students. So a few students we managed to get. There's a very small number of students that we have four essays. Like who. English and Arabic from 2019 and English and Arabic from 2021.
Ingrid Pillar
That's amazing.
David Palfreyman
But, you know, some students we have maybe three essays from, like, only one from one year. It's patchy because of the logistical problems. But the students who are in the 2021 sample, they have not just entered the university. So the ones in the initials that say book corpus, which is in the Book. They were basically just finished high school and were entering university. The new written corpus is adding to that students who are in their second year or going into their third year. And some of them are. You can kind of compare, get a longitudinal dimension. Yeah. So there's that. And then also in the new part of the corpus, we have spoken, spoken corpus. So we got together some students to have conversations about academic. Related to their academic learning, but trying to relate it to the real world or to work. So this is the idea of the transition from. From kind of study to work. So we're adding a corpus of students. So doing two stages of a task. First of all, discussing with another student when they mix a lot of Arabic and English and then speaking kind of presenting their ideas to somebody who is supposed to be only Arabic speaking or only English speaking. So that's been very interesting to look at that. And I've done some analysis of those tasks. For example, looking at the dialogic quality of the talk. And one of the interesting things is it seems like the parts where they are most dialogic, so they're most listening to each other. Students are most listening to each other and generating ideas is when they are mixing language languages rather than the parts when they're.
Ingrid Pillar
Oh, that's so interesting.
David Palfreyman
In English or only in Arabic.
Ingrid Pillar
Yeah. Wow. So that would be support for the whole translanguaging pedagogy and idea.
David Palfreyman
Yeah, it was really interesting to listen to that, look at those recordings in detail and see that in action. Because it's a kind of transition from studying in English at university to maybe having to explain those things to an Arab client. So it's a transition from university to work and also a transition from one language to another or from translanguaging to expressing it to somebody who can only understand one of those languages.
Ingrid Pillar
Yeah. So lots of fascinating research to come. This was really interesting. To learn more about your fascinating book Bilingual Writers and Corpus Analysis, I strongly recommend it to everyone. Thanks so much, David for taking the time. Thanks for listening everyone. If you enjoyed the show, please subscribe to our channel, leave a five star review on your podcast app of choice and recommend the Language on the Move podcast and I'll partner the New Books Network to your students, colleagues and friends. Till next time, switch it on.
David Palfreyman
Foreign. Some follow the noise. Bloomberg follows the money. Whether it's the funds fueling AI or crypto's trillion dollar swings, there's a money side to every story. Get the money side of the story. Subscribe now@bloomberg.com.
Podcast: New Books Network – Language on the Move
Date: May 5, 2026
Host: Ingrid Pillar
Guest: David Palfreyman
Episode Focus: Bilingual Writers and Corpus Analysis (Routledge, 2023) – the creation, purpose, and implications of the Zayed Arabic-English Bilingual Undergraduate Corpus (ZAE-B or “Zabuk”)
In this episode, Professor Ingrid Pillar interviews Professor David Palfreyman about his recent book Bilingual Writers and Corpus Analysis, co-edited with Nizar Habash. The discussion explores the creation and significance of the ZAE-B corpus—a pioneering collection of bilingual (Arabic-English) student writing from the United Arab Emirates. Themes include the unique sociolinguistic context of the UAE, corpus linguistics methods, the challenges and benefits of analyzing bilingual student writing, and the future of multilingual literacy research.
Introduction of Guest & Book
Notable Quote:
Linguistic Landscape in the UAE:
What is a Corpus?
Authenticity & Size:
Large Language Models:
Genre-Based vs. Writer-Based Corpora:
Motivation:
Learner Corpora:
Notable Quote:
Sociolinguistic Complexity:
Code-Mixing:
Notable Quote:
Collection Methods (2019 Cohort):
Assessment:
Challenges & Insights:
Defining ‘Bilingual Corpus’:
The Learner vs. User Perspective:
Types of Research Enabled:
Key Findings:
Metaphor Study Example:
Technology & Wider Use:
Notable Dialogue:
Authenticity in the Age of AI:
Transferability of Research:
Reference Corpora & Comparative Work:
Longitudinal Expansion:
Notable Quote:
On Sociolinguistic Diversity:
“It’s very multicultural...both English and Arabic have prestige, but of different kinds.” (David Palfreyman, 06:45)
On Language Development & Biases:
“Everybody is a learner also of their first language...” (David Palfreyman, 47:00)
On Corpus Impact:
“I think it’s an amazing corpus, an amazing achievement...so many perspectives that make us question fundamental beliefs about language proficiency, about literacy, but also all these practical applications...” (Ingrid Pillar, 65:39)
On Global Relevance:
“When I was putting the book together, I found that people seemed to focus on…Arabic rather than…first language and second language proficiency and literacy. And I think…these issues are very relatable to other contexts.” (David Palfreyman, 66:38)
On Future Directions:
“We will soon be releasing a second stage to Zabuk…writing again, essays in English and in Arabic from the same cohort…[as well as] spoken corpus.” (David Palfreyman, 70:36)
| Topic | Timestamp | |----------------------------------------------|-------------| | UAE Sociolinguistic Landscape | 01:35–06:45 | | David’s Arabic-Learning Journey | 08:27–11:43 | | Corpus Concepts & Evolution | 11:43–18:05 | | Bilingual Writer’s Corpus Introduction | 18:05–26:59 | | Student Linguistic Profiles in UAE | 26:59–32:27 | | ZAE-B Corpus Methods & Challenges | 33:00–46:32 | | Learning vs. Using Languages | 46:32–49:24 | | Corpus Applications & Findings | 52:31–64:32 | | AI, Authenticity, and Corpus Usage | 64:32–66:38 | | Global Relevance & Future Research | 66:38–73:39 | | Longitudinal Data & Spoken Corpus | 70:36–73:39 |
This episode provides a comprehensive look at the conceptual, methodological, and practical dimensions of building and using a bilingual writer’s corpus in a complex multilingual context—not only as a research resource but also as an engine for rethinking language learning, literacy, and education policy across the world. The conversation underscores the need for more nuanced resources—both in written and spoken domains—reflecting the real-world multilingual experiences of students and professionals everywhere.
Resource:
Closing Thought:
“Every context is different, but a lot of these issues could be explored… if people could make biliterate corpora in other contexts and compare them…” (David Palfreyman, 68:08)