Gender aware spoken language translation Arabic To English

Download the paper

During my work at Microsoft Research Lab in Cairo, we were brainstorming for research projects related to our work Skype Translator for the upcoming summer internship.I got the idea of making a machine translation system that keeps the lost gender information while translating from Arabic To English.


Leverage demographic information like gender information that can be extracted from speech to improve accuracy of machine translation models.

Example on English to Arabic and English to French:

I am sure from a female speaker => انا متأكدة
I am sure from a female speaker => je suis certaine

Without the proposed system the above example would be translated depending on the bias in the input corpus, the corpus might be biased to male translations forcing the system to give higher probability to male adjective.


Source Wikipedia
Source Wikipedia

English [and about half the world’s languages] do not have strong grammatical gender agreement requirements.

The other half, including Arabic and most European languages have a gender system.

Importance of Gender Agreement in SLT
Importance of Gender Agreement in SLT

Also fixing gender biases in the output of a Spoken language translation system would give a better user experience and will reduce the confusion.

System Diagram

Technical Details

Add a token during training to indicate the genders of the speaker and listener. For example : -sf- to denote speaker female and -lf- to denote listener female

POS Tagger

We used a POS Tagger to determine gender of target sentence and to put its label on the source language for training. This step is just used to generate the data used for the training of the NMT system parallel Data.

POS Tagger Results on Arabic for Gender Extraction

Metric Speaker is a male Speaker is a female Listener is a male Listener is a female
Precision 80.00% 100.00% 63.15% 93.33%
Recall 19.04% 25.00% 11.65% 51.85%


Data Labeling on Parallel Text Data

We used the POS Tagger on 1.5 million sentences from subtitles data. About 150k sentences were tagged with a gender. If the tagger couldn’t determine the gender we didn’t add the token.

Test Set

We created 3 test sets:

Model Architecture

We used an attentional encoder decoder model with Bidirectional LSTM.

Bleu Scores

Model Gender only set Random 2k Labeled data within random 2k
Before Adaptation 16.48 18.36 18.8
1 Epoch, α=0.001 17.95 18.26 20.47
2 Epochs, α=0.001 16.07 17.98 20.38
5 Epochs, α=0.001 15.12 17.97 21.16
10 Epochs, α=0.0001 21.8 14.07 19.27