Update
This unofficial API no longer works as of February 2023. The library itself hasn’t been updated since September 2021 and the latest release is from October 2020. I’m not much of a Duolingo user these days so I won’t be attempting any other hacks to access the API, unless Duolingo releases an official API. This post is therefore obsolete.
Duolingo is a popular language learning application with millions of users, over 20 languages and counting. It has made language learning more flexible and accessible. I have been using the app since 2014.
Duolingo has a built-in feature to display learned words, which I find very useful as I like to look at my vocabulary list from time to time. An issue I had with this feature is that it does not display the translations within the learned words table. I used to keep a Google Spreadsheet to manage my vocabulary, but it was tedious to maintain. Therefore, I was looking for other methods I can use to keep a vocabulary list, ideally something that can be automated.
I came across the Unofficial Duolingo API, an open-source Python package that can be used to extract data from your Duolingo account. I created a Python script, duolingo_data.py
, which uses this API to produce CSV and PDF files (via LaTeX) of the vocabulary with English translations, extracted from Duolingo’s dictionary and vocabulary overview.
Requirements
Running the script and building PDF files require installations of Python 3 and LaTeX.
Cloning the repository
First, clone the GitLab repository:
-
Option 1 - via HTTPS:
git clone https://gitlab.com/nithiya/duolingo.git
-
Option 2 - via SSH:
git clone git@gitlab.com:nithiya/duolingo.git
Python
After installing Python 3, create and activate a virtual environment and install all dependencies:
-
Option 1 - using
venv
:on Linux:
python3 -m venv env source env/bin/activate python -m pip install -r requirements.txt
on Windows:
py -m venv env .\env\Scripts\activate py -m pip install -r requirements.txt
-
Option 2 - using Anaconda (I recommend the lightweight Miniconda):
conda create --name duolingo python=3 pandas requests conda activate duolingo python -m pip install duolingo-api
To view the list of dependencies, see requirements.txt
. See the pandas and Requests documentation for more information about these packages.
LaTeX
All required LaTeX packages are available on CTAN. A TeX distribution, such as TeX Live, is recommended to ensure all requirements are satisfied.
The PDF files are built using vocab.tex
via XeLaTeX and glossaries:
xelatex vocab.tex
makeglossaries vocab
xelatex vocab.tex
Running the script
Define the languages you are learning in languages.conf
. Then, run the Python script:
python duolingo_data.py
Enter your Duolingo username and password when prompted in the terminal and press enter
to continue.
After this script has finished running, CSV, JSON, TeX, and PDF files containing the vocabulary will be saved in the same directory, using the naming convention vocab_XX
. XX
refers to the language code as defined by Duolingo. The codes in the table below are the ones I am aware of (mostly corresponding to ISO 639). When tested with the dictionary URL (e.g., https://d2.duolingo.com/api/1/dictionary/hints/en/de?token=hello
– for translating ‘hello’ from English to German), the languages below produced an output.
Caution
This script is not guaranteed to work for all languages, as it has only been tested on a small subset of languages. (Last checked: May 2022.)
Code | Language | Alternate code |
---|---|---|
ar | Arabic | |
cs | Czech | |
cy | Welsh | |
da | Danish | |
de | German | |
dn | Dutch | nl-NL |
el | Greek | |
eo | Esperanto | |
es | Spanish | |
fi | Finnish | |
fr | French | |
ga | Irish | |
gd | Scottish Gaelic | |
he | Hebrew | |
hi | Hindi | |
ht | Haitian Creole | |
hu | Hungarian | |
hv | High Valyrian | |
hw | Hawaiian | |
id | Indonesian | |
it | Italian | |
ja | Japanese | |
kl | Klingon | tlh |
ko | Korean | |
la | Latin | |
nb | Norwegian (Bokmål) | no-BO |
nv | Navajo | |
pl | Polish | |
pt | Portuguese | |
ro | Romanian | |
ru | Russian | |
sv | Swedish | |
sw | Swahili | |
tr | Turkish | |
uk | Ukrainian | |
vi | Vietnamese | |
yi | Yiddish | |
zs | Chinese | zh |
I created a for loop to extract vocabulary information, which is in JSON format, from the Duolingo vocabulary overview. I want my vocabulary in table form. So, I convert vocab into a pandas dataframe and drop any duplicate entries.
To translate the words into English, we need to use Duolingo’s built-in dictionary. I ran into a problem where I couldn’t automate the translation using the dictionary as the number of words exceeded its capacity. To avoid this, I made a generator to split the list of words into manageable chunks.
def splitlist(mylist, chunk_size):
"""
define a generator to split the list of words to be translated
to chunks (to prevent 'Exception: Could not get translations'
caused by long lists)
"""
return [
mylist[offs: offs + chunk_size] for
offs in range(0, len(mylist), chunk_size)
]
word_list = splitlist(vocab_df['word_string'].tolist(), 500)
Since the translations are done in chunks, all translations have to be merged. I created an empty dataframe, translated each chunk in a for
loop and concatenated the results.
All the data we need have been extracted and merged nicely into a dataframe. However, there are a lot of unnecessary columns, such as the IDs of each word and the practice time in milliseconds. I dropped these columns, sorted the values by the associated skill, then the words themselves alphabetically.