Extracting Duolingo vocabulary using Python

Update
This unofficial API no longer works as of February 2023. The library itself hasn’t been updated since September 2021 and the latest release is from October 2020. I’m not much of a Duolingo user these days so I won’t be attempting any other hacks to access the API, unless Duolingo releases an official API. This post is therefore obsolete.

Duolingo is a popular language learning application with millions of users, over 20 languages and counting. It has made language learning more flexible and accessible. I have been using the app since 2014.

Duolingo has a built-in feature to display learned words, which I find very useful as I like to look at my vocabulary list from time to time. An issue I had with this feature is that it does not display the translations within the learned words table. I used to keep a Google Spreadsheet to manage my vocabulary, but it was tedious to maintain. Therefore, I was looking for other methods I can use to keep a vocabulary list, ideally something that can be automated.

I came across the Unofficial Duolingo API, an open-source Python package that can be used to extract data from your Duolingo account. I created a Python script, duolingo_data.py, which uses this API to produce CSV and PDF files (via LaTeX) of the vocabulary with English translations, extracted from Duolingo’s dictionary and vocabulary overview.

Requirements

Running the script and building PDF files require installations of Python 3 and LaTeX.

Cloning the repository

First, clone the GitLab repository:

Option 1 - via HTTPS:

git clone https://gitlab.com/nithiya/duolingo.git

Option 2 - via SSH:

git clone git@gitlab.com:nithiya/duolingo.git

Python

After installing Python 3, create and activate a virtual environment and install all dependencies:

Option 1 - using venv:

on Linux:

python3 -m venv env
source env/bin/activate
python -m pip install -r requirements.txt

on Windows:

py -m venv env
.\env\Scripts\activate
py -m pip install -r requirements.txt

Option 2 - using Anaconda (I recommend the lightweight Miniconda):

conda create --name duolingo python=3 pandas requests
conda activate duolingo
python -m pip install duolingo-api

To view the list of dependencies, see requirements.txt. See the pandas and Requests documentation for more information about these packages.

LaTeX

All required LaTeX packages are available on CTAN. A TeX distribution, such as TeX Live, is recommended to ensure all requirements are satisfied.

The PDF files are built using vocab.tex via XeLaTeX and glossaries:

xelatex vocab.tex
makeglossaries vocab
xelatex vocab.tex

Running the script

Define the languages you are learning in languages.conf. Then, run the Python script:

python duolingo_data.py

Enter your Duolingo username and password when prompted in the terminal and press enter to continue.

After this script has finished running, CSV, JSON, TeX, and PDF files containing the vocabulary will be saved in the same directory, using the naming convention vocab_XX. XX refers to the language code as defined by Duolingo. The codes in the table below are the ones I am aware of (mostly corresponding to ISO 639). When tested with the dictionary URL (e.g., https://d2.duolingo.com/api/1/dictionary/hints/en/de?token=hello – for translating ‘hello’ from English to German), the languages below produced an output.

Caution
This script is not guaranteed to work for all languages, as it has only been tested on a small subset of languages. (Last checked: May 2022.)

Code	Language	Alternate code
ar	Arabic
cs	Czech
cy	Welsh
da	Danish
de	German
dn	Dutch	nl-NL
el	Greek
eo	Esperanto
es	Spanish
fi	Finnish
fr	French
ga	Irish
gd	Scottish Gaelic
he	Hebrew
hi	Hindi
ht	Haitian Creole
hu	Hungarian
hv	High Valyrian
hw	Hawaiian
id	Indonesian
it	Italian
ja	Japanese
kl	Klingon	tlh
ko	Korean
la	Latin
nb	Norwegian (Bokmål)	no-BO
nv	Navajo
pl	Polish
pt	Portuguese
ro	Romanian
ru	Russian
sv	Swedish
sw	Swahili
tr	Turkish
uk	Ukrainian
vi	Vietnamese
yi	Yiddish
zs	Chinese	zh

I created a for loop to extract vocabulary information, which is in JSON format, from the Duolingo vocabulary overview. I want my vocabulary in table form. So, I convert vocab into a pandas dataframe and drop any duplicate entries.

To translate the words into English, we need to use Duolingo’s built-in dictionary. I ran into a problem where I couldn’t automate the translation using the dictionary as the number of words exceeded its capacity. To avoid this, I made a generator to split the list of words into manageable chunks.

    def splitlist(mylist, chunk_size):
        """
        define a generator to split the list of words to be translated
        to chunks (to prevent 'Exception: Could not get translations'
        caused by long lists)
        """
        return [
            mylist[offs: offs + chunk_size] for
            offs in range(0, len(mylist), chunk_size)
        ]

    word_list = splitlist(vocab_df['word_string'].tolist(), 500)

Since the translations are done in chunks, all translations have to be merged. I created an empty dataframe, translated each chunk in a for loop and concatenated the results.

All the data we need have been extracted and merged nicely into a dataframe. However, there are a lot of unnecessary columns, such as the IDs of each word and the practice time in milliseconds. I dropped these columns, sorted the values by the associated skill, then the words themselves alphabetically.