Update
This unofficial API no longer works as of February 2023. The library itself hasn’t been updated since September 2021 and the latest release is from October 2020. I’m not much of a Duolingo user these days so I won’t be attempting any other hacks to access the API, unless Duolingo releases an official API. This post is therefore obsolete.

Duolingo is a popular language learning application with millions of users, over 20 languages and counting. It has made language learning more flexible and accessible. I have been using the app since 2014.

Duolingo has a built-in feature to display learned words, which I find very useful as I like to look at my vocabulary list from time to time. An issue I had with this feature is that it does not display the translations within the learned words table. I used to keep a Google Spreadsheet to manage my vocabulary, but it was tedious to maintain. Therefore, I was looking for other methods I can use to keep a vocabulary list, ideally something that can be automated.

I came across the Unofficial Duolingo API, an open-source Python package that can be used to extract data from your Duolingo account. I created a Python script, duolingo_data.py, which uses this API to produce CSV and PDF files (via LaTeX) of the vocabulary with English translations, extracted from Duolingo’s dictionary and vocabulary overview.

Requirements

Running the script and building PDF files require installations of Python 3 and LaTeX.

Cloning the repository

First, clone the GitLab repository:

  • Option 1 - via HTTPS:

    git clone https://gitlab.com/nithiya/duolingo.git
    
  • Option 2 - via SSH:

    git clone git@gitlab.com:nithiya/duolingo.git
    

Python

After installing Python 3, create and activate a virtual environment and install all dependencies:

  • Option 1 - using venv:

    on Linux:

    python3 -m venv env
    source env/bin/activate
    python -m pip install -r requirements.txt
    

    on Windows:

    py -m venv env
    .\env\Scripts\activate
    py -m pip install -r requirements.txt
    
  • Option 2 - using Anaconda (I recommend the lightweight Miniconda):

    conda create --name duolingo python=3 pandas requests
    conda activate duolingo
    python -m pip install duolingo-api
    

To view the list of dependencies, see requirements.txt. See the pandas and Requests documentation for more information about these packages.

LaTeX

All required LaTeX packages are available on CTAN. A TeX distribution, such as TeX Live, is recommended to ensure all requirements are satisfied.

The PDF files are built using vocab.tex via XeLaTeX and glossaries:

xelatex vocab.tex
makeglossaries vocab
xelatex vocab.tex

Running the script

Define the languages you are learning in languages.conf. Then, run the Python script:

python duolingo_data.py

Enter your Duolingo username and password when prompted in the terminal and press enter to continue.

After this script has finished running, CSV, JSON, TeX, and PDF files containing the vocabulary will be saved in the same directory, using the naming convention vocab_XX. XX refers to the language code as defined by Duolingo. The codes in the table below are the ones I am aware of (mostly corresponding to ISO 639). When tested with the dictionary URL (e.g., https://d2.duolingo.com/api/1/dictionary/hints/en/de?token=hello – for translating ‘hello’ from English to German), the languages below produced an output.

Caution
This script is not guaranteed to work for all languages, as it has only been tested on a small subset of languages. (Last checked: May 2022.)

Code Language Alternate code
ar Arabic  
cs Czech  
cy Welsh  
da Danish  
de German  
dn Dutch nl-NL
el Greek  
eo Esperanto  
es Spanish  
fi Finnish  
fr French  
ga Irish  
gd Scottish Gaelic  
he Hebrew  
hi Hindi  
ht Haitian Creole  
hu Hungarian  
hv High Valyrian  
hw Hawaiian  
id Indonesian  
it Italian  
ja Japanese  
kl Klingon tlh
ko Korean  
la Latin  
nb Norwegian (Bokmål) no-BO
nv Navajo  
pl Polish  
pt Portuguese  
ro Romanian  
ru Russian  
sv Swedish  
sw Swahili  
tr Turkish  
uk Ukrainian  
vi Vietnamese  
yi Yiddish  
zs Chinese zh

I created a for loop to extract vocabulary information, which is in JSON format, from the Duolingo vocabulary overview. I want my vocabulary in table form. So, I convert vocab into a pandas dataframe and drop any duplicate entries.

To translate the words into English, we need to use Duolingo’s built-in dictionary. I ran into a problem where I couldn’t automate the translation using the dictionary as the number of words exceeded its capacity. To avoid this, I made a generator to split the list of words into manageable chunks.

    def splitlist(mylist, chunk_size):
        """
        define a generator to split the list of words to be translated
        to chunks (to prevent 'Exception: Could not get translations'
        caused by long lists)
        """
        return [
            mylist[offs: offs + chunk_size] for
            offs in range(0, len(mylist), chunk_size)
        ]

    word_list = splitlist(vocab_df['word_string'].tolist(), 500)

Since the translations are done in chunks, all translations have to be merged. I created an empty dataframe, translated each chunk in a for loop and concatenated the results.

All the data we need have been extracted and merged nicely into a dataframe. However, there are a lot of unnecessary columns, such as the IDs of each word and the practice time in milliseconds. I dropped these columns, sorted the values by the associated skill, then the words themselves alphabetically.

Leave a comment

Your email address will not be published. Required fields are marked *.

Loading...