Duolingo is a popular language learning application with millions of users, over 20 languages and counting. It has made language learning more flexible and accessible. I have been using the app since 2014.
Duolingo has a built-in feature to display learned words, which I find very useful as I like to look at my vocabulary list from time to time. An issue I had with this feature is that it does not display the translations within the learned words table. I used to keep a Google Spreadsheet to manage my vocabulary, but it was tedious to maintain. Therefore, I was looking for other methods I can use to keep a vocabulary list, ideally something that can be automated.
I came across the Unofficial Duolingo API, an open-source Python package that can be used to extract data from your Duolingo account. I created a Python script,
duolingo_data.py, which uses this API to produce CSV and PDF files (via LaTeX) of the vocabulary with English translations, extracted from Duolingo’s dictionary and vocabulary overview.
Running the script and building PDF files require installations of Python 3 and LaTeX.
Cloning the repository
First, clone the GitLab repository:
Option 1 - via HTTPS:
git clone https://gitlab.com/nithiya/duolingo.git
Option 2 - via SSH:
git clone email@example.com:nithiya/duolingo.git
After installing Python 3, create and activate a virtual environment and install all dependencies:
Option 1 - using
python3 -m venv env source env/bin/activate python -m pip install -r requirements.txt
py -m venv env .\env\Scripts\activate py -m pip install -r requirements.txt
Option 2 - using Anaconda (I recommend the lightweight Miniconda):
conda create --name duolingo python=3 pandas requests conda activate duolingo python -m pip install duolingo-api
To view the list of dependencies, see
requirements.txt. See the pandas and Requests documentation for more information about these packages.
All required LaTeX packages are available on CTAN. A TeX distribution, such as TeX Live, is recommended to ensure all requirements are satisfied.
The PDF files are built using
vocab.tex via XeLaTeX and glossaries:
xelatex vocab.tex makeglossaries vocab xelatex vocab.tex
Running the script
Define the languages you are learning in
languages.conf. Then, run the Python script:
Enter your Duolingo username and password when prompted in the terminal and press
enter to continue.
After this script has finished running, CSV, JSON, TeX, and PDF files containing the vocabulary will be saved in the same directory, using the naming convention
XX refers to the language code as defined by Duolingo. The codes in the table below are the ones I am aware of (mostly corresponding to ISO 639). When tested with the dictionary URL (e.g.,
https://d2.duolingo.com/api/1/dictionary/hints/en/de?token=hello – for translating ‘hello’ from English to German), the languages below produced an output.
This script is not guaranteed to work for all languages, as it has only been tested on a small subset of languages. (Last checked: May 2022.)
I created a for loop to extract vocabulary information, which is in JSON format, from the Duolingo vocabulary overview. I want my vocabulary in table form. So, I convert vocab into a pandas dataframe and drop any duplicate entries.
To translate the words into English, we need to use Duolingo’s built-in dictionary. I ran into a problem where I couldn’t automate the translation using the dictionary as the number of words exceeded its capacity. To avoid this, I made a generator to split the list of words into manageable chunks.
def splitlist(mylist, chunk_size): """ define a generator to split the list of words to be translated to chunks (to prevent 'Exception: Could not get translations' caused by long lists) """ return [ mylist[offs: offs + chunk_size] for offs in range(0, len(mylist), chunk_size) ] word_list = splitlist(vocab_df['word_string'].tolist(), 500)
Since the translations are done in chunks, all translations have to be merged. I created an empty dataframe, translated each chunk in a
for loop and concatenated the results.
All the data we need have been extracted and merged nicely into a dataframe. However, there are a lot of unnecessary columns, such as the IDs of each word and the practice time in milliseconds. I dropped these columns, sorted the values by the associated skill, then the words themselves alphabetically.
Leave a comment
Your email address will not be published. Required fields are marked *.