Topic modeling in jupyter notebook

Before you run any notebook, you will need to install kiara on your computer.

This process is quite manageable, and I will walk you through it step by step.

You will need:

  • a tool called conda or miniconda for managing Python and dependencies

  • a special environment where kiara will live

  • the kiara software itself, plus a few plugins

Install conda (or miniconda)

Conda is a popular tool that helps you manage different versions of Python and software libraries. If you already have Conda (or Miniconda) installed, you can skip this step.

If not:

  • Go to the Miniconda download page.

  • Download the version that matches your computer (Windows, macOS, or Linux).

  • Follow the instructions to install it.

*Double-check that you choose the correct version for your operating system and architecture (most modern computers use the 64-bit version).

Create a kiara environment

Once Conda is installed, open your Terminal (on Mac or Linux) or Command Line Interface (CLI) (on Windows).

You are now going to create a special environment just for kiara. This is like giving kiara its own room, so that it will not interfere with other tools or projects on your computer.

Type this command:

conda create -n kiara_testing python jupyter

This creates an environment called kiara_testing and installs a recent version of Python and Jupyter Notebook (the tool you will use to run the notebooks).

You can replace kiara_testing with any name you like for your environment.

Activate the envrionment

Next, tell Conda that you want to activate this new environment:

conda activate kiara_testing

Install kiara

Now that your environment is set up, let's install kiara iteself.

kiara is installed using a package-management system for Python called pip. Just type:

pip install kiara

The first installation may take a few minutes. That is normal.

Install kiara plugins

To make kiara more useful, you will also install some basic plugins by running:

pip install kiara_plugin.core_types kiara_plugin.onboarding kiara_plugin.tabular 

These plugins provide support for:

  • core data types

  • helpful onboarding tools

  • tabular data (spreadsheets, CSVs, etc.)

To see which versions of kiara and its plugins are installed, you can run:

pip list | grep kiara

At the time of writing, the versions installed are:

kiara                     0.5.13
kiara_plugin.core_types   0.5.2
kiara_plugin.onboarding   0.5.2
kiara_plugin.tabular      0.5.6

Install the topic modeling package

To install the topic modeling package, run:

pip install git+https://github.com/DHARPA-Project/kiara_plugin.topic_modelling

Alternatively, you can run:

! pip install git+https://github.com/DHARPA-Project/kiara_plugin.topic_modelling

Note: For visualization operations, also install:

pip install observable_jupyter

Start Jupyter notebook

To open the Jupyter interface, run:

jupyter notebook

Import kiara and create an API

Now that kiara and its plugins are installed, let's set up kiaraAPI.

To start using kiara in Jupyter Notebook, you first need to create a kiaraAPI instance. This instance allows us to control kiara and see what operations are available.

To set this up, run the following code in a notebook cell:

from kiara.api import KiaraAPI

kiara = KiaraAPI.instance()

Data onboarding

Before running topic modeling, you must first onboard your corpus. Kiara offers three options for loading textual data, depending on where your files are stored. The third option uses example data present in the topic modelling plugin.

Choose one of the following options:

Option 1: Onboard texts from Zenodo

Use this method if your text files are archived on Zenodo. The operation topic_modelling.create_table_from_zenodo retrieves a ZIP archive from Zenodo using its DOI and extracts its contents into a table with two columns: file_name and content.

Run the following:

create_table_from_zenodo_inputs = {
    "doi": "4596345",
    "file_name": "ChroniclItaly_3.0_original.zip"
}
create_table_from_zenodo_results = kiara.run_job('topic_modelling.create_table_from_zenodo', inputs=create_table_from_zenodo_inputs, comment= " ")
corpus_table_zenodo = create_table_from_zenodo_results['corpus_table']
create_table_from_zenodo_results

The resulting table contains the name of each file and its corresponding text content, ready for further processing.

Option 2: Onboard texts from GitHub

If your files are hosted in a public GitHub repository, you can use the operation create.table_from_github_files to download and structure the data. Provide the repository owner, name, and path to the folder containing your text files.

Run the following:

create_table_from_github_files_inputs = {
    "download_github_files__user": "DHARPA-Project",
    "download_github_files__repo": "kiara.examples",
    "download_github_files__sub_path": "kiara.examples-main/examples/workshops/dh_benelux_2023/data",
    "download_github_files__include_files": ["txt"]
}
create_table_from_github_files_results = kiara.run_job('create.table_from_github_files', inputs=create_table_from_github_files_inputs, comment=" ")
create_table_from_github_files_results

This method creates a kiara table from the selected .txt files, alongside a downloadable file bundle for inspection or archival.

Option 3: Onboard texts from a local folder

To use text files stored locally on your machine, run the operation import.table.from.local_folder_path. This imports all text files from a specified directory and creates a table similar to the above options.

Run the following:

import_table_from_local_folder_inputs = {
    "path": "/Users/mariella.decrouychan/Documents/GitHub/kiara_plugin.topic_modelling/tests/resources/data/text_corpus/data"
}
import_table_from_local_folder_results = kiara.run_job('import.table.from.local_folder_path', inputs=import_table_from_local_folder_inputs, comment=" ")
import_table_from_local_folder_results

Make sure to replace the path with the actual location of your text corpus. The resulting table contains metadata and full content for each text file.

Subset creation

After onboarding your corpus, the next step is to enrich it with metadata, explore its temporal distribution, and optionally filter it to create a more focused subset for analysis.

Extract metadata from filenames

To begin, we extract metadata such as publication identifiers and dates directly from the file names with the topic_modelling.lccn_metadata operation. This helps structure the dataset for further filtering and analysis.

Run the following operation to extract the metadata and append it to your corpus table:

lccn_metadata_inputs = {
    "corpus_table": import_table_from_local_folder_results['table'],
    "column_name": "file_name",
    "map": [["sn84037024","sn84037025"],["La Ragione","La Rassegna"]]   
}
lccn_metadata_results = kiara.run_job('topic_modelling.lccn_metadata', inputs=lccn_metadata_inputs, comment = " ")
lccn_metadata_results

This will add three new columns to your table: date, publication_reference, and publication_name, based on patterns identified in the file names.

Visualize corpus distribution

To understand how your documents are distributed over time and by publication, you can group the corpus by time periods with the topic_modelling.corpus_distribution operation.

Run the following to compute the distribution:

corpus_dist_inputs = {
    "corpus_table": lccn_metadata_results["corpus_table"],
    "periodicity": "month",
    "date_col": "date",
    "publication_ref_col": "publication_name",
}
corpus_dist_results = kiara.run_job('topic_modelling.corpus_distribution', inputs=corpus_dist_inputs, comment = " ")
corpus_dist_results['dist_table'].data

This operation returns a table (dist_table) and a list (dist_list) summarizing how many texts exist per publication and time.

You can visualize the results using observable notebooks:

from observable_jupyter import embed
embed('@dharpa-project/timestamped-corpus', cells=['viewof chart', 'style'], inputs={"data":corpus_dist_results['dist_list'].data.list_data,"scaleType":'height', "timeSelected":'month'})

Create a subset of the corpus

Once you’ve explored the distribution, you can run the query.table operation to filter the corpus based on specific criteria (e.g., a time range) using SQL queries.

To create a subset of documents published in 1917, run the following:

date_ref_1 = "1917-1-1"
date_ref_2 = "1917-12-31"
query = f"SELECT * FROM corpus_table WHERE CAST(date AS DATE) <= DATE '{date_ref_2}' AND CAST(date AS DATE) > DATE '{date_ref_1}'"
inputs = {
    'query' : query,
    'table': lccn_metadata_results['corpus_table'],
    'relation_name': "corpus_table"
}

subset = kiara.run_job('query.table', inputs=inputs, comment = " ")
subset

This filtered table (query_result) can now be used as input for subsequent topic modeling steps.

Tokenize corpus

With your subset ready, the next step is to convert each document into a list of tokens (words or characters), which will be the basis for topic modeling. This section walks through the process of extracting text content from the corpus, tokenizing it, and applying basic preprocessing steps.

Extract text content as an array

To prepare the corpus for tokenization, you first need to extract the column that contains the text content and convert it into an array.

Run the operation table.pick.column to extract the content column from the table:

pick_column_inputs = {
    "table": import_table_from_local_folder_results['table'],
    "column_name": "content"   
}
pick_column_results = kiara.run_job('table.pick.column', inputs=pick_column_inputs, comment = " ")

This returns the text contents as an array, which can now be tokenized.

Tokenize the text

Next, tokenize the array using the operation topic_modelling.tokenize_array. By default, tokenization is done by word (not by character).

Run the following:

tokenize_array_inputs = {
    "corpus_array": pick_column_results['array'],
    "column_name": "content"   
}
tokenize_array_results = kiara.run_job('topic_modelling.tokenize_array', inputs=tokenize_array_inputs, comment= " ") 
tokenize_array_results

This operation returns an array where each entry corresponds to a list of tokens (words) extracted from the respective document.

Preprocess the tokens

To clean and standardize the tokens, use the operation topic_modelling.preprocess_tokens. This step is optional but recommended, especially for removing punctuation, digits, and very short tokens.

Run the following to lowercase all tokens, keep only alphabetic tokens, and filter out those shorter than three characters:

preprocess_tokens_inputs = {
    "tokens_array": tokenize_array_results['tokens_array'],
    "lowercase": True,
    "isalpha": True,
    "min_length": 3,  
}
preprocess_tokens_results = kiara.run_job('topic_modelling.preprocess_tokens', inputs=preprocess_tokens_inputs, comment= " ")
preprocess_tokens_results

This will return a cleaned array of token lists, ready to be used for training the topic model.

Remove stopwords

Stopwords are common words (such as and, the, etc.) that usually carry little semantic weight in topic modeling. Removing them helps the model focus on the more meaningful vocabulary of your corpus.

Create a stopwords list

To begin, you can generate a list of stopwords using the operation topic_modelling.stopwords_list. This operation allows you to combine standard stopword lists from the Natural Language Toolkit (NLTK) with any custom stopwords relevant to your project.

Run the following to create a stopword list in both English and Italian, with a few additional custom entries:

stopwords_list_inputs = {
    "languages": ["english","italian"],
    "stopwords_list": ["test","test"]  
}
stopwords_list_results = kiara.run_job('topic_modelling.stopwords_list', inputs=stopwords_list_inputs, comment= " ")
stopwords_list_results

The result is a combined list of stopwords that will be used in the next step to filter your tokenized texts.

Remove stopwords from the tokens

Now that you have a stopword list, you can remove those words from your preprocessed tokens using the operation topic_modelling.remove_stopwords.

Run the following:

remove_stopwords_inputs = {
    "tokens_array": preprocess_tokens_results['tokens_array'],
    "stopwords_list": stopwords_list_results["stopwords_list"] 
}
remove_stopwords_results = kiara.run_job('topic_modelling.remove_stopwords', inputs=remove_stopwords_inputs, comment= " ")
remove_stopwords_results

This returns a cleaned array of tokens, free from common and custom stopwords. These filtered tokens are now ready to be used in the topic modeling stage.

Create bigrams

To improve the coherence of your topic modeling results, you can create bigrams from your preprocessed and stopword-filtered tokens. This step helps detect commonly co-occurring word pairs, "digital_humanities", and treats them as single tokens in the topic modeling process.

To generate bigrams, run the following command:

bigrams_inputs = {
    "tokens_array": remove_stopwords_results['tokens_array'],
    "min_count": 3,
}
bigrams_results = kiara.run_job('topic_modelling.get_bigrams', inputs=bigrams_inputs, comment= " ")
bigrams_results

This operation uses the topic_modelling.get_bigrams module and accepts optional parameters such as min_count (minimum frequency of token pairs) and threshold (a score threshold for forming phrases, through not provided in the code above). The output is a token array containing the generated bigrams.

Topic modeling with LDA

After generating bigrams, you can proceed to apply Latent Dirichlet Allocation (LDA) to detect latent thematic structures in the corpus. kiara provides two module options for LDA:

LDA Multicore

The topic_modelling.lda module wraps Gensim’s LdaMulticore implementation and is generally faster on multicore machines than the standard LDA implementation. However, it does not expose all LDA parameters.

To run LDA using the multicore implementation, use the following code:

lda_inputs = {
    "tokens_array": bigrams_results['tokens_array'],
    "num_topics": 3,
    "passes": 20,
    "chunksize": 30 
}
lda_results = kiara.run_job('topic_modelling.lda', inputs=lda_inputs, comment= " ")
lda_results

The results include the top 15 most frequent words and the generated topics.

LDA with extended parameters

For more flexibility, you can use the topic_modelling.lda_extended_params module, which allows configuration of additional parameters such as alpha/eta tuning, topic coherence methods, and minimum topic probability thresholds.

To run this version, use:

lda_ext_params_inputs = {
    "tokens_array": bigrams_results['tokens_array'],
    "passes": 20,
    "chunksize": 30,
    "num_topics": 3,
    "alpha": True,
    "eta": True,
}
lda_ext_params_results =  kiara.run_job('topic_modelling.lda_extended_params', inputs=lda_ext_params_inputs, comment= " ")
lda_ext_params_results

The output includes:

  • most_common_words: Top frequent tokens across the corpus.

  • print_topics: A list of generated topic descriptions.

  • top_topics: Topic descriptions with coherence scores.

To trace the entire pipeline—from importing your local file to tokenization, stopword removal, bigram creation, and finally LDA—you can inspect the lineage of the result:

lda_results['print_topics'].lineage

This command returns a detailed tree structure showing how the output was generated. It includes every kiara operation used, along with their parameters and input/output relationships. This is particularly useful for:

  • Debugging or validating each stage of processing

  • Reproducing or modifying specific parts of the pipeline

  • Documenting your research for transparency and reuse

Test model coherence depending on the number of topics

To compare how well different LDA models fit your data depending on the number of topics, use the topic_modelling.lda_coherence operation.

This module allows you to test multiple topic numbers and returns:

  • Coherence scores for each model, which give a quantitative estimate of how interpretable or semantically consistent the topics are.

  • The corresponding printout of topics for each number of topics tested.

Run the following to evaluate model coherence across two different topic numbers:

lda_coherence_inputs = {
    "tokens_array": bigrams_results['tokens_array'],
    "num_topics_list": [2,5],
    "passes": 20,
    "chunksize": 30,
    "num_topics": 3,
    "alpha": True,
    "eta": True,
}
lda_coherence_results =  kiara.run_job('topic_modelling.lda_coherence', inputs=lda_coherence_inputs, comment= " ")
lda_coherence_results

The coherence_scores helps you choose the optimal number of topics for interpretation, while the print_topics output displays topic-word distributions.

Last updated