Network analysis in jupyter notebook
Last updated
Last updated
Before running any notebook, you will need to install kiara on your computer using the command-line interface (CLI). We will guide you through the process step by step.
To begin, you must install conda or miniconda, which are tools for managing software environments and dependencies. We recommend installing miniconda, which is the lighter version of conda. Installation instructions for miniconda can be found .
Be sure to download the right version for your operating system (Windows, macOS, or Linux).
We suggest creating a separate environment for kiara. This makes it easier to manage and avoid conflicts with other software. You can do this by opening your CLI and typing:
You can replace kiara_testing
with any name you like for your environment.
Once the environment is created, activate it with:
Now that your environment is set up, you can begin installing the necessary packages. kiara is not available directly through conda, so we’ll use pip
, another common package manager:
The first installation may take a few minutes. Once kiara is installed, we will also add some essential plugins by running:
To see which versions of kiara and its plugins are installed, you can run:
At the time of writing, the versions installed are:
As we will see in the next section, depending on which notebooks you wish to run, you may also need to install the plugins for topic modelling and network analysis.
Before we get started with network analysis, we need to check whether kiara and its associated plugins are installed. kiara's features are available through plugins.
There are seven plugins:
To install these, first launch Jupyter from the command line by running:
This will open the Jupyter notebook in your default web browser.
In a notebook cell, run the following code:
Now that the plugins are ready, let's set up kiara itself.
To start using kiara in Jupyter, you need to create an instance of the kiaraAPI
. This API provides access to kiara's functions, enabling you to interact with and control your data workflows.
To set this up, run the following code in a notebook cell:
In kiara, a context is your project space. It keeps track of your data, the tasks you run, and the steps you take. A default context is always available, but you can also create your own for specific projects.
To create and use a new context called hello_kiara
, run the following code:
This operation will also show all your available contexts and confirm which one is currently active.
The output will be something like:
This confirms that your new context is set up and ready to use.
Now, we can explore the tools kiara offers. To view a list of all available operations (based on the installed plugins), run:
This will return a list of operations, like:
Each operation is a task you can perform in kiara, such as creating a table, calculating network metrics, or exporting files.
Now that kiara is set up, let's bring a file into our notebook using the download.file
operation.
To understand what this operation does and what information it needs, run:
We’ll now download a sample CSV file using this operation. First, we define the input (the file URL and name) and then run the job:
This gives you a file object as output, including the downloaded file and some technical metadata. Let’s print it to confirm:
You will see a preview of the file's content. This shows the journal data was successfully downloaded.
To keep using this file later (even if the notebook is closed), we will save it inside kiara using an alias. This works like giving a name that kiara remembers.
Now, Journal_Nodes
is saved in kiara's internal storage. You can refer to it later just by its alias, just like using a variable in Python.
Now that we have downloaded the file, let's turn it into a table so we can work with the data.
We can look through kiara’s available operations by filtering for those that start with create
:
This shows a list of operations. Since we’re working with a CSV file, the one we want is create.table.from.file
.
This operation will read the file and turn it into a structured table.
To see what inputs and outputs this operation expects, run:
From this, we learn:
Inputs
Required: a file.
Optional:
first_row_is_header
– indicates if the first row of a CSV file contains column headers.
delimiter
– specifies the column separator (only for CSV), used if kiara cannot auto-detect it.
Outputs
A table
object, which can be used in the next steps.
Let’s turn the downloaded file (which we saved earlier under the alias Journal_Nodes
) into a table:
This will process the CSV file and show the result as a table with columns and rows.
To make it easier to reuse the table later, we can save it in kiara under a new alias:
Now, your data is saved inside kiara and can be accessed at any time using the name Journal_Nodes_table
.
Now that we have downloaded the file and converted it into a table, we can start exploring the data. One simple way to do that is by running SQL queries directly on the table using κiara.
To find relevant operations for querying data, search with the keyword 'query'
:
This returns:
Since we are working with a table, we will use:
This tells us that query.table
allows us to write an SQL query to explore the data. The required inputs are:
table
: the data you want to query
query
: your SQL statement
Let’s find out how many of these journals were published in Berlin:
The result (in outputs['query_result']
) is a filtered table showing only journals published in Berlin.
Let's narrow this further and find all the journals that are about general medicine and published in Berlin.
We can re-use the query.table
function and the table we've just made, stored in outputs['query_result']
This returns a smaller table with only the Berlin-based general medicine journals.
Now that we’ve transformed and queried our data, let's review what κiara knows about the outputs we've created and how it tracks changes.
Even though we have made changes along the way, we can still access a lot of information about our data.
Specifically, the operation gave us:
A unique value ID
The data type (in this case, a table
)
When the value was created
A record of the job that generated it
Links to the inputs and outputs of previous steps
kiara automatically traces all of these changes, keeping track of inputs and outputs and assigning each a unique identifier, so you always know exactly what has happened to your data.
To see a 'backstage' view of how your data was transformed, including the inputs for each function we have run and how they connect, run the following:
This shows a chain of operations:
The SQL query you ran (query.table
)
The table that was queried (from create.table
)
The original file that was downloaded (download.file
)
Each input is assigned a unique ID, allowing complete transparency and traceability.
Now that we're comfortable with what kiara looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with network analysis.
Network analysis offers a computational and quantitative means to examine and explore relational objects, with proxies to measure structural roles and concepts such as power and influence. Doing so digitally - and at scale - also allows us to consider these kinds of questions with large amounts of material or documents that were not heretofore manageable with qualitative or manual approaches.
Before we begin exploring network analysis, let's make sure everything is ready.
In this step, we will check that the necessary plugins are available and set up the Kiara API, which is the interface that allows us to run Kiara commands inside our Jupyter notebook.
This is the code to get started:
Next, we will set up the filepaths for the data that we are going to use in this notebook. The data file is stored in the same directory as the two jupyter notebooks. To set the file path, you can either save the full path to the csv file in the variable below, or use the os.path
modules in Python to shorten this, as below:
Great, we are all set up. We are now going to import some data again using the kiara function import.local.file
. This function will allow you to bring in a local file, one stored on your computer. We're using sample data, but you can also use this function to import your own data.
This collection includes about 20,000 letters written by and to 17th-century scholars in the Dutch Republic. By using network analysis, we can explore questions such as:
Who was the most prolific writer?
Which actor connected the most people?
Who operated in closely knit writing groups?
While network analysis can be used to explore and map unknown datasets, in this case, we already know something about the data. The research questions and module parameters in this notebook have been shaped by that prior knowledge. That is important to keep in mind as we proceed.
Let’s now use the import.local.file
module from Kiara to access our CSV file. We will specify the path to the CSV file in our inputs and save the outputs of the function as 'CKCC'. Alternatively, we can use the download.file
module used in the Hello Kiara notebook.
We will leave the comments blank here for you to fill in yourself, but the comment here might indicate why you have chosen this dataset, or a reminder of which version you are working with if you have multiple versions of the same dataset.
Now that we’ve imported our data, it’s time to build a network from it.
As with most network analysis tools, Kiara requires the data to be in the form of an edge table first. An edge table shows the connections or relationships between different entities, in this case, between senders and recipients of letters. Later, we could also add a table with nodes (the individual entities), but that is optional, and we will skip it for now.
To transform our CSV file into an edge table, we will use the create.table.from.file
function that we used in the first notebook. We will save this table in a variable called CKCC, which we will reuse later on.
Before running it, we should check the input requirements of the function, just to make sure we are using it correctly. You can do that with the following command:
This will display useful information about the function, such as the inputs it needs and the outputs it produces.
Now, we can turn our CSV data into a kiara table by loading the data file we imported earlier (the one stored in CKCC
) and telling Kiara that the first row of the file contains the column headers. You can do that with the following command:
Now that we have our edges formatted as a kiara table, we are ready to make our network graph. But before we do that, it is helpful to preview the structure of the network using kiara’s preview.network_info
function. All we need to do is select our edges table and the column names for our sources and targets by running:
This function gives us the total number of nodes, but it also helps us think about how different types of graphs - directed, undirected, multi-directed, and multi-undirected - might affect the number of edges in the network.
We see that there are more edges in a directed graph than in an undirected graph. This suggests that there are reciprocal or directed edges between a pair of nodes, something typical in an epistolary network, where people are writing back and forth to each other.
We also notice that there are even more edges in a multigraph than in either of our non-multigraphs, which means the dataset includes parallel edges (i.e., duplicates in our edge table). Again, this is common for an epistolary network, where someone writes more than one letter to their friend.
The preview shows no isolates (nodes without any edges) and a number of components. However, we see a large number of self-loops. This is unusual in epistolarly collections, as people are unlikely to write to themselves.
So, in addition to helping us decide what graph type is most useful for our dataset, this module helps us to review our data by flagging up potential errors or inconsistencies in our dataset that we can go back to at some point.
Having access to this kind of information means we can make more informed decisions about the next steps of our research or digital analysis, especially those that are sometimes automated for us.
For our network, a directed graph makes the most sense.
Let's now look at what we need to build one with our assemble.network_graph
module using kiara.retrieve_operation_info
.
This might seem like a chunky module, but it gets us to do a lot of important decision-making up front. This means that we don't have to keep inputting these decisions later on when we do some more analytical stuff.
If we change our mind about the kind of graph we want to use later on, we can always come back to this step. The preview.network_info
function is useful because it allows us to get the information we need to make an informed decision about our network early on.
We already decided that we want to make a directed network, so we can select 'directed' for graph type, and we created our edge table as 'edges' earlier, so we can pop that back in. We also need to specify our Source and Target columns again, and we can copy all this information from our preview module. We do not have a node table for this dataset, but if we did, now is where we would include it.
Now, we can make some more decisions that we have not seen yet. One is deciding whether our network is weighted or not, which might mean a number of things, depending on the data you are using - the number of letters between correspondents, the distance between them, the number of years they have known each other. If all the relationships between nodes in the network are the same, we can set this as False; if not, we need to tell kiara where this weight information is coming from.
If weights already exist in the edges table, for example, you've already assigned weights to the network before uploading the data into kiara, you can just pick the weight column and move on, or choose to do something more with them. If you have parallel edges (which the preview.network_info
module will have told you) and you don't want to use a multigraph, you can choose how you want to handle weighted parallel edges. You can either: add all the weights together ('sum'); calculate the average weight for the merged edge ('mean'); find the largest value ('maximum'); or find the smallest value ('minimum'). This will then assign this value as the new weight for this edge in the network.
If you want kiara to calculate the weights for you, you can select 'sum' and total the number of occurences of this edge as a weight. Note that if you haven't provided any weights already, the edges will be automatically assigned a weight of 1, so choosing 'mean', 'minimum', or 'maximum' for this will just return a value of 1 for every edge, which will count the same as an unweighted network.
The inputs for this module are prompting us to reflect on the decisions we are making as we are going along, and think about how our data fits into these kind of measurements, but by doing it in kiara, these inputs also allows us to track these decisions, both in terms of storing the processes and with the comments we are adding in.
We're still working with our letter dataset, so let's get kiara to add all the edges together so that the weight will tell us how many letters each person wrote to each other.
We will not get into any core network theories or their uses in the humanities here, as we're focused on the ways in which network analysis in kiara offers an interesting way to wrap the research process, and think about the decisions we're making and how to trace them. If you're interested in learning more about network analysis, or how to code using , the library currently used in these kiara modules, check out our recommended reading at the bottom.
The dataset we’re using is a sample from the Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic (CKCC) collection, compiled by the Huygens Institute in the Netherlands and made available on the LetterSampo portal, part of the Reassembling the Republic of Letters project. You can find more information about these projects .