Lesson 5: Processing text documents for LLMs


Chris Kroenke


October 21, 2023

Preparing text data for LLM applications.


Welcome to the fifth lesson in the course. Let’s recap our progress so far:

  • Lesson 1: We made a python environment for LLMs.
  • Lesson 2: Set up a personal blog to track our progress.
  • Lesson 3: Ran our first LLM with the HuggingFace API.
  • Lesson 4: Ran a quantized LLM with llama.cpp

So far, we’ve used LLMs off the shelf as they are. Now we take our first steps towards augmenting our own LLM.

Specifically, we will augment an LLM with the Diátaxis website. Diátaxis is a framework and approach to write technical documents. Our goal is to give an LLM knowledge about Diátaxis and use it to help us write better notebooks.

Let’s start with our running notebook best practice:

%load_ext autoreload
%autoreload 2

Grabbing the Diátaxis data

The Diátaxis docs will be the source of knowledge for our LLM. The pages are available in a repo as reStructuredText files with the extension .rst.

Link to Diátaxis code repo

# clone the Diátaxis repo
git clone https://github.com/evildmp/diataxis-documentation-framework

Converting .rst to .txt files

It is rare that files come in exactly the right format for our ML algorithms. Pre-processing the input is one of the most important steps in ML that often gets overlooked. However, it is a great place to follow one of the Golden Rules of ML: *always* look at your data.

All too often, folks jump right into the code and start training models. This a fun step, to be sure, but we can learn so much about both our problem and the domain itself by first looking at the data. Without carefully inspecting data, you are basically flying blind. It is only the sheer and overwhelming power of ML and LLMs that let us get away with it (sometimes), but that doesn’t mean we should.

With that said, here we only have to do a little bit of pre-processing. We need to convert the Diátaxis .rst files into .txt files, then clean up the text a bit.


Make sure you are inside of the llm-env virtual environment.

Run the cell below to install the rst processing libraries.

# installing the rst to txt converter and writer
pip install rst2txt docutils

Next we can modify the example in the rst2txt documentation to write a function that turns an .rst file into a .txt file.

from docutils.core import publish_file
import rst2txt

def convert_rst_to_txt(filename):
    Turns an rst file to a txt file with the same name.
    with open(filename, 'r') as source:
            destination_path=filename.replace(".rst", ".txt"),

Next up, let’s grab all of the .rst files in the Diátaxis repository and convert them into .txt files.

import os

# NOTE: replace with your path to the Diátaxis repo
path_to_diataxis = '/Users/cck/repos/diataxis-documentation-framework'

# find all rst files in the docs repo
rst_files = [o for o in os.listdir(path_to_diataxis) if o.endswith('.rst')]

# convert all rst files to txt files
for rst in rst_files:

The following subset are the docs with relevant information an LLM would need to write notebooks in the Diaxtaxis style.

# files with important content about writing docs
valid_files = [

Let’s read in these text files and store them into a data dictionary.

# stores the text data
data = {}

# read in the relevant files
for f in valid_files:
    with open(f'{path_to_diataxis}/{f}', 'r') as file:
        data[f] = str(file.read())

In data, file name are the keys and the values are the text in the files. This is a pretty standard pattern when loading ML data: features are loaded into a map (dictionary), indexed by some unique identifier.

Cleaning up the text

Take a moment to look through the .txt files we’ve loaded, for example how-to-guides.txt. One thing should immediately stand out: there are some errors from the conversion process.

Specifically, there are some sections it wasn’t able to parse. Here’s an example of a broken parsing output:

<SYSTEM MESSAGE: ... Unknown interpreted text role "ref".>

Thankfully this is isolated to a single line that failed, the rest of the document is ok.

This means we have two kinds of text cleanup to do:
1. Standard text cleanup and formatting.
2. Errors from the .rst conversion process.

Standard text cleanup

There are a few best-practices steps to cleaning up text data:
- Remove extra and trailing whitespaces.
- Remove special characters, like HTML tags.
- Properly handle escaped characters (\t, \n, etc).

Other steps like lower-casing, removing numbers, or dropping typical stop-words are more task-specific.

Let’s define a clean_text function that cleans up a given string.

import re

def clean_text(text):
    Cleans up the headers and footers of the text.
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)

    # Define the regex pattern for the headers and footers
    pattern = r'[\*\=\^]+'
    # Substitute the special sequences with an empty string
    text = re.sub(pattern, '', text)

    # TODO: any other cleaning you can think of?
    return text

Let’s call this cleanup function on the raw text file.

# cleaning up the text
data = {k: clean_text(v) for k, v in data.items()}

Special text cleanup

Now we can handle the errors that popped up when converting .rst documents. Let’s split the documents into list of sentences, so we can find the incorrect “SYSTEM MESSAGE” lines.

# split the data into a list of sentences
def split_sentences(text):
    "Turns documents into a list of sentences."
    return [o for o in text.split('. ') if o]

split_data = {k: split_sentences(v) for k, v in data.items()}

Let’s look at one of the sentences in the how-to-guides.txt file.

# Looking at an example sentence
'How-to guides matter not just because users need to be able to accomplish things: the list of how-to guides in your documentation helps frame the picture of what your product can actually do'

How many processing errors are in this documents?

# counting the number of system messages in how-to-guides.txt
doc = 'how-to-guides.txt'

def count_errors(text):
    "Counts the number of system messages in the text."
    return sum(1 for o in text if '<SYSTEM MESSAGE:' in o)


Let’s count the errors in all of the documents.

# checking the full count of system errors
for f in valid_files:
    print(f"NumErrors {f}: {count_errors(split_data[f])}")
NumErrors compass.txt: 0
NumErrors complex-hierarchies.txt: 3
NumErrors explanation.txt: 0
NumErrors how-to-guides.txt: 1
NumErrors how-to-use-diataxis.txt: 0
NumErrors needs.txt: 1
NumErrors quality.txt: 1
NumErrors reference-explanation.txt: 1
NumErrors reference.txt: 0
NumErrors tutorials-how-to.txt: 3
NumErrors tutorials.txt: 0

Not too bad, but still something we want to clean up.

def clean_rst_errors(txt):
    "Only returns items without system messages."
    return [o for o in txt if '<SYSTEM MESSAGE:' not in o]  

# our cleaned up data split into sentences
clean_data = {k: clean_rst_errors(v) for k, v in split_data.items()}

We can then check if the system messages are gone:

# checking the full count of system errors
for f in valid_files:
    print(f"Clean NumErrors {f}: {count_errors(clean_data[f])}")
Clean NumErrors compass.txt: 0
Clean NumErrors complex-hierarchies.txt: 0
Clean NumErrors explanation.txt: 0
Clean NumErrors how-to-guides.txt: 0
Clean NumErrors how-to-use-diataxis.txt: 0
Clean NumErrors needs.txt: 0
Clean NumErrors quality.txt: 0
Clean NumErrors reference-explanation.txt: 0
Clean NumErrors reference.txt: 0
Clean NumErrors tutorials-how-to.txt: 0
Clean NumErrors tutorials.txt: 0

Embedding the Diátaxis data

Now we have a set of clean sentences ready for embedding. Text embeddings are usually placed in vector store databases. There are many startups providing this service, or we could spin up our own. For now, we’ll use the chromadb embedding storage.

# install chromadb inside llm-env
pip install chromadb
import chromadb
chroma_client = chromadb.Client()
# create a collection
coll_name = 'diaxtaxis_docs'
collection = chroma_client.create_collection(name=coll_name)

Now we can store the embeddings.

# step through our documents and sentences
for fid, sentences in clean_data.items():

    # metadata for the files
    metadatas = [{"source": fid}] * len(sentences)

    # unique id for each file
    ids = [f"{fid}_{i}" for i in range(len(sentences))]

    # add the documents
/Users/cck/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [01:42<00:00, 814kiB/s] 

Querying the database

Now we have a stored set of embeddings we can search with queries. Let’s try to find some relevant sentences for writing a Notebook.

# example setup
example_prompt = "Writing code to pre-process and cleanup text."

results = collection.query(
{'ids': [['how-to-use-diataxis.txt_38',
 'distances': [[1.1993290185928345,
 'metadatas': [[{'source': 'how-to-use-diataxis.txt'},
   {'source': 'tutorials.txt'},
   {'source': 'reference.txt'},
   {'source': 'tutorials-how-to.txt'},
   {'source': 'reference.txt'}]],
 'embeddings': None,
 'documents': [["Working like this helps reduce the stress of one of the most paralysing and troublesome aspects of the documentation-writer's work: working out what to do",
   'Provide minimal explanation of actions in the most basic language possible',
   'List commands, options, operations, features, flags, limitations, error messages, etc',
   'You already know these processes',
   'Do nothing but describe  Technical reference has one job: to describe, and to do that clearly, accurately and comprehensively']],
 'uris': None,
 'data': None}


This notebook took the first steps to augment an LLM with extra knowledge. We embedded the Diátaxis documentation to eventually use it for Retrieval-Augmented Generation (RAG). Later on, we will also use other LLMs to generate Question-and-Answer pairs based on these documents, and use them to fine-tune a model.