Lesson 5: Processing text documents for LLMs

Preparing text data for LLM applications.

Intro

Welcome to the fifth lesson in the course. Let's recap our progress so far:

  • Lesson 1: We made a python environment for LLMs.
  • Lesson 2: Set up a personal blog to track our progress.
  • Lesson 3: Ran our first LLM with the HuggingFace API.
  • Lesson 4: Ran a quantized LLM with llama.cpp

So far, we've used LLMs off the shelf as they are. Now we take our first steps towards augmenting our own LLM.

Specifically, we will augment an LLM with the Diátaxis website. Diátaxis is a framework and approach to write technical documents. Our goal is to give an LLM knowledge about Diátaxis and use it to help us write better notebooks.

Let's start with our running notebook best practice:

%load_ext autoreload
%autoreload 2

Grabbing the Diátaxis data

The Diátaxis docs will be the source of knowledge for our LLM. The pages are available in a repo as reStructuredText files with the extension .rst.

Link to Diátaxis code repo

# clone the Diátaxis repo
git clone https://github.com/evildmp/diataxis-documentation-framework

Converting .rst to .txt files

It is rare that files come in exactly the right format for our ML algorithms. Pre-processing the input is one of the most important steps in ML that often gets overlooked. However, it is a great place to follow one of the Golden Rules of ML: *always* look at your data.

All too often, folks jump right into the code and start training models. This a fun step, to be sure, but we can learn so much about both our problem and the domain itself by first looking at the data. Without carefully inspecting data, you are basically flying blind. It is only the sheer and overwhelming power of ML and LLMs that let us get away with it (sometimes), but that doesn't mean we should.

With that said, here we only have to do a little bit of pre-processing. We need to convert the Diátaxis .rst files into .txt files, then clean up the text a bit.

:::: callout-note Make sure you are inside of the llm-env virtual environment. ::::

Run the cell below to install the rst processing libraries.

# installing the rst to txt converter and writer
pip install rst2txt docutils

Next we can modify the example in the rst2txt documentation to write a function that turns an .rst file into a .txt file.

from docutils.core import publish_file
import rst2txt

def convert_rst_to_txt(filename):
    """
    Turns an rst file to a txt file with the same name.
    """
    with open(filename, 'r') as source:
        publish_file(
            source=source,
            destination_path=filename.replace(".rst", ".txt"),
            writer=rst2txt.Writer()
        )

Next up, let's grab all of the .rst files in the Diátaxis repository and convert them into .txt files.

#| output: false
import os

# NOTE: replace with your path to the Diátaxis repo
path_to_diataxis = '/Users/cck/repos/diataxis-documentation-framework'

# find all rst files in the docs repo
rst_files = [o for o in os.listdir(path_to_diataxis) if o.endswith('.rst')]

# convert all rst files to txt files
for rst in rst_files:
    convert_rst_to_txt(f'{path_to_diataxis}/{rst}')
/Users/cck/repos/diataxis-documentation-framework/colofon.rst:62: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/colofon.rst:70: (ERROR/3) Unknown interpreted text role "doc".
/Users/cck/repos/diataxis-documentation-framework/index.rst:23: (ERROR/3) Unknown interpreted text role "doc".
/Users/cck/repos/diataxis-documentation-framework/index.rst:61: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/index.rst:65: (ERROR/3) Unknown directive type "toctree".

.. toctree::
   :maxdepth: 1
   :hidden:
   :titlesonly:

   Home <self>
   Tutorials <tutorials>
   How-to guides <how-to-guides>
   Reference <reference>
   Explanation <explanation>

/Users/cck/repos/diataxis-documentation-framework/index.rst:76: (ERROR/3) Unknown directive type "toctree".

.. toctree::
   :maxdepth: 1
   :hidden:
   :titlesonly:

   Tutorials vs how-to guides <tutorials-how-to>
   Reference vs explanation <reference-explanation>
   needs
   compass
   quality
   Complex hierarchies <complex-hierarchies>
   how-to-use-diataxis

/Users/cck/repos/diataxis-documentation-framework/index.rst:89: (ERROR/3) Unknown directive type "toctree".

.. toctree::
   :maxdepth: 1
   :hidden:
   :titlesonly:

   adoption
   colofon
/Users/cck/repos/diataxis-documentation-framework/how-to-guides.rst:38: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/quality.rst:200: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/complex-hierarchies.rst:48: (ERROR/3) Error in "code-block" directive:
unknown option: "emphasize-lines".

..  code-block:: text
    :emphasize-lines: 8-11

    home                      <- landing page
        tutorial
            part 1
            part 2
            part 3
        how-to guides         <- landing page
            install           <- landing page
                locally
                Docker
                virtual machine
                Linux container
            deploy
            scale
        reference             <- landing page
            commandline tool
            available endpoints
            API
        explanation           <- landing page
            best practice recommendations
            security overview
            performance

/Users/cck/repos/diataxis-documentation-framework/complex-hierarchies.rst:216: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/complex-hierarchies.rst:254: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/tutorials-how-to.rst:34: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/tutorials-how-to.rst:127: (ERROR/3) Unknown directive type "cssclass".

..  cssclass:: lined

/Users/cck/repos/diataxis-documentation-framework/tutorials-how-to.rst:129: (ERROR/3) Unknown directive type "grid".

..  grid:: 1 2 2 2
    :margin: 0
    :padding: 0
    :gutter: 3

    ..  grid-item::

        A tutorial’s purpose is **to help the pupil acquire basic competence**.

    ..  grid-item::

        A how-to guide’s purpose is **to help the already-competent user perform a particular task
        correctly**.

    ..  grid-item::

        A tutorial **provides a learning experience**. People learn skills through practical, hands-on experience. What matters
        in a tutorial is what the learner *does*, and what they experience while doing it.

    ..  grid-item::

        A how-to guide **directs the user’s work**.

    ..  grid-item::

        The tutorial follows a **carefully-managed path**, starting at a given point and working to
        a conclusion. Along that path, the learner must have the *encounters* that the lesson
        requires.

    ..  grid-item::

        The how-to guide aims for a successful *result*, and guides the user along the safest,
        surest way to the goal, but **the path can’t be managed**: it’s the real world, and
        anything could appear to disrupt the journey.

    ..  grid-item::

        A tutorial **familiarises the learner** with the work: with the tools, the language, the processes and the way that
        what they’re working with behaves and responds, and so on. Its job is to introduce them, manufacturing a structured,
        repeatable encounter with them.

    ..  grid-item::

        The how-to guide can and should **assume familiarity** with them all.

    ..  grid-item::

        The tutorial takes place in a **contrived setting**, a learning environment where as much as possible is set
        out in advance to ensure a successful experience.

    ..  grid-item::

        A how-to guide applies to the **real world**, where you have to deal
        with what it throws at you.

    ..  grid-item::

        The tutorial **eliminates the unexpected**.

    ..  grid-item::

        The how-to guide must **prepare for the unexpected**, alerting the user to its possibility
        and providing guidance on how to deal with it.

    ..  grid-item::

        A tutorial’s path follows a single line. **It doesn’t offer choices or alternatives**.

    ..  grid-item::

        A **how-to guide will typically fork and branch**, describing different routes
        to the same destination: *If this, then that. In the case of ..., an alternative approach
        is to…*

    ..  grid-item::

        A tutorial **must be safe**. No harm should come to the learner; it must always be possible to go back to the beginning
        and start again.

    ..  grid-item::

        A how-to guide **cannot promise safety**; often there’s only one chance to get it right.

    ..  grid-item::

        In a tutorial, **responsibility lies with the teacher**. If the learner gets into trouble, that's the teacher's problem
        to put right.

    ..  grid-item::

        In a how-to guide, **the user has responsibility** for getting themselves in and out of trouble.

    ..  grid-item::

        The learner **may not even have sufficient competence to ask the questions** that a tutorial answers.

    ..  grid-item::

        A how-to guide can assume that **the user is asking the right questions in the first
        place**.

    ..  grid-item::

        The tutorial is **explicit about basic things** - where to do things, where to put them, how to manipulate objects. It
        addresses the embodied experience - in our medical example, how hard to press, how to hold an implement; in a software
        tutorial, it could be where to type a command, or how long to wait for a response.

    ..  grid-item::

        A how-to guide relies on this as **implicit knowledge** - even bodily knowledge.

    ..  grid-item::

        A tutorial is **concrete and particular** in its approach. It refers to the specific, known, defined tools, materials,
        processes and conditions that we have carefully set before the learner.

    ..  grid-item::

        The how-to guide has to take a **general** approach: many of these things will be
        unknowable in advance, or different in each real-world case.

    ..  grid-item::

        The tutorial **teaches general skills and principles** that later could be applied to a
        multitude of cases.

    ..  grid-item::

        The user following a how-to guide is doing so in order to **complete a particular task**.

/Users/cck/repos/diataxis-documentation-framework/reference-explanation.rst:72: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/needs.rst:110: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/needs.rst:111: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/needs.rst:112: (ERROR/3) Unknown interpreted text role "ref".
/Users/cck/repos/diataxis-documentation-framework/needs.rst:113: (ERROR/3) Unknown interpreted text role "ref".

The following subset are the docs with relevant information an LLM would need to write notebooks in the Diaxtaxis style.

# files with important content about writing docs
valid_files = [
    'compass.txt',
    'complex-hierarchies.txt',
    'explanation.txt',
    'how-to-guides.txt',
    'how-to-use-diataxis.txt',
    'needs.txt',
    'quality.txt',
    'reference-explanation.txt',
    'reference.txt',
    'tutorials-how-to.txt',
    'tutorials.txt',
]

Let's read in these text files and store them into a data dictionary.

# stores the text data
data = {}

# read in the relevant files
for f in valid_files:
    with open(f'{path_to_diataxis}/{f}', 'r') as file:
        data[f] = str(file.read())

In data, file name are the keys and the values are the text in the files. This is a pretty standard pattern when loading ML data: features are loaded into a map (dictionary), indexed by some unique identifier.

Cleaning up the text

Take a moment to look through the .txt files we've loaded, for example how-to-guides.txt. One thing should immediately stand out: there are some errors from the conversion process.

Specifically, there are some sections it wasn't able to parse. Here's an example of a broken parsing output:

<SYSTEM MESSAGE: ... Unknown interpreted text role "ref".>

Thankfully this is isolated to a single line that failed, the rest of the document is ok.

This means we have two kinds of text cleanup to do:

  1. Standard text cleanup and formatting.
  2. Errors from the .rst conversion process.

Standard text cleanup

There are a few best-practices steps to cleaning up text data:

  • Remove extra and trailing whitespaces.
  • Remove special characters, like HTML tags.
  • Properly handle escaped characters (\t, \n, etc).

Other steps like lower-casing, removing numbers, or dropping typical stop-words are more task-specific.

Let's define a clean_text function that cleans up a given string.

import re

def clean_text(text):
    """
    Cleans up the headers and footers of the text.
    """
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)

    # Define the regex pattern for the headers and footers
    pattern = r'[\*\=\^]+'
    # Substitute the special sequences with an empty string
    text = re.sub(pattern, '', text)

    # TODO: any other cleaning you can think of?
    
    return text

Let's call this cleanup function on the raw text file.

# cleaning up the text
data = {k: clean_text(v) for k, v in data.items()}

Special text cleanup

Now we can handle the errors that popped up when converting .rst documents. Let's split the documents into list of sentences, so we can find the incorrect "SYSTEM MESSAGE" lines.

# split the data into a list of sentences
def split_sentences(text):
    "Turns documents into a list of sentences."
    return [o for o in text.split('. ') if o]

split_data = {k: split_sentences(v) for k, v in data.items()}

Let's look at one of the sentences in the how-to-guides.txt file.

# Looking at an example sentence
split_data['how-to-guides.txt'][5]
'How-to guides matter not just because users need to be able to accomplish things: the list of how-to guides in your documentation helps frame the picture of what your product can actually do'

How many processing errors are in this documents?

# counting the number of system messages in how-to-guides.txt
doc = 'how-to-guides.txt'

def count_errors(text):
    "Counts the number of system messages in the text."
    return sum(1 for o in text if '<SYSTEM MESSAGE:' in o)

count_errors(split_data['how-to-guides.txt'])
1

Let's count the errors in all of the documents.

# checking the full count of system errors
for f in valid_files:
    print(f"NumErrors {f}: {count_errors(split_data[f])}")
NumErrors compass.txt: 0
NumErrors complex-hierarchies.txt: 3
NumErrors explanation.txt: 0
NumErrors how-to-guides.txt: 1
NumErrors how-to-use-diataxis.txt: 0
NumErrors needs.txt: 1
NumErrors quality.txt: 1
NumErrors reference-explanation.txt: 1
NumErrors reference.txt: 0
NumErrors tutorials-how-to.txt: 3
NumErrors tutorials.txt: 0

Not too bad, but still something we want to clean up.

def clean_rst_errors(txt):
    "Only returns items without system messages."
    return [o for o in txt if '<SYSTEM MESSAGE:' not in o]  

# our cleaned up data split into sentences
clean_data = {k: clean_rst_errors(v) for k, v in split_data.items()}

We can then check if the system messages are gone:

# checking the full count of system errors
for f in valid_files:
    print(f"Clean NumErrors {f}: {count_errors(clean_data[f])}")
Clean NumErrors compass.txt: 0
Clean NumErrors complex-hierarchies.txt: 0
Clean NumErrors explanation.txt: 0
Clean NumErrors how-to-guides.txt: 0
Clean NumErrors how-to-use-diataxis.txt: 0
Clean NumErrors needs.txt: 0
Clean NumErrors quality.txt: 0
Clean NumErrors reference-explanation.txt: 0
Clean NumErrors reference.txt: 0
Clean NumErrors tutorials-how-to.txt: 0
Clean NumErrors tutorials.txt: 0

Embedding the Diátaxis data

Now we have a set of clean sentences ready for embedding. Text embeddings are usually placed in vector store databases. There are many startups providing this service, or we could spin up our own. For now, we'll use the chromadb embedding storage.

# install chromadb inside llm-env
pip install chromadb
import chromadb
chroma_client = chromadb.Client()
chroma_client.delete_collection(name=coll_name)
# create a collection
coll_name = 'diaxtaxis_docs'
collection = chroma_client.create_collection(name=coll_name)

Now we can store the embeddings.

# step through our documents and sentences
for fid, sentences in clean_data.items():

    # metadata for the files
    metadatas = [{"source": fid}] * len(sentences)

    # unique id for each file
    ids = [f"{fid}_{i}" for i in range(len(sentences))]

    # add the documents
    collection.add(
        documents=sentences,
        metadatas=metadatas,
        ids=ids,
    )
/Users/cck/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [01:42<00:00, 814kiB/s] 

Querying the database

Now we have a stored set of embeddings we can search with queries. Let's try to find some relevant sentences for writing a Notebook.

# example setup
example_prompt = "Writing code to pre-process and cleanup text."

results = collection.query(
    query_texts=[example_prompt],
    n_results=5
)
results
{'ids': [['how-to-use-diataxis.txt_38',
   'tutorials.txt_104',
   'reference.txt_43',
   'tutorials-how-to.txt_62',
   'reference.txt_30']],
 'distances': [[1.1993290185928345,
   1.2605922222137451,
   1.2740986347198486,
   1.2925323247909546,
   1.304250955581665]],
 'metadatas': [[{'source': 'how-to-use-diataxis.txt'},
   {'source': 'tutorials.txt'},
   {'source': 'reference.txt'},
   {'source': 'tutorials-how-to.txt'},
   {'source': 'reference.txt'}]],
 'embeddings': None,
 'documents': [["Working like this helps reduce the stress of one of the most paralysing and troublesome aspects of the documentation-writer's work: working out what to do",
   'Provide minimal explanation of actions in the most basic language possible',
   'List commands, options, operations, features, flags, limitations, error messages, etc',
   'You already know these processes',
   'Do nothing but describe  Technical reference has one job: to describe, and to do that clearly, accurately and comprehensively']],
 'uris': None,
 'data': None}

Conclusion

This notebook took the first steps to augment an LLM with extra knowledge. We embedded the Diátaxis documentation to eventually use it for Retrieval-Augmented Generation (RAG). Later on, we will also use other LLMs to generate Question-and-Answer pairs based on these documents, and use them to fine-tune a model.

References

:::: {#refs} Procida D. Diátaxis documentation framework ::::