Part 6 of 8
Preparing text data for LLM applications.
Welcome to the fifth lesson in the course. Let's recap our progress so far:
So far, we've used LLMs off the shelf as they are. Now we take our first steps towards augmenting our own LLM.
Specifically, we will augment an LLM with the Diátaxis website. Diátaxis is a framework and approach to write technical documents. Our goal is to give an LLM knowledge about Diátaxis and use it to help us write better notebooks.
Let's start with our running notebook best practice:
%load_ext autoreload
%autoreload 2
The Diátaxis docs will be the source of knowledge for our LLM. The pages are available in a repo as reStructuredText files with the extension .rst
.
Link to Diátaxis code repo
# clone the Diátaxis repo
git clone https://github.com/evildmp/diataxis-documentation-framework
It is rare that files come in exactly the right format for our ML algorithms. Pre-processing the input is one of the most important steps in ML that often gets overlooked. However, it is a great place to follow one of the Golden Rules of ML: *always* look at your data
.
All too often, folks jump right into the code and start training models. This a fun step, to be sure, but we can learn so much about both our problem and the domain itself by first looking at the data. Without carefully inspecting data, you are basically flying blind. It is only the sheer and overwhelming power of ML and LLMs that let us get away with it (sometimes), but that doesn't mean we should.
With that said, here we only have to do a little bit of pre-processing. We need to convert the Diátaxis .rst
files into .txt
files, then clean up the text a bit.
:::: callout-note
Make sure you are inside of the llm-env
virtual environment.
::::
Run the cell below to install the rst processing libraries.
# installing the rst to txt converter and writer
pip install rst2txt docutils
Next we can modify the example in the rst2txt
documentation to write a function that turns an .rst
file into a .txt
file.
from docutils.core import publish_file
import rst2txt
def convert_rst_to_txt(filename):
"""
Turns an rst file to a txt file with the same name.
"""
with open(filename, 'r') as source:
publish_file(
source=source,
destination_path=filename.replace(".rst", ".txt"),
writer=rst2txt.Writer()
)
Next up, let's grab all of the .rst files in the Diátaxis repository and convert them into .txt files.
#| output: false
import os
# NOTE: replace with your path to the Diátaxis repo
path_to_diataxis = '/Users/cck/repos/diataxis-documentation-framework'
# find all rst files in the docs repo
rst_files = [o for o in os.listdir(path_to_diataxis) if o.endswith('.rst')]
# convert all rst files to txt files
for rst in rst_files:
convert_rst_to_txt(f'{path_to_diataxis}/{rst}')
The following subset are the docs with relevant information an LLM would need to write notebooks in the Diaxtaxis style.
# files with important content about writing docs
valid_files = [
'compass.txt',
'complex-hierarchies.txt',
'explanation.txt',
'how-to-guides.txt',
'how-to-use-diataxis.txt',
'needs.txt',
'quality.txt',
'reference-explanation.txt',
'reference.txt',
'tutorials-how-to.txt',
'tutorials.txt',
]
Let's read in these text files and store them into a data
dictionary.
# stores the text data
data = {}
# read in the relevant files
for f in valid_files:
with open(f'{path_to_diataxis}/{f}', 'r') as file:
data[f] = str(file.read())
In data
, file name are the keys and the values are the text in the files. This is a pretty standard pattern when loading ML data: features are loaded into a map (dictionary), indexed by some unique identifier.
Take a moment to look through the .txt
files we've loaded, for example how-to-guides.txt
. One thing should immediately stand out: there are some errors from the conversion process.
Specifically, there are some sections it wasn't able to parse. Here's an example of a broken parsing output:
<SYSTEM MESSAGE: ... Unknown interpreted text role "ref".>
Thankfully this is isolated to a single line that failed, the rest of the document is ok.
This means we have two kinds of text cleanup to do:
.rst
conversion process.There are a few best-practices steps to cleaning up text data:
\t
, \n
, etc).Other steps like lower-casing, removing numbers, or dropping typical stop-words are more task-specific.
Let's define a clean_text
function that cleans up a given string.
import re
def clean_text(text):
"""
Cleans up the headers and footers of the text.
"""
# Replace multiple spaces with a single space
text = re.sub(r'\s+', ' ', text)
# Define the regex pattern for the headers and footers
pattern = r'[\*\=\^]+'
# Substitute the special sequences with an empty string
text = re.sub(pattern, '', text)
# TODO: any other cleaning you can think of?
return text
Let's call this cleanup function on the raw text file.
# cleaning up the text
data = {k: clean_text(v) for k, v in data.items()}
Now we can handle the errors that popped up when converting .rst
documents. Let's split the documents into list of sentences, so we can find the incorrect "SYSTEM MESSAGE" lines.
# split the data into a list of sentences
def split_sentences(text):
"Turns documents into a list of sentences."
return [o for o in text.split('. ') if o]
split_data = {k: split_sentences(v) for k, v in data.items()}
Let's look at one of the sentences in the how-to-guides.txt
file.
# Looking at an example sentence
split_data['how-to-guides.txt'][5]
How many processing errors are in this documents?
# counting the number of system messages in how-to-guides.txt
doc = 'how-to-guides.txt'
def count_errors(text):
"Counts the number of system messages in the text."
return sum(1 for o in text if '<SYSTEM MESSAGE:' in o)
count_errors(split_data['how-to-guides.txt'])
Let's count the errors in all of the documents.
# checking the full count of system errors
for f in valid_files:
print(f"NumErrors {f}: {count_errors(split_data[f])}")
Not too bad, but still something we want to clean up.
def clean_rst_errors(txt):
"Only returns items without system messages."
return [o for o in txt if '<SYSTEM MESSAGE:' not in o]
# our cleaned up data split into sentences
clean_data = {k: clean_rst_errors(v) for k, v in split_data.items()}
We can then check if the system messages are gone:
# checking the full count of system errors
for f in valid_files:
print(f"Clean NumErrors {f}: {count_errors(clean_data[f])}")
Now we have a set of clean sentences ready for embedding. Text embeddings are usually placed in vector store databases. There are many startups providing this service, or we could spin up our own. For now, we'll use the chromadb
embedding storage.
# install chromadb inside llm-env
pip install chromadb
import chromadb
chroma_client = chromadb.Client()
chroma_client.delete_collection(name=coll_name)
# create a collection
coll_name = 'diaxtaxis_docs'
collection = chroma_client.create_collection(name=coll_name)
Now we can store the embeddings.
# step through our documents and sentences
for fid, sentences in clean_data.items():
# metadata for the files
metadatas = [{"source": fid}] * len(sentences)
# unique id for each file
ids = [f"{fid}_{i}" for i in range(len(sentences))]
# add the documents
collection.add(
documents=sentences,
metadatas=metadatas,
ids=ids,
)
Now we have a stored set of embeddings we can search with queries. Let's try to find some relevant sentences for writing a Notebook.
# example setup
example_prompt = "Writing code to pre-process and cleanup text."
results = collection.query(
query_texts=[example_prompt],
n_results=5
)
results
This notebook took the first steps to augment an LLM with extra knowledge. We embedded the Diátaxis documentation to eventually use it for Retrieval-Augmented Generation (RAG). Later on, we will also use other LLMs to generate Question-and-Answer pairs based on these documents, and use them to fine-tune a model.
:::: {#refs} Procida D. Diátaxis documentation framework ::::