There were a few times when I had to redact a few details from the document before printing. This made me think if this task could be automated. After all, I was redacting names of people, institutions and places. These can be easily searched and their background can be set to black to get the redaction. So, I decided to try this out in Python.
In Python, there is a package called
python-docx which can create and update Microsoft Word (.docx) files. It is easy to add paragraphs with multiple runs and varying styles to an existing or new document.
A document is made up of headers, paragraphs and images. Paragraphs can have one or multiple runs. A run denotes an inline content in a paragraph which can be text, pictures or other graphics. We are interested in modifying runs with text.
For the purpose of redacting, we need to search for the pattern(s) to be redacted in run-text, split a run according to the location of the pattern(s) and update its style, namely the background color.
In this article, we will go through the script that can take a list of patterns to redact in a document, redact it and save another document with the redactions.
While working on this script I got stuck at splitting the runs and inserting the split runs at the same index of the original run. During research, I came across
python-docx-split-run a repo by
allexx88 on GitHub which had implementation for all the functionality needed for this project. Due credit to
allexx88 for the work.
python-docx run manipulation. Contribute to alllexx88/python-docx-split-run development by creating an account on…
run_tools.py script in
python-docx-split-run provides a lot of functionality from which I have used
split_run_by function which makes use of
Here is the image of the document that I will use to redact:
The patterns we will be redacting are
Python, Kotlin, Java, Go, Swift, TypeScript, Ruby.
Now, lets see how does a script for redacting a Word document looks and go through the code.
Hopefully, the comments in the code should be good enough to help you understand the code. The gist is that we loop through the paragraphs and the runs in each paragraph. The reason to loop through the runs in the reverse order is, the runs we split gets added after the
run_index currently been processed, which helps us not skip or double process a run. We search for patterns in the run-text and get the indices of the patterns that match. The runs are split using
split_run_by and highlighted as necessary.
Sometimes the text color is not really black, even if it looks black, it might be a very dark gray. In such cases if you set the background color to black and print, the text might still be legible. Therefore, we also set the text color to the same color as the background color.
Here is the output from the
Let's go through the functions used in
redact_colors function returns color to set the text color and background color.
process_matches takes the list of matches from a regex call and returns a list of indices to spilt the run which is acceptable by the
split_run_by function. The boolean list of highlights is of the same length as matches and holds whether to highlight the run.
split_run_by function does not need the matches to start with
0 and end with the
length of the run-text, it does it on its own. However, the implementation of
process_matches does that to get the highlights list correct.
Now, drawing black rectangles to hide some info seems redundant and if you would like to save ink and not print redundant rectangles you can set the redaction color to white. The image below shows the output with white as the redaction color. A printer usually optimizes ink usage by not printing white in the region where there is text.
Now, lets take the redaction to the next level. The person reading the redacted document might have some subject knowledge about the document and might be able to decipher the redacted text by looking at the length of the rectangles and the surrounding context.
To overcome this possibility we can change the
redact_document function to replace the patterns to be redacted with a constant length string. Here for simplicity I have used a string of length 10 of the hashtags
#. The function
redact_document_with_replace below implements the replacement and redact ion mechanism.
The only difference between the
redact_document and the
redact_document_with_replace function is that the pattern is replaced and then redacted. The function
redact_document_with_replace takes and additional parameter of
replace_with and the additional code at line number 35 and 47 to replace the pattern.
Below is the output where the above code is modified for demo to use black as the text color and yellow as background color to show the replacement. You can see the style stays intact.
And below is the actual redaction output.
One important point to remember is that if for example you want to redact a name “Firstname Lastame”, then add the “Firstname” and “Lastname” as two different patterns instead of one where they are together. It will help the regular expression code as well as help redact places where only the first name or the last name is used.
The whole code can be found in the repo below where I have merged both the functions into one and implemented a command line access.
Python code to redact contents in word (docx) document. - arccoder/redact-docx
Here is the interface to the script.
This project was implemented using Python 3.8.3 and the requirements file is available in the repo with the documents whose images you see in this blog.
Hope you find this article helpful. Visit the repo and if you happen to try it and find any errors or bugs, please submit an issue on GitHub and I will try to resolve it as soon as possible.