Change detection in semi-structured documents

In digital forensics investigations it is sometimes required to sift and analyse a large number of documents submitted as evidence. A computer system could assist an investigator in establishing the exact derivation of changes in a document by summarising tracked changes between two documents. This project was concerned with target documents encoded in a docx format ‒ the default file format of Microsoft Word ‒ and assumed that the documents had text and track change annotations.

A tool, named DocxDiff, was developed in order to compare documents. The developed tool is easy to use and requires minimal user interaction. DocxDiff utilises a change-detection algorithm that is specifically designed to work on documents that are internally represented as hierarchical XML trees, as found in docx files. The algorithm is efficient and is capable of providing accurate deltas. The change-detection algorithm ignores irrelevant deltas, such as nodes that are represented differently, but semantically equivalent. XML nodes offer a shorthand writing notation known as self-closing tags. If they are not considered equal to the longer version, they have to be worked on at a later stage. This increases the size of the output delta. DocxDiff uses the change detection algorithm to only inspect the parts of the documents that are flagged as ‘modified’. This avoids the need to wholly traverse the document, thus significantly improving its performance when large documents would be the target. DocxDiff uses the unique paths extracted from a delta to access changed nodes in the document, and determine their behaviour between the previous and a later document revision. DocxDiff classes the track changes into four categories, namely: inserted text, deleted text, approved/rejected text marked for insertion, and approved/rejected text marked for deletion.

It would be necessary for the investigator to select two documents suspected of being related to one another. DocxDiff would perform a number of preprocessing steps to ensure that the selected documents have a relatively high chance of being related. This step removes the need for the investigator to manually open and compare the content of the documents to identify similarity. Since the investigator would not need to know the sequence of events, it would be possible to input the documents in any order. DocxDiff would determine the sequence of events automatically and, if necessary, swap the supplied order. DocxDiff encapsulates the delta of the two selected documents and presents any altered track-change state to the investigator in a way that would be easy to understand and manage. This is a key factor, as an investigator might need to obtain a summary of other similar documents.

The investigator would be expected to use the presented summary to determine whether to shortlist the documents for further analysis. During testing, DocxDiff achieved a very good track-change detection accuracy. Detected track changes were always classified into the right category. Due to the limitations present in the docx format itself, some documents that were resaved at a later stage tended to cause the pre-processing validation to fail. This caused DocxDiff to mark these documents as being non-related.

Figure 1. A sample document and associated before and after tracked changes
Student: Wayne Bonnici
Course: B.Sc. IT (Hons.) Software Development
Supervisor: Dr Joseph G. Vella