Textblock™ Detection

Textblocks are useful for many tasks, including tracking the dissemination of information, measuring influence between co-workers, and establishing authorship of anonymous documents.

Textblocks are defined in US patent #7,143,091. A textblock is any significant portion of text that is reused between documents. It may be short or long, and its contents may have changed here and there. As long as it is still recognizable as being the same piece of text from the same author, it counts. Examples range from stock text inserted as part of a template, to a particularly good description of a product that gets copied into each piece of marketing material.

While the idea is quite simple, detecting textblocks within large sets of data is difficult. We don’t know how long they are, where they are relative to other text within a document, or how the textblocks have been changed by subsequent edits: so how do we know what to compare to what?

Cataphora has developed a number of novel techniques for this purpose. In one, a training corpus of text is modeled as a graph showing transitions between text features. The graph is accumulated across all documents additively. The software then looks for patterns associated with the same transitions occurring in the same order repeatedly. Transformations on the graph produce filters, which can be used to pinpoint such regions in additional documents.

Read about other Cataphora technologies: