by Mark Donohoe
Last week, I was waiting in line to go to a playoff hockey game. The doors opened in twenty minutes and the line was long.It was also raining, so I started talking with the woman in front of me to pass the time. She was a teacher at theuniversity. She shared some stories about the classes she taught. She asked me what I did and I talked aboutworking on PDF software. When I described the process of redacting PDF files, she said she was familiar withredaction, having used the software at her previous job. I am always curious to hear about user software experiences,so I asked her what kind of files she had to redact. And she ended up describing one of the more interesting, but perhaps less obvious features of Redax.
She worked with files which originated as data from utility bills, where the data is predictably formatted. The filescontained the electric bills for thousands of customers. The goal was to remove name and address informationfor each customer, and leave behind the electrical usage data. From the redacted versions of the files, they hadanother process for scraping the month and year data along with the usage numbers. At this point, the doors tothe venue opened and the line started to move. But I thought more about how she achieved the goal.What follows is how she used templates in Redax to mark up well-formed, repeating data.
She used a facility in Redax called templates. She started by opening a sample data file. Each of the PDFscontained a block of customer records, each one page long. She created redax boxes around the name andaddress regions on the page. There were two places where this information existed. She only marked thefirst page since each page had the information to be redacted in the same positions on the page.
To create the template file, she selected Redax | Export Template File. A dialog came up requesting the numberof pages in the template. In her case, a customer record always comprised a single page, so she changed thePage Repeat Increment value from the page count of the file to 1. (If each customer record took up 3 pages, shewould have entered 3.) After okaying that dialog, the next dialog prompted for a file name to save to.
To test the template, she closed the current file, discarding the changes. Then she opened a small file with just 100records. Her goal was to get an idea of how fast redacting 100 pages would be. She selected Redax | Import Template Fileand picked the template file she had just saved to. Though it was fast enough she didn't really notice it, Redax addedredax boxes on each page in the same places she originally marked. She turned on Page Thumbnails and could see ata glance that all of the pages in the file had been marked. She redacted the file via Redax | Redact Document. To verifythe result, she could have done several things.
She might have done a quick search for one of the customer names from the original file. It came up empty, but she mightnot have been satisfied with that. Since each page contained at least one zip code, she chose Redax | Find Using Patterns. Added PostalCode USA and hit OK. When the resulting alert indicated that 0 pages had been annotated, she was more confident that shegot the desired results.
Using Redax | Reports | Export Document Text ..., she could have exported all the text from the original file. Then after redactingthe file, export the text. Using winmerge or windiff, or diff (in a linux environment), the difference between the two should containthe customer data from the original file.
And, in case you're wondering, the home team won, allowing them to continue their season for another week.