Search Huma's Emails

How We Did It

When the FBI released the emails discovered on Anthony Weiner's laptop, they were released as Adobe PDF documents containing scanned images of printouts of the emails. This rendered the documents unsearchable. applied Optical Character Recognition (OCR) to the documents using pdftotext. This generated text documents from the images contained in the PDF documents which can be searched. OCR is not a perfect technology. Errors will be present in the documents, and their formatting is very basic. The date is parsed from the first line of text (if present) and used to help order document results.

The documents were then added to MongoDB with full-text search. Searches are performed without alteration, and result sets are paginated into sets of 20 documents each. The set is sorted by date as discovered from the documents themselves.

The goal is to provide a best-effort search of the documents. A link to each original PDF is provided at the top of each email message. For the original formatting as provided by the FBI, access the PDF document and make sure the language you discovered is factual and intact.