Extracting the text from PDF documents

Published: 2017-11-05
Last Updated: 2017-11-05 17:20:18 UTC
by Didier Stevens (Version: 1)
1 comment(s)

In my previous diary entry, we looked at a phishing PDF and extracted the URLs.

But what if you want to look at the message contained in the PDF without opening it? There are several tools (online and offline) that can convert PDF documents to text.

It can also be done with my pdf-parser.py tool, but we need to know a bit of PDF internals.

First we search for objects that are pages:

Object 5 is a page, and the content of the page is in object 65 (/Contents 65 0 R).

It contains a compressed stream, let's decompress it:

Text can be rendered on a PDF page using "text-showing operators" like Tj and TJ. Tj takes a string as argument, and TJ a list of strings and integers. If you take a close look at the screenshot above, you will find the TJ operator preceded by a list of strings. It's not that easy to see though, so let's grep for TJ:

A list is represented with square brackets [], and a string with parentheses ().

[( )] TJ instructs the PDF reader to draw a space character on the page. This is a postfix language: the operands are placed before the operator. [( )] is the operand and TJ is the operator. [( )] is a list of strings, containing just one string ( ).

[(D)4(e)7(a)9(r)-18( )30(Cu)27(s)39(t)-15(o)-18(m)23(e)7(r)-18(,)] TJ instructs the PDF reader to draw a text "Dear Customer," on the page. Each letter is found inside a string () in this list, and the integers indicate the amount of space to leave between each letter.

If we extract all the letters from the strings and concatenate them, we can extract the text:

Using regular expression \(.+?\) we can select all strings, but we want the content of the string, not the string itself. We can achieve this with my re-search.py tool and regular expression \((.+?)\). re-search.py searches through (text) files using a regular expression, and outputs all matches. If you define a capture group () inside a regular expression, then re-search will output the content of the capture group, and not the complete match.

This will output each character on a line, and we use tr to remove all newlines and thus join all characters in one big line.

 

This method is not practical of course, it's only to be used if automatic conversion tools can not be used.

Didier Stevens
Microsoft MVP Consumer Security
blog.DidierStevens.com DidierStevensLabs.com

Keywords: maldoc pdf phishing
1 comment(s)

Comments

What's this all about ..?
password reveal .
<a hreaf="https://technolytical.com/">the social network</a> is described as follows because they respect your privacy and keep your data secure:

<a hreaf="https://technolytical.com/">the social network</a> is described as follows because they respect your privacy and keep your data secure. The social networks are not interested in collecting data about you. They don't care about what you're doing, or what you like. They don't want to know who you talk to, or where you go.

<a hreaf="https://technolytical.com/">the social network</a> is not interested in collecting data about you. They don't care about what you're doing, or what you like. They don't want to know who you talk to, or where you go. The social networks only collect the minimum amount of information required for the service that they provide. Your personal information is kept private, and is never shared with other companies without your permission
https://thehomestore.com.pk/
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> nearest public toilet to me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> nearest public toilet to me</a>
<a hreaf="https://defineprogramming.com/the-public-bathroom-near-me-find-nearest-public-toilet/"> public bathroom near me</a>
https://defineprogramming.com/
https://defineprogramming.com/
Enter comment here... a fake TeamViewer page, and that page led to a different type of malware. This week's infection involved a downloaded JavaScript (.js) file that led to Microsoft Installer packages (.msi files) containing other script that used free or open source programs.
distribute malware. Even if the URL listed on the ad shows a legitimate website, subsequent ad traffic can easily lead to a fake page. Different types of malware are distributed in this manner. I've seen IcedID (Bokbot), Gozi/ISFB, and various information stealers distributed through fake software websites that were provided through Google ad traffic. I submitted malicious files from this example to VirusTotal and found a low rate of detection, with some files not showing as malware at all. Additionally, domains associated with this infection frequently change. That might make it hard to detect.
https://clickercounter.org/
Enter corthrthmment here...

Diary Archives