Entity view (Content)

Extracting and Converting PDF documents

By brandont
Feb. 4, 2011

Recently I was working on a site that required the user to upload a PDF document and have its contents indexed by Drupal as well, converting the PDF into HTML for viewing. I found PDFtoText and PDFtoHTML, which are two utilities that work very well on a linux/php/Drupal environment.

PDFtoText:

PDFtoText takes one required parameter being the file path. When you execute the command, the utility strips the text from the PDF document and creates a text file within the same folder as the file path provided as the argument and then returns the text from the PDF file. In the example below, I pass the file path and store the returned text in the $file_text variable. That variable is then checked for emptiness and indexed appropriately within Drupal. For more info check the man page for PDFtoText (linuxmanpages.com/man1/pdftotext.1.php).

function example_resume_index_resume($file_path, $nid, $file_extension){
  $helper_command = '/usr/bin/pdftotext "' . $file_path . '" -';
  $file_text = shell_exec($helper_command);
  if($file_text == ''){
    search_index($nid, 'resume', '');
  }
  else{
    search_index($nid, 'resume', 'file name: '. $file_path .', text: '. $file_text);
  }
}


PDFtoHTML:

PDFtoHTML is a program that converts PDF documents into HTML. When the program is run it generates the HTML output in the current working directory. In the example provided I pass three parameters into the function. The “-c” generates complex output and “-i” ignores any images provided in the document and the file path. Once the HTML file has been created I output the on a view page later on for the user. For more info on the PDFtoHTML program, check the man page here (http://linux.die.net/man/1/pdftohtml).

function example_resume_create_html_txt($file_extension, $file_path){
  if($file_extension == 'pdf'){
  $html_command = 'pdftohtml -c -i ' . $file_path;
  shell_exec($html_command);
}


In conclusion, these two utilities made it really simple to index the PDF file allowing it to be searched via Drupal. As well, converting the PDF to HTML allowed the PDF contents to be viewed in a similar fashion as the original document without the need for Adobe or any other PDF reader.

Post Tags: