取出 PDF 中的圖片並且做一些雜訊的清除

edited 三月 2014 in 進階PHP討論
透過機器掃描得到的 PDF ,其實裡面都是圖片檔,取出裡面圖片檔的方式很多,但發現 ghost script 的方式最有效率,參考了這篇:
http://right-sock.net/linux/better-convert-pdf-to-jpg-using-ghost-script/

取出的圖片帶有浮水印,所以加入了下面這篇文章提到的方式來做部份的去除:
http://www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=18707

除了產出圖片之外,也進一步針對圖片做了一個列表,然後就像這樣子:

$path = dirname(dirname(__FILE__));
$fh = fopen($path . '/pdf/pdf2jpg.csv', 'w');
fputcsv($fh, array('id','檔名','頁數','網址','圖寬','圖高'));
$fileId = 1;
foreach (glob($path . '/pdf/*/*/*.pdf') AS $file) {
$pathinfo = pathinfo($file);
$file = addslashes($file);
$file = str_replace(array(' ', '(', ')'), array('\\ ', '\\(', '\\)'), $file);
$pathinfo['filename'] = str_replace(array(' ', '(', ')'), array('-', '', ''), $pathinfo['filename']);
$firstTargetFile = \"{$pathinfo['dirname']}/{$pathinfo['filename']}-1.jpg\";
if (!file_exists($firstTargetFile)) {
exec(\"gs -dNOPAUSE -sDEVICE=jpeg -sOutputFile={$pathinfo['filename']}-%d.jpg -dJPEGQ=100 -r300x300 -q {$file} -c quit\");
foreach (glob($path . \"/{$pathinfo['filename']}-*\") AS $jpg) {
exec(\"convert {$jpg} -morphology thicken '1x3>:1,0,1' {$jpg}\");
exec(\"convert {$jpg} -morphology thicken '1x3>:1,0,1' {$jpg}\");
$size = getimagesize($jpg);
exec(\"mv {$jpg} {$pathinfo['dirname']}/\");
$dashPos = strrpos($jpg, '-');
$dotPos = strpos($jpg, '.', $dashPos);
$pageNumber = substr($jpg, $dashPos + 1, $dotPos - $dashPos - 1);
fputcsv($fh, array($fileId++,substr($file, 48),$pageNumber,$jpg,$size[0],$size[1]));
}
} else {
foreach (glob(\"{$pathinfo['dirname']}/{$pathinfo['filename']}-*.jpg\") AS $jpg) {
$size = getimagesize($jpg);
$dashPos = strrpos($jpg, '-');
$dotPos = strpos($jpg, '.', $dashPos);
$pageNumber = substr($jpg, $dashPos + 1, $dotPos - $dashPos - 1);
fputcsv($fh, array($fileId++,substr($file, 48),$pageNumber,$jpg,$size[0],$size[1]));
}
}
}
fclose($fh);


完整原始碼: https://github.com/kiang/tw-campaign-finance/blob/master/scripts/pdf2jpg.php

評論

Sign In or Register to comment.