取出 PDF 中的圖片並且做一些雜訊的清除

edited 三月 2014 in 進階PHP討論
透過機器掃描得到的 PDF ,其實裡面都是圖片檔,取出裡面圖片檔的方式很多,但發現 ghost script 的方式最有效率,參考了這篇:
http://right-sock.net/linux/better-convert-pdf-to-jpg-using-ghost-script/

取出的圖片帶有浮水印,所以加入了下面這篇文章提到的方式來做部份的去除:
http://www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=18707

除了產出圖片之外,也進一步針對圖片做了一個列表,然後就像這樣子:
$path = dirname(dirname(__FILE__));
$fh = fopen($path . '/pdf/pdf2jpg.csv', 'w');
fputcsv($fh, array('id','檔名','頁數','網址','圖寬','圖高'));
$fileId = 1;
foreach (glob($path . '/pdf/*/*/*.pdf') AS $file) {
    $pathinfo = pathinfo($file);
    $file = addslashes($file);
    $file = str_replace(array(' ', '(', ')'), array('\\ ', '\\(', '\\)'), $file);
    $pathinfo['filename'] = str_replace(array(' ', '(', ')'), array('-', '', ''), $pathinfo['filename']);
    $firstTargetFile = "{$pathinfo['dirname']}/{$pathinfo['filename']}-1.jpg";
    if (!file_exists($firstTargetFile)) {
        exec("gs -dNOPAUSE -sDEVICE=jpeg -sOutputFile={$pathinfo['filename']}-%d.jpg -dJPEGQ=100 -r300x300 -q {$file} -c quit");
        foreach (glob($path . "/{$pathinfo['filename']}-*") AS $jpg) {
            exec("convert {$jpg} -morphology thicken '1x3>:1,0,1' {$jpg}");
            exec("convert {$jpg} -morphology thicken '1x3>:1,0,1' {$jpg}");
            $size = getimagesize($jpg);
            exec("mv {$jpg} {$pathinfo['dirname']}/");
            $dashPos = strrpos($jpg, '-');
            $dotPos = strpos($jpg, '.', $dashPos);
            $pageNumber = substr($jpg, $dashPos + 1, $dotPos - $dashPos - 1);
            fputcsv($fh, array($fileId++,substr($file, 48),$pageNumber,$jpg,$size[0],$size[1]));
        }
    } else {
        foreach (glob("{$pathinfo['dirname']}/{$pathinfo['filename']}-*.jpg") AS $jpg) {
            $size = getimagesize($jpg);
            $dashPos = strrpos($jpg, '-');
            $dotPos = strpos($jpg, '.', $dashPos);
            $pageNumber = substr($jpg, $dashPos + 1, $dotPos - $dashPos - 1);
            fputcsv($fh, array($fileId++,substr($file, 48),$pageNumber,$jpg,$size[0],$size[1]));
        }
    }
}
fclose($fh);

完整原始碼: https://github.com/kiang/tw-campaign-finance/blob/master/scripts/pdf2jpg.php

評論

Sign In or Register to comment.