RC Knowledge Nexus > Knowledgebase > Deki Wiki > Enabling Support for Japanese PDF, DOC and XLS Indexing for Deki Wiki

Enabling Support for Japanese PDF, DOC and XLS Indexing for Deki Wiki

From $1

Assumptions

This configuration was done on Deki Wiki Hayes+ running on Ubuntu 7.10. Ubuntu is based on Debian so that distribution may work as well.

Background

The Deki Wiki indexer uses a filtering scheme to convert attachments such as Acrobat PDFs, Word DOCs, Excel XLSs and PowerPoint PPTs to text, which it then indexes. As for Acrobat PDFs, the default PDF filter uses pdftohtml, a utility that is installed with the poppler-utils package, and pdftohtml is said to be “based on xpdf”. Xpdf includes the ability to specify a system-wide resource file called xpdfrc, where you can specify language packages, fonts and so forth, but, I discovered that pdftohtml seems to ignore this file and there seems to be no way to specify the resource file with pdftohtml.

When you use the generic pdftohtml command to test convert a Japanese pdf, an error is produced.

root@fire:/var/www/deki-hayes/bin/filters # ./pdf2text < /path/to/myJapanese.pdf > jptestpdf.txt 
Error: Unknown character collection 'Adobe-Japan1' 

This error just indicates your system does not understand the PDF’s character collection “Adobe-Japan1”.

One can download source for both xpdf and its Japanese library xpdf-japanese from http://www.foolabs.com/xpdf/download.html and compile them. That said, Ubuntu and Debian have packages for these available in the “multiverse” repository, so if you set your system up to look in this repository and update your system’s repository caches, the xpdf-japanese is easily installed. The xpdf package comes with a utility called “pdftotext” which can be used in replacement of the pdftohtml command used by the Deki Wiki filter “pdf2text” which is a shell script in ./deki-hayes/bin/filters.

The original shell script looks like this:

root@fire:/rcutils # cat /var/www/deki-hayes/bin/filters/pdf2text 
#!/bin/sh 
# save stdin to a file since pdftohtml doesn't work on streams 
TEMP=`mktemp` 
dd of=$TEMP 2> /dev/null 
pdftohtml -stdout -i -noframes -enc UTF-8 "$TEMP" | html2text -nobs - - | sed '/^[\=]\+/ d' |sed '$d' | sed '1d' | sed '/^$/d' # trim first, last and blank lines 
rm $TEMP 


The pdftohtml in this shell script should be able to be replaced by a similar command using pdftotext from the xpdf-japanese package.

Setup your Apt Repository

The xpdf-japanese package is available in the multiverse. Edit your sources.list file:

root@esolia-fire:~ # nano /etc/apt/sources.list 

The version I am currently using is "gutsy" so the lines look like this:

... 
deb http://archive.ubuntu.com/ubuntu/ gutsy universe multiverse 
deb-src http://archive.ubuntu.com/ubuntu/ gutsy universe multiverse 
... 

Update Your List of Available Packages

This command must be done after any source.list changes, and it updates the list of available packages from the apt sources.

# aptitude update 

Enable PDF Indexing

Install XPDF and Related Packages

Install xpdf and xpdf-japanese like so:

# aptitude install xpdf xpdf-japanese 

...where, xpdf is the base software, and xpdf-japanese is the fonts and libraries needed for the Japanese processing.

You might have to reinstall for some reason, and you can do so as follows:

# aptitude reinstall xpdf 
# aptitude reinstall xpdf-japanese 

The xpdf-japanese installs the prerequisite fonts, but if you need to install them separately, this is the command:

# aptitude install ttf-kochi-gothic 

Update the Includes for the New Resource File

Xpdf has an update facility, that must be run after language resources such as xpdf-japanese are added:

# /usr/sbin/update-xpdfrc 

Confirm Versions and Locations

Now confirm the versions and locations.

root@fire:~ # xpdf -v 
xpdf version 3.02 
Copyright 1996-2007 Glyph & Cog, LLC 
root@fire:~ # pdftotext -v 
pdftotext version 3.02 
Copyright 1996-2007 Glyph & Cog, LLC 
root@fire:~ # whereis xpdf 
xpdf: /usr/bin/xpdf.bin /usr/bin/xpdf /etc/xpdf /usr/share/xpdf /usr/share/man/man1/xpdf.1.gz 

Edit the XPDFRC Resource File

Xpdf installs a system-wide resource file called xpdfrc. You can override this for yourself, by creating a copy of this file as ~/.xpdfrc. Edit the system-wide resource file to your liking.

root@fire:/etc/xpdf # cat xpdfrc 
#======================================================================== 
# 
# System-wide xpdfrc file 
# 
# The Xpdf tools look for a config file in two places: 
# 1. ~/.xpdfrc 
# 2. /etc/xpdf/xpdfrc 
# 
# Note that if ~/.xpdfrc exists, Xpdf will NOT read the system 
# configuration file /etc/xpdf/xpdfrc. You may wish to include it 
# from your ~/.xpdfrc using: 
#    include /etc/xpdf/xpdfrc 
# and then add additional settings. 
# 
# For complete details on config file syntax and available options, 
# please see the xpdfrc(5) man page. 
# 
# http://www.foolabs.com/xpdf/ 
# 
#======================================================================== 
 
#----- display fonts 
 
# These map the Base-14 fonts to the Type 1 fonts that ship with 
# ghostscript (gsfonts package). 
 
displayFontT1 Times-Roman        /usr/share/fonts/type1/gsfonts/n021003l.pfb 
displayFontT1 Times-Italic        /usr/share/fonts/type1/gsfonts/n021023l.pfb 
displayFontT1 Times-Bold        /usr/share/fonts/type1/gsfonts/n021004l.pfb 
displayFontT1 Times-BoldItalic        /usr/share/fonts/type1/gsfonts/n021024l.pfb 
displayFontT1 Helvetica            /usr/share/fonts/type1/gsfonts/n019003l.pfb 
displayFontT1 Helvetica-Oblique        /usr/share/fonts/type1/gsfonts/n019023l.pfb 
displayFontT1 Helvetica-Bold        /usr/share/fonts/type1/gsfonts/n019004l.pfb 
displayFontT1 Helvetica-BoldOblique    /usr/share/fonts/type1/gsfonts/n019024l.pfb 
displayFontT1 Courier            /usr/share/fonts/type1/gsfonts/n022003l.pfb 
displayFontT1 Courier-Oblique        /usr/share/fonts/type1/gsfonts/n022023l.pfb 
displayFontT1 Courier-Bold        /usr/share/fonts/type1/gsfonts/n022004l.pfb 
displayFontT1 Courier-BoldOblique    /usr/share/fonts/type1/gsfonts/n022024l.pfb 
displayFontT1 Symbol            /usr/share/fonts/type1/gsfonts/s050000l.pfb 
displayFontT1 ZapfDingbats        /usr/share/fonts/type1/gsfonts/d050000l.pfb 
 
# If you need to display PDF files that refer to non-embedded fonts, 
# you should add one or more fontDir options to point to the 
# directories containing the font files.  Xpdf will only look at .pfa, 
# .pfb, and .ttf files in those directories (other files will simply 
# be ignored). 
 
#fontDir        /usr/local/fonts/bakoma 
 
#----- PostScript output control 
 
# Set the default PostScript file or command. 
 
psFile            "|lpr" 
 
# Set the default PostScript paper size -- this can be letter, legal, 
# A4, or A3.  You can also specify a paper size as width and height 
# (in points). Xpdf uses the paper size in /etc/papersize by default. 
 
psPaperSize        A4 
 
#----- text output control 
 
# Choose a text encoding for copy-and-paste and for pdftotext output. 
# The Latin1, ASCII7, and UTF-8 encodings are built into Xpdf.  Other 
# encodings are available in the language support packages. 
 
textEncoding        UTF-8 
 
# Choose the end-of-line convention for multi-line copy-and-past and 
# for pdftotext output.  The available options are unix, mac, and dos. 
 
textEOL        unix 
 
#----- misc settings 
 
# Enable Type 1 font rasterizing with t1lib. Default "yes". 
 
#enableT1lib        no 
 
# Enable TrueType and Type 1 font rasterizing with FreeType. Default "yes". 
 
#enableFreeType        no 
 
# Enable anti-aliasing of fonts. Default "yes". 
 
#antialias        no 
 
# Set the command used to run a web browser when a URL hyperlink is 
# clicked. 
 
urlCommand    "sensible-browser '%s'" 
 
# Include the language configuration file list generated by update-xpdfrc 
include /etc/xpdf/includes 

Note the include statement at the bottom. The include statement references a file that references the various language resource files used by xpdf.

root@fire:/etc/xpdf # cat includes 
# This file was automatically generated by /usr/sbin/update-xpdfrc. 
# Instead, add or remove files in /etc/xpdf/ then run 
# /usr/sbin/update-xpdfrc to regenerate this file. 
 
include /etc/xpdf/xpdfrc-thai 
include /etc/xpdf/xpdfrc-turkish 
include /etc/xpdf/xpdfrc-cyrillic 
include /etc/xpdf/xpdfrc-hebrew 
include /etc/xpdf/xpdfrc-latin2 
include /etc/xpdf/xpdfrc-arabic 
include /etc/xpdf/xpdfrc-greek 
include /etc/xpdf/xpdfrc-japanese 

Test PDFTOTEXT

Now you can test the included utility pdftotext. The initial error...:

root@fire:/var/www/deki-hayes/bin/filters # ./pdf2text < /path/to/myJapanese.pdf > jptestpdf.txt 
Error: Unknown character collection 'Adobe-Japan1' 

...should now be resolved. The pdf2text is MindTouch's script, while pdftotext is the utility that comes with xpdf.

root@fire:~ # pdftotext -enc UTF-8 /path/to/myJapanese.pdf /path/to/mytest-textfile.txt 

The output text file should have well-formed Japanese.

Create the Script to Replace MindTouch's PDF2TEXT

Create "jppdf2text" as follows. Copy pdf2text:

root@fire:/var/www/deki-hayes/bin/filters # cp pdf2text jppdf2text

Set permissions so that the Apache user www-data can use the script:

root@fire:/var/www/deki-hayes/bin/filters # chmod a+x jppdf2text 
root@fire:/var/www/deki-hayes/bin/filters # chown www-data jppdf2text

Copy the following text into the new script, using an editor or other means:

#!/bin/sh 
# use pdftotext from xpdf with xpdf-japanese since pdftohtml does not reference the xpdfrc config file 
# use mktmp to make a temp filename and assign to a variable 
TEMP_IN=`mktemp` 
# use dd to convert stdin to file and redirect to dev/null 
dd of=$TEMP_IN > /dev/null 2>&1 
# convert PDF in variable to text and output to stdout (-), then use sed to strip blank lines 
pdftotext -enc UTF-8 "$TEMP_IN" - | sed '/^$/d' 
# cleanup 
rm $TEMP_IN 

Test the Script

The new script should work on the command line interactively.

root@fire:/var/www/deki-hayes/bin/filters # ./jppdf2text < /path/to/japanese.pdf > /path/to/output.txt

You should see the output on stdout, as well as in the output.txt text file. The PDF I used for testing was just a print-to-PDF of a  file listing of the various files I was using for testing.

root@fire:/var/www/deki-hayes/bin/filters # ./jppdf2text < /path/to/japanese.pdf > /path/to/output.txt
Deki Search Test 日本語 パワーポイント.ppt Deki Search Test Japanese Powerpoint.ppt Deki Search Test 日本語 エクセル.xls 

Setup the Filter

Setup the filter to point to the new script. Make sure you edit the correct mindtouch.deki.startup.xml, and if you are in doubt, you can trace its location from the /etc/init.d/dekihost init script. Mine is in /etc/dekiwiki:

root@fire:/etc/dekiwiki # ls
mindtouch.deki.startup.xml
root@fire:/etc/dekiwiki # nano mindtouch.deki.startup.xml  

Edit the filter path for PDF:

<indexer>
        <path.store>/usr/local/var/luceneindex</path.store>
        <filter-path extension="doc">/var/www/deki-hayes/bin/filters/wvText</filter-path>
        <filter-path extension="pdf">/var/www/deki-hayes/bin/filters/jppdf2text</filter-path>
        <filter-path extension="xhtml">/var/www/deki-hayes/filters/html2text</filter-path>
        ...
        <filter-path extension="xsl"></filter-path>
        <filter-path extension="xslt"></filter-path>
</indexer>

Enabling DOC Indexing

In a similar way to the PDF indexing, enable a new "jpword2text" script as the filter. I had to create the script using wvHtml, since the original wvText used by MindTouch did NOT work correctly for Japanese DOCs. For some reason, wvText munges the Japanese inside the DOC, while the wvHtml does not. This script pipes the output of wvHtml to html2text, producing an indexable text file. 

root@fire:/var/www/deki-hayes/bin/filters # cat jpword2text 
#!/bin/bash

TEMP_IN=`mktemp`
TEMP_OUT=`mktemp`

# copy stdin to tmp file ignoring any errors from dd
dd of=$TEMP_IN 2> /dev/null

wvHtml $TEMP_IN $TEMP_OUT 
cat $TEMP_OUT | html2text -nobs - - | sed '/^$/d'
rm $TEMP_IN $TEMP_OUT 

Enabling XLS Indexing

Enabling XLS indexing was a simple matter of installing java, which is executed by the original MindTouch filter script that references jxl.jar:

# aptitude install sun-java6-jdk 

Apply the Changes

Now make the changes effective by applying them. Restart dekihost:

# /etc/init.d/dekihost restart 

And re-index your content in the UI, in Control Panel, Site Admin.

In a few minutes, you should be able to search for the content of attached Japanese PDFs.

Here is what it looks like.

Conclusion

It turns out that it is relatively trivial to enable Japanese file indexing on a Ubuntu 7.10 system.

  • PDF - install Xpdf and Xpdf-Japanese, and call Xpdf's pdftotext from a new filter script, to convert the Japanese PDF to text for indexing.
  • DOC - install a new filter script that passes the Japanese DOC to wvHtml, and then converts the resulting HTML file to Japanese text for indexing.
  • XLS - install Java, so the original MindTouch script can work.
  • PPT - works from the beginning.

Enjoy!


FileSizeDateAttached by 
 Successful Jp PDF Indexing.pdf
Successful Japanese PDF Indexing
49.65 kB02:05, 27 Nov 2007Rick CogleyActions
Images (0)
 
Comments (0)
You must login to post a comment.