|
|
RC Knowledge Nexus > Knowledgebase > Deki Wiki > Enabling Support for Japanese PDF, DOC and XLS Indexing for Deki Wiki
Enabling Support for Japanese PDF, DOC and XLS Indexing for Deki WikiFrom $1Table of contentsTOC
AssumptionsThis configuration was done on Deki Wiki Hayes+ running on Ubuntu 7.10. Ubuntu is based on Debian so that distribution may work as well. BackgroundThe Deki Wiki indexer uses a filtering scheme to convert attachments such as Acrobat PDFs, Word DOCs, Excel XLSs and PowerPoint PPTs to text, which it then indexes. As for Acrobat PDFs, the default PDF filter uses pdftohtml, a utility that is installed with the poppler-utils package, and pdftohtml is said to be “based on xpdf”. Xpdf includes the ability to specify a system-wide resource file called xpdfrc, where you can specify language packages, fonts and so forth, but, I discovered that pdftohtml seems to ignore this file and there seems to be no way to specify the resource file with pdftohtml. root@fire:/var/www/deki-hayes/bin/filters # ./pdf2text < /path/to/myJapanese.pdf > jptestpdf.txt Error: Unknown character collection 'Adobe-Japan1' This error just indicates your system does not understand the PDF’s character collection “Adobe-Japan1”. root@fire:/rcutils # cat /var/www/deki-hayes/bin/filters/pdf2text #!/bin/sh # save stdin to a file since pdftohtml doesn't work on streams TEMP=`mktemp` dd of=$TEMP 2> /dev/null pdftohtml -stdout -i -noframes -enc UTF-8 "$TEMP" | html2text -nobs - - | sed '/^[\=]\+/ d' |sed '$d' | sed '1d' | sed '/^$/d' # trim first, last and blank lines rm $TEMP Setup your Apt Repository The xpdf-japanese package is available in the multiverse. Edit your sources.list file: root@esolia-fire:~ # nano /etc/apt/sources.list The version I am currently using is "gutsy" so the lines look like this: ... deb http://archive.ubuntu.com/ubuntu/ gutsy universe multiverse deb-src http://archive.ubuntu.com/ubuntu/ gutsy universe multiverse ... Update Your List of Available PackagesThis command must be done after any source.list changes, and it updates the list of available packages from the apt sources. # aptitude update Enable PDF IndexingInstall XPDF and Related Packages Install xpdf and xpdf-japanese like so: # aptitude install xpdf xpdf-japanese ...where, xpdf is the base software, and xpdf-japanese is the fonts and libraries needed for the Japanese processing. # aptitude reinstall xpdf # aptitude reinstall xpdf-japanese The xpdf-japanese installs the prerequisite fonts, but if you need to install them separately, this is the command: # aptitude install ttf-kochi-gothic Update the Includes for the New Resource File Xpdf has an update facility, that must be run after language resources such as xpdf-japanese are added: # /usr/sbin/update-xpdfrc Confirm Versions and LocationsNow confirm the versions and locations. root@fire:~ # xpdf -v xpdf version 3.02 Copyright 1996-2007 Glyph & Cog, LLC root@fire:~ # pdftotext -v pdftotext version 3.02 Copyright 1996-2007 Glyph & Cog, LLC root@fire:~ # whereis xpdf xpdf: /usr/bin/xpdf.bin /usr/bin/xpdf /etc/xpdf /usr/share/xpdf /usr/share/man/man1/xpdf.1.gz Edit the XPDFRC Resource FileXpdf installs a system-wide resource file called xpdfrc. You can override this for yourself, by creating a copy of this file as ~/.xpdfrc. Edit the system-wide resource file to your liking. root@fire:/etc/xpdf # cat xpdfrc #======================================================================== # # System-wide xpdfrc file # # The Xpdf tools look for a config file in two places: # 1. ~/.xpdfrc # 2. /etc/xpdf/xpdfrc # # Note that if ~/.xpdfrc exists, Xpdf will NOT read the system # configuration file /etc/xpdf/xpdfrc. You may wish to include it # from your ~/.xpdfrc using: # include /etc/xpdf/xpdfrc # and then add additional settings. # # For complete details on config file syntax and available options, # please see the xpdfrc(5) man page. # # http://www.foolabs.com/xpdf/ # #======================================================================== #----- display fonts # These map the Base-14 fonts to the Type 1 fonts that ship with # ghostscript (gsfonts package). displayFontT1 Times-Roman /usr/share/fonts/type1/gsfonts/n021003l.pfb displayFontT1 Times-Italic /usr/share/fonts/type1/gsfonts/n021023l.pfb displayFontT1 Times-Bold /usr/share/fonts/type1/gsfonts/n021004l.pfb displayFontT1 Times-BoldItalic /usr/share/fonts/type1/gsfonts/n021024l.pfb displayFontT1 Helvetica /usr/share/fonts/type1/gsfonts/n019003l.pfb displayFontT1 Helvetica-Oblique /usr/share/fonts/type1/gsfonts/n019023l.pfb displayFontT1 Helvetica-Bold /usr/share/fonts/type1/gsfonts/n019004l.pfb displayFontT1 Helvetica-BoldOblique /usr/share/fonts/type1/gsfonts/n019024l.pfb displayFontT1 Courier /usr/share/fonts/type1/gsfonts/n022003l.pfb displayFontT1 Courier-Oblique /usr/share/fonts/type1/gsfonts/n022023l.pfb displayFontT1 Courier-Bold /usr/share/fonts/type1/gsfonts/n022004l.pfb displayFontT1 Courier-BoldOblique /usr/share/fonts/type1/gsfonts/n022024l.pfb displayFontT1 Symbol /usr/share/fonts/type1/gsfonts/s050000l.pfb displayFontT1 ZapfDingbats /usr/share/fonts/type1/gsfonts/d050000l.pfb # If you need to display PDF files that refer to non-embedded fonts, # you should add one or more fontDir options to point to the # directories containing the font files. Xpdf will only look at .pfa, # .pfb, and .ttf files in those directories (other files will simply # be ignored). #fontDir /usr/local/fonts/bakoma #----- PostScript output control # Set the default PostScript file or command. psFile "|lpr" # Set the default PostScript paper size -- this can be letter, legal, # A4, or A3. You can also specify a paper size as width and height # (in points). Xpdf uses the paper size in /etc/papersize by default. psPaperSize A4 #----- text output control # Choose a text encoding for copy-and-paste and for pdftotext output. # The Latin1, ASCII7, and UTF-8 encodings are built into Xpdf. Other # encodings are available in the language support packages. textEncoding UTF-8 # Choose the end-of-line convention for multi-line copy-and-past and # for pdftotext output. The available options are unix, mac, and dos. textEOL unix #----- misc settings # Enable Type 1 font rasterizing with t1lib. Default "yes". #enableT1lib no # Enable TrueType and Type 1 font rasterizing with FreeType. Default "yes". #enableFreeType no # Enable anti-aliasing of fonts. Default "yes". #antialias no # Set the command used to run a web browser when a URL hyperlink is # clicked. urlCommand "sensible-browser '%s'" # Include the language configuration file list generated by update-xpdfrc include /etc/xpdf/includes Note the include statement at the bottom. The include statement references a file that references the various language resource files used by xpdf. root@fire:/etc/xpdf # cat includes # This file was automatically generated by /usr/sbin/update-xpdfrc. # Instead, add or remove files in /etc/xpdf/ then run # /usr/sbin/update-xpdfrc to regenerate this file. include /etc/xpdf/xpdfrc-thai include /etc/xpdf/xpdfrc-turkish include /etc/xpdf/xpdfrc-cyrillic include /etc/xpdf/xpdfrc-hebrew include /etc/xpdf/xpdfrc-latin2 include /etc/xpdf/xpdfrc-arabic include /etc/xpdf/xpdfrc-greek include /etc/xpdf/xpdfrc-japanese Test PDFTOTEXTNow you can test the included utility pdftotext. The initial error...: root@fire:/var/www/deki-hayes/bin/filters # ./pdf2text < /path/to/myJapanese.pdf > jptestpdf.txt Error: Unknown character collection 'Adobe-Japan1' ...should now be resolved. The pdf2text is MindTouch's script, while pdftotext is the utility that comes with xpdf. root@fire:~ # pdftotext -enc UTF-8 /path/to/myJapanese.pdf /path/to/mytest-textfile.txt The output text file should have well-formed Japanese. Create the Script to Replace MindTouch's PDF2TEXTCreate "jppdf2text" as follows. Copy pdf2text: root@fire:/var/www/deki-hayes/bin/filters # cp pdf2text jppdf2text Set permissions so that the Apache user www-data can use the script: root@fire:/var/www/deki-hayes/bin/filters # chmod a+x jppdf2text root@fire:/var/www/deki-hayes/bin/filters # chown www-data jppdf2text Copy the following text into the new script, using an editor or other means: #!/bin/sh # use pdftotext from xpdf with xpdf-japanese since pdftohtml does not reference the xpdfrc config file # use mktmp to make a temp filename and assign to a variable TEMP_IN=`mktemp` # use dd to convert stdin to file and redirect to dev/null dd of=$TEMP_IN > /dev/null 2>&1 # convert PDF in variable to text and output to stdout (-), then use sed to strip blank lines pdftotext -enc UTF-8 "$TEMP_IN" - | sed '/^$/d' # cleanup rm $TEMP_IN Test the ScriptThe new script should work on the command line interactively. root@fire:/var/www/deki-hayes/bin/filters # ./jppdf2text < /path/to/japanese.pdf > /path/to/output.txt You should see the output on stdout, as well as in the output.txt text file. The PDF I used for testing was just a print-to-PDF of a file listing of the various files I was using for testing. root@fire:/var/www/deki-hayes/bin/filters # ./jppdf2text < /path/to/japanese.pdf > /path/to/output.txt Deki Search Test 日本語 パワーポイント.ppt Deki Search Test Japanese Powerpoint.ppt Deki Search Test 日本語 エクセル.xls Setup the FilterSetup the filter to point to the new script. Make sure you edit the correct mindtouch.deki.startup.xml, and if you are in doubt, you can trace its location from the /etc/init.d/dekihost init script. Mine is in /etc/dekiwiki: root@fire:/etc/dekiwiki # ls mindtouch.deki.startup.xml root@fire:/etc/dekiwiki # nano mindtouch.deki.startup.xml Edit the filter path for PDF: <indexer>
<path.store>/usr/local/var/luceneindex</path.store>
<filter-path extension="doc">/var/www/deki-hayes/bin/filters/wvText</filter-path>
<filter-path extension="pdf">/var/www/deki-hayes/bin/filters/jppdf2text</filter-path>
<filter-path extension="xhtml">/var/www/deki-hayes/filters/html2text</filter-path>
...
<filter-path extension="xsl"></filter-path>
<filter-path extension="xslt"></filter-path>
</indexer>
Enabling DOC IndexingIn a similar way to the PDF indexing, enable a new "jpword2text" script as the filter. I had to create the script using wvHtml, since the original wvText used by MindTouch did NOT work correctly for Japanese DOCs. For some reason, wvText munges the Japanese inside the DOC, while the wvHtml does not. This script pipes the output of wvHtml to html2text, producing an indexable text file. root@fire:/var/www/deki-hayes/bin/filters # cat jpword2text #!/bin/bash TEMP_IN=`mktemp` TEMP_OUT=`mktemp` # copy stdin to tmp file ignoring any errors from dd dd of=$TEMP_IN 2> /dev/null wvHtml $TEMP_IN $TEMP_OUT cat $TEMP_OUT | html2text -nobs - - | sed '/^$/d' rm $TEMP_IN $TEMP_OUT Enabling XLS IndexingEnabling XLS indexing was a simple matter of installing java, which is executed by the original MindTouch filter script that references jxl.jar: # aptitude install sun-java6-jdk Apply the ChangesNow make the changes effective by applying them. Restart dekihost: # /etc/init.d/dekihost restart And re-index your content in the UI, in Control Panel, Site Admin. In a few minutes, you should be able to search for the content of attached Japanese PDFs. Here is what it looks like. ConclusionIt turns out that it is relatively trivial to enable Japanese file indexing on a Ubuntu 7.10 system.
Enjoy! |