]project-open[ : @This Wiki
Portrait

Welcome, Unregistered Visitor

Contents

User Profiles

4 registered users
 in last 24 hours

]po[ Full-Text File Search

Provides full-text indexing for filenames and files in the ]po[ filestorages. Uses a number of external filters to periodically scan the ]po[ file storage for new files, builds up a Full Text Index and allows the user to retrieve the files using the normal search interface.

Required Software

intranet-search-pg-files requires the following software to extract indexable strings from different file formats:

  • CatDoc: /usr/local/bin/catdoc
  • HTMLtoTxt /usr/bin/html2text
  • wvText: /usr/bin/wvText (optional)

Basic Operation

The package will periodically (default: every 5 minutes) check a maximum number of objects (default: 100) for new files. Please see below for the parameters controlling the indexing behaviour.

This scheduled behaviour is necessary in order to balance the desire for fast indexing with the considerable load that full text indexing will pose on your database.

Supported File Types

  • txt, text, perl, php, sql:
    These files are considered to consist fully of indexable text.
  • doc:
    We use CatDoc to extract strings from Microsoft Word format
  • htm, html, xml, asp:
    We use HTMLtoText to extract the indexable text from these files.
  • The following extensions are explicitely ignored:
    • Image files: gif, jpg, pgp, bmp, png, wav, mp3, ico
    • File types without reasonable converter: xls, rtf (may be added later)
    • Other files: log, bz2, zip, tar, tgz, rar, gz, js, mso, exe 

 To add new file type please see ~/packages/intranet-search-pg-files-procs.tcl and search for "intranet_search_pg_files_fti_content". Very basic TCL skills are sufficient to add a new converter once you have the converter running on the shell level.

Administration & Control

To control indexing please see the page http://<your_server>/intranet-search-pg-files/. In this page you can see the files found by the indexer and you can re-index certain business objects.

Please see the error log at ~/log/error.log for detailed messages.

Parameters

  • IndexerMaxFiles - 100
    Limit indexer activity to MaxFiles. You can determine this parameter by dividing the number of files in your intranet (example: 30.000) by the time interval (in seconds) to check all files (for example: 24*60*60 for 1 day) and multiplying with the SearchIndexerInterval (example: 300). You have to make sure that the indexer can handle MaxFiles in SearchIndexerInterval, otherwise the system may get overload.
  • SearchIndexerInterval - 300
    Run the search indexer every X seconds
  • IndexFileContentsP - 1
    Should we index the _contents_ of a file, in addition to its filename?
    Disable this parameter if you are running a translation business, because your file contents are related to your customers, but not to your own business (in general). Set the parameter to 1 if you are interested in the contents of your files.


References

Related Packages

Related Modules

Related Software

  • PostgreSQL  - we use the TSearch2 engine from PostgreSQL for full text indexing