]project-open[
emphasizes knowledge internet collaboration and knowedge management. However,
internet collaboration needs to respect the tight security permissions, featured
by other parts of the system. So the search engine needs to comply with all
of the requirements:
]project-open[ Search is implemented using the TSearch2 full text engine that comes as a part of the PostgreSQL search engine (on Oracle, ]project-open[ supports Intermedia).
TSearch2 allows for a tight integration of full text indices and SQL statements, allowing to mix instructions for structured queries (in order to determine permissions and object relationships) and access to full text indices.
The implementation consists of the following elements:
-- The main search table with Full Text Index.
--
create table im_search_objects (
object_id integer,
-- may include "object types" outside of OpenACS
-- that are not in the "acs_object_types" table.
object_type_id integer
constraint im_search_objects_object_type_id_fk
references im_search_object_types
on delete cascade,
-- What is the topmost container for this object?
-- Allows to speed up the elimination of objects
-- that the current user can't access
biz_object_id integer
constraint im_search_objects_biz_obj_id_fk
references acs_objects
on delete cascade,
-- Owner may not need to be a "user" (in the case
-- of a deleted user). Owners can be asked to give
-- permissions to a document even if the document
-- is not readable for the searching user.
owner_id integer
constraint im_search_objects_owner_id_fk
references persons
on delete cascade,
-- Bitset with one bit for each "profile":
-- We use an integer instead of a "bit varying"
-- in order to keep the set compatible with Oracle.
-- A set bit indicates that object is readable to
-- members of the profile independent of the
-- biz_object_id permissions.
profile_permissions integer,
-- counter for number of accesses to this object
-- either from the permission() proc or from
-- reading in the server log file.
popularity integer,
-- Full Text Index
fti tsvector,
-- For tables that don't respect the OpenACS object
-- scheme we may get "object_id"s that start with 0.
primary key (object_id, object_type_id)
);
create index im_search_objects_fti_idx on im_search_objects using gist(fti);
create index im_search_objects_object_id_idx on im_search_objects (object_id);
The following steps are executed during each search:
TSearch2 contains several features allowing to adapt the search process to specific languages such as dictionaries, language specific stop words etc. However, ]project-open[ needs to be able to operate with content items of several languages at the same time.
However, it is not always possible to determine the language of a content item, so that we have decided not to implement these features at the moment.
However, the practical experiences of the use of TSearch with languages such as French, Spanish and German has required us to add a "normalization" feature to TSearch2 that "normalized" search content and queries in order to deal with accents and notational variants:
This normalization allows to search for "carlos" and to receive search results such as "Carlós" or "carlos@abc.com".
Actually, we had to implement this normalization ourselves, because there was no code on the PostgreSQL page about it. Also, the PostgreSQL "conversion" functionality (UTF-8 => SQL_ASCII) did not elimiated the accents. Here is snapshot of the code. Please check for the latest version at Sourceforge.net.
Ranking is currently limited to the built-in TSearch2 ranking functionality. In the future we are going to use several types of statistics to determine the "popularity" of a content item.