The following sections describe several aspects of the indexing back-end of the proposed search system. This comprises some changes to the OJS back-end but above all includes solr/Lucene configuration recommendations.
Index architecture is one of the most important aspects of solr configuration. We list available options in this area and provide recommendations with respect to the requirements specified for this project.
The main decision with respect to index architecture is whether to use a single index or multiple indexes (and corresponding solr cores).
Advantages of a single index for all journals and document types:
Disadvantages of a single index:
There are two basic design options to index a multilingual document collection:
See http://lucene.472066.n3.nabble.com/Designing-a-multilingual-index-td688766.html for a discussion of multilingual index design.
Advantages of a single index:
Advantages of a multi-index approach:
In our case the advantages of a single-index approach for multilingual content definitely outweigh its disadvantages.
The following sections provide index architecture recommendations for all deployment scenarios.
We generally recommend a single-index architecture if possible.
Several disadvantages of the single index scenario are not relevant in scenarios S1 to S3:
On the other hand there are advantages of a single index architecture (e.g. search across several OJS instances, simplicity of configuration, maintenance, etc.) which are relevant in our case, see above.
There are two potential problems that can occur when consolidating many journals in a single index:
The first point refers to the fact that if the whole index needs to be rebuilt (e.g. due to index corruption) we have to trigger the rebuild from all connected OJS instances. This cannot be automated within OJS as OJS does not allow actions across instances. It can, however, be easily automated via a simple custom shell script when we provide a CLI interface for index rebuilds which we recommend.
Whether ranking will suffer from a single-index approach depends on the heterogeneity of the journals added to the index. It may become a problem when search terms that have a high selectivity for one journal are much less selective for other journals thereby distorting Lucene’s default inverse document frequency (IDF) scoring measure when restricting query results to a single journal.
An example will illustrate this: Imagine that you have two Mathematics journals. One of these journals accepts contributions from all sub-disciplines while the other is specialized on topology. Now a search on “algebraic topology” may be quite selective in the general Maths journal while it may hit a whole bunch of articles in the topology journal. This is probably not a problem as long as we search across both journals. If we search within the general maths journal only, then documents matching “algebraic topology” will probably receive lower scores than they should because the overall index-level document frequency for “algebraic topology” is higher than appropriate for the article sub-set of the general maths journal. This means that in a search with several search terms, e.g. “algebraic topology AND number theory” the second term will probably be overrepresented in the journal-restricted query result set. Only experiment with test data can show whether this is relevant in practice. It is fair to believe, though, that the majority of queries will be across all indexed journals and therefore not suffer such distortion. This is because most users do have an interest in their topic matter rather than being interested in a specific publication only.
NB: We do not have to bother about content heterogeneity on lower granularity levels, e.g. journal sections, as these cannot be selected as search criteria to limit search results.
The same ranking distortion could theoretically apply to multilingual content if we were to collect all languages in a single index field. In the proposed schema, however, we use a separate field per language, see “Multilingual Documents” below. As document frequency counts are per index field, we’ll get correct language-specific document counts. The total document count will also be ok as we’ll denormalize all language versions to the article level.
While we generally recommend a single index design there are cases where a multi-index design may be appropriate and can be optionally implemented by a provider:
Whether these problems occur or not can only be decided by experimentation. While one index per OJS instance is supported, even in a network scenario, it must be kept in mind that multiple indexes may have disadvantages: From a user perspective the most relevant potential disadvantage is thatsearches across several journals will only be supported when those journals are in the same index. This is due to the fact that we do not recommend distributed search across several indexes because they are much more complex and therefore costly to implement and create difficult ranking problems we can hardly solve. See a full list above.
While we generally recommend a single-index architecture for all deployment options, there are a few comments to be made with respect to specific employment scenarios.
In deployment scenario S1 and S2 we only search within the realm of a single OJS installation. This means that a single embedded solr core listening on the loopback IP interface could serve such requests, see “Embedded Deployment” below.
In deployment scenario S3 we search across installations. This means that the default deployment approach with a per-installation embedded solr core will not be ideal as it means searching across a potentially large number of distributed cores. Therefore, the provider will probably want to maintain a single index for all OJS installations deployed on their network.
This has a few implications:
In deployment scenario S4 we have an unspecified number of disparate document types to be indexed. This means that the best index design needs to be defined on a per-case basis. We may distinguish two possible integration scenarios:
The present specification only deals with the second case as the first almost certainly requires provider-specific customization of OJS code that we do have no information about.
Our index architecture recommendation for the S4 scenario is to create a separate dedicated solr core with OJS documents exactly as in scenario S3. Then searches to the “OJS core” can be combined with queries to solr cores with non-OJS document types in federated search requests from arbitrary third-party search interfaces within the provider’s network. (See http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set for one possible solution of federated search.)
This has the advantage that the standard OJS solr search support can be used unchanged based on the same documentation resources that we provide to support S3 (see previous section).
The only extra requirement to support the S4 scenario is to make sure that the unique document ID of other document types does not clash with the OJS unique article id. This is important so that a federated search can uniquely identify OJS documents among other application documents. When working with a globally unique installation ID such clashes are extremely improbable. Potential ID clashes are only a problem when using solr’s built-in federated search feature. Otherwise the search client will query the cores separately and join documents based on application-specific logic (e.g. displaying separate result lists for different document types).
Our recommendation for the data model is based on the type of queries and results required according to our feature list. We also try to implement a data model that requires as little schema and index modifications in the future as possible to reduce maintenance cost.
Meta-data fields that we want to search separately (e.g. in an advanced search) must be implemented as separate fields in Lucene. Sometimes all text is joined in an additional “catch-all” field to support unstructured queries. We do not believe that such a field is necessary in our case as we’ll do query expansion instead.
To support multilingual search and proper ranking of multilingual content we need one field per language for all localized meta-data fields, galleys and supplementary files.
In order to avoid ranking problems we also prefer to have separate fields per document format (e.g. PDF, HTML, MS Word) rather than joining all data formats into a single search field. We can use query expansion to cover all formats while still maintaining good ranking metrics even when certain formats are not used as frequently as other formats.
The relatively large number of required fields for such a denormalized multilingual/multiformat data model is not a problem in Lucene (see http://lucene.472066.n3.nabble.com/Maximum-number-of-fields-allowed-in-a-Solr-document-td505435.html). Storing sparse or denormalized data is efficient in Lucene, comparable to a NoSQL database.
We prefer dynamic fields over statically configured fields:
The publication date will be indexed to a trie date type field.
Authors are not localized and will be stored verbatim in a multi-valued string type field.
Specific fields are:
These fields will be analyzed for search query use cases, potentially including stemming (see “Analysis” below). The exact data schema obviously depends on the number of languages and data formats used by the indexed journals.
In the case of supplementary files there may be several files for a single locale/document format combination. As we only query for articles, all supplementary file full text can be joined into a single field per language/document format. And as we do not allow queries on specific supplementary file meta-data fields we can even further consolidate supplementary file meta-data into a single field per language.
To reduce index size and minimize communication over the network link all our fields are indexed but not stored. The only field to be stored in the index is the ID field which will also be the only field to be returned over the network in response to a the query request. Article data (title, abstract, etc.) will then have to be retrieved locally in OJS for display. As we are using paged result sets this can be done without relevant performance impact.
If we want to support highlighting then the galley fields need to be stored, too.
Further specialized fields will be required for certain use cases. If we want to support auto-suggestions or alternative spelling suggestions then we’ll have to provide textual article meta-data fields in a minimally analyzed (lowercase only, non-localized) version. These fields will be called “xxxxx_spell” where “xxxxx” stands for the field name without locale extension.
Fields that we want to use as optional sort criteria need to be single valued, indexed, and not-tokenized . This means that sortable values will potentially have to be analyzed separately into “xxxxx_xx_XX_txtsort” or “xxxxx_dtsort” fields where “xxxxx” stands for the field name and “xx_XX” for the locale (if any) of the sort field.
Faceting fields (“xxxxx_xx_XX_facet”) need to be localized. They are minimally analyzed (lower case only) and tokenized by separator (e.g. “,” or “;”) rather than by whitespace.
If we want to support the “more-like-this” feature then we may have to store term vectors for galley fields if we run into performance problems. We do not store term vectors by default, though.
Further technical details of the data model can be found in plugins/generic/solr/embedded/solr/conf/schema.xml.
Article data needs to be submitted to solr and preprocessed so that it can be ingested by solr’s Lucene back-end. This is especially true for binary galley and supplementary file formats that need to be transformed into a UTF-8 character stream. The following sections will describe various options and recommendations with respect to document submission and preprocessing.
The current OJS search engine implements document conversion based on 3rd-party commandline tools that need to be installed on the OJS server. Solr, on the other hand, is well integrated with Tika, a document and document meta-data extraction engine written in pure Java. We have to decide whether to re-use the existing OJS solution or whether to use Tika instead.
Advantages of the existing OJS conversion:
Advantages of Tika:
The only real disadvantage of Tika with respect to our requirements is that it does not support conversion of PS files. PS could be supported indirectly by first converting it to PDF locally and then submitting PDF to the solr server. It is however not clear, whether nowadays there exist OJS installations with an interest in solr that actually use Postscript as a publishing format. The advantage of solr being able to support the ePub format seems more important than the missing PS support.
Recommendation: Use the Tika conversion engine.
In the multi-installation scenarios S3 and S4 document preprocessing could be done locally to the installation or on the central solr server.
Advantages of local processing are:
Advantages of remote processing are:
Recommendation: Use remote processing, mostly due to the reduced deployment cost and easy use of Solr extensions.
Document load can be initiated on the client side (push processing) or on the server side (pull processing). Both options have their strengths and weaknesses.
Advantages of push configuration:
Advantages of pull configuration:
Recommendation: Use the simpler push configuration by default but check its performance and reliability early on. If it turns out to be slow or unreliable, especially in the network deployment case, then provide instructions and sample configuration for an optional pull configuration for larger deployments, see “OJS/solr Protocol Specification” and “Deployment Options” below.
Both, push and pull processing, can be implemented with our without callback. We recommend callback for network deployment only where large amounts of data have to be indexed and full index re-builds can be very costly, see “OJS/solr Protocol Specification” below.
The OJS search feature returns result sets on article level rather than listing galleys or supplementary files as independent entities. This means that ideally our index should contain one entry per article so that we do not have to de-duplicate and join result sets. Different language versions and formats of articles should be spread over separate fields rather than documents. Such a denormalized design also facilitates multilingual search and ranking. A detailed argumentation for this preferred index design will be given in the “Multilingual Documents” section below.
For document preprocessing this design implies that we have to join various binary files (galleys and supplementary files in all languages and formats) plus the article meta-data fields into a single solr/Lucene document. As we’ll see in the “Solr Preprocessing plug-ins” section, this considerably influences and restricts the implementation options for document import.
We have to decide whether we want to implement our own custom preprocessing wrapper to solr as in the current OJS search implementation or whether we want to re-use the preprocessing interface and capabilities provided by native solr import and preprocessing plug-ins.
Advantages of a custom preprocessing interface are:
Advantages of standard solr plug-ins:
A priori both options have their strengths and advantages. In our case, though, the choice is relatively clear due to our preference for remote document preprocessing and Tika as an extraction engine. Having to maintain custom Java code or creating a separate server-side PHP preprocessing and Tika integration engine are certainly not attractive options for FUB or PKP.
Recommendation: The advantages of using established solr plug-ins for data extraction and preprocessing outweigh the advantages of a custom preprocessing interface in our case.
Currently there are two native solr extensions that support Tika integration: The “Data Import Handler” (IDH) and the “Solr Content Extraction Library” (Solr Cell).
Cell is meant to index large amounts of files with very little configuration requirements. Cell does not support more complex import scenarios with several data sources and complex transformation requirements, though. It also does not support data pull. In our case, these disadvantages rule it out as a solution.
The second standard solr preprocessing plug-in, IDH, is a flexible extraction, transformation and loading framework for solr that allows integration of various data sources and supports both, pull and push scenarios.
Unfortunately even IDH has two limitations that are relevant in our case:
Recommendation: Use IDH for document preprocessing with a custom XML document transmission format.
Tika can retrieve document meta-data from certain document formats, e.g. MS Word documents. This functionality is also well integrated with IDH.
Using this meta-data is problematic, though:
Recommendation: Do not use Tika to extract document meta-data but use the data provided by OJS instead.
IDH supports several data transmission protocols, e.g. direct file access, HTTP, JDBC, etc. In our case we could use direct file access or JDBC for the embedded deployment scenario. But as we also have to support multi-installation scenarios we prefer channeling all data through the network stack so that we can use a single preprocessing configuration for all deployment options. Using the network locally is only marginally slower than accessing the database and file system directly. By far most processing time is spent for document conversion and indexing so document transmission will hardly become a performance bottleneck.
HTTP is the network protocol supported by IDH. HTTP can be used for push and pull configurations. It supports transmission of character stream (meta-)data as well as binary (full text) documents. Our recommendation is therefore to use HTTP as the only data transmission protocol in all deployment scenarios.
Non-HTTP protocols can still be optionally supported (e.g. for performance reasons) by making relatively small custom changes to the default IDH configuration.
Exact details of the transmission protocol will be laid out in the “OJS/solr Protocol Specification” below.
To sum up, our analysis of the data import process revealed that the following requirements should be met by a data preprocessing solution:
We provide a prototypical IDH configuration that serves all these import and preprocessing needs:
Please see plugins/generic/solr/embedded/solr/conf/dih-ojs.xml for details.
In the Lucene context, “analysis” means filtering the character stream of preprocessed document data (e.g. filter out diacritics), splitting it up into indexed search terms (tokenization) and manipulating terms to improve the relevance of search results (e.g. synonym injection, lower casing and stemming).
This part of the document describes how we analyze and index documents and queries to improve precision and recall of the OJS search. In other words: We have to include a maximum number of documents relevant to a given search query (recall) into our result set while including a minimum of false positives (precision).
Measures that may improve recall in our case are:
Measures that improve precision may be:
Often there is a certain conflict between optimizing recall and precision. Measures that improve recall by ignoring potentially significant differences between search terms may produce false positives thereby reducing precision.
Please observe that most of the above measures require knowledge about the text language, i.e. its specific notation, grammar or even pronunciation. A notable exception to this rule is n-gram analysis which is language-agnostic. Support for a broad number of languages is one of our most important requirements. Therefore appropriate language-specific treatment of meta-data and full text documents is critical to the success of the proposed design. We’ll therefore treat language-specific analysis in detail in the following section.
Our general approach is to keep the analysis process as simple as possible by default. This also includes minimal stemming and language-specific analysis. This is to honor the “simplicity” design goal as specified for this project. Whenever we discover unsatisfactory relevance of result lists during testing (see our testing approach above), especially insufficient recall of multilingual documents, we’ll further customize analysis chains. This ensures that additional complexity is only introduced when well justified by specific user needs.
It is one of the core requirements of this project to better support search in multilingual content. This is especially true for languages with logographic notation, such as Japanese or Chinese, that are not supported by the current OJS search implementation. We’ve already analyzed the impact of multilingual documents on index and data model design. The most important part of multilingual support lies in the analysis process, though. In fact, allowing for language-specific analysis is one of the reasons why we recommend a “one-field-per-language” data model.
There is no recommended default approach for dealing with multilingual content in solr/Lucene. The range of potential applications is so large that individual solutions have to be found for every use case. We’ll therefore handle this question to a considerable amount of detail: First we’ll list a few specific analysis requirements derived from the more general project requirements presented earlier. Then we’ll discuss several approaches to multilingual analysis. Finally we’ll recommend an individual solution for the use cases to be supported in this project.
Requirements for the analysis process must above all be derived from expected user queries and the corresponding correctly ranked result lists. The following list of analysis requirements are therefore derived from properties specific to multilingual OJS search queries:
Further requirements derive from multilingual test queries. Consult the list of test queries linked in the main “Requirements” section above for details.
When multilingual content should be analyzed in a language-specific manner (e.g. stemming, stopwords, etc.) we need to know the document language to be able to branch into the correct analysis chain. There are two basic approaches to obtain such language identity information: machine language recognition and user input.
Advantages of machine language recognition:
Advantages of preset languages:
Reliability of machine language recognition vs. preset languages mainly depends on the reliability of user input in the case of preset languages: In our case user provided language information will probably be quite reliable for meta-data and galleys. This is not the case for the content of supplementary files as these do not have a standardized locale field. This seems to be a minor problem, though: It is assumed that searches on supplementary file content are of minor importance in our case.
Our recommendation therefore is to work with preset languages to avoid unnecessary implementation/maintenance cost and complexity. If we see in practice that important test queries cannot be run with preset languages then we can still plug-in language recognition where necessary. We can use solr’s “langid” plug-in in this case, see https://wiki.apache.org/solr/LanguageDetection. It provides field-level language recognition out-of-he-box.
The granularity of multilingual analysis has a great influence on implementation complexity and cost. While document-level language processing is largely supported with standard Lucene components, paragraph or sentence-level language recognition and processing requires considerable custom implementation work. This includes development and maintenance of custom solr/Lucene plug-ins based on 3rd-party natural language processing (NLP) frameworks like OpenNLP or LingPipe.
We identified the following implementation options for multilingual support:
The advantage of the first two options is that they can be implemented with standard solr/Lucene components. The third option will require development and maintenance of custom solr/Lucene plug-ins and integration with third-party language processing tools. This is not an option in our case as it would require custom Java programming which has been excluded as a possibility for this project.
We recommend the second approach which will be further detailed in the next section.
There are two basic approaches to deal with multilingual content: A generic n-gram approach that works in a language-agnostic manner and provides relatively good mixed-language analysis results. Alternatively language-specific analysis chains can be used to analyze text whose language is known at analysis time.
Advantages of an n-gram approach:
Advantages of language-specific analysis chains:
While language-specific analysis chains may not be ideal for mixed-language content, it is improbable that n-gram analysis alone will provide satisfactory relevance of result sets.
We therefore recommend a mixed approach: We should provide language-specific analysis chains for the main language of a document or meta-data fields where the language is known and supported. All fields and documents may additionally undergo partial n-gram (e.g. edge-gram) analysis if we find that this is necessary to support multilingual document fields or fields that do not have a language specified. The results from both analysis processes will have to go into separate fields. This requires separate fields per language (see “Data Model” above) and query expansion to all language fields (see the “Query Transformation and Expansion” below).
According to our “simplicity by default” approach we do not recommend any character stream filtering unless specific test use cases require us to do so. The recommended stemming filters deal to a large extent with diacritics. Lower case filtering is done on a token level.
Tokenization differs for alphabetic languages on the one side and logographic languages on the other. We recommend standard whitespace tokenization for most Western languages while a bigram approach is usually recommended for Japanese, Chinese and Korean. We therefore recommend the solr CJK-tokenizer for these languages.
We recommend lowercase filtering for alphabetic languages and language-specific stopword filtering by default. In order to simplify analysis and avoid additional maintenance cost, we do not recommend synonym filtering unless required to support specific test cases.
We recommend solr’s minimal language-specific stemming implementations where they exist. Should these yield insufficient recall during testing then we can replace them with more aggressive stemmers on a case-by-case basis.
One might even want to remove all stemming and cluster all alphabetical languages into a single analysis chain similarly to what currently is being done in standard OJS search. In order to keep flexibility for advanced use cases in scenarios S3 and S4 we do recommend language-specific analysis chains, though, even if not used out-of-the-box. It has to be kept in mind that this complexity is completely transparent to end users.
Keyword fields like discipline, subject, etc. are not usually passed through stemming filters. We therefore recommend a generic, language-agnostic analysis chain for all keyword fields.
We have to support a special analysis chain for the article and issue publication date so that range queries on the publication date can be supported. There are default analyzers and field types for dates which we recommend here.
Text fields to be sorted on must not be tokenized. Date fields to be sorted on must be of a different type as date fields to be queried on. We therefore provide special field types for sort ordering.
Theoretically chronological coverage could be analyzed with a location analyzer if (and only if) geographical coverage would be given in a well-defined latitude/longitude format. As this is not usually the case in OJS we recommend analyzing geographic coverage in the same way as other keyword fields.
Most use cases only require us to index fields. Storage is not required. The only field we need to store (and return from queries) is the document ID field which will be required by OJS to retrieve article data for display in result sets. There is a notable exceptions to this rule, though: If we enable highlighting then storage of galley fields is mandatory. This is necessary so that the highlighting component can return search terms in their original context. Therefore highlighting considerably increases storage space required by OJS solr indexes. This should be considered when deciding whether this feature is to be supported out-of-the-box.
Please see plugins/generic/solr/embedded/solr/conf/schema.xml for our recommended analysis configuration.