Stemming and thesaurus interactions with other search features

As core features of the Navigation Engine search subsystem, stemming and the thesaurus have the following types of interactions with other search features:

  • Search characters: The search character set configured for the application dictates the set of available characters for stemming and thesaurus entries. By default, only alphanumeric ASCII characters may be used in stemming and thesaurus entries. Additional punctuation and other special characters may be enabled for use in stemming and thesaurus entries by adding these characters to the search character set.
  • The Navigation Engine matches user query terms to thesaurus forms using the following rule: all alphanumeric and search characters must match against the stemming and thesaurus forms exactly; other characters in the user search query are treated as word delimiters. For details on search characters, see Using search characters.

  • Spelling: Spelling correction is a closely-related feature to stemming and thesaurus functionality, because spelling auto-correction essentially provides an additional mechanism for computing alternate versions of the user query. In the Navigation Engine, spelling is handled as a higher-level feature than stemming and thesaurus. That is, spelling correction considers only the raw form of the user query when producing alternate query forms.
  • Alternate spell-corrected queries are then subject to all of the normal stemming and thesaurus processing. For example, if the user enters the query telvision and this query is spell-corrected to television, the results will also include results for the alternate forms televisionstv, and tvs.

    Note: In some cases, the thesaurus feature is used as a replacement or in addition to the system's standard spelling correction features. In general, this technique is discouraged. The vast majority of actual misspelled user queries can be handled correctly by the spelling correction subsystem. But in some rare cases, the spelling correction feature cannot correct a particular misspelled query of interest; in these cases it is common to add a thesaurus entry to handle the correction. If at all possible, such entries should be avoided as they can lead to undesirable feature interactions.

  • Stop words: Stop words are words configured to be ignored by the Navigation Engine search query engine. A stop word list typically includes words that occur too frequently in the data to be useful (for example, the word bottle in a wine data set), as well as words that are too general (such as clothing in an apparel-only data set).
  • If the is marked as a stopword, then a query for the computer will match to text containing the word computer, but possibly missing the word the.

    Stop words are not currently expanded by the stemming and thesaurus equivalence set. For example, suppose you mark item as a stopword and also include a thesaurus equivalence between the words item and items. This will not automatically mark the word items as a stopword; such expansions must be applied manually.

    Stop words are respected when matching thesaurus entries to user queries. For example, suppose you define an equivalence between Muhammad Ali and Cassius Clay and also mark M as a stopword (it is not uncommon to mark all or most single letter words as stopwords). In this case, a query for Cassius M. Clay would match the thesaurus entry and return results for Muhammad Ali as expected.

    For a list of suggested stop words, see Appendix B.

  • Phrase Search: A phrase search is a search query that contains one or more multi-word phrases enclosed in quotation marks. The words inside phrase-query terms are interpreted strictly literally and are not subject to stemming or thesaurus processing. For example, if we define a thesaurus equivalence between Jennifer Lopez and JLo, normal (unquoted) searches for Jennifer Lopez will also return results for JLo, but a quoted phrase search for "Jennifer Lopez" will not return the additional JLo results.
  • Relevance Ranking: In many cases, it is desirable to affect the order in which results are returned based on stemming and thesaurus processing. In particular, it is typically desirable to return results for the actual user query ahead of results for stemming and/or thesaurus transformed versions of the query. This type of result ordering is supported by the interp relevance ranking module. For details, see Using Relevance Ranking.

 

'Dev > endeca' 카테고리의 다른 글

Endeca Report Generator 기간설정  (0) 2012.09.24
Valid search modes  (0) 2012.01.30
OptiSpell, error creating pspell manager  (0) 2012.01.04
Endeca 검색결과 기본 정렬  (0) 2011.12.15
CAS Console must be accessed through Workbench  (0) 2011.11.07

+ Recent posts