InformationRetrieval:Lucene

From fritz'wiki
Revision as of 17:28, 30 September 2013 by Fritz (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

기본 용어

Token

A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string. The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc. The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word". A Token can optionally have metadata (a.k.a. Payload) in the form of a variable length byte array. Use TermPositions.getPayloadLength() and TermPositions.getPayload(byte[], int) to retrieve the payloads from the index.

NOTE: As of 2.9, Token implements all Attribute interfaces that are part of core Lucene and can be found in the tokenattributes subpackage. Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.

Term

A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in, an interned string. Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.

Payload

A Payload is metadata that can be stored together with each occurrence of a term. This metadata is stored inline in the posting list of the specific term. To store payloads in the index a TokenStream has to be used that produces payload data. Use TermPositions.getPayloadLength() and TermPositions.getPayload(byte[], int) to retrieve the payloads from the index.

한국어 분석

LuceneKorean 카페

AnalysisOutput

WordEntry

CompoundEntry

한국어 문법

불규칙활용
한글맞춤법
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox