Categories
Uncategorized

Elasticsearch (10) — Built-word, a Chinese word breaker

Elasticsearch (10) — Built-word, a Chinese word breaker

This blog is mainly about: word Concepts, ES built tokenizer, ES Chinese word breaker.

First, the word Concepts

1, Analysis and Analyzer

Analysis: Text analysis is to convert a series of full-text word (term / token) process, also known as segmentation. Analysis is done by the Analyzer.

When a document is indexed, each Field are likely to create an inverted index (Mapping can not set index of the Field).

Inverted index of the process is to a document into a through Analyzer Term, each point contains a collection of Term Term of this document.

When a query query, Elasticsearch will decide whether to perform query analyze, and then inverted index term related query, matching the appropriate document in accordance with the type of search.

2, Analyzer composition

Analyzer (Analyzer) composed of three constituent members of blocks: character filters, tokenizers, token filters.

1) character filter character filter

在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(hello --> hello),& --> and(I&you --> I and you)

2) tokenizers tokenizer

English word can space separates the word, Chinese word is more complex, machine learning algorithms can be used for word.

3) Token filters Token filter

The segmentation of word processing. Case conversion (for example the “Quick” lowercase), remove the word (e.g., stop words like “a”, “and”, “the”, etc.), or increase the word (e.g. a synonym as “jump” and “leap “).

Three order: Character Filters —> Tokenizer —> Token Filter

The number of three: analyzer = CharFilters (0 or more) + Tokenizer (just a) + TokenFilters (0 or more)

3, Elasticsearch built-in word breaker

  • Standard Analyzer – default word, a word divided by cutting, processing lowercase

  • Simple Analyzer – in a non-alphabetical slicing (symbol is filtered), the processing lowercase

  • Stop Analyzer – lowercase processing, filtering stop words (the, a, is)

  • Whitespace Analyzer – according space division Geqie not turn lowercase

  • Keyword Analyzer – regardless of the word, as a direct input output

  • Patter Analyzer – regular expressions, default \ W + (non-character segmentation)

  • Language – provides a 30 word is more common languages

  • Customer Analyzer custom tokenizer

4, set the tokenizer When creating an index

PUT new_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "std_folded": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "std_folded" #指定分词器
            },
            "content": {
                "type": "text",
                "analyzer": "whitespace" #指定分词器
            }
        }
    }
}

Two, ES built tokenizer

Here are a few common word is explained under: Standard Analyzer, Simple Analyzer, whitespace Analyzer.

1, Standard Analyzer (default)

1) Example

standard is the default analyzer. It provides a tag-based syntax of (Unicode text-based segmentation algorithm) for most languages

POST _analyze
{
  "analyzer": "standard",
  "text":     "Like X 国庆放假的"
}

operation result

2) Configuration

Standard analyzer accepts the following parameters:

    max_token_length: maximum token length, default 255

    stopwords: predefined list of stop words, such as _english_ or an array containing the list of stop words, the default is _none_

    stopwords_path: contains the file path of stop words

PUT new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",       #设置分词器为standard
          "max_token_length": 5,    #设置分词最大为5
          "stopwords": "_english_"  #设置过滤词
        }
      }
    }
  }
}

2、Simple Analyzer

simple parser encounters when it is not alphabetic characters long, the text will be parsed into term, and the term are all lowercase.

POST _analyze
{
  "analyzer": "simple",
  "text":     "Like X 国庆放假 的"
}

operation result

3、Whitespace Analyzer

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "Like X 国庆放假 的"
}

返回

Third, the Chinese word

The Chinese word now is that we are more recommended IK word is, of course, some other such smartCN, HanLP.

Where they talk about how to use the IK as the Chinese word.

1, IK word installed

Open source word breaker Ik’s github: https: //github.com/medcl/elasticsearch-analysis-ik

IK word consistent attention to the version you have installed ES version is 7.1.0 so I am here to find the corresponding version in github, then start command

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip

operation result

Note After you install the plug-in need to restart Es, to take effect.

2, IK use

There are two kinds of particles IK split degrees:

ik_smart: split will do the most coarse-grained

ik_max_word: text will do the most fine-grained Split

1) ik_smart Split

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_smart"
}

operation result

2) ik_max_word Split

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_max_word"
}

operation result

reference

1、Elasticsearch Analyzers

2, Elasticsearch tokenizer

3, Elasticsearch Pinyin minute installation and use of words and word of IK

 我相信,无论今后的道路多么坎坷,只要抓住今天,迟早会在奋斗中尝到人生的甘甜。抓住人生中的一分一秒,胜过虚度中的一月一年!(15)

Leave a Reply