Categories
Uncategorized

ES 32 – Exploration and Practice Elasticsearch data modeling

table of Contents

    1 What is data modeling?

  • 2 如何对 ES 中的数据进行建模

      2.1 Modeling Program field type

      2.2 retrieval, and sort the polymerization modeling scheme

      2.3 additional storage modeling program

  • 3 ES 数据建模实例演示

      3.1 Dynamic mapping between

      The mapping between 3.2 Manual

      3.3 new demand – add large field

      3.4 solve performance problems caused by a large field

      Common parameters 3.5 mapping field

      3.6 mapping settings summary

  • 4 ES 数据建模最佳实践

      4.1 How to deal with relationships

      4.2 avoid too many fields

      4.3 avoid regular query

      4.4 avoid polymerization caused by nulls allowed

1 What is data modeling?

Data Modeling (Data modeling), is the process of creating a data model.

The data model is a tool and method for real-world abstract description, maps of the real world, such as film and television works, the actors, the audience comment …

There are three data modeling process: conceptual model => logical model => data model (Third Pattern)

data model, the type of database to be used in conjunction with, and the final definition is provided that the operational read/write performance needs are met.

2 ES in how data modeling

Data Modeling in ES:

By the data storage, retrieval and other functions to extract the entity attribute requirements, relationships between entities = “logical model is formed;

Indexing templates by performance requirements, index mapping (including field configuration, relationship processing) == to form a physical model.

The basic unit for storing and retrieving in ES is the index document, which consists of fields, so ES modeling fields.

The document is similar to a row of data in a relational database, and the field corresponds to a column of data in a relational database.

2.1 Modeling Program field type

(1) Text versus keyword:

    text: for full-text fields, text will be Analyzer word; default does not support aggregation analysis and sorting, set “fielddata”: true can support;

  • keyword: for id, and text need not enumerate the word, such as ID number, telephone number, Email address, etc; for Filter (exact match filter), the Sorting (sorting) and Aggregations (polymerization).

  • Set up multiple field types:

    By default, text type will be set to text, and a keyword subfield will be set;
    When dealing with human natural words, you can add word separators such as “English”, “spelling” and “standard” to improve the correctness of search results.

(2) Data Structure:

    Numerical type: Try to choose close type, for example, you can use byte, do not use long;

  • Enum Type: keyword set, even numbers, it should be provided keyword, better performance; additional keyword retrieval range, faster;

  • Other types: Date, Binary, Boolean, Geographic, etc.

2.2 retrieval, and sort the polymerization modeling scheme

    As not need to retrieve, sort and analyze the polymerization can be set “enable”: false;

  • As not need to retrieve, may be provided “index”: false;

  • If you do not need sorting, aggregation analysis function, you can set “doc_values”: false/ “fielddate”: false;

  • Update the field of keyword type frequently and aggregate query frequently, recommended setting “eager_global_ordinals”: true.

2.3 additional storage modeling program

    Do you need to store current field data specifically?

“store”: true, which stores the original contents of the field;

General binding “_source”: { “enabled”: false} for use, because the default “_source”: { “enabled”: true}, i.e. the original document are stored JSON structure when adding an index to the _source.

    disable_source: Disable _source yuan field, can save disk for data type indicator – similar identification field, time field data is not updated, highlighting queries, usually used to filter operation to quickly filter out smaller results set for supporting a faster polymerization operation.

Official advice: If you pay more attention to disk space, it is recommended to give priority to increasing the compression ⽐ data, instead of disabling _source;

_Source can not see the field, can not do reindex, update, update_by_query operation;

So far, it is not possible to find the index with the _source field disabled in Kibana.

– cautious disabled _source field, reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

3 ES Data Modeling Example Demonstration

3.1 Dynamic mapping between

# 直接写入一本图书信息:
POST books/_doc
{
  "title": "Thinking in Elasticsearch 7.2.0",
  "author": "Heal Chow",
  "publish_date": "2019-10-01",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch.",
  "cover_url": "https://healchow.com/images/29dMkliO2a1f.jpg"
}

# 查看自动创建的mapping关系:
GET books/_mapping
# 内容如下:
{
  "books" : {
    "mappings" : {
      "properties" : {
        "author" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "cover_url" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "description" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "publish_date" : {
          "type" : "date"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

3.2 Manually Creating Mapping Relationships

# 删除自动创建的图书索引:
DELETE books

# 手动优化字段的mapping:
PUT books
{
  "mappings": {
    "_source": { "enabled": true },
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      },
      "author": { "type": "keyword" },
      "publish_date": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis"
      },
      "description": { "type": "text" },
      "cover_url": {          # index 设置成 false, 不支持搜索, 但支持 Terms 聚合
        "type": "keyword",
        "index": false
      }
    }
  }
}

Description: _source yuan field is enabled by default, after if disabled, will not be able to show the results of the search can not be reindex, update, update_by_query operation.

3.3 new demand – add large field

  • Description of Requirement: Adding book content field, required to support full-text search, and can be highlighted.

  • Analysis of requirements: new requirements will cause the content of _source to be overwritten, although we can filter the fields in the search results by source filtering:

    "_source": {
        "includes": ["title"]  # 或 "excludes": ["xxx"] 排除某些字段, includes 优先级更高
    }

    But this method is only filtering when ES server transfers to client, when internal fetching data, ES data nodes will still transfer all data from _source to the coordination node – network IO is not substantially reduced.

3.4 Resolve Performance Issues with Large Fields

(1) Manually close the _source meta field when creating mapping: “_source”: {“enabled”: false};

(2) Then set “store” for each field: true.

# 关闭_source元字段, 设置store=true:
PUT books
{
  "mappings": {
    "_source": { "enabled": false },
    "properties": {
      "title": {
        "type": "text",
        "store": true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      },
      "author": { "type": "keyword", "store": true },
      "publish_date": {
        "type": "date",
        "store": true,
        "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis"
      },
      "description": { "type": "text", "store": true },
      "cover_url": {
        "type": "keyword",
        "index": false,
        "store": true
      },
      "content": { "type": "text", "store": true }
    }
  }
}

(3) Add data and highlight the query:

# 添加包含新字段的文档:
POST books/_doc
{
  "title": "Thinking in Elasticsearch 7.2.0",
  "author": "Heal Chow",
  "publish_date": "2019-10-01",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch.",
  "cover_url": "https://healchow.com/images/29dMkliO2a1f.jpg",
  "content": "1. Revisiting Elasticsearch and the Changes. 2. The Improved Query DSL. 3. Beyond Full Text Search. 4. Data Modeling and Analytics. 5. Improving the User Search Experience. 6. The Index Distribution Architecture.  .........."
}

# 通过 stored_fields 指定要查询的字段:
GET books/_search
{
  "stored_fields": ["title", "author", "publish_date"],
  "query": {
    "match": { "content": "data modeling" }
  },
  "highlight": {
    "fields": { "content": {} }
  }
}

The results of the inquiry are as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "dukLoG0BdfGBNhbF13CJ",
        "_score" : 0.5753642,
        "highlight" : {
          "content" : [
            "Data Modeling and Analytics. 5. Improving the User Search Experience. 6."
          ]
        }
      }
    ]
  }
}

(4) Statement of results:

    The _source field is not included in the returned result;

    For information to be displayed, specify “stored_fields” in the query: [“xxxx”, “yyyy”];

    After disabling the _source field, the use of the Highlights API is still supported.

Common parameters for fields in 3.5 mapping

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html

    enabled — Set to false, the current field is stored only, search and aggregation analysis is not supported (data is saved in _source);

    index — Whether to build an inverted index, set to false, cannot be searched, but the aggregation operation is still supported and will appear in _source;

    standards — only to filter and aggregate analysis (indicator data), do not care about scoring fields, it is recommended to close, save storage space;

    doc_values — whether doc_values are enabled for sorting and aggregation analysis;

    field_data — Fielddata needs to be set to true if sorting and aggregation analysis is enabled for text type;

    coerce — Whether to turn on automatic conversion of data types (e.g. string to number), by default;

    multifields – whether to open the multi-field characteristics;

    dynamic – control dynamically updated strategy mapping, there are true / false / strict three.

doc_values ​​and fielddata comparison:

doc_values: aggregation and sorting need to open the fields – the default for all non-text types of fields open – when not enough memory is written to disk file;

fielddata: whether to open the type of text, in order to implement sorting and aggregation analysis – off Default – fully loaded into memory.

3.6 mapping settings summary

(1) Add support for new fields (including sub-field), the replacement word operations like:

Can make the old data are cleaned by update_by_query.

(2) Index Template: The name of the index matching different mappings and Settings;

(3) Dynamic Template: dynamically set on a field type Mapping;

(4) Reindex: To modify, delete existing fields, or modify the parameters like the number of fragments, it is necessary to rebuild the index.

It must be shut down, when the amount of data would be more time-consuming for a long time.

Zero downtime for maintenance can be achieved by means of Index Alias ​​(alias index).

4 ES data modeling best practices

4.1 How to deal with relationships

(1) design paradigm:

We know that the concept of “paradigm design” in a relational database, there 1NF, 2NF, 3NF, BCNF and so on, the main goal is to reduce unnecessary updates, while saving storage space, but the drawback is that the data read operation may be slower, especially across the table, you need to join the table will be a lot.

Anti design paradigm: Data flat, without the use of relationship, but to save the redundant data in a document by copying _source field.

Advantages: join operation without treatment, good data read performance;

Disadvantages: not suitable for frequent changes of scene data.

== “ES relationship is not good at handling, generally, a nested type (nested), parent-child relationship (child / parent) is solved by the object type (object).

Larger space occupied by the specific use, will be omitted here.

4.2 avoid too many fields

(1) a document are better not to have the amount field zoomed:

    Excessive lead to data field is not easy to maintain;

    mapping information is stored in Cluster State, the amount of data is too far, there will be impact on the performance of the cluster (Cluster State information needs to be synchronized and all nodes);

    When you delete or modify the field, you need to reindex;

(2) the maximum number of single index field 1000 is a default ES can be modified by the number of fields is largest parameter index.mapping.total_fields.limt.

Question: What causes the document hundreds of field?

ES-free mode (schemaless) is, by default, each adding a field, automatically adds ES mapping relationship according to the field of possible types.

If the business process is not rigorous, field explosion phenomenon will appear in order to avoid the occurrence of this phenomenon, the need to develop dynamic strategies:

    true – field is automatically added to the unknown, is the default setting;

    false – the new field will not be indexed, but will be saved to the _source;

    strict – Added fields will not be indexed, file write failed, exception thrown.

– a production environment, try not to use the default “dynamic”: true.

4.3 avoid regular query

The regular, prefix, and wildcard queries are all term queries, but performance is poor (scan all documents and compare them one by one), especially by placing wildcards at the beginning, can cause a performance disaster.

(1) Case:

    Documents in a field which contains the version information Elasticsearch such version: “7.2.0”;

    Search bug_fix versions of a series (the last one non-zero version number)? Document associated with each major version number?

(2) wildcard queries Example:

# 插入2条数据:
PUT softwares/_doc/1
{
  "version": "7.2.0",
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

PUT softwares/_doc/2
{
  "version": "7.3.0",
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

# 通配符查询:
GET softwares/_search
{
  "query": {
    "wildcard": {
      "version": "7*"
    }
  }
}

(3) Solutions – a string type to convert the object type:

# 创建对象类型的映射:
PUT softwares
{
  "mappings": {
    "properties": {
      "version": {      # 版本号设置为对象类型
        "properties": {
          "display_name": { "type": "keyword" },
          "major": { "type": "byte" },
          "minor": { "type": "byte" },
          "bug_fix": { "type": "byte" }
        }
      },
      "doc_url": { "type": "text" }
    }
  }
}

# 添加数据:
PUT softwares/_doc/1
{
  "version": {
    "display_name": "7.2.0",
    "major": 7,
    "minor": 2,
    "bug_fix": 0
  },
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

PUT softwares/_doc/2
{
  "version": {
    "display_name": "7.3.0",
    "major": 7,
    "minor": 3,
    "bug_fix": 0
  },
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

# 通过filter过滤, 避免正则查询, 大大提升性能:
GET softwares/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": { "version.major": 7 }
        },
        {
          "match": { "version.minor": 2 }
        }
      ]
    }
  }
}

4.4 avoid polymerization caused by nulls allowed

(1) Example:

# 添加数据, 包含1条 null 值的数据:
PUT ratings/_doc/1
{
  "rating": 5
}
PUT ratings/_doc/2
{
  "rating": null
}

# 对含有 null 值的字段进行聚合:
GET ratings/_search
{
  "size": 0,
  "aggs": {
    "avg_rating": {
      "avg": { "field": "rating"}
    }
  }
}

# 结果如下:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,              # 2条数据, avg_rating 结果不正确
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_rating" : {
      "value" : 5.0
    }
  }
}

(2) to solve the problem using null_value null values:

# 创建 mapping 时, 设置 null_value:
PUT ratings
{
  "mappings": {
    "properties": {
      "rating": {
        "type": "float",
        "null_value": "1.0"
      }
    }
  }
}

# 添加相同的数据, 再次聚合, 结果正确:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_rating" : {
      "value" : 3.0
    }
  }
}

Reference material

“Elasticsearch core technology and practical” “Geek Time” video course of

Copyright Notice

Author: MA thin air (https://healchow.com)

Source: blog Park horse thin air of blog (https://www.cnblogs.com/shoufeng)

Thank you for reading, If the article is helpful or inspired to you, order [good article to top ?] or [recommended ?] it ?

This article belongs to all bloggers, welcome to reprint, but [must clearly indicate the location of the original article page link], otherwise bloggers right to pursue legal responsibilities related personnel.

Leave a Reply