Elasticsearch core technology (2) — Basic Concepts
This blog talked about the basic concepts include: Index, Type, Document. Clusters, nodes, and a copy of the fragment, inverted index.
A, Index, Type, Document
index: Index is a document (Document) containers, is a kind of collection of documents.
The word in the index ElasticSearch have three meanings:
1), index (noun)
Analogy traditional relational database field, the index is equivalent to a database (Database) SQL is. Index by its name (must be all lowercase characters) are identified.
2), the index (verb)
Save a document to process index (noun) a. This is very similar to the SQL statement INSERT keyword. It is equivalent to UPDATE the database if the document already exists.
3), inverted index
Relational database by adding a B + tree index to the specified column to enhance the speed of data retrieval. ElasticSearch index uses a structure called the inverted index to achieve the same purpose.
Type can be understood as a relational database Table.
Previously, the concept of a document index and middle there is a type under each index can set up multiple types, you need to specify the index and store the document type. 6.0.0 From the beginning there is only a single index type,
7.0.0 The future will not recommended, after 8.0.0 do not support.
The reason abandoned the concept:
Although we can go to popular understanding Index for SQL Database ratio of, Type Table likened the SQL. But this is not accurate, because if in SQL, independent of each other before the Table, the field of the same name has nothing in both tables.
But in the ES, under the same Index Different Type If there is a field of the same name, they will be treated as Luecence the same field, and they must have the same definition. So I think now more like a table Index,
The Type field did not have much significance. Currently Type has been Deprecated, beginning at 7.0, an index can only be built as a Type _doc
Document Index recorded a single which is called Document (document). Equivalent relational database table rows.
We look at a document source data
_index document belongs index name.
_type document belongs type name.
_id primary key of Doc. When written, the ID value can be specified Doc, if not specified, the system automatically generates a unique UUID value.
Version Information _version document. Elasticsearch by using the version to ensure that changes to the document can be performed in the correct order to avoid data loss due to out of order.
_seq_no strictly increasing sequence numbers, one for each document, Shard level strictly increasing, Doc’s _seq_no written after the guarantee is greater than Doc’s _seq_no first written.
primary_term _seq_no and also as an integer primary_term, whenever Primary Shard occur reassigned, such as restart, Primary elections, _primary_term incremented by 1
ID found query correctly so ture, if Id is not correct, finding out the data, found field is false.
JSON data _source original document.
Second, clusters, nodes, and a copy of the fragment
ElasticSearch cluster is actually a distributed system, it needs to have two characteristics:
1) High Availability
a) Service Availability: Allow nodes out of service;
b) availability of data: part of the node is lost, the data is not lost;
With the rising amount of requests, the growing amount of data, the system data can be distributed to other nodes, to achieve horizontal scaling;
A cluster can have one or more nodes;
Cluster health value
green: All major fragmentation and fragmentation are available copy
yellow: All major slice is available, but not all copy fragments are available
red: Not all of the major fragments are available
2, the node (Node)
What 1) nodes are?
a) node is an example of a ElasticSearch, which is essentially a Java process;
b) ElasticSearch can run multiple instances on a machine, but it is recommended to run a ElasticSearch instance only on one machine in a production environment;
Node is a single cluster of servers, for storing data and providing search and indexing cluster. Like cluster node also has a unique name, the default when the node startup will generate a uuid as a node name,
The name can also be specified manually. Single cluster may be composed of any number of nodes. If only one start node, a single-node cluster will form.
Primary Shard (primary slice)
ES shard to solve the problem of node size limit ,, the main fragments may be distributed over the data to all nodes in the cluster.
The relationship between them
一个节点对应一个ES实例； 一个节点可以有多个index（索引）; 一个index可以有多个shard（分片）； 一个分片是一个lucene index（此处的index是lucene自己的概念，与ES的index不是一回事）；
Main distributor is specified when the number of sheets index creation, can not be modified follow, unless Reindex
Index data stored in a plurality of sub-sheets (as a default), equivalent to the level sub-table. A slice is a Lucene instance, it is itself a complete search engine. Our documents are stored and indexed into fragments.
But the application is interacting directly with the index rather than fragmentation.
Replica Shard (copy)
There are two copies of important roles:
1. Service availability: Because the data only one, if a node linked, and that the presence of the above data is all lost, with replicas, as long as this data is not stored in the node trailer, the data will not be lost. So do not copy and fragmentation
Primary slice allocated to the same node;
2, scalability: to improve search performance by a parallel search across all replicas because the data on the replicas are near real-time (near realtime), so all replicas can provide search capabilities, by setting reasonable replicas.
You can increase the number of high throughput search
Carved piece set for a production environment, we need to do capacity planning in advance, because the main points is the number of pieces in the pre-set index creation, follow-up can not be modified.
It is set too small number of fragments
Lead to subsequent nodes can not increase the level of expansion.
Resulting in the amount of data pieces is too large, time-consuming data reallocation;
Setting number is too large fragment
Affect the search results of relevancy scoring, affect the accuracy of statistical results;
On a single node excessive fragmentation, it would lead to waste of resources, but also affect performance;
Third, the inverted index
ES search function is based on the fundamental principle of lucene, but lucene search index is a flashback, with the reverse order of the results related to the type of word.
1, assume a collection of documents comprising five documents, documents every content As shown in FIG leftmost column is a text block corresponding to each document ID.
As shown (FIG. Pirates)
2, the first to use word system will automatically cut documents into word sequence record which documents containing the word, at the end of such a process, we can get the most simple inverted index.
3, the indexing system may also record additional information in addition, also described under FIG word frequency information. The document is divided into one sentence term (term used to indicate a word or words, depending on the method used word),
Flashback index stored in the term, term frequency of occurrence (tf, term frequency) and the emergence of location (flashback index words are in order, and this figure does not reflect it), please note that the contents of this document are document
In a field that is indexed each field has its own flashback index
A simple search process
Suppose we search Google Maps father, the search process will be the case
Word, sentence segmentation plug-in will be divided into three valleys term song, map, father
These three get flashbacks term index to find (would be very efficient, such as binary search), if matched to the corresponding document id Take obtain the document content
However, how to determine the order of results?
Here to introduce the concept of _score, for the term of the match, lucene will have on its score, the higher the score, more high ranking here to introduce several related concepts
- TF(term frequency),词频,term在当前document中出现的频率,一个term在当前document中出现5次要比出现1次更相关,打分也会更高 - IDF(inverse doucment frequency),逆向文档频率,term在所有document中出现的频率,这个频率越高,该term对应的分值越低 - 字段长度归一值,简单来说就是字段越短,字段的权重越高, 比如 term `我`在匹配 `我123`和`我123456`时,`我123`的得分会更高.
1, Elasticsearch core technology and combat — Ruan Yiming (eBay Pronto platform technical director
2, ElasticSearch basic concepts
3, Elasticsearch the basic concepts
4, ElasticSearch Section 5 inverted index, tokenizer