Elasticsearch使用總結(jié)

株野 2017-05-25

展開(kāi)全文

最初接觸Elasticsearch是在ELK日志系統(tǒng)的建設(shè)中，隨著對(duì)日志數(shù)據(jù)的消費(fèi)越來(lái)越多，被其強(qiáng)大的數(shù)據(jù)搜索和分析能力所吸引；后來(lái)，在用戶(hù)行為數(shù)據(jù)采集系統(tǒng)中，使用Elasticsearch做核心數(shù)據(jù)存儲(chǔ)和實(shí)時(shí)聚合分析；再后來(lái)，使用Elasticsearch搭建了產(chǎn)品的搜索服務(wù)。目前來(lái)看，Elasticsearch在這三個(gè)系統(tǒng)中表現(xiàn)都很靈活和優(yōu)異，沒(méi)有讓我們失望，而在系統(tǒng)建設(shè)中，我們也遇到過(guò)不少問(wèn)題，有基本概念的迷惑、操作方法、部署、性能等等各個(gè)方面。本文著重對(duì)Elasticsearch在應(yīng)用層面上的使用進(jìn)行總結(jié)，搞清楚WHAT和HOW兩個(gè)層面，即是什么、怎么用。
NOTE：本文所述的概念和方法均在Elasticsearch2.3版本下。

基本概念

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

這是官方對(duì)Elasticsearch的定位。通俗的講，Elasticsearch就是一款面向文檔的NoSQL數(shù)據(jù)庫(kù)，使用JSON作為文檔序列化格式。但是，它的高級(jí)之處在于，使用Lucene作為核心來(lái)實(shí)現(xiàn)所有索引和搜索的功能，使得每個(gè)文檔的內(nèi)容都可以被索引、搜索、排序、過(guò)濾。同時(shí)，提供了豐富的聚合功能，可以對(duì)數(shù)據(jù)進(jìn)行多維度分析。對(duì)外統(tǒng)一使用REST API接口進(jìn)行溝通，即Client與Server之間使用HTTP協(xié)議通信。
首先，來(lái)看看在存儲(chǔ)上的基本概念，這里將其與MySQL進(jìn)行了對(duì)比，從而可以更清晰的搞清楚每個(gè)概念的意義。

Elasticsearch	MySQL
index（索引，名詞）	database
doc type（文檔類(lèi)型）	table
document（文檔）	row
field（字段）	column
mapping（映射）	schema
query DSL（查詢(xún)語(yǔ)言）	SQL

然后，來(lái)看看倒排索引的概念（官方解釋?zhuān)５古潘饕撬阉饕娴幕彩荅lasticsearch能實(shí)現(xiàn)快速全文搜索的根本。歸納起來(lái)，主要是對(duì)一個(gè)文檔內(nèi)容做兩步操作：分詞、建立“單詞-文檔”列表。舉個(gè)例子，假如有下面兩個(gè)文檔：

1 2	`1. {"content":` `"The quick brown fox jumped over the lazy dog"}` `2. {"content":` `"Quick brown foxes leap over lazy dogs in summer"}`

Elasticsearch會(huì)使用分詞器對(duì)content字段的內(nèi)容進(jìn)行分詞，再根據(jù)單詞在文檔中是否出現(xiàn)建立如下所示的列表，√表示單詞在文檔中有出現(xiàn)。假如我們想搜索“quick brown”，只需要找到每個(gè)詞在哪個(gè)文檔中出現(xiàn)即可。如果有多個(gè)文檔匹配，可以根據(jù)匹配的程度進(jìn)行打分，找出相關(guān)性高的文檔。

Term	Doc_1	Doc_2
Quick		√
The	√
brown	√	√
dog	√
dogs		√
fox	√
foxes		√
in		√
jumped	√
lazy	√	√
leap		√
over	√	√
quick	√
summer		√
the	√

最后，我們?cè)倩剡^(guò)頭看看上面的映射的概念。類(lèi)似于MySQL在db schema中申明每個(gè)列的數(shù)據(jù)類(lèi)型、索引類(lèi)型等，Elasticsearch中使用mapping來(lái)做這件事。常用的是，在mapping中申明字段的數(shù)據(jù)類(lèi)型、是否建立倒排索引、建立倒排索引時(shí)使用什么分詞器。默認(rèn)情況下，Elasticsearch會(huì)為所有的string類(lèi)型數(shù)據(jù)使用standard分詞器建立倒排索引。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

查看mapping：GET http://localhost:9200/<index name="">/_mapping

NOTE: 這里的index是blog，doc type是test

{

"blog": {

"mappings": {

"test": {

"properties": {

"activity_type": {

"type": "string",

"index": "not_analyzed"

},

"address": {

"type": "string",

"analyzer": "ik_smart"

},

"happy_party_id": {

"type": "integer"

},

"last_update_time": {

"type": "date",

"format": "yyyy-MM-dd HH:mm:ss"

}

}</index>

數(shù)據(jù)插入

在MySQL中，我們需要先建立database和table，申明db schema后才可以插入數(shù)據(jù)。而在Elasticsearch，可以直接插入數(shù)據(jù)，系統(tǒng)會(huì)自動(dòng)建立缺失的index和doc type，并對(duì)字段建立mapping。因?yàn)榘虢Y(jié)構(gòu)化數(shù)據(jù)的數(shù)據(jù)結(jié)構(gòu)通常是動(dòng)態(tài)變化的，我們無(wú)法預(yù)知某個(gè)文檔中究竟有哪些字段，如果每次插入數(shù)據(jù)都需要提前建立index、type、mapping，那就失去了其作為NoSQL的優(yōu)勢(shì)了。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

直接插入數(shù)據(jù)：POST http://localhost:9200/blog/test

{

"count": 5,

"desc": "hello world"

}

查看索引：GET http://localhost:9200/blog/_mapping

{

"blog": {

"mappings": {

"test": {

"properties": {

"count": {

"type": "long"

},

"desc": {

"type": "string"

}

然而這種靈活性是有限，比如上文我們提到，默認(rèn)情況下，Elasticsearch會(huì)為所有的string類(lèi)型數(shù)據(jù)使用standard分詞器建立倒排索引，那么如果某些字段不想建立倒排索引怎么辦。Elasticsearch提供了dynamic template的概念來(lái)針對(duì)一組index設(shè)置默認(rèn)mapping，只要index的名稱(chēng)匹配了，就會(huì)使用該template設(shè)置的mapping進(jìn)行字段映射。
下面所示即創(chuàng)建一個(gè)名稱(chēng)為blog的template，該template會(huì)自動(dòng)匹配以”blog_”開(kāi)頭的index，為其自動(dòng)建立mapping，對(duì)文檔中的所有string自動(dòng)增加一個(gè).raw字段，并且該字段不做索引。這也是ELK中的做法，可以查看ELK系統(tǒng)中Elasticsearch的template，會(huì)發(fā)現(xiàn)有一個(gè)名為logstash的template。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

創(chuàng)建template：POST http://localhost:9200/_template/blog

{

"template": "blog_*",

"mappings": {

"_default_": {

"dynamic_templates": [{

"string_fields": {

"mapping": {

"type": "string",

"fields": {

"raw": {

"index": "not_analyzed",

"ignore_above": 256,

"type": "string"

}

},

"match_mapping_type": "string"

}

}],

"properties": {

"timestamp": {

"doc_values": true,

"type": "date"

}

},

"_all": {

"enabled": false

}

直接插入數(shù)據(jù)：POST http://localhost:9200/blog_2016-12-25/test

{

"count": 5,

"desc": "hello world"

}

插入問(wèn)題還有個(gè)話題就是批量插入。Elasticsearch提供了bulk API用來(lái)做批量的操作，你可以在該API中自由組合你要做的操作和數(shù)據(jù)，一次性發(fā)送給Elasticsearch進(jìn)行處理，其格式是這樣的。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

action_and_meta_data\n

optional_source\n

action_and_meta_data\n

optional_source\n

....

action_and_meta_data\n

optional_source\n

比如：

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }

{ "field1" : "value1" }

{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }

{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }

{ "field1" : "value3" }

{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} }

{ "doc" : {"field2" : "value2"} }

如果是針對(duì)相同的index和doc type進(jìn)行操作，則在REST API中指定index和type即可。批量插入的操作舉例如下：

1

2

3

4

5

6

7

8

9

10

11

批量插入：POST http://localhost:9200/blog_2016-12-24/test/_bulk

{"index": {}}

{"count": 5, "desc": "hello world 111"}

{"index": {}}

{"count": 6, "desc": "hello world 222"}

{"index": {}}

{"count": 7, "desc": "hello world 333"}

{"index": {}}

{"count": 8, "desc": "hello world 444"}

查看插入的結(jié)果：GET http://localhost:9200/blog_2016-12-24/test/_search

數(shù)據(jù)查詢(xún)

Elasticsearch的查詢(xún)語(yǔ)法（query DSL）分為兩部分：query和filter，區(qū)別在于查詢(xún)的結(jié)果是要完全匹配還是相關(guān)性匹配。filter查詢(xún)考慮的是“文檔中的字段值是否等于給定值”，答案在“是”與“否”中；而query查詢(xún)考慮的是“文檔中的字段值與給定值的匹配程度如何”，會(huì)計(jì)算出每份文檔與給定值的相關(guān)性分?jǐn)?shù)，用這個(gè)分?jǐn)?shù)對(duì)匹配了的文檔進(jìn)行相關(guān)性排序。
在實(shí)際使用中，要注意兩點(diǎn)：第一，filter查詢(xún)要在沒(méi)有做倒排索引的字段上做，即上面mapping中增加的.raw字段；第二，通常使用filter來(lái)縮小查詢(xún)范圍，使用query進(jìn)行搜索，即二者配合使用。舉例來(lái)看，注意看三個(gè)不同查詢(xún)?cè)趯?xiě)法上的區(qū)別：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

1. 只使用query進(jìn)行查詢(xún)：

POST http://localhost:9200/user_action/_search

查詢(xún)的結(jié)果是page_name字段中包含了wechat所有文檔

這里使用size來(lái)指定返回文檔的數(shù)量，默認(rèn)Elasticsearch是返回前100條數(shù)據(jù)的

{

"query": {

"bool": {

"must": [{

"match": {

"page_name": "wechat"

}

},

{

"range": {

"timestamp": {

"gte": 1481218631,

"lte": 1481258231,

"format": "epoch_second"

}

}]

}

},

"size": 2

}

2. 只使用filter進(jìn)行查詢(xún)：

POST http://localhost:9200/user_action/_search

查詢(xún)的結(jié)果是page_name字段值等于"example.cn/wechat/view.html"的所有文檔

{

"filter": {

"bool": {

"must": [{

"term": {

"page_name.raw": "example.cn/wechat/view.html"

}

},

{

"range": {

"timestamp": {

"gte": 1481218631,

"lte": 1481258231,

"format": "epoch_second"

}

}]

}

},

"size": 2

}

3. 同時(shí)使用query與filter進(jìn)行查詢(xún)：

POST http://localhost:9200/user_action/_search

查詢(xún)的結(jié)果是page_name字段值等于"example.cn/wechat/view.html"的所有文檔

{

"query": {

"bool": {

"filter": [{

"bool": {

"must": [{

"term": {

"page_name.raw": "job.gikoo.cn/wechat/view.html"

}

},

{

"range": {

"timestamp": {

"gte": 1481218631,

"lte": 1481258231,

"format": "epoch_second"

}

}]

}

}]

}

},

"size": 2

}

聚合分析

類(lèi)似于MySQL中的聚合由分組和聚合計(jì)算兩種，Elasticsearch的聚合也有兩部分組成：Buckets與Metrics。Buckets相當(dāng)于SQL中的分組group by，而Metrics則相當(dāng)于SQL中的聚合函數(shù)COUNT，SUM，MAX，MIN等等。聚合分析自然離不開(kāi)對(duì)多個(gè)字段值進(jìn)行分組，在MySQL中，我們只要使用“group by c1, c2, c3”就可以完成這樣的功能，但是Elasticsearch沒(méi)有這樣的語(yǔ)法。Elasticsearch提供了另一種方法，即Buckets嵌套，仔細(xì)想想，似乎這種設(shè)計(jì)更加符合人的思維方式。舉例來(lái)看具體操作方法：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

1. 最簡(jiǎn)單的聚合查詢(xún)

POST http://localhost:9200/user_action/_search

為了簡(jiǎn)單，這里刪除了query的條件描述

將符合條件的文檔按照公司進(jìn)行聚合

這里有兩個(gè)size，和aggs并列的size=0表示返回結(jié)果不包含查詢(xún)結(jié)果，只返回聚合結(jié)果，terms里面的size表示返回的聚合結(jié)果數(shù)量

{

"aggs": {

"company_terms": {

"terms": {

"field": "company",

"size": 2

}

},

"size": 0

}

2. Buckets與Metric配合

POST http://localhost:9200/user_action/_search

將符合條件的文檔按照公司進(jìn)行聚合，并獲取每個(gè)公司最近一次操作的時(shí)間

{

"aggs": {

"company_terms": {

"terms": {

"field": "company",

"size": 2

},

"aggs": {

"latest_record": {

"max": {

"field": "timestamp"

}

},

"size": 0

}

3. Buckets嵌套

POST http://localhost:9200/user_action/_search

將符合條件的文檔先按照公司進(jìn)行聚合，再對(duì)每個(gè)公司下的門(mén)店進(jìn)行聚合，并獲取每個(gè)門(mén)店最近一次操作的時(shí)間

{

"aggs": {

"company_terms": {

"terms": {

"field": "company",

"size": 1

},

"aggs": {

"store_terms": {

"terms": {

"field": "store",

"size": 2

},

"aggs": {

"latest_record": {

"max": {

"field": "timestamp"

}

},

"size": 0

}

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶(hù)發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。