目錄 1 什么是數據建模?數據建模(Data modeling), 是創建數據模型的過程.
數據建模有三個過程: 概念模型 => 邏輯模型 => 數據模型(第三范式) 數據模型, 需要結合使用的數據庫類型, 在滿足業務讀寫性能等需求的前提下, 制定出最終的定義. 2 如何對 ES 中的數據進行建模ES中的數據建模: 由數據存儲、檢索等功能需求提煉出實體屬性、實體之間的關系 =》形成邏輯模型; 由性能需求提煉制定索引模板、索引Mapping(包括字段的配置、關系的處理) ==》形成物理模型. ES 中存儲、檢索的基本單位是索引文檔(document), 文檔由字段(field)組成, 所以ES的建模就是對字段進行建模.
2.1 字段類型的建模方案(1) text 與 keyword 比較:
(2) 結構化數據:
2.2 檢索、聚合及排序的建模方案
2.3 額外存儲的建模方案
3 ES 數據建模實例演示3.1 動態創建映射關系# 直接寫入一本圖書信息: POST books/_doc { "title": "Thinking in Elasticsearch 7.2.0", "author": "Heal Chow", "publish_date": "2019-10-01", "description": "Master the searching, indexing, and aggregation features in Elasticsearch.", "cover_url": "https:///images/29dMkliO2a1f.jpg" } # 查看自動創建的mapping關系: GET books/_mapping # 內容如下: { "books" : { "mappings" : { "properties" : { "author" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "cover_url" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "description" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "publish_date" : { "type" : "date" }, "title" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } } 3.2 手動創建映射關系# 刪除自動創建的圖書索引: DELETE books # 手動優化字段的mapping: PUT books { "mappings": { "_source": { "enabled": true }, "properties": { "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 100 } } }, "author": { "type": "keyword" }, "publish_date": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis" }, "description": { "type": "text" }, "cover_url": { # index 設置成 false, 不支持搜索, 但支持 Terms 聚合 "type": "keyword", "index": false } } } } 說明: 3.3 新增需求 - 添加大字段
3.4 解決大字段帶來的性能問題(1) 在創建 mapping 時手動關閉 (2) 然后為每個字段設置 # 關閉_source元字段, 設置store=true: PUT books { "mappings": { "_source": { "enabled": false }, "properties": { "title": { "type": "text", "store": true, "fields": { "keyword": { "type": "keyword", "ignore_above": 100 } } }, "author": { "type": "keyword", "store": true }, "publish_date": { "type": "date", "store": true, "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis" }, "description": { "type": "text", "store": true }, "cover_url": { "type": "keyword", "index": false, "store": true }, "content": { "type": "text", "store": true } } } } (3) 加數據, 并進行高亮查詢: # 添加包含新字段的文檔: POST books/_doc { "title": "Thinking in Elasticsearch 7.2.0", "author": "Heal Chow", "publish_date": "2019-10-01", "description": "Master the searching, indexing, and aggregation features in Elasticsearch.", "cover_url": "https:///images/29dMkliO2a1f.jpg", "content": "1. Revisiting Elasticsearch and the Changes. 2. The Improved Query DSL. 3. Beyond Full Text Search. 4. Data Modeling and Analytics. 5. Improving the User Search Experience. 6. The Index Distribution Architecture. .........." } # 通過 stored_fields 指定要查詢的字段: GET books/_search { "stored_fields": ["title", "author", "publish_date"], "query": { "match": { "content": "data modeling" } }, "highlight": { "fields": { "content": {} } } } 查詢結果如下: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.5753642, "hits" : [ { "_index" : "books", "_type" : "_doc", "_id" : "dukLoG0BdfGBNhbF13CJ", "_score" : 0.5753642, "highlight" : { "content" : [ "<em>Data</em> <em>Modeling</em> and Analytics. 5. Improving the User Search Experience. 6." ] } } ] } } (4) 結果說明:
3.5 mapping中字段的常用參數參考: https://www./guide/en/elasticsearch/reference/current/mapping-params.html
doc_values 與 fielddata 比較:
3.6 mapping 設置小結(1) 支持加入新的字段 (包括子字段)、更換分詞器等操作:
(2) Index Template: 根據索引的名稱匹配不同的 mappings 和 settings; (3) Dynamic Template: 在一個 mapping 上動態設定字段類型; (4) Reindex: 如果要修改、刪除已經存在的字段, 或者修改分片個數等參數, 就要重建索引.
4 ES 數據建模最佳實踐4.1 如何處理關聯關系(1) 范式化設計: 我們知道, 在關系型數據庫中有“范式化設計”的概念, 有 1NF、2NF、3NF、BCNF 等等, 主要目標是減少不必要的更新, 雖然節省了存儲空間, 但缺點是數據讀取操作可能會更慢, 尤其是跨表操作, 需要 join 的表會很多. 反范式化設計: 數據扁平, 不使用關聯關系, 而是在文檔中通過
==》ES 不擅長處理關聯關系, 一般可以通過對象類型(object)、嵌套類型(nested)、父子關聯關系(child/parent)解決. 具體使用所占篇幅較大, 這里省略. 4.2 避免太多的字段(1) 一個?檔中, 最好不要有?量的字段:
(2) ES中單個索引最大字段數默認是 1000, 可以通過參數 思考: 什么原因會導致文檔中有成百上千的字段?
4.3 避免正則查詢正則、前綴、通配符查詢, 都屬于 Term 查詢, 但是性能很不好(掃描所有文檔, 并逐一比對), 特別是將通配符放在開頭, 會導致性能災難. (1) 案例:
(2) 通配符查詢示例: # 插入2條數據: PUT softwares/_doc/1 { "version": "7.2.0", "doc_url": "https://www./guide/en/elasticsearch/.../.html" } PUT softwares/_doc/2 { "version": "7.3.0", "doc_url": "https://www./guide/en/elasticsearch/.../.html" } # 通配符查詢: GET softwares/_search { "query": { "wildcard": { "version": "7*" } } } (3) 解決方案 - 將字符串類型轉換為對象類型: # 創建對象類型的映射: PUT softwares { "mappings": { "properties": { "version": {# 版本號設置為對象類型 "properties": { "display_name": { "type": "keyword" }, "major": { "type": "byte" }, "minor": { "type": "byte" }, "bug_fix": { "type": "byte" } } }, "doc_url": { "type": "text" } } } } # 添加數據: PUT softwares/_doc/1 { "version": { "display_name": "7.2.0", "major": 7, "minor": 2, "bug_fix": 0 }, "doc_url": "https://www./guide/en/elasticsearch/.../.html" } PUT softwares/_doc/2 { "version": { "display_name": "7.3.0", "major": 7, "minor": 3, "bug_fix": 0 }, "doc_url": "https://www./guide/en/elasticsearch/.../.html" } # 通過filter過濾, 避免正則查詢, 大大提升性能: GET softwares/_search { "query": { "bool": { "filter": [ { "match": { "version.major": 7 } }, { "match": { "version.minor": 2 } } ] } } } 4.4 避免空值引起的聚合不準(1) 示例: # 添加數據, 包含1條 null 值的數據: PUT ratings/_doc/1 { "rating": 5 } PUT ratings/_doc/2 { "rating": null } # 對含有 null 值的字段進行聚合: GET ratings/_search { "size": 0, "aggs": { "avg_rating": { "avg": { "field": "rating"} } } } # 結果如下: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2,# 2條數據, avg_rating 結果不正確 "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "avg_rating" : { "value" : 5.0 } } } (2) 使用 # 創建 mapping 時, 設置 null_value: PUT ratings { "mappings": { "properties": { "rating": { "type": "float", "null_value": "1.0" } } } } # 添加相同的數據, 再次聚合, 結果正確: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "avg_rating" : { "value" : 3.0 } } }
|
|