ES 32 - Elasticsearch 數據建模的探索與實踐

Coder編程 2020-12-17

展開全文

1 什么是數據建模?
2 如何對 ES 中的數據進行建模

2.1 字段類型的建模方案
2.2 檢索、聚合及排序的建模方案
2.3 額外存儲的建模方案

3 ES 數據建模實例演示

3.1 動態創建映射關系
3.2 手動創建映射關系
3.3 新增需求 - 添加大字段
3.4 解決大字段帶來的性能問題
3.5 mapping中字段的常用參數
3.6 mapping 設置小結

4 ES 數據建模最佳實踐

4.1 如何處理關聯關系
4.2 避免太多的字段
4.3 避免正則查詢
4.4 避免空值引起的聚合不準

參考資料
版權聲明

1 什么是數據建模?

數據建模(Data modeling), 是創建數據模型的過程.

數據模型是對真實世界進行抽象描述的一種工具和方法, 實現對現實世界的映射. 比如影視作品、演員、觀眾評論...

數據建模有三個過程: 概念模型 => 邏輯模型 => 數據模型(第三范式)

數據模型, 需要結合使用的數據庫類型, 在滿足業務讀寫性能等需求的前提下, 制定出最終的定義.

2 如何對 ES 中的數據進行建模

ES中的數據建模:

由數據存儲、檢索等功能需求提煉出實體屬性、實體之間的關系 =》形成邏輯模型;

由性能需求提煉制定索引模板、索引Mapping(包括字段的配置、關系的處理) ==》形成物理模型.

ES 中存儲、檢索的基本單位是索引文檔(document), 文檔由字段(field)組成, 所以ES的建模就是對字段進行建模.

文檔類似于關系型數據庫中的一行數據, 字段對應關系型數據庫中的某一列數據.

2.1 字段類型的建模方案

(1) text 與 keyword 比較:

text: 用于全文本字段, 文本會被 Analyzer 分詞; 默認不支持聚合分析及排序, 設置 "fielddata": true 即可支持;
keyword: 用于 id、枚舉及不需要分詞的文本, 比如身份證號碼、電話號碼，Email地址等; 適用于 Filter(精確匹配過濾)、Sorting(排序) 和 Aggregations(聚合).
設置多字段類型:
默認會為文本類型設置成 text, 并設置一個 keyword 的子字段;
在處理人類自然語?時, 可以添加“英?”、“拼?”、“標準”等分詞器, 提高搜索結果的正確性.

(2) 結構化數據:

數值類型: 盡量選擇貼近的類型, 例如可以用 byte, 就不要用 long;
枚舉類型: 設置為 keyword, 即使是數字, 也應該設置成 keyword, 獲取更好的性能; 另外范圍檢索使用keyword, 速度更快;
其他類型: 日期、二進制、布爾、地理信息等類型.

2.2 檢索、聚合及排序的建模方案

如不需要檢索、排序和聚合分析, 則可設置 "enable": false ;
如不需要檢索, 則可設置 "index": false ;
如不需要排序、聚合分析功能, 則可設置 "doc_values": false / "fielddate": false ;
更新頻繁、聚合查詢頻繁的 keyword 類型的字段, 推薦設置 "eager_global_ordinals": true .

2.3 額外存儲的建模方案

是否需要專門存儲當前字段數據?

"store": true, 可以存儲該字段的原始內容;
一般結合 "_source": { "enabled": false } 進行使用, 因為默認的 "_source": { "enabled": true } , 也就是添加索引時文檔的原始 JSON 結構都會存儲到 _source 中.

disable_source: 禁用 _source 元字段, 能節約磁盤, 適用于指標型數據 —— 類似于標識字段、時間字段的數據, 不會更新、高亮查詢, 多用來進行過濾操作以快速篩選出更小的結果集, 用來支撐更快的聚合操作.

官方建議: 如果更多關注磁盤空間, 那么建議優先考慮增加數據的壓縮?, 而不是禁用 _source;
無法看到 _source 字段, 就不能做 reindex、update、update_by_query 操作;
目前為止, Kibana 中無法對禁用了 _source 字段的索引進行 Discover 挖掘操作.
—— 謹慎禁用 _source 字段, 參考: https://www./guide/en/elasticsearch/reference/current/mapping-source-field.html

3 ES 數據建模實例演示

3.1 動態創建映射關系

# 直接寫入一本圖書信息:
POST books/_doc
{
  "title": "Thinking in Elasticsearch 7.2.0",
  "author": "Heal Chow",
  "publish_date": "2019-10-01",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch.",
  "cover_url": "https:///images/29dMkliO2a1f.jpg"
}

# 查看自動創建的mapping關系:
GET books/_mapping
# 內容如下:
{
  "books" : {
    "mappings" : {
      "properties" : {
        "author" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "cover_url" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "description" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "publish_date" : {
          "type" : "date"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

3.2 手動創建映射關系

# 刪除自動創建的圖書索引:
DELETE books

# 手動優化字段的mapping:
PUT books
{
  "mappings": {
    "_source": { "enabled": true },
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      },
      "author": { "type": "keyword" },
      "publish_date": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis"
      },
      "description": { "type": "text" },
      "cover_url": {          # index 設置成 false, 不支持搜索, 但支持 Terms 聚合
        "type": "keyword",
        "index": false
      }
    }
  }
}

說明: _source 元字段默認是開啟的, 若禁用后, 就無法對搜索的結果進行展示, 也無法進行 reindex、update、update_by_query 操作.

3.3 新增需求 - 添加大字段

需求描述: 添加圖書內容字段, 要求支持全文搜索, 并且能夠高亮顯示.
需求分析: 新需求會導致 _source 的內容過?, 雖然我們可以通過source filtering對要搜索結果中的字段進行過濾:
```
"_source": {
    "includes": ["title"]  # 或 "excludes": ["xxx"] 排除某些字段, includes 優先級更高
}
```
但這種方式只是 ES 服務端傳輸給客戶端時的過濾, 內部 Fetch 數據時, ES 各數據節點還是會傳輸 _source 中的所有數據到協調節點 —— 網絡 IO 沒有得到本質上的降低.

3.4 解決大字段帶來的性能問題

(1) 在創建 mapping 時手動關閉 _source 元字段: "_source": { "enabled": false} ;

(2) 然后為每個字段設置 "store": true .

# 關閉_source元字段, 設置store=true:
PUT books
{
  "mappings": {
    "_source": { "enabled": false },
    "properties": {
      "title": {
        "type": "text",
        "store": true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      },
      "author": { "type": "keyword", "store": true },
      "publish_date": {
        "type": "date",
        "store": true,
        "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis"
      },
      "description": { "type": "text", "store": true },
      "cover_url": {
        "type": "keyword",
        "index": false,
        "store": true
      },
      "content": { "type": "text", "store": true }
    }
  }
}

(3) 加數據, 并進行高亮查詢:

# 添加包含新字段的文檔:
POST books/_doc
{
  "title": "Thinking in Elasticsearch 7.2.0",
  "author": "Heal Chow",
  "publish_date": "2019-10-01",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch.",
  "cover_url": "https:///images/29dMkliO2a1f.jpg",
  "content": "1. Revisiting Elasticsearch and the Changes. 2. The Improved Query DSL. 3. Beyond Full Text Search. 4. Data Modeling and Analytics. 5. Improving the User Search Experience. 6. The Index Distribution Architecture.  .........."
}

# 通過 stored_fields 指定要查詢的字段:
GET books/_search
{
  "stored_fields": ["title", "author", "publish_date"],
  "query": {
    "match": { "content": "data modeling" }
  },
  "highlight": {
    "fields": { "content": {} }
  }
}

查詢結果如下:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "dukLoG0BdfGBNhbF13CJ",
        "_score" : 0.5753642,
        "highlight" : {
          "content" : [
            "<em>Data</em> <em>Modeling</em> and Analytics. 5. Improving the User Search Experience. 6."
          ]
        }
      }
    ]
  }
}

(4) 結果說明:

返回結果中不包含 _source 字段;
對需要顯示的信息, 要在查詢中指定 "stored_fields": ["xxx", "yyy"] ;
禁? _source 字段后, 仍然支持使用 Highlights API 的使用.

3.5 mapping中字段的常用參數

參考: https://www./guide/en/elasticsearch/reference/current/mapping-params.html

enabled – 設置成 false, 當前字段就只存儲, 不支持搜索和聚合分析 (數據保存在 _source 中);
index – 是否構建倒排索引, 設置成 false, 就無法被搜索, 但還是支持聚合操作, 并會出現在 _source 中;
norms – 只?來過濾和聚合分析(指標數據)、不關心評分的字段, 建議關閉, 節約存儲空間;
doc_values – 是否啟用 doc_values, 用于排序和聚合分析;
field_data – 如果要對 text 類型啟用排序和聚合分析, fielddata 需要設置成true;
coerce – 是否開啟數據類型的自動轉換 (如: 字符串轉數字), 默認開啟;
multifields - 是否開啟多字段特性;
dynamic – 控制 mapping 的動態更新策略, 有 true / false / strict 三種.

doc_values 與 fielddata 比較:

doc_values: 聚合和排序的字段需要開啟 —— 默認 為所有非text類型的字段 開啟 —— 內存不夠時, 會寫入磁盤文件中;
fielddata: 是否為text類型開啟, 以實現排序和聚合分析 —— 默認關閉 —— 全部加載進內存中.

3.6 mapping 設置小結

(1) 支持加入新的字段 (包括子字段)、更換分詞器等操作:

可以通過 update_by_query 令舊數據得到清洗.

(2) Index Template: 根據索引的名稱匹配不同的 mappings 和 settings;

(3) Dynamic Template: 在一個 mapping 上動態設定字段類型;

(4) Reindex: 如果要修改、刪除已經存在的字段, 或者修改分片個數等參數, 就要重建索引.

必須停機, 數據量大時耗時會比較久.
可借助 Index Alias (索引別名) 來實現零停機維護.

4 ES 數據建模最佳實踐

4.1 如何處理關聯關系

(1) 范式化設計:

我們知道, 在關系型數據庫中有“范式化設計”的概念, 有 1NF、2NF、3NF、BCNF 等等, 主要目標是減少不必要的更新, 雖然節省了存儲空間, 但缺點是數據讀取操作可能會更慢, 尤其是跨表操作, 需要 join 的表會很多.

反范式化設計: 數據扁平, 不使用關聯關系, 而是在文檔中通過 _source 字段來保存冗余的數據拷貝.

優點: 無需處理 join 操作, 數據讀取性能好;
缺點: 不適合數據頻繁修改的場景.

==》ES 不擅長處理關聯關系, 一般可以通過對象類型(object)、嵌套類型(nested)、父子關聯關系(child/parent)解決.

具體使用所占篇幅較大, 這里省略.

4.2 避免太多的字段

(1) 一個?檔中, 最好不要有?量的字段:

過多的字段導致數據不容易維護;
mapping 信息保存在 Cluster State 中, 數據量過?, 對集群性能會有影響 (Cluster State 信息需要和所有的節點同步);
刪除或修改字段時, 需要 reindex;

(2) ES中單個索引最大字段數默認是 1000, 可以通過參數 index.mapping.total_fields.limt 修改最?字段數.

思考: 什么原因會導致文檔中有成百上千的字段?

ES 是無模式 (schemaless) 的, 默認情況下, 每添加一個字段, ES 都會根據該字段可能的類型自動添加映射關系.
如果業務處理不嚴謹, 會出現字段爆炸的現象. 為了避免這種現象的發生, 需要制定 dynamic 策略:
true - 未知字段會被自動加入, 是默認設置;
false - 新字段不會被索引, 但是會保存到 _source 中;
strict - 新增字段不會被索引, ?檔寫入失敗, 拋出異常.
—— 生產環境中, 盡量不要使用默認的 "dynamic": true .

4.3 避免正則查詢

正則、前綴、通配符查詢, 都屬于 Term 查詢, 但是性能很不好(掃描所有文檔, 并逐一比對), 特別是將通配符放在開頭, 會導致性能災難.

(1) 案例:

文檔中某個字段包含了 Elasticsearch 的版本信息, 例如 version: "7.2.0" ;
搜索某系列的 bug_fix 版本(末位非0的版本號)? 每個主要版本號所關聯的文檔?

(2) 通配符查詢示例:

# 插入2條數據:
PUT softwares/_doc/1
{
  "version": "7.2.0",
  "doc_url": "https://www./guide/en/elasticsearch/.../.html"
}

PUT softwares/_doc/2
{
  "version": "7.3.0",
  "doc_url": "https://www./guide/en/elasticsearch/.../.html"
}

# 通配符查詢:
GET softwares/_search
{
  "query": {
    "wildcard": {
      "version": "7*"
    }
  }
}

(3) 解決方案 - 將字符串類型轉換為對象類型:

# 創建對象類型的映射:
PUT softwares
{
  "mappings": {
    "properties": {
      "version": {# 版本號設置為對象類型
        "properties": {
          "display_name": { "type": "keyword" },
          "major": { "type": "byte" },
          "minor": { "type": "byte" },
          "bug_fix": { "type": "byte" }
        }
      },
      "doc_url": { "type": "text" }
    }
  }
}

# 添加數據:
PUT softwares/_doc/1
{
  "version": {
    "display_name": "7.2.0",
    "major": 7,
    "minor": 2,
    "bug_fix": 0
  },
  "doc_url": "https://www./guide/en/elasticsearch/.../.html"
}

PUT softwares/_doc/2
{
  "version": {
    "display_name": "7.3.0",
    "major": 7,
    "minor": 3,
    "bug_fix": 0
  },
  "doc_url": "https://www./guide/en/elasticsearch/.../.html"
}

# 通過filter過濾, 避免正則查詢, 大大提升性能:
GET softwares/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": { "version.major": 7 }
        },
        {
          "match": { "version.minor": 2 }
        }
      ]
    }
  }
}

4.4 避免空值引起的聚合不準

(1) 示例:

# 添加數據, 包含1條 null 值的數據:
PUT ratings/_doc/1
{
  "rating": 5
}
PUT ratings/_doc/2
{
  "rating": null
}

# 對含有 null 值的字段進行聚合:
GET ratings/_search
{
  "size": 0,
  "aggs": {
    "avg_rating": {
      "avg": { "field": "rating"}
    }
  }
}

# 結果如下:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,# 2條數據, avg_rating 結果不正確
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_rating" : {
      "value" : 5.0
    }
  }
}

(2) 使用 null_value 解決空值的問題:

# 創建 mapping 時, 設置 null_value:
PUT ratings
{
  "mappings": {
    "properties": {
      "rating": {
        "type": "float",
        "null_value": "1.0"
      }
    }
  }
}

# 添加相同的數據, 再次聚合, 結果正確:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_rating" : {
      "value" : 3.0
    }
  }
}