Skip to main content

Fingerprint Analyzer

指纹分析器实现了OpenRefine项目使用的指纹识别算法来协助聚类

输入文本采用小写、标准化以删除扩展字符、排序、消除重复并连接到单个标记中。如果配置了停止词列表,停止词也将被删除。

示例

POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}

分词结果

[ and consistent godel is said sentence this yes ]

参数

  • separator 用于连接terms的字符。默认为空格。
  • max_output_size 发出的最大token大小。默认值为255。大于此大小的token将被丢弃。
  • stopwords 预定义的停用词列表,如english或包含停止词列表的数组。默认为none

参数示例

在本例中,我们将模式分析器配置为将电子邮件地址拆分为非单词字符或下划线(\W|_),并将结果小写:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
"stopwords": "_english_"
}
}
}
}
}

POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}

分词结果

[ consistent godel said sentence yes ]

定义

  • Tokenizer
    • Standard Tokenizer
  • Token Filters (in order)
    • Lower Case Token Filter
    • ASCII Folding Token Filter
    • Stop Token Filter (disabled by default)
    • Fingerprint Token Filter

自定义示例

PUT /fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_fingerprint": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint"
]
}
}
}
}
}