跳到主要内容

Pattern Analyzer

模式分析器使用正则表达式将文本拆分为token。正则表达式默认为\W+(所有非单词字符)。

谨防不理智的正则表达式

模式分析器使用Java正则表达式,一个写得不好的正则表达式可能运行得非常慢,甚至引发StackOverflowError。

示例

POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

分词结果

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

配置

模式分析器接受以下参数:

  • pattern Java正则表达式,默认为\W+。
  • flags Java正则表达式标志。标志应以|分隔,例如“CASE_INSENSITIVE | COMMENTS”。
  • lowercase 是否小写。默认为true。
  • stopwords 预定义的停用词列表,如english或包含停止词列表的数组。默认为none

配置参数示例

在本例中,我们将模式分析器配置为将电子邮件地址拆分为非单词字符或下划线(\W|_),并将结果小写:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"lowercase": true
}
}
}
}
}

POST my_index/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
提示

pattern 将模式指定为JSON字符串时,需要转义模式中的反斜杠。

分词结果

[ john, smith, foo, bar, com ]

CamelCase驼峰标记器

驼峰式拆分

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}

GET my_index/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}

分词结果

[ moose, x, ftp, class, 2, beta ]

上面的正则表达式更容易理解为:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)

定义

  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lower Case Token Filter
    • Stop Token Filter (disabled by default)

自定义示例

PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+" (1)
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase" (2)
]
}
}
}
}
}
  1. 默认模式是\W+,它在非单词字符上拆分,您可以在此处更改它。
  2. 您可以在小写后添加其他令牌筛选器。