查找类似文档 more_like_this

More Like This 查找给定文档集“相似”的文档。

例1. 基础样例

例如，我们要求所有在 “title” 和 “description” 字段中包含类似于 “Once upon a time” 的文本的所有电影，将所选关键词的数量限制为 12。

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "description"],
            "like" : "Once upon a time",
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

例2. 查询与存在的文档相似的文档

一个更复杂的用例包括将文本与索引中已存在的文档混合。在这种情况下，指定文档的语法类似于 Multi GET API 中使用的语法。

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "description"],
            "like" : [
            {
                "_index" : "imdb",
                "_type" : "movies",
                "_id" : "1"
            },
            {
                "_index" : "imdb",
                "_type" : "movies",
                "_id" : "2"
            },
            "and potentially some more text here as well"
            ],
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

最后，可以混合一些文本，一组选定的文档，还可以提供索引中不一定存在的文档。

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["name.first", "name.last"],
            "like" : [
            {
                "_index" : "marvel",
                "_type" : "quotes",
                "doc" : {
                    "name": {
                        "first": "Ben",
                        "last": "Grimm"
                    },
                    "_doc": "You got no idea what I'd... what I'd give to be invisible."
                  }
            },
            {
                "_index" : "marvel",
                "_type" : "quotes",
                "_id" : "2"
            }
            ],
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

参数	含义
fields	需要匹配的字段。
like	要匹配的文本。
unlike	unlike 参数与 like 结合使用，用于排除包含不喜欢的文本的文档。
min_term_freq	文档中词项的最低频率，默认是2，低于此频率的文档会被忽略。
max_query_terms	query中能包含的最大词项数目，默认为25。
min_doc_freq	最小的文档频率，默认为5。
max_doc_freq	最大文档频率。
min_word_length	单词的最小长度。
max_word_length	单词的最大长度。
stop_words	停用词列表。
analyzer	分词器。
minimum_should_match	文档应该匹配的最小单词数量，默认为query分词后词项的30%。
boost_terms	词项的权重。
include	是否把输入文档作为结果返回。
boost	整个query的权重，默认为1.0。

例1. 基础样例​

例2. 查询与存在的文档相似的文档​

例1. 基础样例

例2. 查询与存在的文档相似的文档