ES的搜索,不是关系数据库中的LIKE,而是通过搜索条件及文档之间的相关性来进行的。
对于一次搜索,对于每一个文档,都有一个浮点数字段_score 来表示文档与搜索主题的相关性, _score 的评分越高,相关性越高。
评分的计算方式取决于不同的查询类型:
fuzzy查询会计算与关键词的拼写相似程度
terms查询会计算找到的内容与关键词组成部分匹配的百分比
而全文本搜索是指计算内容与关键词的类似程度。
ES通过计算TF/IDF(即检索词频率/反向文档频率, Term Frequency/Inverse Document Frequency)作为相关性指标,具体与下面三个指标相关:
检索词频率TF: 对于一条记录,检索词在查询字段中出现的频率越高,相关性也越高。比如,一共有5个检索词,有4个出现在第一条记录,3条出现在第二条记录,则第一条记录TF会比第二条高一些。
反向文档频率IDF: 每个检索词在所有文档的该字段中出现的频率越高,则该词相关性越低。比如有5个检索词,如果一个词在所有文档中都出现,而另一个词之出现了一次,则所有文档中都包含的词几乎可以被忽略,只出现了一次的这个词权重会很高。
字段长度: 对于一条记录,查询字段的长度越长,相关性越低。比如有一条记录长度为10个词,另一条记录长度为100个词,而一个关键词,在两条记录里都出现了一次。则长度为10个词的记录,比长度为100个词的记录,相关性会高很多。
通过对TF/IDF的了解,可以让你解释一些看似不应该出现的结果。同时,你应该清楚,这不是一种精确匹配算法,而是一种评分算法,根据相关性进行了排序。
如果认为评分结果不合理,可以用下面的语句,查看评分过程:
#解释查询是如何进行评分的
crul -XPost http://127.0.0.1:9200/myindex/user/_search?explain -d'
{
"query" : { "match" : { "家庭住址" : "魔都大街" }}
}'
#结果如下:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 4,
"hits": [
{
"_shard": 4,
"_node": "5Tv2a5YaQDqmzUFbTp4iaw",
"_index": "myindex",
"_type": "user",
"_id": "u002",
"_score": 4,
"_source": {
"用户ID": "u002",
"姓名": "李四",
"性别": "男",
"年龄": "25",
"家庭住址": "上海市闸北区魔都大街007号",
"注册时间": "2015-02-01 08:30:00"
},
"_explanation": {
"value": 4,
"description": "sum of:",
"details": [
{
"value": 4,
"description": "sum of:",
"details": [
{
"value": 1,
"description": "weight(家庭住址:魔 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.5,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 0.5,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 2,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 2,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
},
{
"value": 1,
"description": "weight(家庭住址:都 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.5,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 0.5,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 2,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 2,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
},
{
"value": 1,
"description": "weight(家庭住址:大街 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.5,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 0.5,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 2,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 2,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
},
{
"value": 1,
"description": "weight(家庭住址:街 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.5,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 0.5,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 2,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 2,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 0.5,
"description": "_type:user, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 0.5,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": 0,
"_node": "5Tv2a5YaQDqmzUFbTp4iaw",
"_index": "myindex",
"_type": "user",
"_id": "u003",
"_score": 0.71918744,
"_source": {
"用户ID": "u003",
"姓名": "王五",
"性别": "男",
"年龄": "26",
"家庭住址": "广州市花都区花城大街010号",
"注册时间": "2015-03-01 08:30:00"
},
"_explanation": {
"value": 0.71918744,
"description": "sum of:",
"details": [
{
"value": 0.71918744,
"description": "product of:",
"details": [
{
"value": 1.4383749,
"description": "sum of:",
"details": [
{
"value": 0.71918744,
"description": "weight(家庭住址:大街 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.71918744,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.35959372,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 0.35959372,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 2,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 2,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
},
{
"value": 0.71918744,
"description": "weight(家庭住址:街 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.71918744,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.35959372,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 0.35959372,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 2,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)",
"details": []
},
{
"value": 2,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(2/4)",
"details": []
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 0.35959372,
"description": "_type:user, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 0.35959372,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
},
......
]
}
}
你可以看到,不仅是“魔都大街”的记录被查询出来了,只要有“大街”的记录也被查出来了哦。同时,也告诉了你,为什么”u002″是最靠前的。
还有一种用法,就是让ES告诉你,查询语句哪里错了:
curl -XPOST http://127.0.0.1:9200/myindex/user/_validate/query?explain -d'
{
"query" : { "matchA" : { "家庭住址" : "魔都大街" }}
}'
{
"valid": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "myindex",
"valid": false,
"error": "org.elasticsearch.index.query.QueryParsingException: No query registered for [matchA]"
}
]
}
ES会告诉你matchA这里错了哦。