To illustrate the different query types in Elasticsearch, we will be searching a collection of book documents with the following fields; title, authors, summary, release date and number of reviews.
But first, let’s create a new index and index some documents using the bulk API
Here is a basic match query that searches for the string “guide” in all the fields
The full body version of this query is shown below and produces same results as the above search lite.
The
Both APIs also allow you specify what fields you want to search on. For example, to search for books with the words “in Action” in the title field:
However, the full body DSL gives you more flexibility in creating more complicated queries (as we will see later) and in specifying how you want the results back. In the example below we specify the number of results we want back, the offset to start from (useful for pagination), the document fields we want returned, and term highlighting.
UPDATE: Filtered queries have been removed from the upcoming Elasticsearch 5.0 in favor of the bool query. Here is the same example as above re-written to use the bool query instead. The results returned are exactly the same.
This also applies to multiple filters in the example below.
Note 1: We could have just run a regular
Note 2: There are a number of additional parameters that tweak the extent of the boosting effect on the original relevance score such as “modifier”, “factor”, “boost_mode”, etc. These are explored in detail in the Elasticsearch guide.
The scoring script looks like this:
Note 1: To use dynamic scripting, it must be enabled for your Elasticsearch instance in the
Note 2: JSON cannot include embedded newline characters so the semicolon is used to separate statements.
But first, let’s create a new index and index some documents using the bulk API
PUT /bookdb_index
{ "settings": { "number_of_shards": 1 }}
POST /bookdb_index/book/_bulk
{ "index": { "_id": 1 }}
{ "title": "Elasticsearch: The Definitive Guide", "authors": ["clinton gormley", "zachary tong"], "summary" : "A distibuted real-time search and analytics engine", "publish_date" : "2015-02-07", "num_reviews": 20, "publisher": "oreilly" }
{ "index": { "_id": 2 }}
{ "title": "Taming Text: How to Find, Organize, and Manipulate It", "authors": ["grant ingersoll", "thomas morton", "drew farris"], "summary" : "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization", "publish_date" : "2013-01-24", "num_reviews": 12, "publisher": "manning" }
{ "index": { "_id": 3 }}
{ "title": "Elasticsearch in Action", "authors": ["radu gheorge", "matthew lee hinman", "roy russo"], "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "publish_date" : "2015-12-03", "num_reviews": 18, "publisher": "manning" }
{ "index": { "_id": 4 }}
{ "title": "Solr in Action", "authors": ["trey grainger", "timothy potter"], "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "publish_date" : "2014-04-05", "num_reviews": 23, "publisher": "manning" }
Examples
Basic Match Query
There are a two ways of executing a basic full-text (match) query: using the Search Lite API which expects all the search parameters to be passed in as part of the URL, or using the full JSON request body which allows you use the full Elasticsearch DSL.Here is a basic match query that searches for the string “guide” in all the fields
GET /bookdb_index/book/_search?q=guide
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.28168046,
"_source": {
"title": "Elasticsearch: The Definitive Guide",
"authors": [
"clinton gormley",
"zachary tong"
],
"summary": "A distibuted real-time search and analytics engine",
"publish_date": "2015-02-07",
"num_reviews": 20,
"publisher": "manning"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.24144039,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
],
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"publish_date": "2014-04-05",
"num_reviews": 23,
"publisher": "manning"
}
}
]
The full body version of this query is shown below and produces same results as the above search lite.
{
"query": {
"multi_match" : {
"query" : "guide",
"fields" : ["_all"]
}
}
}
The
multi_match
keyword is used in place of the match
keyword as a convenient shorthand way of running the same query against multiple fields. The fields
property specifies what fields to query against and in this case we want to query against all the fields in the document.Both APIs also allow you specify what fields you want to search on. For example, to search for books with the words “in Action” in the title field:
GET /bookdb_index/book/_search?q=title:in action
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.6259885,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
],
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"publish_date": "2014-04-05",
"num_reviews": 23,
"publisher": "manning"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.5975345,
"_source": {
"title": "Elasticsearch in Action",
"authors": [
"radu gheorge",
"matthew lee hinman",
"roy russo"
],
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"publish_date": "2015-12-03",
"num_reviews": 18,
"publisher": "manning"
}
}
]
POST /bookdb_index/book/_search
{
"query": {
"match" : {
"title" : "in action"
}
},
"size": 2,
"from": 0,
"_source": [ "title", "summary", "publish_date" ],
"highlight": {
"fields" : {
"title" : {}
}
}
}
[Results]
"hits": {
"total": 2,
"max_score": 0.9105287,
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.9105287,
"_source": {
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
},
"highlight": {
"title": [
"Elasticsearch <em>in</em> <em>Action</em>"
]
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.9105287,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"title": "Solr in Action",
"publish_date": "2014-04-05"
},
"highlight": {
"title": [
"Solr <em>in</em> <em>Action</em>"
]
}
}
]
}
Note: For multi-word queries, the match
query lets you specify whether to use the and
operator instead of the default or
operator. You can also specify the minimum_should_match
option to tweak relevance of the returned results. Details can be found here.Multi-field Search
As we’ve seen already, to query more than one document field in a search (e.g. searching for the same query string in both the title and the summary) you use themulti_match
query.POST /bookdb_index/book/_search
{
"query": {
"multi_match" : {
"query" : "elasticsearch guide",
"fields": ["title", "summary"]
}
}
}
[Results]
"hits": {
"total": 3,
"max_score": 0.9448582,
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.9448582,
"_source": {
"title": "Elasticsearch: The Definitive Guide",
"authors": [
"clinton gormley",
"zachary tong"
],
"summary": "A distibuted real-time search and analytics engine",
"publish_date": "2015-02-07",
"num_reviews": 20,
"publisher": "manning"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.17312013,
"_source": {
"title": "Elasticsearch in Action",
"authors": [
"radu gheorge",
"matthew lee hinman",
"roy russo"
],
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"publish_date": "2015-12-03",
"num_reviews": 18,
"publisher": "manning"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.14965448,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
],
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"publish_date": "2014-04-05",
"num_reviews": 23,
"publisher": "manning"
}
}
]
}
Note that hit number 3 matched because the word “guide” was found in the summary.Boosting
Since we are searching across multiple fields, we may want to boost the scores in a certain field. In the contrived example below, we boost scores from the summary field by a factor of 3 in order to increase the importance of the summary field, which will in turn increase the relevance of document_id 4
.POST /bookdb_index/book/_search
{
"query": {
"multi_match" : {
"query" : "elasticsearch guide",
"fields": ["title", "summary^3"]
}
},
"_source": ["title", "summary", "publish_date"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.31495273,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.14965448,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.13094766,
"_source": {
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
}
}
]
Note: Boosting does not merely imply that the calculated score gets multiplied by the boost factor. The actual boost value that is applied goes through normalization and some internal optimization. More information on how boosting works can be found in the Elasticsearch guide.Bool Query
The AND/OR/NOT operators can be used to fine tune our search queries in order to provide more relevant or specific results. This is implemented in the search API as abool
query. The bool
query accepts a must
parameter (equivalent to AND), a must_not
parameter (equivalent to NOT), and a should
parameter (equivalent to OR). For example, if I want to search for a book with the word “Elasticsearch” OR “Solr” in the title, AND is authored by “clinton gormley” but NOT authored by “radu gheorge”:POST /bookdb_index/book/_search
{
"query": {
"bool": {
"must": {
"bool" : { "should": [
{ "match": { "title": "Elasticsearch" }},
{ "match": { "title": "Solr" }} ] }
},
"must": { "match": { "authors": "clinton gormely" }},
"must_not": { "match": {"authors": "radu gheorge" }}
}
}
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.3672021,
"_source": {
"title": "Elasticsearch: The Definitive Guide",
"authors": [
"clinton gormley",
"zachary tong"
],
"summary": "A distibuted real-time search and analytics engine",
"publish_date": "2015-02-07",
"num_reviews": 20,
"publisher": "oreilly"
}
}
]
Note: As you can see, a bool query can wrap any other query type including other bool queries to create arbitrarily complex or deeply nested queries.Fuzzy Queries
Fuzzy matching can be enabled on Match and Multi-Match queries to catch spelling errors. The degree of fuzziness is specified based on the Levenshtein distance from the original word.POST /bookdb_index/book/_search
{
"query": {
"multi_match" : {
"query" : "comprihensiv guide",
"fields": ["title", "summary"],
"fuzziness": "AUTO"
}
},
"_source": ["title", "summary", "publish_date"],
"size": 1
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.5961596,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
}
]
Note: The fuzziness value of "AUTO"
is equivalent to specifying a value of 2
when the term length is greater than 5. However, setting 80% of human misspellings have an edit distance of 1
and setting the fuzziness to 1
may improve your overall search performance. See the Typos and Misspellings chapter of Elasticsearch the Definitive Guide for more information.Wildcard Query
Wildcard queries allow you to specify a pattern to match instead of the entire term.?
matches any character and *
matches zero or more characters. For example, to find all records that have an author whose name begins with the letter ‘t’POST /bookdb_index/book/_search
{
"query": {
"wildcard" : {
"authors" : "t*"
}
},
"_source": ["title", "authors"],
"highlight": {
"fields" : {
"authors" : {}
}
}
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 1,
"_source": {
"title": "Elasticsearch: The Definitive Guide",
"authors": [
"clinton gormley",
"zachary tong"
]
},
"highlight": {
"authors": [
"zachary <em>tong</em>"
]
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": 1,
"_source": {
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"authors": [
"grant ingersoll",
"thomas morton",
"drew farris"
]
},
"highlight": {
"authors": [
"<em>thomas</em> morton"
]
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 1,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
]
},
"highlight": {
"authors": [
"<em>trey</em> grainger",
"<em>timothy</em> potter"
]
}
}
]
Regexp Query
Regexp queries allow you specify more complex patterns than wildcard queries.POST /bookdb_index/book/_search
{
"query": {
"regexp" : {
"authors" : "t[a-z]*y"
}
},
"_source": ["title", "authors"],
"highlight": {
"fields" : {
"authors" : {}
}
}
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 1,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
]
},
"highlight": {
"authors": [
"<em>trey</em> grainger",
"<em>timothy</em> potter"
]
}
}
]
Match Phrase Query
The match phrase query requires that all the terms in the query string be present in the document, be in the order specified in the query string and be close to each other. By default the terms are required to be exactly beside each other but you can specify theslop
value which indicates how far apart terms are allowed to be while still considering the document a match.POST /bookdb_index/book/_search
{
"query": {
"multi_match" : {
"query": "search engine",
"fields": ["title", "summary"],
"type": "phrase",
"slop": 3
}
},
"_source": [ "title", "summary", "publish_date" ]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.22327082,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.16113183,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
}
]
Note: in the example above, for a non phrase type query, document _id 1
would normally have a higher score and appear ahead of document _id 4
because its field length is shorter. However, as a phrase query the proximity of the terms is factored in, so document _id 4
scores better.Match phrase prefix
Match phrase prefix queries provide search-as-you-type or a poor man’s version of autocomplete at query time without needing to prepare your data in any way. Like the match_phrase query, it accepts aslop
parameter to make the word order and relative positions somewhat less rigid. I also accepts the max_expansions
parameter to limit the number of terms matched in order to reduce resource intensity.POST /bookdb_index/book/_search
{
"query": {
"match_phrase_prefix" : {
"summary": {
"query": "search en",
"slop": 3,
"max_expansions": 10
}
}
},
"_source": [ "title", "summary", "publish_date" ]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.5161346,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.37248808,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
}
]
Note: Query-time search-as-you-type has a performance cost. A better solution is index-time search-as-you-type. Checkout the Completion Suggester API or the use of Edge-Ngram filters for more information.Query String
Thequery_string
query provides a means of executing multi_match queries, bool queries, boosting, fuzzy matching, wildcards, regexp and range queries in a concise shorthand syntax. In the following example, we execute a fuzzy search for the terms “saerch algorithm” in which one of the book authors is “grant ingersoll” or “tom morton”. We search all fields but apply a boost of 2 to the summary field.POST /bookdb_index/book/_search
{
"query": {
"query_string" : {
"query": "(saerch~1 algorithm~1) AND (grant ingersoll) OR (tom morton)",
"fields": ["_all", "summary^2"]
}
},
"_source": [ "title", "summary", "authors" ],
"highlight": {
"fields" : {
"summary" : {}
}
}
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": 0.14558059,
"_source": {
"summary": "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization",
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"authors": [
"grant ingersoll",
"thomas morton",
"drew farris"
]
},
"highlight": {
"summary": [
"organize text using approaches such as full-text <em>search</em>, proper name recognition, clustering, tagging, information extraction, and summarization"
]
}
}
]
Simple Query String
Thesimple_query_string
query is a version of the query_string
query that is more suitable for use in a single search box that is exposed to users because; it replaces the use of AND/OR/NOT with +/|/- respectively, and it discards invalid parts of a query instead of throwing an exception if a user makes a mistake.POST /bookdb_index/book/_search
{
"query": {
"simple_query_string" : {
"query": "(saerch~1 algorithm~1) + (grant ingersoll) | (tom morton)",
"fields": ["_all", "summary^2"]
}
},
"_source": [ "title", "summary", "authors" ],
"highlight": {
"fields" : {
"summary" : {}
}
}
}
Term/Terms Query
The above examples have been examples of full-text search. Sometimes we are more interested in structured search in which we want to find an exact match and return the results. Theterm
and terms
queries help us here. In the below example, we are searching for all books in our index published by Manning Publications.POST /bookdb_index/book/_search
{
"query": {
"term" : {
"publisher": "manning"
}
},
"_source" : ["title","publish_date","publisher"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": 1.2231436,
"_source": {
"publisher": "manning",
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"publish_date": "2013-01-24"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 1.2231436,
"_source": {
"publisher": "manning",
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 1.2231436,
"_source": {
"publisher": "manning",
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
}
]
Multiple terms can be specified by using the terms
keyword instead and passing in an array of search terms.{
"query": {
"terms" : {
"publisher": ["oreilly", "packt"]
}
}
}
Term Query - Sorted
Term queries results (like any other query results) can easily be sorted. Multilevel sorting is also allowedPOST /bookdb_index/book/_search
{
"query": {
"term" : {
"publisher": "manning"
}
},
"_source" : ["title","publish_date","publisher"],
"sort": [
{ "publish_date": {"order":"desc"}},
{ "title": { "order": "desc" }}
]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": null,
"_source": {
"publisher": "manning",
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
},
"sort": [
1449100800000,
"in"
]
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": null,
"_source": {
"publisher": "manning",
"title": "Solr in Action",
"publish_date": "2014-04-05"
},
"sort": [
1396656000000,
"solr"
]
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": null,
"_source": {
"publisher": "manning",
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"publish_date": "2013-01-24"
},
"sort": [
1358985600000,
"to"
]
}
]
Range Query
Another structured query example is the range query. In this example we search for books published in 2015.POST /bookdb_index/book/_search
{
"query": {
"range" : {
"publish_date": {
"gte": "2015-01-01",
"lte": "2015-12-31"
}
}
},
"_source" : ["title","publish_date","publisher"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 1,
"_source": {
"publisher": "oreilly",
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 1,
"_source": {
"publisher": "manning",
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
}
}
]
Note: Range queries work on date, number and string type fields.Filtered Query
Filtered queries allow you filter down the results of a query. For our example, we are querying for books with the term “Elasticsearch” in title or summary but we want to filter our results to only those with 20 or more reviews.POST /bookdb_index/book/_search
{
"query": {
"filtered": {
"query" : {
"multi_match": {
"query": "elasticsearch",
"fields": ["title","summary"]
}
},
"filter": {
"range" : {
"num_reviews": {
"gte": 20
}
}
}
}
},
"_source" : ["title","summary","publisher", "num_reviews"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.5955761,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"publisher": "oreilly",
"num_reviews": 20,
"title": "Elasticsearch: The Definitive Guide"
}
}
]
Note: Filtered queries do not mandate the presence of a query on which to filter on. If no query is specified then the match_all
query is run which basically returns all the documents in the index and then filters on it. In actuality, the filter is run first, reducing the surface area needed to be queried. Also the filter is cached after the first use which makes it very performant.UPDATE: Filtered queries have been removed from the upcoming Elasticsearch 5.0 in favor of the bool query. Here is the same example as above re-written to use the bool query instead. The results returned are exactly the same.
POST /bookdb_index/book/_search
{
"query": {
"bool": {
"must" : {
"multi_match": {
"query": "elasticsearch",
"fields": ["title","summary"]
}
},
"filter": {
"range" : {
"num_reviews": {
"gte": 20
}
}
}
}
},
"_source" : ["title","summary","publisher", "num_reviews"]
}
This also applies to multiple filters in the example below.
Multiple Filters
Multiple filters can be combined through the use of thebool
filter. In the next example, the filter determines that the returned results must have at least 20 reviews, must not be published before 2015 and should be published by oreilly.POST /bookdb_index/book/_search
{
"query": {
"filtered": {
"query" : {
"multi_match": {
"query": "elasticsearch",
"fields": ["title","summary"]
}
},
"filter": {
"bool": {
"must": {
"range" : { "num_reviews": { "gte": 20 } }
},
"must_not": {
"range" : { "publish_date": { "lte": "2014-12-31" } }
},
"should": {
"term": { "publisher": "oreilly" }
}
}
}
}
},
"_source" : ["title","summary","publisher", "num_reviews", "publish_date"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.5955761,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"publisher": "oreilly",
"num_reviews": 20,
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
}
]
Function Score: Field Value Factor
There may be a case where you want to factor in the value of a particular field in your document into the calculation of the relevance score. This is typical in scenarios where you want the boost the relevance of a document based on its popularity. In our example, we would like the more popular books (as judged by the number of reviews) to be boosted. This is possible using thefield_value_factor
function score.POST /bookdb_index/book/_search
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query" : "search engine",
"fields": ["title", "summary"]
}
},
"field_value_factor": {
"field" : "num_reviews",
"modifier": "log1p",
"factor" : 2
}
}
},
"_source": ["title", "summary", "publish_date", "num_reviews"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.44831306,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"num_reviews": 20,
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.3718407,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"num_reviews": 23,
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.046479136,
"_source": {
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"num_reviews": 18,
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": 0.041432835,
"_source": {
"summary": "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization",
"num_reviews": 12,
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"publish_date": "2013-01-24"
}
}
]
multi_match
query and sorted by the num_reviews field but then we lose the benefits of having relevance scoring.Note 2: There are a number of additional parameters that tweak the extent of the boosting effect on the original relevance score such as “modifier”, “factor”, “boost_mode”, etc. These are explored in detail in the Elasticsearch guide.
Function Score: Decay Functions
Suppose that instead of wanting to boost incrementally by the value of a field, you have ideal value you want to target and you want the boost factor to decay the further away you move from the value. This is typically useful in boosts based on lat/long, numeric fields like price, or dates. In our contrived example, we are searching for books on “search engines” ideally published around June 2014.POST /bookdb_index/book/_search
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query" : "search engine",
"fields": ["title", "summary"]
}
},
"functions": [
{
"exp": {
"publish_date" : {
"origin": "2014-06-15",
"offset": "7d",
"scale" : "30d"
}
}
}
],
"boost_mode" : "replace"
}
},
"_source": ["title", "summary", "publish_date", "num_reviews"]
}
[Results]
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.27420625,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"num_reviews": 23,
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.005920768,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"num_reviews": 20,
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": 0.000011564,
"_source": {
"summary": "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization",
"num_reviews": 12,
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"publish_date": "2013-01-24"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.0000059171475,
"_source": {
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"num_reviews": 18,
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
}
}
]
Function Score: Script Scoring
In the case where the built-in scoring functions do not meet your needs, there is the option to specify a Groovy script to use for scoring. In our example, we want to specify a script that takes into consideration the publish_date before deciding how much to factor in the number of reviews. Newer books may not have as many reviews yet so they should not be penalized for that.The scoring script looks like this:
publish_date = doc['publish_date'].value
num_reviews = doc['num_reviews'].value
if (publish_date > Date.parse('yyyy-MM-dd', threshold).getTime()) {
my_score = Math.log(2.5 + num_reviews)
} else {
my_score = Math.log(1 + num_reviews)
}
return my_score
To use a scoring script dynamically, we use the script_score
parameterPOST /bookdb_index/book/_search
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query" : "search engine",
"fields": ["title", "summary"]
}
},
"functions": [
{
"script_score": {
"params" : {
"threshold": "2015-07-30"
},
"script": "publish_date = doc['publish_date'].value; num_reviews = doc['num_reviews'].value; if (publish_date > Date.parse('yyyy-MM-dd', threshold).getTime()) { return log(2.5 + num_reviews) }; return log(1 + num_reviews);"
}
}
]
}
},
"_source": ["title", "summary", "publish_date", "num_reviews"]
}
[Results]
"hits": {
"total": 4,
"max_score": 0.8463001,
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.8463001,
"_source": {
"summary": "A distibuted real-time search and analytics engine",
"num_reviews": 20,
"title": "Elasticsearch: The Definitive Guide",
"publish_date": "2015-02-07"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.7067348,
"_source": {
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"num_reviews": 23,
"title": "Solr in Action",
"publish_date": "2014-04-05"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "3",
"_score": 0.08952084,
"_source": {
"summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
"num_reviews": 18,
"title": "Elasticsearch in Action",
"publish_date": "2015-12-03"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "2",
"_score": 0.07602123,
"_source": {
"summary": "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization",
"num_reviews": 12,
"title": "Taming Text: How to Find, Organize, and Manipulate It",
"publish_date": "2013-01-24"
}
}
]
}
config/elasticsearch.yaml
file. It’s also possible to use scripts that have been stored on the Elasticsearch server. Checkout the Elasticsearch reference docs for more information.Note 2: JSON cannot include embedded newline characters so the semicolon is used to separate statements.
I'm neither a statistician nor a professional data scientist. But when a friend and fellow Chattanooga-based developer, Jared Menard and I started talking about the salary survey he put together, I was intrigued by the insights that could be gotten from the data. The survey asked questions like:
- How much money do you make a year?
- How many years of experience do you have?
- How many books have you read in the last year?
- Do you have a degree? Is it related to your career?
- Name some things that you use at work.
TAP is Intel's new open source platform for performing data analysis and building data driven apps over big data stores. Open Source. As of this writing the big data platform of choice is the Cloudera Hadoop distribution. TAP is trying to solve the problem of having to deal with many disparate platforms, frameworks and processes for handling tasks around big data. TAP brings these different platforms under one roof for management purposes at least. However TAP's biggest value proposition is its solution to the common issue of data scientists writing code in Python or R that then has to be productionized (rewritten) to run in a distributed manner. Their solution to this problem is the Analytics Toolkit (ATK)
TAP ATK allows data scientists to write regular looking Python data science code in a Jupyter Notebook that is automatically translated into Spark jobs and run on a Spark Cluster. The scientist can then develop a model and publish it, making it available as a REST API for clients to consume. TAP ATK code must make use of the ATK dataframe (instead of pandas dataframe) and the ATK Machine Learning lib (instead of scikit-learn) to take advantage of the automatic translation to Spark. But ATK attempts to maintain compatibility with the tools data scientist are already familiar with.
Examples of problems solved with the TAP ATK are typical machine learning problems like Outlier detection of item placement at Levi stores, prediction of ER readmittance and ER workload. In the case of the Levi store example TAP was used to ingest the events, filter them, store them in HDFS, and create the machine learning model.
Here's an image showing the functions/tools that make up TAP: 
More information can be found at:
The following blog post consists mostly of notes taken from Tim Krajcar's talk at OSCON 2016 in Austin. I am sharing this because I thought it was a great talk that deserved to have something more long form than the speaker deck available here. Although I have some of my own research mixed into these notes, credit for the ideas expressed below belongs to Tim.
Performance
Optimize request queuing time.
Requesting queuing time is the time after a request enters your production infrastructure and before it reaches your application. It may include an actual queue that requests enter, or it may represent other functions (such as load balancing or internal network latency). If you are monitoring request queuing along with your application metrics, you may find that under heavy loads, your request queuing time shoots up, however your service/application performance stays uniform.
The images below show just that. The 1st image shows overall latency including request queuing time while the 2nd image shows latency with the queuing time taken out.
For more information on how to measure and optimize request queuing time checkout the following links:
- https://docs.newrelic.com/docs/apm/applications-menu/features/request-queuing-tracking-front-end-time
- https://www.appneta.com/blog/measuring-the-impact-of-request-queueing/
- http://blog.leansentry.com/all-about-iis-asp-net-request-queues/
- http://blog.scoutapp.com/articles/2011/03/30/detecting-haproxy-apache-passenger-queue-backlogs
- http://railsware.com/blog/2013/02/25/heroku-queuing-time-part1-problem/
- http://railsware.com/blog/2013/02/28/heroku-queuing-time-part2-solution/
Performance metrics during unit tests
Unit tests can be a useful tool for understanding the performance of your system. During unit testing you can run a profiling tool that shows how long tests, functions, and execution paths take to execute. Armed with this information, you can look at a list of the slowest tests / execution paths and try to understand why they are slow. In ruby a useful tool is minitest-perf.
Framework reduction
Use smaller, more focused frameworks. For example, if you don't need everything that Rails provides use Sinatra instead, same goes with using Sequel instead of ActiveRecord. In an experiment Tim conducted, he was able to optimize his Rails app to produce a 9.4% increase in client requests per minute by getting rid of jbuilder templates and using the oj gem. By replacing ActiveRecord with Sequel he was able to stretch that performance gain to 13.5%. But by replacing Rails with Sinatra his performance jumped by 372%, from 18,011 requests per minute to 85,017 requests per minute!
You can also take a look at running your ruby code on interpreters that are faster like JRuby and Rubinious
Scaling tips
- Set reasonable timeouts on your client calls. The Ruby HTTP default timeout of 60 seconds is NOT a reasonable timeout. In tandem, use the circuit breaker pattern to avoid flooding struggling servers. Gems that can help here are the Timeout module and cb2 gem.
- Parallelize operations over collections whenever possible. This is very easy to do with HTTP requests using Typhoeus Hydra. There's also the Parallel library that allows you to operate over collections using multiple processes or threads.
- Rate limit by client. If you are using Rack there's a middleware called rack/throttle that can enforce a minimum time interval between subsequent HTTP requests from a particular client and can also define a maximum number of allowed HTTP requests per a given time period (per minute, hourly, or daily).
I attended the DataStax Cassandra 2016 in Atlanta and took down a ton of notes on things that I found interesting. After going through those notes it occurred to me that many of the nuggets in these notes could be useful to someone else other than myself. So I’ve published the notes below.
The information below is mostly composed of quotes from DataStax engineers and evangelists. Very little context is contained in these notes. However, if you are a beginning-intermediate level Cassandra developer or admin you’ll likely have the needed context already. I did attempt to organize the notes somewhat coherently in order to allow you jump to a section you care about and also to provide some context in the grouping.
Recommendation for Spark & Cassandra on the same node: Spark jobs run in their own process and therefore have their own heap that can be tuned and managed separately. Depending on how much performance you are trying to get out of C*, Cassandra should get its 32 GB of RAM as usual. Anything over should then go to Spark. So for example to get great performance you could have a 64 GB RAM system with 32 GB to C* and 32 GB to Spark. Same thing for cores. You should have 12-16 cores; 8-12 for C* and the rest for Spark. If vertical scaling starts to get too expensive you can alternatively add more nodes to meet performance expectations.
The recommendation is to have no more that 1 TB of data per node. The reason for 2 TB disk despite a 1 TB recommendation is because once over 60% of your disk is full you run a risk of not having enough disk space during compaction. This is especially true if you are using size tiered compaction. With level tiered compaction you can use up to 70% without risk.
Use RAID 0 for your storage configuration. C* does replication for you. You can also use JBOD and C* can intelligently handle failures of some of the disks in your JBOD cluster.
Interesting Note: A DataStax customer once conducted a stress test that ran for 6-8 weeks for 24 hrs. They were testing to see how changing the compaction strategy impacted their read heavy workload.
Since you are running on multiple machines it becomes important to aggregate your logs.
The information below is mostly composed of quotes from DataStax engineers and evangelists. Very little context is contained in these notes. However, if you are a beginning-intermediate level Cassandra developer or admin you’ll likely have the needed context already. I did attempt to organize the notes somewhat coherently in order to allow you jump to a section you care about and also to provide some context in the grouping.
Data Modeling Tips
General
-
When migrating to C*, don’t just port over your SQL schema. Be query-driven in the design of your schema.
-
If you are planning on using Spark with C*, start with a C* use case / data model 1st and then use Spark on top of it for analytics
-
Patrick McFadden (DataStax evangelist) on not having certain relational DB constructs in C*: “In the past I’ve scaled SQL DBs by removing referential integrity, indexes and denormalizing. I’ve even built a KV database on an Oracle DB that I was paying of dollars per core for”. The implication here is these constructs bound scalability in relational databases and in explicitly not having them Cassandra’s scalability is unbounded (well, at least theoretically).
-
You can stop partition hotspots by adding an additional column to the partition key (like getting the modulus of another column when divided by the number of nodes) or by increasing the resolution of the key in the case where the partition key is a time span.
-
Using the “IF NOT EXISTS” clause stops an UPSERT from happening automatically / by-default. It also creates a lock on the record while executing, so that multiple writers don’t step on each other trying to insert the same record in a race condition. This is a light weight transaction (LWT). You can also create an LWT when doing a BATCH UPSERT
-
You can set a default TTL (Time To Live) on an individual table. This will apply to all data inserted into the table. A CQL insert can also specify a TTL for the inserted data that overrides the default.
-
DTCS (DateTieredCompactionStrategy) compaction is built for time series data. It groups SSTables together by time so that older tables don’t get compacted and can be efficiently dropped if a TTL is set.
-
CQL Maps allow you to create complex types inside your data store
-
One of the reasons for limiting the size of elements that can be in a CQL collection is because on reads the entire collection must be denormalized as a whole in the JVM so you can add a lot of data to the heap.
Secondary indexes
-
Secondary indexes are not like you have them in relational DBs. They are not built for speed, they are built for access.
-
Secondary indexes get slower the more nodes you have (because of network latencies)
-
Best thing to do with a secondary index is just to test it out and see its performance, but do it on a cluster not your laptop so you can actually see how it would perform in prod. Secondary indexes are good for low cardinality data.
Development Tips
Querying
-
Use the datastax drivers not ODBC drivers because datastax drivers are token aware and therefore can send queries to the right node, removing the need for the coordinator to make excessive network requests depending on the consistency level.
-
Use PreparedStatements for repeated queries. The performance difference is significant.
-
Use ExecuteAsync with PreparedStatements when bulk loading. You can have callbacks on Future objects and use the callbacks for things like detecting a failure and responding appropriately
-
BATCH is not a performance optimization. It leads to garbage collection and hotspots because the data stays in memory on the coordinator.
-
Use BATCH only to update multiple tables at once atomically. An example is if you have a materialized view / inverted index table that needs to be kept in sync with the main table.
General
-
Updates on collections create range tombstones to mark the old version of the collection (map, set, list) as deleted & create the new one. This is important to know because tombstones affect read performance and at a certain time having too many tombstones (100K) can cause a read to fail.
http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html
-
Cassandra triggers should be used with care and only in specific use cases because you need to consider the distributed nature of C*.
Ops Tips
Replication Settings
-
SimpleStrategy fails if you have multiple datacenters (DCs). Because 50% of your traffic that’s going to the other DC becomes terribly slow. Use NetworkTopologyStrategy instead. You can configure how replication goes to each DC individually, so you can have a table that never gets replicated to the US for example, etc.
-
If you are using the NetworkTopologyStrategy then you should use the Gossiping Property File Snitch to make C* network topology aware instead of the other property file configurator because you dan’t now have to change the file on every node and reboot them.
Hardware Sizing
Recommended Node size- 32 GB RAM
- 8-12 Cores
- 2 TB SSD
Recommendation for Spark & Cassandra on the same node: Spark jobs run in their own process and therefore have their own heap that can be tuned and managed separately. Depending on how much performance you are trying to get out of C*, Cassandra should get its 32 GB of RAM as usual. Anything over should then go to Spark. So for example to get great performance you could have a 64 GB RAM system with 32 GB to C* and 32 GB to Spark. Same thing for cores. You should have 12-16 cores; 8-12 for C* and the rest for Spark. If vertical scaling starts to get too expensive you can alternatively add more nodes to meet performance expectations.
The recommendation is to have no more that 1 TB of data per node. The reason for 2 TB disk despite a 1 TB recommendation is because once over 60% of your disk is full you run a risk of not having enough disk space during compaction. This is especially true if you are using size tiered compaction. With level tiered compaction you can use up to 70% without risk.
Use RAID 0 for your storage configuration. C* does replication for you. You can also use JBOD and C* can intelligently handle failures of some of the disks in your JBOD cluster.
Java Heap Size & Garbage Collection
-
As a general rule of thumb; start with defaults and then walk it up.
-
The ParNew/CMS GC works best with 8 GB
-
The G1GC can manage 20 GB of RAM (Note: Another engineer mentioned to me that 32 GB of RAM is no problem for G1GC). Should not be used if the heap is under 8 GB.
- Use G1GC with Solr / DSE Search nodes
Memory Usage and Caching
-
Its very important to have ample Off-heap RAM. Some C* data structures such as memtables and bloom filters are Off-heap. You also want to have non-heap RAM for page caching.
-
Row caching can significantly speed up reads because if avoids a table scan (If the page isn’t cached already). However row caching should be judiciously used. Best use case is for tables with a high density of hotspot data. The reason being that on a large table with varying and disparate data and seemingly random reads, you’ll end up with a lot of cache misses which invalidates the point of having a cache.
-
The row cache is filled on reads. memtables are filled on writes.
-
Memtables remain in memory until there is memory pressure based on configuration in the cassandra.yaml, then they are removed from RAM.
Benchmarking
-
Use the Cassandra Stress program that comes with C*.
-
Cassandra Stress can be configured; you can specify number of columns, data size, data model, queries, types of data, cardinality, etc.
-
To model production, use multiple clients & multiple threads for clients in your Benchmarking
-
When stress testing make sure you run it long enough to run into compactions, GC, repairs. Because when you test you want to know what happens in that situation. You can even stress test and introduce failures and see how it responds. You can/should instrument the JVM during stress testing and then go back and look at it.
-
General recommended stress test times is 24 - 48 hrs run times.
-
DSE has solr-stress for testing the solr integration.
Interesting Note: A DataStax customer once conducted a stress test that ran for 6-8 weeks for 24 hrs. They were testing to see how changing the compaction strategy impacted their read heavy workload.
General
-
Turn on user authentication. At least Basic Auth. This is good for security and auditing purposes. Also it allows you to not accidentally drop a production table because you thought you were connected to staging.
-
Use TLS if you are talking between DCs across the public internet
-
Don’t just bump up the heap for greater performance or to solve your problems! You’ll have to pay for it later during GC.
-
If you have crappy latency on 1% of your operations you shouldn’t just ignore it. You should try to understand what happened, is it compaction? Is it GC? That way you can address the issue that caused the high latency. Because that 1% could one day be 5%.
-
Why should be have backups? Backups exist to protect against people not machines. data corruption is the primary reason for backups. For example someone accidentally changes all the '1’s in your DB to 'c’s.
-
There is no built in way to count the number of rows in a Cassandra table. The only way to do so is to write a Spark job. You can estimate the table size if you know the amount of data per row and divide the table size by that amount.
-
Use ntpd! C* nodes in a cluster must always be on time because time stamps are important and are used in resolving conflict. Clock drifts cannot be tolerated.
-
Tombstone Hell: queries on partitions with a lot of tombstones require a lot of filtering which can cause performance problems. Compaction gets rid of tombstones.
-
Turn off swap on C* nodes.
-
If C* runs out of memory it just dies. But that’s perfectly ok, because the data is distributed / replicated and you can just bring it back up. In the mean time data will be read from the other nodes.
Cluster Management
-
Don’t put a load balancer in front of your C* cluster.
-
Make sure you are running repairs. Repairs are essentially network defrag and help maintain consistency. Run repairs a little at a time, all the time.
-
If you can model your data to have TTLs you can run repairs much less or not at all.
-
If you never delete your data you can set gc_grace_period to 0.
-
Don’t upgrade your C* versions by replacing an outgoing node with a new node running a newer version of C*. C* is very sensitive when it comes to running mixed versions in production. The older nodes may not be able to stream data to the newer node. Instead you should do an in-place upgrade, i.e. shut down the node (the C* service), upgrade C* and then bring it back up. (https://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html)
-
When a new node is added in order to increase storage capacity / relieve storage pressure on the existing nodes. Ensure you run
nodetool cleanup
as the final step. This is because C* won’t automatically reclaim the space of the data streamed out to the new node.
Monitoring, Diagnostic
Monitoring Services for capturing machine level metrics- Monit
- Munin
- Icinga
- JMX Metrics
Since you are running on multiple machines it becomes important to aggregate your logs.
- Logstash
htop
(a better version oftop
)iostat
dstat
strace
jstack
tcpdump
(monitor network traffic, can even see plain text queried coming in)nodetool tpstats
(can help diagnose performance problems by showing you which thread pools are overwhelmed / blocked. From there you can make hypotheses are to the cause of the blockage / performance problem)
DSE Offering
DSE Max => Cassandra + Support + Solr + SparkDSE search
-
Solr fixes a couple of rough edges for C* like joins, ad-hoc querying, fuzzy text searching and secondary indexing problems in larger clusters.
-
DSE search has tight Solr integration with C*. C* stores the data, Solr stores the indexes. CQL searches that use the
solr_query
expression in the WHERE clause search Solr first for the location of the data to fetch and then queries C* for the actual data.
-
You can checkout killrvideo’s Github for an example of DSE search in action (https://github.com/LukeTillman/killrvideo-csharp/blob/master/src/KillrVideo.Search/SearchImpl/DataStaxEnterpriseSearch.cs)
-
Solr is about a 3x multiplication on CPU and RAM needed for a running regular C*. This is because Solr indexes must live in RAM.
-
Solr can do geospatial searches & can do arbitrary time range searches (which is another rough edge that C* cannot do). E.g. “search for all sales in the past 4 mins 30 seconds”
DSE Spark
-
Spark runs over distributed data stores and schedules analytics jobs on workers on those nodes. DSE Max has Spark integration that just requires the flipping of a switch, no additional config.
-
There’s no need for definition files, workers automatically have access to the tables and Spark is data locality aware so jobs go to the right node.
-
Optionally with DSE search integration you can have search running on the same nodes that have the analytics and data and leverage the search indexes for faster querying instead of doing table scans.
-
With DSE analytics, you can create an analytics DC and have 2-way replication between the operations DC and the analytics DC. 2 way is important because it means that the analytics DC can store the result of its computation to the Cassandra table which then gets replicated back to the ops DC.
-
The Spark jobs / workers have access to more than just the C* table data. They can do just about anything you code. They can pull data from anything; open files, read queues, JDBC data stores, HDFS, etc. And write data back out as well.
-
Recommendation for Spark & Cassandra on the same node. Appropriate resource allocation is important. Having Spark will require more memory. Spark jobs run in their own process and therefore have their own heap that can be tuned and managed separately. Depending on how much performance you are trying to get out of C*, Cassandra should get its 32 GB of RAM as usual. Anything over should then go to Spark. So for example to get great performance you could have a 64 GB RAM system with 32 GB to C* and 32 GB to Spark. Same thing for cores. You should have 12-16 cores; 8-12 for C* and the rest for Spark. If vertical scaling starts to get too expensive you can alternatively add more nodes to meet performance expectations.
Other Notes
Cassandra has a reference implementation called killrvideo. It is an actual website hosted on MS Azure. The address is killrvideo.com. It is written by Luke Tillman in C#. Checkout the source code on Github (https://github.com/LukeTillman/killrvideo-csharp).
In your cassandra-env.sh set
If you want username and password security. Keep the default setting for jmxremote authenticate which is true. Otherwise set it to false:
Note: If set to true, enter a username and password for the jmx user in the
Restart the server (if needed).
Double-click the executable jar to run it (or run it from the command line). Select “Remote Process” and enter the following connection string.
replacing <target ip address> with the address of the machine you intend to manage or monitor.
Note: JConsole may try (and fail) to connect using ssl first. If it does so it will ask if it can connect over a non-encrypted connection. You should answer this prompt in the affirmative and you are good.
Congrats! You now have everything you need to monitor and manage a cassandra node.
For help with how to monitor Cassandra using JMX and with interpreting the metrics see:
LOCAL_JMX=no
If you want username and password security. Keep the default setting for jmxremote authenticate which is true. Otherwise set it to false:
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"
Note: If set to true, enter a username and password for the jmx user in the
jmxremote.password.file
. Follow the instructions here on how to secure that file. Setting jmxremote.authenticate to true also requires you to pass in the username and password when running a nodetool command, e.g. nodetool status -u cassandra -pw cassandra
Restart the server (if needed).
Connecting to the node using JConsole for monitoring
Find the JConsole jar in yourJDK_HOME/bin
or JDK_HOME/lib
directory. If you don’t have the Java JDK installed it can be downloaded from the Oracle Website.Double-click the executable jar to run it (or run it from the command line). Select “Remote Process” and enter the following connection string.
service:jmx:rmi:///jndi/rmi://<target ip address>:7199/jmxrmi
replacing <target ip address> with the address of the machine you intend to manage or monitor.
Note: JConsole may try (and fail) to connect using ssl first. If it does so it will ask if it can connect over a non-encrypted connection. You should answer this prompt in the affirmative and you are good.
Congrats! You now have everything you need to monitor and manage a cassandra node.
For help with how to monitor Cassandra using JMX and with interpreting the metrics see:
data modeling
The Right Database for the Right Job - Chattanooga Developer Lunch Presentation
2/19/2016
Does this sound like you? "OMG!! PostreSQL, Neo4j, Elasticsearch, MongoDB, RethinkDB, Cassandra, SQL Server, Riak, InfluxDB, Oracle NoSQL, SQLite, Hive, Couchbase, CouchDB, DynamoDB. I've got an issue with my current database solution or I'm starting a new project and I don't know what to choose!"
This talk is intended to help you match your data storage needs with suitable solutions from a wide field of contenders. Looking at different data types, structures and interaction patterns, we will try to understand what makes certain data stores better suited than others and how implement polyglot persistence.
This talk is intended to help you match your data storage needs with suitable solutions from a wide field of contenders. Looking at different data types, structures and interaction patterns, we will try to understand what makes certain data stores better suited than others and how implement polyglot persistence.