Don’t use TextField for your unique key in Solr
This seems immediately obvious when you think about it, but TextField is what you use for fuzzy searches in Solr, and why would a person want a fuzzy search on a unique value? While I can come up with some oddball use cases, making use of copy fields would seem to be the more valid approach and fitting with the typical use of Solr IE you filter on strings and query on text.
However, people have done this a few times and they throw me for a loop and in the case of DataStax Enterprise Search (built on Solr) this creates an interesting split between the index and the data.
Given a Cassandra schema of
cqlsh> CREATE TABLE test.my_table (id text, value text, PRIMARY KEY(id));
A Solr Schema of (important bits in bold):
<?xml version=”1.0" encoding=”UTF-8" standalone=”no”?>
<schema name=”autoSolrSchema” version=”1.5">
<types>
<fieldType class=”org.apache.solr.schema.TextField” name=”TextField”>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer></fieldType>
<fieldType class=”org.apache.solr.schema.StrField” name=”StrField”/></types>))<fields>
<field indexed=”true” name=”value” stored=”true” type=”TextField”/>
<field indexed=”true” name=”id” stored=”true” type=”TextField”/>
</fields>
<uniqueKey>id</uniqueKey>
</schema>
Initial records never get indexed
I’m assuming this is because the aspect of indexing that checks to see if it’s been visited or not is thrown by the tokens:
First fill up a table
cqlsh> INSERT INTO test.my_table (id, value) VALUES ( ‘1234’, ‘999’);
cqlsh> INSERT INTO test.my_table (id, value) VALUES ( ‘1235’, ‘999’); //this is the guy that never shows up
cqlsh> INSERT INTO test.my_table (id, value) VALUES ( ‘123’, ‘999’);
Then turn on indexing
./bin/dsetool create_core test.my_table schema=/home/user/schema.xml solrconfig=/home/usersolrconfig.xml
Add one more record
cqlsh> INSERT INTO test.my_table (id, value) VALUES ( '9999’, ‘added after’);
Then query via Solr and…no ‘1235’ or ‘1234’
cqlsh> select * from test.my_table where solr_query=’*:* ‘;
id | solr_query | value
— — — + — — — — — — + — — — -
123 | null | 999
9999 | null | added after //ok new record is indexed..where is '1234' and '1235'
But Cassandra knows all about them
cqlsh> select * from test.my_table;
id | solr_query | value
— — — + — — — — — — + — — — -
123 | null | 999
1234 | null | 999
1235 | null | 999 //STILL NOT SHOWING UP IN INDEX
9999 | null | added after //BUT NEW RECORD ALREADY IS
To recap we never indexed ‘1234’ and ‘1235’ for some reason ‘123’ indexes and later on when I add 9999 it indexes fine. Later testing showed that as soon as readded ‘1234’ is joined the search results, so this only appears to happen to records that were there before hand.
Deletes can greedily remove LOTS
cqlsh> select * from test.my_table where solr_query=’*:* ‘;
id | solr_query | value
— — — — — -+ — — — — — — + — — — — — —
1230 | null | 999
1234 | null | 999
9999 | null | 999
1234 4566 | null | with space
I delete id ‘1234’
cqlsh> DELETE FROM test.my_table where id=’1234';
But when I query Solr I find only:
cqlsh> select * from test.my_table where solr_query=’*:* ‘;
id | solr_query | value
— — — + — — — — — — + — — — -
9999 | null | 999I deleted id ‘1234’ but when I when I read from the table using Solr I only found one other record with an ID of ‘9999’.
So where did ‘1234 4566’, ‘1235’, and ‘1230’ go? If I query Cassandra directly they’re safe and sound only now Solr has no idea about them.
cqlsh> select * from test.my_table;
id | solr_query | value
— — — — — -+ — — — — — — + — — — — — —
1230 | null | 999
1235 | null | 999
1234 4566 | null | with space
9999 | null | 999
To recap, this is just nasty and the only fix I’ve found is either reindexing or just adding the records again.
Summary
Just use a StrField type for your key and everything is happy. Special thanks to J.B. Langston (twitter https://twitter.com/jblang3) of DataStax for finding the nooks and crannies and then letting me take credit by posting about it.