Sunspot (Apache Solr) 導入 - アクトインディ開発者ブログ

こんにちは、tahara です。

いこーよに Apache Solr の全文検索を使う Sunspot を導入しました。

これとかこれで MySQL の全文検索を使っていましたが、Sunspot を試してみたら

速い
ファセットがものすごく便利
Kuromoji という日本語形態素解析機が使える

だったので、

Sunspot は Apache Solr を使う。
Apache Solr を Java を使う。
Java か。。。

というあたりの精神的障壁を克服し、がんばって乗り換えることにしました。

ちょうど Solr 4.0.0 がリリースされたので

Solr 4.0.0
sunspot 2.0.0.pre.120925
sunspot_rails 2.0.0.pre.120925

を使うことにしました。

vi Gemfile

gem "sunspot_rails", "~> 2.0.0.pre.120925"
gem "sunspot", "~> 2.0.0.pre.120925"

bundle install

全文検索の対象となるモデルに searchable を書きます。 vi app/model/facility.rb

  searchable do
    text(:name, :boost => 1.5)
    text(:kana)
    string(:kana)
    text(:region_name, :boost => 3)
    text(:prefecture_name, :boost => 3)
    text(:address, :boost => 2)
    text(:body) do
      "#{pr} #{description} #{tag_list} #{search_keyword} #{features.map(&:name).join(' ')} #{ages.map(&:name).join(' ')}"
    end
    string(:tag, :multiple => true) do
      tag_list
    end
    latlon(:location) { Sunspot::Util::Coordinates.new(lat, lng) }
    integer(:age_ids, :multiple => true)
    integer(:feature_ids, :multiple => true)
    integer(:prefecture_id)
    integer(:region_id)
    integer(:favorites_count)
    boolean(:has_picture) do
      picture_1_file_size.to_i > 0
    end
    boolean(:publish)
    boolean(:coupon_enabled)
    float(:rating) do
      # 口コミがあるものは夜間バッチの評価更新で ratings.created_at <> ratings.updated_at になっている。
      if rating.created_at == rating.updated_at
        0
      else
        rating.overall_rating
      end
    end
    boost { coupon_enabled? ? 3.0 : 1.0 }
    time :created_at
  end

  def self.default_search_scope(solr, params)
    params = HashWithIndifferentAccess.new(params) unless HashWithIndifferentAccess === params
    solr.all_of do
      if params[:publish].blank?
        with(:publish, true)
      else
        with(:publish, params[:publish])
      end
      with(:age_ids, params[:age_ids]) if params[:age_ids].present?
      with(:feature_ids, params[:feature_ids]) if params[:feature_ids].present?
      with(:prefecture_id, params[:prefecture_ids]) if params[:prefecture_ids].present?
      with(:region_id, params[:region_ids]) if params[:region_ids].present?
      if params[:tags].present?
        params[:tags].each do |tag|
          with(:tag, tag) if tag.present?
        end
      end
    end
    solr.with(:location).in_radius(params[:lat], params[:lng], params[:distance] || 100, :bbox => true) if params[:lat].present? && params[:lng].present?
    solr.fulltext(params[:word]) if params[:word].present?
  end

よく検索するパターンがあるので、 self.default_search_scope にそれをまとめています。

検索を行うコントローラ vi app/controllers/facilities_controller.rb

    @facilities = Facility.search(:include => [:ages, :rating, :tags]) do
      Facility::default_search_scope(self, params)
      if params[:format] == 'rss'
        order_by :created_at, :desc
      elsif params[:lat].present?
        order_by_geodist(:location, params[:lat], params[:lng])
      else
        if params[:word].present?
          order_by :score, :desc
        end
        order_by :coupon_enabled, :desc
        order_by :rating, :desc
        order_by :has_picture, :desc
      end
      paginate(:page => params[:page], :per_page => params[:per_page])
      facet :region_id if params[:region_ids].blank? && params[:prefecture_ids].blank?
      facet :prefecture_id if params[:region_ids].present?
    end

facet によって検索結果に加え都道府県ごとのヒット件数をあわせて取得できます。この機能はとても便利です。

Rails 側は以上で、次に Solr サイドです。

schema.xml は Sunspot のものにちょっと変更を加えます。

Solr 4.0.0 では _version_ フィールドタイプが必要みたいなので

    <field name="_version_" type="long" indexed="true" stored="true"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>

また日本語形態素解析機の Kuromoji を使うために text フィールドタイプのアナライザーを変更します。

    <fieldType name="text" class="solr.TextField" omitNorms="false" autoGeneratePhraseQueries="true" positionIncrementGap="100" >
      <analyzer type="index">
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
        <!-- Reduces inflected verbs and adjectives to their base/dictionary forms (辞書形) -->
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <!-- synonyms -->
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- Removes tokens with certain part-of-speech tags -->
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" enablePositionIncrements="true"/>
        <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" enablePositionIncrements="true" />
        <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
        <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
        <!-- Lower-cases romaji characters -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- カタカナ → ひらがなに -->
        <filter class="org.apache.lucene.analysis.icu.ICUTransformFilterFactory" id="Katakana-Hiragana" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
        <!-- Reduces inflected verbs and adjectives to their base/dictionary forms (辞書形) -->
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <!-- synonyms -->
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- Removes tokens with certain part-of-speech tags -->
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" enablePositionIncrements="true"/>
        <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" enablePositionIncrements="true" />
        <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
        <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
        <!-- Lower-cases romaji characters -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- カタカナ → ひらがなに -->
        <filter class="org.apache.lucene.analysis.icu.ICUTransformFilterFactory" id="Katakana-Hiragana" />
      </analyzer>
    </fieldType>

Solr を起動する init スクリプトも必要ですね。

#! /bin/sh
### BEGIN INIT INFO
# Provides:          solr
# Required-Start:    $remote_fs $syslog
# Required-Stop:     $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Apache Solr
# Description:       Apache Solr
#                    sudo ln -s /var/www/outing/current/solr/etc/init.d/init.sh solr
#                    sudo update-rc.d solr defaults
### END INIT INFO

# Author: antindi <dev@actindi.net>
#
# Please remove the "Author" lines above and replace them
# with your own name if you copy and modify this script.

# Do NOT "set -e"

# PATH should only include /usr/* if it runs after the mountnfs.sh script
PATH=/sbin:/usr/sbin:/bin:/usr/bin
DESC="Solr"
NAME=solr
PROCESS_NAME=java
SOLR_HOME=/var/www/outing/current/solr
DAEMON=/usr/bin/java
DAEMON_ARGS="-Xmx1024m -Djava.util.logging.config.file=etc/logging.properties -jar start.jar"
PIDFILE=/var/run/$NAME/$NAME.pid
LOG_DIR=/var/log/$NAME
BASE_DIR=/var/lib/$NAME
DATA_DIR=$BASE_DIR/data
SCRIPTNAME=/etc/init.d/$NAME
SOLR_USER=deployer

# Exit if the package is not installed
[ -x "$DAEMON" ] || exit 0

# Read configuration variable file if it is present
[ -r /etc/default/$NAME ] && . /etc/default/$NAME

# Load the VERBOSE setting and other rcS variables
. /lib/init/vars.sh

# Define LSB log_* functions.
# Depend on lsb-base (>= 3.2-14) to ensure that this file is present
# and status_of_proc is working.
. /lib/lsb/init-functions

#
# Function that starts the daemon/service
#
do_start()
{
        mkdir `dirname $PIDFILE` > /dev/null 2>&1 || true
        chown $SOLR_USER `dirname $PIDFILE`
        mkdir $LOG_DIR > /dev/null 2>&1 || true
        chown $SOLR_USER $LOG_DIR
        mkdir -p $DATA_DIR > /dev/null 2>&1 || true
        chown $SOLR_USER $DATA_DIR
        # Return
        #   0 if daemon has been started
        #   1 if daemon was already running
        #   2 if daemon could not be started
        start-stop-daemon -b -m -c $SOLR_USER -d $SOLR_HOME --start --quiet --pidfile $PIDFILE --exec $DAEMON --test > /dev/null \
                || return 1
        start-stop-daemon -b -m -c $SOLR_USER -d $SOLR_HOME --start --quiet --pidfile $PIDFILE --exec $DAEMON -- \
                $DAEMON_ARGS \
                || return 2
        # Add code here, if necessary, that waits for the process to be ready
        # to handle requests from services started subsequently which depend
        # on this one.  As a last resort, sleep for some time.
}

#
# Function that stops the daemon/service
#
do_stop()
{
        # Return
        #   0 if daemon has been stopped
        #   1 if daemon was already stopped
        #   2 if daemon could not be stopped
        #   other if a failure occurred
        start-stop-daemon --stop --quiet --retry=TERM/30/KILL/5 --pidfile $PIDFILE --name $PROCESS_NAME
        RETVAL="$?"
        [ "$RETVAL" = 2 ] && return 2
        # Wait for children to finish too if this is a daemon that forks
        # and if the daemon is only ever run from this initscript.
        # If the above conditions are not satisfied then add some other code
        # that waits for the process to drop all resources that could be
        # needed by services started subsequently.  A last resort is to
        # sleep for some time.
        #start-stop-daemon --stop --quiet --oknodo --retry=0/30/KILL/5 --exec $DAEMON
        #[ "$?" = 2 ] && return 2
        # Many daemons don't delete their pidfiles when they exit.
        rm -f $PIDFILE
        return "$RETVAL"
}

#
# Function that sends a SIGHUP to the daemon/service
#
do_reload() {
        #
        # If the daemon can reload its configuration without
        # restarting (for example, when it is sent a SIGHUP),
        # then implement that here.
        #
        start-stop-daemon --stop --signal 1 --quiet --pidfile $PIDFILE --name $PROCESS_NAME
        return 0
}

case "$1" in
  start)
        [ "$VERBOSE" != no ] && log_daemon_msg "Starting $DESC" "$NAME"
        do_start
        case "$?" in
                0|1) [ "$VERBOSE" != no ] && log_end_msg 0 ;;
                2) [ "$VERBOSE" != no ] && log_end_msg 1 ;;
        esac
        ;;
  stop)
        [ "$VERBOSE" != no ] && log_daemon_msg "Stopping $DESC" "$NAME"
        do_stop
        case "$?" in
                0|1) [ "$VERBOSE" != no ] && log_end_msg 0 ;;
                2) [ "$VERBOSE" != no ] && log_end_msg 1 ;;
        esac
        ;;
  status)
        status_of_proc "$DAEMON" "$NAME" && exit 0 || exit $?
        ;;
  #reload|force-reload)
        #
        # If do_reload() is not implemented then leave this commented out
        # and leave 'force-reload' as an alias for 'restart'.
        #
        #log_daemon_msg "Reloading $DESC" "$NAME"
        #do_reload
        #log_end_msg $?
        #;;
  restart|force-reload)
        #
        # If the "reload" option is implemented then remove the
        # 'force-reload' alias
        #
        log_daemon_msg "Restarting $DESC" "$NAME"
        do_stop
        case "$?" in
          0|1)
                do_start
                case "$?" in
                        0) log_end_msg 0 ;;
                        1) log_end_msg 1 ;; # Old process is still running
                        *) log_end_msg 1 ;; # Failed to start
                esac
                ;;
          *)
                # Failed to stop
                log_end_msg 1
                ;;
        esac
        ;;
  *)
        #echo "Usage: $SCRIPTNAME {start|stop|restart|reload|force-reload}" >&2
        echo "Usage: $SCRIPTNAME {start|stop|status|restart|force-reload}" >&2
        exit 3
        ;;
esac

:

sudo update-rc.d solr defaults
sudo service solr

今朝本番投入しましが、ちゃんと動いてくれているようです。よかった。

弊社ではエンジニア募集しています。お気軽にお問い合わせください。