Setting up Solr on Nginx with Let's Encrypt
Software Engineering Team Lead and Director of Cloudsure
I want to enable search on my blog. I started looking into different solutions. I started with ElasticSearch but came across too many issues to get it setup that I paused on that and moved on to Solr. I fully intend on working with ElasticSearch and Kibana soon.
Setting up Solr comes with its own set of challenges, some blatantly obvious mistakes that I made and others that required a little bit of digging.
I configured my solution on a virtual server running Fedora 25.
Install Solr
Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling.
Missing dependency
The Solr installation guide is quite straightforward but I got an error because of a missing dependency.
It wanted the lsof
package which "lists open files". It can get a list of all open files and the processes
that opened them. sudo dnf install lsof
Install package
I downloaded the package referenced in the installation guide using curl
:
curl http://mirror.za.web4africa.net/apache/lucene/solr/7.6.0/solr-7.6.0-src.tgz -o solr-7.6.0-src.tgz
tar xzf solr-7.6.0.tgz solr-7.6.0/bin/install_solr_service.sh --strip-components=2
./install_solr_service.sh solr-7.6.0.tgz
The solr user will be created by the installation script which will own /opt/solr
and /var/solr
.
Once the script completes, Solr will be installed as a service and running in the background on your server on port 8983
.
sudo service solr status
Open firewall
I temporarily opened port 8983
while I was working on the project. I had to open the port on my
server's firewall and through my hosting provider's firewall done through their admin interface.
First, I need to ensure that the firewall is enabled on my server:
sudo firewall-cmd --state
I need to add the Solr port to the firewall:
sudo firewall-cmd --zone=public --permanent --add-port=8983/tcp
Once I have made a change to the firewall, I need to reload it for the change to take effect:
sudo firewall-cmd --reload
Verify that my changes took effect:
sudo firewall-cmd --zone=public --list-ports
Test connection
Open a browser and browse to http://localhost:8983/solr or curl http://localhost:8983/solr
.
Test it remotely by accessing it with your public IP address.
Reflection
My connections were timing out and I had double checked my configuration. After Googling, the obvious answer hit me: firewall. I had made the changes but forgot to reload the firewall for the change to take effect.
Add a Solr core
A Solr Core is a running instance of a Lucene index that contains all the Solr configuration files required to use it. We need to create a Solr Core to perform operations like indexing and analyzing. A Solr application may contain one or multiple cores.
Using the solr
user created during the installation, I
create a new core.
su solr
cd /opt/solr/bin
./solr create -c collection_name
See the core now available in the web interface ready to index some data: http://localhost:8983/solr/#/~cores/collection_name
Side note: You can delete a core using the delete command ./solr delete -c collection_name
Curve ball: I received a warning about my ulimit settings
*** [WARN] *** Your open file limit is currently 1024.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
I set it to the suggested amount. I am not sure if I did this correctly though.
ulimit -a
ulimit -n 65000
Reflection
Don't be scared of creating, deleting and recreating things, especially in the beginning, while learning. Break it, fix it, understand it, learn it.
Scrape Solr
Solr needed some data and I found a really useful python tutorial to create a crawler for my blog which will be hosted on my Fedora server, a server not hosting my blog.
Missing dependencies
While setting up I came across the following missing dependencies
sudo dnf install python-devel
:The libraries and header files needed for Python development
pip install twisted
: An extensible framework for Python programming, with special focus on event-based network programming and multiprotocol integration.
Install Scrapy
Prepare to run Scrapy in a python virtualenv:
PROJECT_DIR=~/projects/scrapy
mkdir $PROJECT_DIR
cd $PROJECT_DIR
virtualenv scrapyenv
source scrapyenv/bin/activate
pip install scrapy
Create a Scrapy application:
scrapy startproject blog
cd blog
Edit blog/items.py
to indicate what needs to be indexed:
from scrapy.item import Item, Field
class BlogItem(Item):
title = Field()
url = Field()
text = Field()
Create a spider to crawl my blog:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from blog.items import BlogItem
from scrapy.item import Item
from urlparse import urljoin
import re
class BlogSpider(BaseSpider):
name = 'blog'
allowed_domains = ['curiousprogrammer.io']
start_urls = ['https://curiousprogrammer.io/']
seen = set()
def parse(self, response):
if response.url in self.seen:
self.log('already seen %s' % response.url)
else:
self.log('parsing %s' % response.url)
self.seen.add(response.url)
hxs = HtmlXPathSelector(response)
if re.match(r'https://curiousprogrammer.io/blog/\S.*$', response.url):
item = BlogItem()
item['title'] = hxs.select('//title/text()').extract()
item['url'] = response.url
item['text'] = hxs.select('//div[@id="post"]//child::node()/text()').extract()
self.log("yielding item " + response.url)
yield item
for url in hxs.select('//a/@href').extract():
url = urljoin(response.url, url)
if not url in self.seen and not re.search(r'.(pdf|zip|jar|gif|png|jpg)$', url):
self.log("yielding request " + url)
yield Request(url, callback=self.parse)
Crawl the blog. An items.json
file is generated. It will be appended to each time a crawl is processed outputting to the same file.
scrapy crawl blog -o items.json -t json
The tutorial showcases a index python script using pysolr
but it didn't work for me.
I indexed it directly through the Solr API using curl.
curl "http://localhost:8983/solr/collection_name/update/json/docs?commit=true" -H "Content-type:application/json" --data-binary @items.json
I set up a daily cronjob to index data using
crontab
through vim ~/projects/scrapy/blog/crawl-and-index
:
#!/bin/bash
echo "Delete entries from Solr"
curl http://localhost:8983/solr/oxygen/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/oxygen/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
echo "Remove existing scraped database"
cd /root/projects/scrapy/blog
rm items.json
echo "Enter they scrapy virtualenv"
source ../scrapyenv/bin/activate
echo "Start crawling your blog..."
scrapy crawl blog -o items.json -t json
echo "Index Solr with crawled database"
curl "http://localhost:8983/solr/oxygen/update/json/docs?commit=true" -H "Content-type:application/json" --data-binary @items.json
echo "Bye!"
crontab -e
@daily ~/projects/scrapy/blog/crawl-and-index
Setup your hosting environment
I no longer wanted to access the Solr API publically using the port. To achieve this, I had to configure a reverse proxy. A great benefit to using this approach is the usage of SSL. For me to get SSL to work, I had to start by getting a domain name.
Get a domain name
You can get a free domain name at freenom or other services that are available.
To release your inner geek, you can update your host's name:
sudo hostname new_host_name
sudo vim /etc/hostname
Update your /etc/hosts
to look something as follows:
127.0.0.1 new_host_name
Add your nameservers that your domain registrar will provide to you:
vim /etc/resolv.conf
# Generated by NetworkManager
search new_host_name
nameserver <IP>
nameserver <IP>
nameserver 8.8.8.8 #Google
Create a webserver with Nginx
NGINX is a high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.
sudo dnf install nginx
# Start NGINX when system boots
sudo systemctl enable nginx
# Start NGINX
sudo systemctl start nginx
# Check NGINX Status
sudo systemctl status nginx
Create an SSL Certificate with Let's Encrypt
Install certbot, configure it and then create a cron job to automatically renew the certificate every twelve months
sudo dnf install certbot-nginx
crontab -e
0 12 * * * /usr/bin/certbot renew --quiet
Redirects
Configure server on port 80 to redirect all traffic to SSL:
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name *.example.com;
root /usr/share/nginx/html;
# Load configuration files for the default server block.
include /etc/nginx/default.d/*.conf;
if ($scheme != "https") {
return 301 https://$host$request_uri;
}
}
Verify that port 443 is configured correctly: Configure server on port 443:
server {
listen 443 ssl http2 default_server;
listen [::]:443 ssl http2 default_server;
server_name *.example.com;
root /usr/share/nginx/html;
ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem; # managed by Certbot
include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
# Load configuration files for the default server block.
include /etc/nginx/default.d/*.conf;
location / {
}
}
Then add hosted applications underneath the last location statement. In this case I am directing
all incoming /solr
traffic to localhost:8983
so that I can run Solr on HTTPS.
# This is our Solr instance
# We will access it through SSL instead of using the port directly
location /solr {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_pass "http://localhost:8983";
}
error_page 404 /404.html;
location = /40x.html {
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
}
Test this in a browser or by running a curl
command. If connections time out, double check the
firewall rules, this time making sure 443 is open on host and hosting provider and that the firewall has been reloaded.
Dropping access to ports
I no longer need to expose Solr's port so I can drop it from the firewall.
sudo firewall-cmd --zone=public --permanent --remove-port=8983/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --zone=public --list-ports
Consume the API
When ready to consume the API using a JavaScript application, it is highly likely that you encounter a Cross-Origin Resource Sharing error when trying to make calls to the remote server. The reverse proxy and the Solr application haven't been explicitly told to give you the resources you are requesting.
Cross-Origin Resource Sharing (CORS) is a mechanism that uses additional HTTP headers to tell a browser to let a web application running at one origin (domain) have permission to access selected resources from a server at a different origin. A web application makes a cross-origin HTTP request when it requests a resource that has a different origin (domain, protocol, and port) than its own origin.
The error goes along the lines of Access to XMLHttpRequest at 'https://example.com/solr/collection_name/select' from origin 'http://localhost:8081' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
Reverse proxy
To enable CORS on the reverse proxy we need to edit the /etc/nginx/nginx.conf
file. In this example
I configure Nginx CORS to support the reverse proxied Solr API.
location /solr {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_pass "http://localhost:8983";
set $cors '';
if ($http_origin ~ '^http(s)*?://(localhost|www\.curiousprogrammer\.io)') {
set $cors 'true';
}
if ($cors = 'true') {
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, PUT, DELETE, OPTIONS' always;
add_header 'Access-Control-Allow-Headers' 'Accept,Authorization,Cache-Control,Content-Type,DNT,If-Modified-Since,Keep-Alive,Origin,User-Agent,X-Requested-With' always;
add_header 'Access-Control-Expose-Headers' 'Authorization' always;
}
if ($request_method = 'OPTIONS') {
# Tell client that this pre-flight info is valid for 20 days
add_header 'Access-Control-Max-Age' 1728000;
add_header 'Content-Type' 'text/plain charset=UTF-8';
add_header 'Content-Length' 0;
return 204;
}
}
Application layer
It can also be applied on the Solr application layer as it ships with the Jetty servlet engine.
cd /opt/solr/server/solr-webapp/webapp/WEB-INF/lib
curl "http://search.maven.org/remotecontent?filepath=org/eclipse/jetty/jetty-servlets/8.1.14.v20131031/jetty-servlets-8.1.14.v20131031.jar" -o jetty-servlets-8.1.14.v20131031.jar
curl "http://search.maven.org/remotecontent?filepath=org/eclipse/jetty/jetty-util/8.1.14.v20131031/jetty-util-8.1.14.v20131031.jar" -o jetty-util-8.1.14.v20131031.jar
Edit server/solr-webapp/webapp/WEB-INF/web.xml
<filter>
<filter-name>cross-origin</filter-name>
<filter-class>org.eclipse.jetty.servlets.CrossOriginFilter</filter-class>
<init-param>
<param-name>allowedOrigins</param-name>
<param-value>http://localhost:8081,https://curiousprogrammer.io</param-value>
</init-param>
<init-param>
<param-name>allowedMethods</param-name>
<param-value>GET,POST,OPTIONS,DELETE,PUT,HEAD</param-value>
</init-param>
<init-param>
<param-name>allowedHeaders</param-name>
<param-value>origin, content-type, accept</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>cross-origin</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
My final thoughts
This exercise took me a day and a half to figure out. It took me another day to make sense of it all to write about it. I made a lot of silly mistakes that could have been avoided had I just taken a few extra breaks to clear my head.
When something goes wrong and I am highly uncertain about the technology, domain or intricacies of the problem, it becomes extremely challenging to translate that into search terms. This is sometimes overwhelming. The best advice I can give is to break down the problem as much as possible and unpack logically how each component fits together and what could possibly be causing an issue. This could extract more useful search terms or actually help solve the problem.
If it isn't that simple and you have error messages, copy them verbatim and read and re-read the errors and solutions presented to you, even if some of them might not seem related. Something might stand out or spark a thought.
If you don't have error messages, spit ball with terms you do know that you are working with and problems you think you may be experiencing.
Most importantly, make notes of what you are doing. You never know when you might need it again. You could also just jot it down and blog about it to help you and others experiencing the same challenges that you are facing regardless of how large or small.
With regards to Solr, it's early days. Looking back at the actual steps I had to take to install it, the setup and installation is really simple. It was finding that information that was the tricky part for me because there is a lot of information out there.
The blog crawler was an interesting find and I am glad that it is in Python because it's a language I feel worthwhile learning. The crawler does a dandy job and exactly what I need it to do for now.
I enjoy working with Nginx as I am now accustomed to it and Let's Encrypt was fairly straightforward to configure once the domain name was correctly configured.
Next is actually configuring this solution into a React component on my Gatsby website. A tale for another day.
References
Solr
- Taking Solr to Production
- Create a Core
- File Handles and Processes (ulimit settings)
- Going Cross-Origin with Solr
Scraping
- Install Scrapy on Linux
- Scrapy installation guide
- Indexing websites in Solr with Python
- Crontab in Linux with 29 useful examples to schedule jobs
- StackOverflow: Match rest of string with regex
- Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript
Hosting
- Find a new free domain with freenom
- Ubuntu Linux Change Hostname (computer name)
- Nginx install
- Install Nginx Web Server on Fedora
- Getting started with Let's Encrypt
- Install Let's Encrypt using certbot with nginx on Fedora
- Cross-Origin Resource Sharing (CORS)
- Example Nginx configuration for adding cross-origin resource sharing (CORS) support to reverse proxied APIs