How to set up Semantic Logging: part one with Logstash, Kibana, ElasticSearch and Puppet,

Logging today is mostly done too unstructured; each application developer has his own syntax for the logs, optimized for his personal requirements and when it is time to deploy, ops consider themselves lucky if there is even some logging in the application, and even luckier if that logging can be used to find problems as they occur by being able to adjust verbosity where needed.

I’ve come to the point where I want a really awesome piece of logging from the get-go – something I can pick up and install in a couple of minutes when I come to a new customer site without proper operations support.

I want to be able to be able to search, drill down into, filter out patterns and have good tooling that allow me to let logging be an obvious support as the application is brought through its life cycle, from development to production. And I don’t want to write my own log parsers, thank you very much!

That’s where semantic logging comes in – my applications should be broadcasting log data in a manner that allow code to route, filter and index it. That’s why I’ve spent a lot of time researching how logging is done in a bloody good manner – this post and upcoming ones will teach you how to make your logs talk!

It’s worth noting that you can read this post no matter your programming language. In fact, the tooling that I’m about to discuss will span multiple operating systems; Linux, Windows, and multiple programming languages: Erlang, Java, Puppet, Ruby, PHP, JavaScript and C#. I will demo logging from C#/Win initially and continue with Python, Haskell and Scala in upcoming posts.

Here’s the outline of this post.

Virtually Awesome

The first thing you have to do is to install Vagrant and VirtualBox. You may install them using the default settings.

Vagrant says it’s used to “create and configure lightweight, reproducible, and portable development environments”. Vagrant means hobo, or someone living without a home. In this context, Vagrant is an excellent tool for packaging your development environment in a way, so that you can repeatedly set it up. Imagine how much time you currently spend on making a developer’s computer come up to speed.

Vagrant orchestrates VirtualBox through its executable ‘VBoxManage’. It is also responsible for tunnelling the output from inside the VM into the console.

Further, the tool uses server provisioning tools that can be equally well used in production as in a test environment, meaning that if you invest time in setting up your development box well, you’ll gain some of that time back when deploying with puppet or chef.

In this post I’m going to use puppet, as I prefer the state-based description language that it uses over the imperative ‘recipe’ like way of chef.

VirtualBox is a free virtualizing product, owned by Oracle, with a corresponding command line tool that vagrant uses to create and destroy the virtual machines.

CAKE NOW, NOT LATER!!

You are now ready to initialize vagrant and have a look at the finished result (before you go through the steps there).

The following bash script will download the git repository corresponding to this blog entry, run vagrant in that directory (while waiting for vagrant to finish) and then start a browser to see the kibana interface. Remember to shut down any RabbitMQ, port 8080, elasticsearch, logstash or kibana that you may have running locally, and answer yes to the admin prompts

You’ll be setting up a RabbitMQ broker (http://127.0.0.1:55672) as messaging fabric. You can log into the admin panel using guest/guest. (On Windows without ‘git bash’? Why?? In that case you’ll have to manually download the files – look in each of the download.sh files in modules ‘elasticsearch’ and ‘logstash’.)

The RabbitMQ Management web interface

ElasticSearch (http://127.0.0.1:9200) as a search engine:

LogStash (http://127.0.0.1:9292) for routing the logs:

and finally Kibana (http://127.0.0.1:8080) for watching the logs:

Kibana Screenshot

Although, the browser is completely devoid of data, since it was just created.

Let’s add some sample data!

Run

or


to insert some random data. Let the process work for as long as you’re interested in investigating the virtual machine.

This is what a drill down on tag looks like in Kibana with the sample program, ‘insert-data.exe’. The data comes from the wonderful Hipster Lipsum site – play around some with the Kibana GUI to explore the random data!

What makes all of this tick? Let’s have a look at the projects in use.

Puppet

[home] [docs]

“Puppet manages your servers: you describe machine configurations in an easy-to-read declarative language, and Puppet will bring your systems into the desired state and keep them there.”

In this guide, puppet-code is what you will be writing to express where you want resources to end up and what services to start.

Very useful for deploying to computer clusters and performing deployments repeatedly well.

ElasticSearch

[home] [docs]

ElasticSearch solves the problem of distributed, available, RESTful searching. Logstash uses an ES shard per day, meaning that previous day’s logs can be heavily optimized. Besides putting your logs in ElasticSearch you can put your application data to be indexed; or you can simply use it as a key-value store!

Rebalancing of shards and data is fully done in the background.

Logstash

[home] [docs]

“logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).”

Another word is ‘log router’. You can choose from a lot of different inputs, filters and outputs. In this post we’re using AMQP, the protocol of ActiveMQ, RabbitMQ, StormMQ, Azure-Service-Bus, QPID and others – a well tested, stable and well performing protocol. A message broker simplifies discovery, because it’s a known endpoint that you can send messages to without knowing the recipients.

If you are running a business dealing with runtime data, such as analysing logs as a service, performance testing as a service or perhaps even a security audit service-as-a-service, you can create an output which can collect data from the system under test!

RabbitMQ

[home] [docs]

My poison of choice for AMQP is RabbitMQ because of its flexibility, ability to do HA-clustering, ability to lie behind a load balancer like HA-Proxy and my previous good experiences with it. It also has excellent performance!

RabbitMQ runs on Erlang OTP, a distributed programming (actor-oriented) code framework written in the Erlang language.

In the sample project accompanying this post, RabbitMQ is running together with RabbitMQ management – a realtime-updated dashboard for what’s going on inside of rabbit, written in Erlang with mochiweb.

Kibana

[home] [docs]

Kibana is a JavaScript-heavy PHP application that queries the REST interface of ElasticSearch. Then it allows you to plot, search and drill down into the data. While Logstash has its own GUI, Kibana is a single-purpose GUI, which does one thing only, and does it well, in true Open Source spirit.

Linux in VirtualBox

You can also browse the virtual machine using ssh. Ensure that you have git bash installed. You should now be able to do

Version 1.0.3 of vagrant unfortunately doesn’t check for ssh properly in Windows, so refer to my fix in this gist:

Now try it again. You’ll be able to explore this virtual machine as you try your puppet skills out in the next section.

Now that you have an idea where you’re heading, let’s have a look at the puppet code!

Three

Wiring it together

A project like this is started by running vagrant init in the root folder. This places a file called the Vagrantfile in that folder with contents looking like this:

Vagrant uses boxes which are bare-bones virtual machines with puppet and a few other things installed. They are meant to be portable, and as you start using vagrant more in your organization you’ll find that you start creating your own boxes with a tool called veewee. You can create both Linux and Windows boxes; Linux uses ssh for communication and Windows uses WinRM. (you can find my previous blog entries for more on WinRM).

Setting up Vagrant

You’ll be changing the Vagrantfile to create a virtual machines named ‘semantic-logging’:

Let’s go through this snippet piece by piece.

This tells vim to use the Ruby language colorization scheme.

When the VM is initialized, we want it to be named semantic-logging as its hostname, so that we can refer to it from puppet.

Vagrant needs a box to use, and if you don't have that box already installed on the system, vagrant will fetch it for you, from the address provided.

This forwards the ports to the local computer, so that you can connect to them like the services were running locally. VirtualBox uses NAT to map the ports.

Since there's an archive in the ./files-folder and using /vagrant/files is an anti-pattern (not reusable search path in non-vagrant VMs), puppet needs a way to find where it keeps its files. This goes together with the option --fileserverconfig=/vagrant/fileserver.conf, and the file fileserver.conf which contains this:

It tells puppet that the files directory is in /etc/puppet/files and that any requestor is allowed.

Using a ruby block, the parameter vm is an object that you can use to configure the virtual machine before it has started. In this case there's a need for more memory considering the numerous programs being run simultaneously.

Specifies that puppet should use the semantic-logging.pp file to set up the computer. This is the center of your software configuration for the virtual machine.

Writing the boilerplate puppet code

Looking in ./modules, you'll find the different products. Each module encapsulates a reusable puppet configuration. When you ran ./download.sh, ElasticSearch and Logstash binaries were downloaded to the respective ./modules/{logstash,elasticsearch}/files folders.

First though, there's snippet in manifests/apt-get-update.pp that keeps the virtual machine software up to date and ensures that it is up to date before anything else is done (the before => Stage[main] part).

Once you have this, you can create the semantic-logging.pp file. This is what is will look like in the end:

The squiggly arrows specifies that the left hand side should occur before the right hand side, and that the right hand side should be notified afterwards.

This specifies that the computer with hostname semantic-logging should have the following declaration.

By adding classes to a node declaration you add functionality. Discrete software or features are broken out as classes. In this case it's a parametized class, which means that you can pass parameters to its definition. stage is in fact a built-in parameter, so you won't find its definition in the class definition. It signifies that the apt-get update logic should be declared to be asserted during the pre stage, which was declared to come before the main stage in apt-get-update.pp file.

You can test your manifest so far by running vagrant provision. You may notice that a puppet apply is made in the virtual machine, applying the node declaration, and if this is the first time you run it, apt-get update and apt-get upgrade -y are run.

The modules are both puppet modules and are fetched as git submodules to your computer. They are fairly stand-alone and can be combined in other ways than what the 'semantic-logging' declaration does in this case. When you write modules, you are to aim to parametrize the 'moving bits' either by class parameters or by using templates and overriding what templates are selected through parameters.

Writing the RabbitMQ code

In this code no parametrizations have been made to the classes, thereby using their respective defaults.

include rabbitmq is a way to "drop" all classes and manifest definitions of the rabbitmq-module into the current scope. In this case, it declares the Service and File types required to get RabbitMQ installed.

The second and third statements state that there should be a root exchange and that the management plugin should be installed.

Some code from within the RabbitMQ module:

The 'package' declaration is named $rabbitmq::package, as declared in the 'rabbitmq' class, and is ensured to be installed. Similarly the corresponding rabbitmq service is declared to be running and depending on the existance of the package.

Again, running vagrant provision will cause RabbitMQ to be downloaded and installed.

Writing the ElasticSearch code

This declaration is similar to the previous, except that it changes some of the class parameters; most notably the version parameter which is in turn used internally of the module to find the correct archive.

Writing the logstash code

Logstash is an executable jar-file that is run by the JVM. It doesn't fork child processes. I wrote this module as a part of writing this post. From a consuming perspective, all you need to place in your manifest, is this:

What follows is a short discussion on how the logstash module is built.

To see the basics of configuring logstash, look at its documentation. For now, it's enough that you know that logstash is started like this: /usr/bin/java -jar /opt/logstash-1.1.1-monolithic.jar agent -f /etc/logstash.conf --log /var/log/logstash.log -- web.

In order to make logstash start every time the operating system is started and then restart if the process fails for some reason, I recommend using upstart. It provides a signal-driven domain specific language for keeping track of jobs and services, and is a re-imagining from scratch of how the SysV/init system should work in the age of increasingly 'agile' computers (a mainframe was started and then not restarted for a long time as opposed to the computers of today which go in and out of networks, change peripherals and connectivity state much more).

Since the upstart files are very easy to read, I'm only going to point a few things out. First, what follows on in parenthesis: net-device-up and local-filesystems and runlevel [2345] are signals. When all these signals have been fired, the logstash job starts.

respawn specifies that if the processes exists for any reason, it should be restarted.

exec /usr/bin/java -jar /opt/logstash-1.1.1-monolithic.jar agent -f /etc/logstash.conf --log /var/log/logstash.log -- web is the meat of the upstart file; it specifies how the process is started.

respawn, exec, start, etc. are called stanzas. An upstart file must at least have an exec-stanza.

This file is then placed in /etc/init using puppet declarations:

Finally, it is specified to be started (also inside the module):

Launching of the jar takes a while, so you might have to try a few times to access http://127.0.0.1:9292.

Writing the Kibana code

I have intentionally not created a module for Kibana, because we'll walk through how to do that shortly. From the consuming manifest, it's as simple as adding a Kibana class:

Now, Kibana is a PHP web site, but does a lot using JavaScript and the ElasticSearch API, so what you need is a way to host a PHP site. Thankfully, there's a forked nginx module on my github that does the heavy lifting, so what is left is to make sure that the site is un-archived and that php5-fpm and php5-curl is properly installed.

Also, it's a best practice to run web sites as their own users, so while the code is prepared to set it up, it's not fully configured; because this would involve changing the php-fpm configuration file, and I want to save Augeas and configuration handling for a future post in this series. For production though, have a look at what others have written.

Let's go through it.

file { "$kpath": takes the Kibana tar from the puppet files folder and places it in /tmp/kibana.tar.gz.

file { '/var/www': ensure => 'directory' prepares the www-directory which exec will untar into.

exec { "$kpath": Specify how and when to run a command, named after the kibana directory in www, that it creates.

command => "tar -zxvf $kpath -C /var/www && mv `ls /var/www | head -n 1` $wwwpath", this command takes the path of the tarball (-t) and target path (-C), and then moves whatever the name of the tarball was (which is now a directory in /var/www, e.g. /var/www/rashidkpc-Kibana-v0.1.5-44-gf5c80c3), and moves that directory to a 'well known' kibana directory.

unless => "find $wwwpath", means that if the /var/www/kibana directory already exists, then don't run the Exec resource.

Before running the command, make sure that both the www-directory and the temporary file are where they need to be.

After the Exec resource, there are some package resources, which are self-explanatory. service { 'php5-fpm': specifies that the php service should be running, however, before it gets requests there must be a web server/proxy furthering the request:

nginx::fcgi::site { 'nginx-kibana-enable':

With this manifest, you can now make requests to port 80 (inside the VM), or to your own OS to port 8080. Kibana has been installed.

Four

Logging!

With all of the infrastructure set up, you can embark on your journey to actually logging everything with LogStash. Follow along for some sample code.

Create a new Console Application and do Install-Package NLog.RabbitMQ in that project. It will install the files necessary for communicating with RabbitMQ using NLog, a custom target that I have written.

Screenshot of VS2010 installing the package

Now, add NLog.config to the project and mark it as 'Copy if newer'.


Just add NLog.config as a Content item to the project and insert the configuration.

Then you can write some code that does logging:

Try it out, and watch the results unfurl in your web GUIs!

The next posts will include a refactor of the kibana puppet module, some clustering for high availability or RabbitMQ, ElasticSearch and furthering the integration of LogStash with the virtual machines.

Till then.

9 Responses to “How to set up Semantic Logging: part one with Logstash, Kibana, ElasticSearch and Puppet,”

  1. Stas says:

    Hi! Nice post.
    But seems references broken:
    fatal: reference is not a tree: 565159b62e13f7ff1114b6417ca73572ee5bc012
    Unable to checkout ’6d88a547b9d327c20dde298f29cbbb3e7c80274e’ in submodule path
    ‘modules/elasticsearch’
    Unable to checkout ’059cb80977e8514b9de7d981f777c672ed2e8388′ in submodule path
    ‘modules/logstash’
    Unable to checkout ’8361272578310563bcc534ea80840b278f6dcdf3′ in submodule path
    ‘modules/nginx’
    Unable to checkout ’565159b62e13f7ff1114b6417ca73572ee5bc012′ in submodule path
    ‘modules/rabbitmq’

  2. TonyViet says:

    Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. If possible, as you gain expertise, would you mind updating your blog with more information? It is extremely helpful for me…thanks

  3. Brian says:

    Excellent article! I’ve just gotten started with Logstash and I’m just browsing around for different implementation schemes on google — looking forward to reading part 2 !

  4. csebille says:

    Hi,

    While retrieving submodules via git (git submodule update –init), I have this error :

    fatal: reference is not a tree: 565159b62e13f7ff1114b6417ca73572ee5bc012
    Unable to checkout ’565159b62e13f7ff1114b6417ca73572ee5bc012′ in submodule path ‘modules/rabbitmq’

    It seems that the referenced commit id is not available anymore (404 when clicking on it on github).

    Is there a way to pass this error ?

    Thanks in advance

    Christophe

  5. [...] How to set up Semantic Logging: part one with Logstash, Kibana, ElasticSearch and Puppet, by Henrik Feldt. [...]

  6. Souls in the Waves…

    Excellent Morning, I just stopped in to visit your site and assumed I would say I loved myself….

  7. Jame Jorden says:

    The Slave of the Husband…

    In search of ahead to finding out added from you afterward!……

  8. Thomas says:

    Excellent post, thank you.
    Is there a reason why you choose to run RabbitMQ instead of redis, which seems to be the prefered way nowadays?

    Regards,
    Thomas

    • Henrik Feldt says:

      Hi Thomas. For a couple of reasons; 1) it seems that AMQP has better bindings in disparate environments, 2) AMQP has been verified to work across WAN, as opposed to Redis. So should I ever wish to ship logs across WAN (e.g. when doing load testing in AWS and analysing locally), Redis doesn’t have a good story for it, 3) Redis doesn’t support SSL and 4) I do a lot of messaging otherwise, with MassTransit, and that works on top of RabbitMQ – so I will already have it set up and pushing messages, 5) with HA-queues and at least two disk-nodes, RabbitMQ can go on without experiencing failure while Redis has master-slave replication only, so a failed master, fails the cluster.

      Regards,
      Henrik

Leave a Reply