Web Server Log Analytics


Chapter 19

database

Normally we won't use web server log file data as the main data source to build our analytics reports. But web server log file data can complement what web analytics tools may have lacked.

What's in a Typical Web Server Log File?

The advantage of web server log file data is that it doesn't require tracking pre-installation. Once the web server of your website goes live and is running, it automatically starts recording data.

  • When a user visits a page on your website, your web server logs a line of record.
  • At the same time when the web page he / she visits has an image, another line of record is logged.

Basically any files that have been triggered to load by a user's visit to your website, the action is recorded in the log file as a line.

Below is a typical log file record in which a user (with IP address 192.168.22.10) visited your website's homepage ( / ) successfully (i.e. http status 200). The traffic source is www.google.com, and the user was on Firefox when visiting the page.

192.168.22.10 - - [21/Nov/2003:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=china+seo&ie=utf-8&oe=utf-8 &aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"

Issues of Web Server Log Analytics

Log file data has disadvantages.

A full stack analytics reporting system cannot be built from only the data which are collected through web server log files. Most websites nowadays have included JavaScript in which the main purpose is for users to perform interactions on the web pages. Log files aren't able to record any of those JavaScript interactions. This will lead to log file analytics missing a large amount of detailed user interaction (or behavior) data. Note, most typical web analytics tools are able to track JavaScript interactions.

When your website has static file caching enabled (which is what most websites are doing nowadays), the file caching mechanism will serve cached files to “returning” users. For example, images files, CSS files, JavaScript files are all files that are suitable to be cached. When your web server returns cached files to users, the files being served are not recorded in the log file.

A website with daily sessions of 100,000 may generate a web log file that is easily more than 30 gigabytes of pre-processed raw data. That will easily become almost 1 terabyte of raw data per month (or 12 terabytes per year). Processing raw data of such large size into human readable reports everyday can be a difficult and time consuming task. It also takes up a large amount of storage resources (i.e. hard disks) to store the raw data (and the processed data).

Search Engine Spider Data in Web Server Log Files

One major advantage of web server log analytics is search engine spider visits are actually recorded by log files. This is the data typical web analytics aren't able to collect.

Below is a typical log file record when a search engine spider (i.e. Googlebot) visits your website's page (/a.html).

66.250.65.101 - - [21/Nov/2003:04:54:20 -0400] "GET /a.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

This part of the line reveals the visit was from Googlebot:

compatible; Googlebot/2.1; +http://www.google.com/bot.html

What We Can Do with Search Engine Spider Data

When dealing with organic search, the traffic funnel is:

Crawl -> Index -> Ranking -> Traffic

Before a search engine can index and rank your web pages, the very first task is to get search engine's spider to crawl your web pages.

Log File Data Reveals Website's Issues

In the log files, whether it is a record of a user's visit, or a record a search engine spider's visit, the record shows a http status code. Below are some of the most frequently seen http status codes.

  • 200 – OK
  • 301 – Permanently moved
  • 302 – Temporarily moved
  • 404 – Not found
  • 500 – Internal server error
  • 503 – Service Unavailable

In the log file, all the records that returns with http status codes 200 or 300 show no issues. All the records which returns with 404, 500 and 503 may have potential issues that will require attention.


Previous Chapters

Next Chapters


Gordon Choi's Analytics Book has been available since August 2016 and is proudly powered by Folks Analytics.

Thank you for reading! If you love my book, you're welcome to donate through Paypal.







Content on Gordon Choi's Analytics Book is licensed under the CC Attribution-Noncommercial 4.0 International license.