, 148 min read

Website Analytics

I had written on website analytics with regard to this blog before:

  1. Statistics of this Blog: Crossed 110.000 Views
  2. Statistics of this Blog: Crossed 120.000 Views
  3. URL Count Statistics

I do not use cookies or any JavaScript libraries to track users. Instead I only analyze the web-server logs using a Perl script. To doublecheck the accuracy of this Perl script I occasionally inserted analytics code from

  1. statcounter
  2. clicky
  3. Google Analytics
  4. Cloudflare Analytics
  5. Heap Analytics
  6. CISCO smartlook

These analytics provider used cookies, and JavaScript. Therefore my site was not cookie-free, when I employed them.

Nowadays I no longer employ those bulky JavaScript libraries. I resort to the web-server logfile. There are a number of advantages and disadvantages.

Advantages:

  1. The user does not have to be concerned about cookies
  2. The user does not need to download bulky JavaScript libraries
  3. The user does not need to make yet another connection to any other server

Disadvantages:

  1. The detection of bots and nonsense access is a little more cumbersome; there seems to be no ready-to-use software to filter out all the bots, so I had to write it myself, see accesslogFilter
  2. There seems to be no off-the-shelf software to fully analyze weblogs and generate diagrams, so I had to write it myself, see blogurlcnt

Below are the statistics for the year 2024. Some key figures using filtered data.

  1. Ca. 18,000 "real" accesses as can be seen by looking at pagefind-ui.js, i.e., someone not loading this JavaScript is in many cases not a real reader
  2. On average there are ca. 1,000 monthly accesses for "real" posts
  3. Most intensive access is HetrixTools
  4. Best post was Hosting Static Content with GitLab, which had more than 4,500 views in July 2024, and more than 1,300 views in August
  5. Second best was Performance Comparison C vs. Java vs. Javascript vs. LuaJIT vs. PyPy vs. PHP vs. Python vs. Perl having 100-200 views per month

1. URL statistics

Below is the output of the Perl script blogurlcnt.

The generated output uses JavaScript DataTables. DataTables allow easy filtering and sorting within the table.

Combine that with Apache ECharts to show histograms to get a visual representation of the development of various URLs over time.

YearMonthWeek

Above table and charts where generated with below command:

time blogconcatlog 67 | pv | tee /tmp/a2 | accesslogFilter -o >(tee /tmp/a1.stat | blogstatcnt -m3000 > /srv/http/statcnt.html) | tee /tmp/a1 | blogurlcnt -m470 > /srv/http/urlstat2-m100.html

I.e., the "uncount" statistics are dropped if they count less than 3,000 entries. Likewise the statistics and charts are only shown for entries, which have more than 470 views.

2. Uncount statistics

The "uncount" statistics is called "uncounted", as these number were not counted in above statistics, i.e., they were filtered out. This filtering was done using accesslogFilter.

A key issue with analyzing the raw web-server log is to filter out all irrelevant entries. Examples of irrelevant entries are:

  1. My own access to my blog on my own local network, i.e., on 127.0.0.1 or localhost
  2. Various bots from Google, Ahrefs, Yahoo, etc.
  3. HTTP codes 404 (not found), etc.
  4. IP address ranges from webhosting platforms or known AI platforms

Below table shows what filters are most effective. Some noteworthy highlights:

  1. There are more than 9-times more irrelevant accesses than good ones, as given by column %good in first row with "Sum total". Or phrased differently: only roughly 10% of all accesses are relevant.
  2. The major factor of all HTTP codes, which are irrelevant, is 404 (not found)
  3. The most active bots, which can only be identified by IP address range, are in the class B subnet 162.55, which is Hetzner's your-server.de; these bots are more intense than all HTTP error codes combined
  4. Ahrefs is the most frequent visitor of all bots (ca. 21%), way before Google bot (4.7%); in line with observations from last year: Website Checking with Ahrefs
  5. It is interesting that Yandex is searching the blog more frequently than Bing