Thu, 18 Nov 2004

Apache Virtual Host Bandwidth Monitoring

Also, a good example of the READ-COMMITTED isolation level

mod_accounting is a simple Apache module for recording bandwidth usage on a per-virtual host basis. It can write to either a MySQL or PostgreSQL database. You just add a couple directives to httpd.conf to tell mod_accounting which database to write to, the query to run, and how often to write to the database.
Here's the configuration I use.

<IfModule mod_accounting.c>
AccountingQueryFmt "   \
  INSERT INTO          \
    vhost_accounting   \
  (                    \
    bytes_in,          \
    bytes_out,         \
    virtual_host,      \
    host               \
  )                    \
  values               \
  (                    \
    %r,                \
    %s,                \
    LOWER('%h'),       \
    'host.example.com' \
  );"
AccountingTimedUpdates 300
AccountingDatabase vhost_accounting
AccountingDatabaseDriver mysql
AccountingDBHost stats.example.com 3306
AccountingLoginInfo vhost_accounting XXXXXXXX
</IfModule>

And the corresponding table.

CREATE TABLE `vhost_accounting` (
  `virtual_host` varchar(128) NOT NULL default '',
  `bytes_in` int(20) unsigned NOT NULL default '0',
  `bytes_out` int(20) unsigned NOT NULL default '0',
  `timestamp` timestamp(14) NOT NULL,
  `host` varchar(128) NOT NULL default '',
  `id` bigint(20) unsigned NOT NULL auto_increment,
  PRIMARY KEY  (`id`)
) TYPE=InnoDB

If you have tables of hosts and virtual hosts, you'll probably want to normalize the database and modify AccountingQueryFmt to use the foreign key to the vhost. For example,
AccountingQueryFmt "INSERT INTO vhost_accounting (bytes_in, bytes_out, virtual_host_id) SELECT %r, %s, id FROM vhost WHERE virtual_host_name = LOWER('%h');"

There are a couple issues that make deploying mod_accounting to multiple machines a bit tricky. Each apache process does its own logging, and thus, maintains its own connection to the database server. If you have hundreds of Apache processes and multiple servers, you'll quickly have thousands of open connections to the database. To reduce the number of concurrent connections, I made a one-line change to mod_accounting to close the connection to MySQL after writing to the database so a new connection is used each time data is written to the database.

The next problem is simply the large amount of data that is collected. It quickly grows to the point where generating reports in real-time is too slow. To solve this problem, I created a summary table, vhost_accounting_summary, to consolidate the data produced by each Apache process. I also decided that hourly statistics were sufficient for my needs.

The records in the vhost_accounting table are removed once the data has been inserted into vhost_accounting_summary. One other table, vhost_accounting_timestamp, is used to keep track of when the summary table was last updated and to ensure that data from vhost_accounting does not accidentally get written to vhost_accounting_summary multiple times. (More details below.)

CREATE TABLE `vhost_accounting_summary` (
  `virtual_host` varchar(128) NOT NULL default '',
  `bytes_in` int(20) unsigned NOT NULL default '0',
  `bytes_out` int(20) unsigned NOT NULL default '0',
  `timestamp` timestamp(14) NOT NULL,
  `host` varchar(128) NOT NULL default '',
  KEY `virtual_host` (`virtual_host`),
  KEY `idx_timestamp_host` (`timestamp`,`host`)
) TYPE=InnoDB

CREATE TABLE `vhost_accounting_timestamp` (
  `timestamp` timestamp(14) NOT NULL
) TYPE=InnoDB

And here's the SQL to update the vhost_accounting_summary table.

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;
SELECT timestamp FROM vhost_accounting_timestamp FOR UPDATE;
SELECT @max_id:= max(id) FROM vhost_accounting;
INSERT INTO
        vhost_accounting_summary
SELECT
        virtual_host,
        SUM(bytes_in),
        SUM(bytes_out),
        DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00') AS dayhour,
        host
FROM
        vhost_accounting
WHERE
        id < @max_id
GROUP BY
        virtual_host,
        dayhour;
DELETE FROM vhost_accounting WHERE id < @max_id;
UPDATE vhost_accounting_timestamp SET timestamp = now();
COMMIT;

There are a couple important points to make about the SQL above. We want to make sure that mod_accounting never has to wait to write to the vhost_accounting because of a lock. In addition, we want to ensure that, if multiple copies of the update script get run concurrently, it does not cause the same data to be written to vhost_accounting_summary multiple times.

Mod_accounting does its work while processing a request. If mod_accounting is waiting for a lock, the Apache process will not finish processing the current request until the lock is removed or a lock timeout occurs. This can quickly cause all of the Apache children to be tied up and prevent new requests from being processed.

To prevent a lock from being set, we only act on records with a primary key one less than the maximum. This prevents a next-key lock from being set when we delete the records we have just processed.

To prevent multiple instances of the script from causing redundant data to be stored, we do a locking read of the timestamp by using FOR UPDATE. This lock only affects this summary script since mod_accounting does not use the vhost_accounting_timestamp table. Furthermore, we must use the READ-COMMITTED isolation level to guarantee that once the script receives the lock, it doesn't read any rows from vhost_accounting that have already been deleted.

And finally, we're ready to generate some reports…

SELECT
  SUM(bytes_in) AS bytes_in,
  SUM(bytes_out) AS bytes_out,
  SUM(bytes_in + bytes_out) AS total,
  virtual_host,
  host
FROM
  vhost_accounting_summary
WHERE
  timestamp >= '2004-11-18 00:00:00' AND
  timestamp <= '2004-11-18 23:59:59'
GROUP BY
  virtual_host
ORDER BY
   total DESC

tech | Permanent Link

The state is that great fiction by which everyone tries to live at the expense of everyone else. - Frederic Bastiat