Also, a good example of the READ-COMMITTED isolation level
mod_accounting is a simple
Apache module for recording bandwidth usage on a per-virtual host basis. It
can write to either a MySQL or PostgreSQL database. You just add a couple
directives to httpd.conf to tell mod_accounting which database to write to, the
query to run, and how often to write to the database.
Here's the configuration I use.
<IfModule mod_accounting.c> AccountingQueryFmt " \ INSERT INTO \ vhost_accounting \ ( \ bytes_in, \ bytes_out, \ virtual_host, \ host \ ) \ values \ ( \ %r, \ %s, \ LOWER('%h'), \ 'host.example.com' \ );" AccountingTimedUpdates 300 AccountingDatabase vhost_accounting AccountingDatabaseDriver mysql AccountingDBHost stats.example.com 3306 AccountingLoginInfo vhost_accounting XXXXXXXX </IfModule>
And the corresponding table.
CREATE TABLE `vhost_accounting` ( `virtual_host` varchar(128) NOT NULL default '', `bytes_in` int(20) unsigned NOT NULL default '0', `bytes_out` int(20) unsigned NOT NULL default '0', `timestamp` timestamp(14) NOT NULL, `host` varchar(128) NOT NULL default '', `id` bigint(20) unsigned NOT NULL auto_increment, PRIMARY KEY (`id`) ) TYPE=InnoDB
If you have tables of hosts and virtual hosts, you'll probably want to
normalize the database and modify AccountingQueryFmt
to use the foreign key to
the vhost. For example,
AccountingQueryFmt "INSERT INTO vhost_accounting (bytes_in, bytes_out, virtual_host_id) SELECT %r, %s, id FROM vhost WHERE virtual_host_name = LOWER('%h');"
There are a couple issues that make deploying mod_accounting to multiple machines a bit tricky. Each apache process does its own logging, and thus, maintains its own connection to the database server. If you have hundreds of Apache processes and multiple servers, you'll quickly have thousands of open connections to the database. To reduce the number of concurrent connections, I made a one-line change to mod_accounting to close the connection to MySQL after writing to the database so a new connection is used each time data is written to the database.
The next problem is simply the large amount of data that is collected. It
quickly grows to the point where generating reports in real-time is too slow.
To solve this problem, I created a summary table, vhost_accounting_summary
, to
consolidate the data produced by each Apache process. I also decided that
hourly statistics were sufficient for my needs.
The records in the vhost_accounting
table are removed once the data has been
inserted into vhost_accounting_summary
. One other table,
vhost_accounting_timestamp
, is used to keep track of when the summary table was
last updated and to ensure that data from vhost_accounting
does not
accidentally get written to vhost_accounting_summary
multiple times. (More
details below.)
CREATE TABLE `vhost_accounting_summary` ( `virtual_host` varchar(128) NOT NULL default '', `bytes_in` int(20) unsigned NOT NULL default '0', `bytes_out` int(20) unsigned NOT NULL default '0', `timestamp` timestamp(14) NOT NULL, `host` varchar(128) NOT NULL default '', KEY `virtual_host` (`virtual_host`), KEY `idx_timestamp_host` (`timestamp`,`host`) ) TYPE=InnoDB CREATE TABLE `vhost_accounting_timestamp` ( `timestamp` timestamp(14) NOT NULL ) TYPE=InnoDB
And here's the SQL to update the vhost_accounting_summary
table.
SET TRANSACTION ISOLATION LEVEL READ COMMITTED; BEGIN; SELECT timestamp FROM vhost_accounting_timestamp FOR UPDATE; SELECT @max_id:= max(id) FROM vhost_accounting; INSERT INTO vhost_accounting_summary SELECT virtual_host, SUM(bytes_in), SUM(bytes_out), DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00') AS dayhour, host FROM vhost_accounting WHERE id < @max_id GROUP BY virtual_host, dayhour; DELETE FROM vhost_accounting WHERE id < @max_id; UPDATE vhost_accounting_timestamp SET timestamp = now(); COMMIT;
There are a couple important points to make about the SQL above. We want to
make sure that mod_accounting never has to wait to write to the
vhost_accounting
because of a lock. In addition, we want to ensure that, if
multiple copies of the update script get run concurrently, it does not cause
the same data to be written to vhost_accounting_summary
multiple times.
Mod_accounting does its work while processing a request. If mod_accounting is waiting for a lock, the Apache process will not finish processing the current request until the lock is removed or a lock timeout occurs. This can quickly cause all of the Apache children to be tied up and prevent new requests from being processed.
To prevent a lock from being set, we only act on records with a primary key one less than the maximum. This prevents a next-key lock from being set when we delete the records we have just processed.
To prevent multiple instances of the script from causing redundant data to be
stored, we do a locking read of the timestamp by using FOR
UPDATE
. This lock only affects this summary script since
mod_accounting does not use the vhost_accounting_timestamp
table.
Furthermore, we must use the READ-COMMITTED
isolation level to guarantee that once the script receives the lock, it
doesn't read any rows from vhost_accounting
that have already been
deleted.
And finally, we're ready to generate some reports
SELECT SUM(bytes_in) AS bytes_in, SUM(bytes_out) AS bytes_out, SUM(bytes_in + bytes_out) AS total, virtual_host, host FROM vhost_accounting_summary WHERE timestamp >= '2004-11-18 00:00:00' AND timestamp <= '2004-11-18 23:59:59' GROUP BY virtual_host ORDER BY total DESC
The state is that great fiction by which everyone tries to live at the expense of everyone else. - Frederic Bastiat