Mirrorbrain
From openSUSE
| Revision as of 20:47, 4 July 2009 Poeml (Talk | contribs) Load the database with routing data - fixed typo � Previous diff |
Revision as of 20:48, 4 July 2009 Poeml (Talk | contribs) Load the database with routing data - fixed another typo Next diff → |
||
| Line 228: | Line 228: | ||
| === Load the database with routing data === | === Load the database with routing data === | ||
| - | * run the following commands as user mirrorbrain, which downloads routing data and import them into the database: | + | * run the following commands as user mirrorbrain, which downloads routing data and imports them into the database: |
| {{Shell| | {{Shell| | ||
Revision as of 20:48, 4 July 2009
| Mirrorbrain | |
| The brain of mirrorbrain | |
| General | |
| Website: | http://www.mirrorbrain.org |
| Author: | Peter Pöml |
| License: | GPL and Apache License |
| openSUSE | |
| Category: | Network |
| Age: | {{{{{age}}}}} |
| Package: | Click to search |
A Download Redirector and Metalink Generator
Introduction
MirrorBrain is an open source framework to run a content delivery network using mirror servers. It solves a challenge that many popular open source projects face - a flood of download requests, often magnitudes more than a single site could practically handle.
A central (and probably the most obvious) part is a "download redirector" which automatically redirects requests from web browsers or download programs to a mirror server near them.
Choosing a suitable mirror for a users request is the key, and MirrorBrain uses geolocation and global routing data to make a sensible choice, and achieve load-balancing for the mirrors at the same time. The used algorithm is both sophisticated and easy to control and tune. In addition, MirrorBrain monitors mirrors, scans them for files, generates mirror lists, and more.
Installation
Repositories
Add the needed repositories (use the subdirectory matching your distribution):
http://download.opensuse.org/repositories/Apache:/MirrorBrain/ http://download.opensuse.org/repositories/devel:/languages:/python/ http://download.opensuse.org/repositories/server:/database:/postgresql/
You can do this via commandline (we are using openSUSE 11.1 in our example):
zypper ar http://download.opensuse.org/repositories/Apache:/MirrorBrain/Apache_openSUSE_11.1 Apache:MirrorBrain zypper ar http://download.opensuse.org/repositories/devel:/languages:/python/openSUSE_11.1 devel:languages:python zypper ar http://download.opensuse.org/repositories/server:/database:/postgresql/openSUSE_11.1 server:database:postgresql
Packages
Here's a list of packages needed to have one host running the database and the redirector:
apache2 apache2-worker apache2-mod_asn apache2-mod_geoip apache2-mod_mirrorbrain apache2-webthings-collection GeoIP libapr-util1-dbd-pgsql libGeoIP1 perl-Config-IniFiles perl-DBD-Pg perl-Digest-MD4 perl-libwww-perl postgresql postgresql-server python-cmdln python-psycopg2 python-sqlobject mirrorbrain mirrorbrain-scanner mirrorbrain-tools postgresql-ip4r
| If the web server is set up seperately from the database server, the web server needs only the package libapr-util1-dbd-pgsql and no other postgresql* packages. |
You can install the packages via the following commandline:
zypper install apache2 apache2-worker apache2-mod_asn apache2-mod_geoip apache2-mod_mirrorbrain apache2-webthings-collection GeoIP libapr-util1-dbd-pgsql libGeoIP1 perl-Config-IniFiles perl-DBD-Pg perl-Digest-MD4 perl-libwww-perl postgresql postgresql-server python-cmdln python-psycopg2 python-sqlobject mirrorbrain mirrorbrain-scanner mirrorbrain-tools postgresql-ip4r
Configure GeoIP
Edit /etc/apache2/conf.d/mod_geoip.conf:
<IfModule mod_geoip.c> GeoIPEnable On GeoIPDBFile /var/lib/GeoIP/GeoIP.dat #GeoIPOutput [Notes|Env|All] GeoIPOutput Env </IfModule>
(Change GeoIPOutput All to GeoIPOutput Env)
Note that a caching mode like MMapCache needs to be used, when Apache runs with the worker MPM.In this case, use
<IfModule mod_geoip.c> GeoIPEnable On GeoIPDBFile /var/lib/GeoIP/GeoIP.dat MMapCache GeoIPOutput Env </IfModule> |
set up automatic GeoIP database updates
New versions of the GeoIP database are released each month. You can set up a cron job to automatically fetch new updates as follows. If you do that, make sure to set the GeoIPDBFile path (see above) to /var/lib/GeoIP/GeoLiteCity.dat.updated .
# update GeoIP database on Mondays 31 2 * * mon root sleep $(($RANDOM/1024)); /usr/bin/geoip-lite-update
Add mirrorbrain User/Group
When you install the mirrorbrain package, a user named "mirrorbrain" in group "mirrorbrain" is automatically created.
Create the mirrorbrain config file
Create /etc/mirrorbrain.conf with the content below:
[general] # there can be several mirrorbrain instance, each identified with a unique name. # each instance will have a [instance_name] section below. instances = mbtest # general information for mirror monitoring [mirrorprobe] mailto = your_mail@example.com, another_mail@example.com # settings for the mbtest instance [mbtest] dbuser = mb dbpass = 12345 dbdriver = postgresql dbhost = localhost # optional: dbport = ... dbname = mb
The file permission should be 0640, ownership root:mirrorbrain.
chown root:mirrorbrain /etc/mirrorbrain.conf chmod 0640 /etc/mirrorbrain.conf
Other possible options for the instances:
| Option | Values |
|---|---|
| scan_top_include | directory names separated by spaces |
| scan_exclude_rsync | exclude list for rsync (same rules as for rsyncs option --exclude= apply) |
| scan_exclude | FTP excludes that we don't use |
Start the PostgreSQL server
- Start the postgresql database, and configure it to be started at boot time automatically:
root@powerpc:~ # rcpostgresql start root@powerpc:~ # chkconfig -a postgresql
Install the PostgreSQL ip4r datatype
IP4 and IP4R are types that contain a single IPv4 address and a range of IPv4 addresses respectively. They can be used as a more flexible, indexable version of the cidr type and mirrorbrain uses this kind of database type for improving database speed.
- Install the datatype, done by executing sql statements from the shipped file:
root@powerpc:~ # su - postgres root@powerpc:~ # psql -f /usr/share/postgresql-ip4r/ip4r.sql template1
| "template1" means that all databases that are created later will have the datatype. To install it onto an existing database, use your database name instead. |
It is normal to see a a good screenful of out printed out by psql.
Setup/configure the database
- Create a separate database user account (named mb) and a new database (also named mb):
root@powerpc:~ # su - postgres postgres@powerpc:~> createuser --no-superuser --no-createdb --no-createrole --pwprompt --login mb Enter password for new role: Enter it again: postgres@powerpc:~> createdb -O mb mb postgres@powerpc:~> createlang plpgsql mb
- Now backup the postgresql settings and adapt it for mirrorbrain
postgres@powerpc:~> cp data/pg_hba.conf data/pg_hba.conf.orig postgres@powerpc:~> vi data/pg_hba.conf
Note that we allow connections via socket and from localhost for the user mb:
# TYPE DATABASE USER CIDR-ADDRESS METHOD # "local" is for Unix domain socket connections only local mb mb md5 local all all ident sameuser # IPv4 local connections: host mb mb 127.0.0.1/32 md5 host all all 127.0.0.1/32 ident sameuser # IPv6 local connections: host mb mb ::1/128 md5 host all all ::1/128 ident sameuser
- Afterwards, restart the postgresql server as root:
- You should now be able to login as user mb with your defined password:
root@powerpc:~ # psql -U mb Password for user mb: Welcome to psql 8.3.7, the PostgreSQL interactive terminal. ...
If the database will be large, reserve enough memory for it (mainly by setting shared_buffers), and in any case you should switch off synchronous commit mode (synchronous_commit = off). This can be set in data/postgresql.conf as user postgres. Don't forget to restart the database engine after changing the config.
- Now import table structure, and initial data:
root@powerpc:~ # psql -U mb -f /usr/share/doc/packages/mirrorbrain/sql/schema-postgresql.sql mb Password for user mb: BEGIN ... root@powerpc:~ # psql -U mb -f /usr/share/doc/packages/mirrorbrain/sql/initialdata-postgresql.sql mb Password for user mb: INSERT 0 6 INSERT 0 246 root@powerpc:~ #
Create the ASN database table
Now, as your initial database exists, execute the sql statements from asn.sql (shipping with apache2-mod_asn). This allows apache to set the looked up data table for autonomous system and the network prefix as env table variables, for perusal by other Apache modules. In addition, it can send it as response headers to the client.
root@powerpc:~ # psql -U mb -f /usr/share/doc/packages/apache2-mod_asn/asn.sql mb
| The command creates a table named pfx2asn in the mb database. The table name is used in some other places, so you should not change it. |
Test if the mb tool can connect to the database
- You should now be able to run the mb command. If successful, you'll get the help output:
root@powerpc:~ # mb Usage: ...
Prepare ASN usage
Load the database with routing data
- run the following commands as user mirrorbrain, which downloads routing data and imports them into the database:
root@powerpc:~ # su - mirrorbrain
mirrorbrain@powerpc:~> asn_get_routeviews- this will take at least a few minutes - about 30MB are downloaded, and the data is about 1 Gig uncompressed. (In the postgresql database it'll need about 40MB, with index.)
- you should set up this script to run once per week by cron, so the database keeps updated regularly.
# update ASN data three times a week, for all configured mirrorbrain instances:
1 1 * * 7 mirrorbrain sleep $(($RANDOM/1024)); \
rm oix-full-snapshot-latest.dat.bz2; \
for i in $(mb instances); do \
asn_get_routeviews | asn_import -b $i;
done
Test if the routing data lookup works
- If everything has worked so far, you should be able to run the following command and get a network prefix and AS number as result of the lookup:
root@powerpc:~ # mb iplookup www.opensuse.org 130.57.0.0/20 (AS3680)
Configure mod_asn
- simply set ASLookup On in the directory context where you want it.
- the shipped config (mod_asn.conf) shows an example.
- set ASSetHeaders Off if you don't want the data to be added to the HTTP response headers.
- the client IP address is the one that the requests originates from. But if mod_asn is running behind a frontend server, the frontend can pass the IP via a header and mod_asn can look at the header instead, and you can configure it to look at that header like this:
ASIPHeader X-Forwarded-For
- if you use ASIPHeader, you would probably use it together with GeoIPScanProxyHeaders.
- alternatively, if you want to use mod_rewrite you can also make mod_asn look at a variable in Apache's subprocess environment:
ASIPEnvvar CLIENT_IP
- ASLookupDebug On can be set to switch on debug logging. It can be set per directory.
Configure Apache
Load and Configure needed Apache modules
- First, enable the 3 apache modules:
root@powerpc:~ # a2enmod form root@powerpc:~ # a2enmod dbd root@powerpc:~ # a2enmod asn root@powerpc:~ # a2enmod mirrorbrain
- create a DNS alias for your web host, if needed
- configure the database adapter (mod_dbd), resp. its connection pool. Put the configuration into server-wide context. Config example::
# for prefork, this configuration is inactive. prefork simply uses 1
# connection per child.
<IfModule !prefork.c>
DBDMin 0
DBDMax 32
DBDKeep 4
DBDExptime 10
</IfModule>
| This is only needed if you use the apache worker module - prefork always uses 1 connection per child. |
Configure Logging
- You may want to log more details than Apache normally logs into the access_log file. You can define a new log format that gives you an access_log, with details from MirrorBrain added:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \
%{X-MirrorBrain-Mirror}o r:%{MB_REALM}e \
%{MB_CONTINENT_CODE}e:%{MB_COUNTRY_CODE}e ASN:%{ASN}e P:%{PFX}e \
size:%{MB_FILESIZE}e %{Range}i" combined_redirect
- This defines a new log format called "combined_redirect", which you can use in your virtual hosts with the CustomLog directive. To change the default logging setup in openSUSE, you would edit /etc/sysconfig/apache2 and change
APACHE_ACCESS_LOG="/var/log/apache2/access_log combined"
- to
APACHE_ACCESS_LOG="/var/log/apache2/access_log combined_redirect"
- most likely, if you set up a virtual host, you would probably add your own CustomLog directive there, to get a separate log file for that virtual host.
Configure a Virtual Host
- Below is a complete example for a working virtual host configuration for mod_mirrorbrain. You'll need to change some details that are specific to your setup, like the base directory (named "/srv/my_tree" below). Some Apache directives need to be placed inside a <Directory> block, others not - the below example should provide guidance regarding this; just copy and paste it and go from there:
<VirtualHost *:80>
# change to something useful:
ServerName localhost
#ServerAdmin webmaster@your_host
# put your mirrored tree here!
DocumentRoot /srv/my_tree
# logging to /var/log/apache2/mb_eval
CustomLog /var/log/apache2/mb-access_log combined_redirect
DBDriver pgsql
DBDParams "host=localhost user=mb password=12345 dbname=mb connect_timeout=15"
# (optional) location of metalink hashes:
# MirrorBrainMetalinkHashesPathPrefix /srv/metalink-hashes
# MirrorBrainMetalinkPublisher "Your project name" http://example.com/
# MirrorBrainMirrorlistStyleSheet "http://example.com/stylesheet.css"
<Directory /srv/my_tree>
MirrorBrainEngine On
MirrorBrainDebug Off
FormGET On
MirrorBrainHandleHEADRequestLocally Off
MirrorBrainMinSize 0
MirrorBrainHandleDirectoryIndexLocally On
Options FollowSymLinks Indexes
AllowOverride None
Order allow,deny
Allow from all
IndexOptions FancyIndexing +NameWidth=* TrackModified
IndexOptions VersionSort SuppressDescription XHTML
IndexStyleSheet "http://example.com/stylesheet.css"
<IfModule mod_autoindex_mb.c>
IndexOptions Metalink
IndexOptions Mirrorlist
</IfModule>
<IfModule mod_asn.c>
ASLookup On
ASSetHeaders On
</IfModule>
</Directory>
</VirtualHost>
Start Apache
- Start the apache server, and configure it to be started at boot time automatically:
root@powerpc:~ # rcapache2 start root@powerpc:~ # chkconfig -a apache2
Using Mirrorbrain
Setting up mirrorbrain is only one (but the biggest) part of mirrorbrain. Afterwards, you might use the commandline tool mb to configure your mirrors.
| Even if mirrorbrain is able to handle rsync, ftp and http - the best way to keep informed about the content on the mirrors is rsync. The mirrorbrain scanner tries to verify the content on the mirrors via rsync, ftp or http in this order. So it's ok to add just an http mirror - but if possible, ask for rsync access ;-) |
root@powerpc:~ # mb help
Shows all relevant information - use mb help <command> to get more information about an option.
Adding mirror servers
Adding a new mirror is easy with mb - if you have all relevant data from your mirror servers.
root@powerpc:~ # mb new mirror.example.com \ -H http://mirror.example.com/download/ \ -R rsync://mirror.example.com/download/ \ -e admin@example.com \ -a "Admins Name" \ -C "He Who Never Sleeps"
Adds the new mirror with the identifier mirror.example.com to the database. New mirrors are normally disabled per default and need at least to be verified via mirrorprobe. So a mirror has to be scanned and probed befor it is accepted for redirect. Scanning and probing normally runs via cronjob (see above). For new hosts, you can instead scan and probe manually - and enable the new mirror afterwards if everything is fine.
So the best way to proceed is to initiate a manual scan of the mirror:
root@powerpc:~ # mb scan -v mirror.example.com
The -v option produces verbose output. You should see your files and summaries like "scanned 105 files (22/s) in 4s" . Now the database is up-to date. Now verify the content on the mirror:
root@powerpc:~ # /usr/bin/mirrorprobe -t 20
As result, the field "statusBaseurl" in the database should be "True" - means: the mirror is checked and verified. Last thing to do is enabling the mirror, so the apache module can use him as redirector:
root@powerpc:~ # mb enable mirror.example.com
- you should set up the mirrorprobe to run every minute by cron, so the mirrors are closely watched:
# logs to /var/log/mirrorbrain/mirrorprobe.log -* * * * * mirrorbrain mirrorprobe -t 20 &>/dev/null
Editing the mirror database
This is done via
root@powerpc:~ # mb edit <identifier>
This command opens your preferred editor and shows a table containing the relevant values for the mirror. You can use parts of the identifier on the commandline - mb completes the name automatically, if it is unique enough.
# # Note: You cannot modify 'identifier' or 'id'. You can use 'mb rename' though. # identifier : mirror.example.com operatorName : operatorUrl : baseurl : http://mirror.example.com/download/ baseurlFtp : baseurlRsync : rsync://mirror.example.com/download/ region : eu country : de asn : 0 prefix : regionOnly : False countryOnly : False asOnly : False prefixOnly : False otherCountries : fileMaxsize : 0 publicNotes : score : 100 enabled : True statusBaseurl : True admin : Admins Name adminEmail : admin@example.com ---------- comments ---------- host is always down on saturday ---------- comments ----------
The most important fields are baseurl*, score, enabled.
| Field | Explanation |
|---|---|
| identifier | This is the unique ID of the mirror server. It is mainly used to do things with a mirror on the commandline, like with the mb tool. With most mb commands, you can use a part of the identifier (one that you can easily remember), and the name will be completed automatically, if it is unique enough. In the table shown by mb edit, this is the only field that cannot be edited. To rename an identifier, you can use the mb rename command. |
| operatorName | The realname of the mirror operator. It could be a person, or the organization running the mirror, or a sponsor. If the mirror list is exposed in some way, this field could be used to give the operator some visibility. Otherwise, it is of no significance than for your information. |
| operatorUrl | A contact or informative URL. |
| baseurl | The root HTTP URL of the mirrored file tree on the mirror. Used by the scanner and redirector to find/redirect the files via HTTP. If a mirror doesn't offer HTTP, but only FTP, an FTP URL can be entered here as well. |
| baseurlFtp | The root FTP URL of the mirrored file tree on the mirror. Used by the scanner and redirector to find/redirect the files via FTP. |
| baseurlRsync | The root rsync URL used by the scanner and redirector to find/redirect the files via rsync. It's possible to use rsync://<username>:<password>@<hostname>/module as commonly done with rsync. rsync is the preferred method of scanning, so it is beneficial if rsync access exists. If it doesn't, the scanner falls back to FTP or HTTP. |
| region | The region code specifying the continent the mirror server is located in. See also regionOnly. If you create a new mirror, mb tries to fill in this and the following field for you; it's possible to edit it later, though. |
| country | The country code for the server. See also countryOnly. |
| asn | This is optional and is a number of the autonomous system the mirror is located in. It may serve as a more specific "network location" than the country, and is filled in automatically when a mirror is created. If you don't use the autonomous system database together with MirrorBrain, the value will be zero and will be ignored by MirrorBrain. It is not strictly needed. It can also be edited manually, or updated via mb update --asn <identifier> from looked up data. |
| prefix | Same as asn, this value is optional, and if present, it is used for a possibly finer-grained mirror selection. It is filled in automatically, and can be edited like asn. Use mb update --prefix <identifier> to fill in data from routing table lookup. |
| regionOnly | If true, only clients from the same region (continent) as the mirror are redirected to this mirror. |
| countryOnly | If true, only clients from the same country as the mirror are redirected to this mirror. |
| asOnly | If true, the mirror will only get requests from clients that are located within the same network autonomous system. |
| prefixOnly | If true, the mirror will only get requests from clients that are located within the same network prefix. |
| otherCountries | List of other countries that should be sent to this server. This overrides the country and region choice, and can be used to fine-tune mirror selection. The list of country IDs specified here is given in the form of comma-separated two-letter codes. Apache does a simple string match on these, and a value that would make sense would be "ca,mx,ar,bo,br,cl,co,ec,fk,gf,gy,pe,py,sr,uy,ve, jp" for instance. |
| fileMaxsize | Maximum filesize, the server can deliver without problems (some servers have problems with files > 2GB for example). MirrorBrain automatically checks HTTP servers for correct delivery, so there is no need to define this value for that reason. It can be used, however, to cause only "small" requests to go to certain mirrors, which are known to have too few bandwidth to deliver large files. If you set a threshold here (in bytes), the mirror will only get files that are smaller. |
| publicNotes | Notes which should be added to a html page listing all mirrors. The field may be used to store information separately from private notes taken in the comments field. The data isn't exposed though, unless you take care of it. |
| score | The score of the server. Higher scored servers are used more often than lower scored servers. Default is 100. A server with score=150 will be used more often than a server with score=50. |
| enabled | disable <identifier>. |
| statusBaseurl | This field is edited by the mirror probe each time it runs (which normally is done frequently via cron). If it's true, the mirror probe found that the mirror is alive the last time it looked. |
| admin | Cleartext name of the responsible admin. |
| adminEmail | Contact Email address. |
| comments | Free text field for additional comments. Use it in any way that suits you. It lends itself to take notes about communication with mirrors, for instance. |
For fields where a boolean is expected, you can type the value (while editing) in the form of 0/1 instead of true/false, which is shorter to type.
Releated Software
Other pages

