Mirror Setup Howto

From openSUSE

(Difference between revisions)
Revision as of 22:09, 24 June 2009
Poeml (Talk | contribs)
add pointer about cache control headers
� Previous diff
Revision as of 21:34, 4 July 2009
Poeml (Talk | contribs)
Walk-through - fix the random sleep command before downloading routing data to be between 0 and 30 minutes, and not between 0 and 30 seconds.
Next diff →
Line 78: Line 78:
** pick an rsync module that you want to sync up from. They are described here: [[Mirror_Infrastructure#rsync_modules]]. This example will use the "opensuse-hotstuff-160gb" module below. ** pick an rsync module that you want to sync up from. They are described here: [[Mirror_Infrastructure#rsync_modules]]. This example will use the "opensuse-hotstuff-160gb" module below.
** add a cronjob to sync content. Here's an example for the most requested files, which we'll pull frequently (every 6 hours, after a small random offset): ** add a cronjob to sync content. Here's an example for the most requested files, which we'll pull frequently (every 6 hours, after a small random offset):
- 1 */6 * * * mirror sleep $(($RANDOM/1024)); rsync -rlpt rsync.opensuse.org::opensuse-hotstuff-160gb /srv/pub/opensuse/ --delete-after --delete-excluded --max-delete=4000 --timeout=1800 -hi+ 1 */6 * * * mirror sleep $(($RANDOM/16)); rsync -rlpt rsync.opensuse.org::opensuse-hotstuff-160gb /srv/pub/opensuse/ --delete-after --delete-excluded --max-delete=4000 --timeout=1800 -hi
** you can try the command out, and pull the initial sync (and watch it), like this: ** you can try the command out, and pull the initial sync (and watch it), like this:

Revision as of 21:34, 4 July 2009


Contents

Walk-through

Below, I'll list steps to set up a mirror for openSUSE content. Feel free to improve this page, or simply mail feedback to ftpadmin at suse.de.

There is one big assumption made: The mirror is running openSUSE itself. This allows me to give specific directions.

If you run a different operating system, the details will differ, but hopefully this howto can serve as an example nevertheless!

  • At first, be sure that you can afford the expected traffic, and your Internet Service Provider doesn't terminate your contract!
  • packages to install:
    • rsync
    • xntp
    • apache2-prefork or apache2-worker
  • take provision to regularly update the machine with security fixes
  • firewall:
    • if you use one, open port 80 (HTTP) and 873 (rsync).
  • general things:
    • install xntp
    • add the IP of a time server into /etc/ntp.conf, and configure it to start (rcntp start; chkconfig -a ntp)
    • make sure that hostname and DNS resolution makes sense:
      • check /etc/hosts, /etc/HOSTNAME, /etc/resolv.conf
      • check that the commands 'hostname' and 'hostname -f' return something useful. A functioning hostname and name resolution are really helpful.
  • web server:
    • assuming your mirror hostname is: mirror.example.com
    • create /etc/apache2/vhosts.d/mirror.example.com.conf
<VirtualHost *:80>
   ServerAdmin admin@example.com
   ServerName mirror.example.com

   DocumentRoot "/srv/pub/opensuse"

   <Directory "/srv/pub/opensuse">
       Options FollowSymLinks Indexes
       IndexOptions FancyIndexing VersionSort NameWidth=* Charset=UTF-8 TrackModified FoldersFirst XHTML
       AllowOverride None
       Order allow,deny
       Allow from all
   </Directory>

   Alias /robots.txt /srv/www/mirror.example.com/robots.txt
   <Directory "/srv/www/mirror.example.com">
       Options None
       Order allow,deny
       Allow from all
   </Directory>

   Include /etc/apache2/conf.d/apachestats.conf

</VirtualHost>
    • create a robots.txt to avoid web crawlers:
      • mkdir /srv/www/mirror.example.com
      • put this into /srv/www/mirror.example.com/robots.txt:
User-agent: *
Disallow: *
    • tuning for high performance:
      • adjust the MPM characteristics in /etc/apache2/server-tuning.conf so that they fit the memory size of your machine. The worst thing which can happen is that it starts swapping, so Apache's maximal size needs to fit in the memory you have. The worker MPM can make better use of the available memory, however the prefork MPM is easier to configure. Watch the RSS column in ps (you can substract SHARED), and multiply it with the maximum number of processes...
      • set a low KeepAliveTimeout (decrease it to 3) in /etc/apache2/server-tuning.conf
    • rcapache2 restart; chkconfig -a apache2


  • content:
    • create a special user, and a directory to mirror to:
      • groupadd mirror
      • useradd -m -g mirror -c "Mirror User" -s /bin/bash mirror
      • mkdir /srv/pub/opensuse
      • mkdir /srv/pub/opensuse/update
      • chown -R mirror:mirror /srv/pub/opensuse
    • pick an rsync module that you want to sync up from. They are described here: Mirror_Infrastructure#rsync_modules. This example will use the "opensuse-hotstuff-160gb" module below.
    • add a cronjob to sync content. Here's an example for the most requested files, which we'll pull frequently (every 6 hours, after a small random offset):
1 */6 * * *    mirror   sleep $(($RANDOM/16)); rsync -rlpt rsync.opensuse.org::opensuse-hotstuff-160gb /srv/pub/opensuse/ --delete-after --delete-excluded --max-delete=4000 --timeout=1800 -hi
    • you can try the command out, and pull the initial sync (and watch it), like this:
      • su - mirror
      • rsync -rlpt rsync.opensuse.org::opensuse-hotstuff-160gb /srv/pub/opensuse/ --delete-after --delete-excluded --max-delete=4000 --timeout=1800 -hi


  • give the openSUSE scanner access, by setting up an rsync server:
    • (rcrsyncd start; chkconfig -a rsyncd)
    • add the following to /etc/rsyncd.conf:
 [opensuse]
         path = /srv/pub/opensuse
         comment = rsync access for openSUSE scanner
         uid = nobody
         # if you want to limit access to the openSUSE mirror scanner:
         #hosts allow = 195.135.220.0/22


  • tell the redirector about it
    • write mail to admin at opensuse org, providing your details, as explained here: Mirror_Infrastructure#Register_Your_Mirror
    • take appropriate care that your webserver is up! The redirector will check it every few minutes... but until the next probe happens, it will continue to redirect clients to your hosts.


  • for extra points, you can considerably increase the service quality for users by configuring cache control headers for certain content. The idea is to mark the metadata files with cache control headers that indicate that they are not served from an intermediary (proxy) cache without checking for freshness before. This greatly reduces the risk that users see inconsistencies (one file being served stale from the cache, another one served fresh from the origin server). Add this to your Apache config (outside of a directory context):
   <LocationMatch "\.(xml|xml\.gz|xml\.asc)">
       Header set Cache-Control "must-revalidate"
       ExpiresActive On
       ExpiresDefault "now"
   </LocationMatch>
    • mod_headers and mod_expires are required for this configuration. Enable them with the following commands:
a2enmod headers
a2enmod expires
rcapache2 restart


  • monitoring and mail
    • there are many ways to configure and use a mail system. What I do, is:
      • add myself to the root alias in /etc/aliases: "root: poeml@example.com"
      • make sure that sending out mail works (you might need to configure a relay). Make sure YOUR mirror isn't accepting mail from externally, which would turn it into a spam hub
      • make the sender more explicit: usermod -c "root at $(hostname)" root
      • a highly useful package is sysstat. After installation, start it (rcsysstat start; chkconfig -a sysstat). The command "sar -A | less" will show various performance data for analysis.

Things to watch out for

If the mirror syncs from our stage rsync server (stage.opensuse.org), a few points need to be observed:

  • rsync needs to be run in a way that directory permissions are respected, and reproduced on the target machine. The above example takes care of that. If the permissions are not correctly reproduced, it interferes with the bitflip release process.
  • always run your mirror scripts under a user id different from the one your web server runs as. An identical user id would make all files readable for the web server, which interferes with the bitflip release process.
  • the user id running the mirror scripts also needs to be different from the user id that runs an rsync daemon
  • never run your web server as root. It also interferes with the bitflip release process.
  • if you happen to also run a public rsync server, make sure that your rsync daemon runs under a different user id than the script which pulls content from openSUSE. Otherwise you might be publicly serving content which is still "staged", i.e. not meant to be public.

See also: Mirror_Infrastructure#Conditions_for_access_to_stage.opensuse.org

Protection of resources

If your mirror is very popular, it may happen that it gets substantial traffic by download clients that open too many connections. There are download clients that open simultaneous connections to grab more of your bandwidth. That's not necessarily a wrong thing in itself, but if they open too many connections (20, or even more than 100), you will have to do something against it, in order to protect your server and also to protect the resources you provide, so they stay accessible for other legitimate users.

You can see the number of simultaneous connections e.g. with this command:

rcapache2 full-server-status | grep ' W ' | sort -k 11

This command basically takes the output of the Apache server status and sorts it by IP address, making it easy to see how many connections originate from where.

There is a number of Apache modules that can be used to achieve that. Don't be confused: what you *don't* want in this scenario is connection throttling, because it would make the clients stay even longer, and occupying server slots longer. There are two modules that I can recommend:

mod_limitipconn

from http://dominia.org/djao/limitipconn.html. Packages here: http://software.opensuse.org/search?q=apache2-mod_limitipconn

This module limits connections that are handled at the same time, per IP. Example configuration:

<IfModule mod_limitipconn.c>
    <Directory /srv/pub/opensuse>
        MaxConnPerIP 20
        # exempting images from the connection limit is often a good
        # idea if your web page has lots of inline images, since these
        # pages often generate a flurry of concurrent image requests
        NoIPLimit image/*
    </Directory>
</IfModule>

The limit should not be too small, because simultaneous connections can also mean that corporate users access your site via a common proxy.

mod_ip_count

Packages are here: http://software.opensuse.org/search?q=apache2-mod_ip_count_modmemcache. Needs mod_memcache from http://software.opensuse.org/search?q=apache2-mod_memcache and a memcache daemon (http://software.opensuse.org/search?q=memcached).

This module limits the rate at which new connections are accepted, per IP.

<IfModule mod_memcache.c>
    MemcacheServer 127.0.0.1:11211 min=0 smax=16 max=32 ttl=600
</IfModule>
<IfModule mod_ip_count.c>
    # Max number of requests before failing
    MemCacheMaxRequests 800
    # Time period in which the requests have to come (seconds)
    MemCacheMaxTime 120
</IfModule>

The window we look at must be large enough so we don't block clients that download a large directory, like the openSUSE install client which downloads packages to install from 11.0/repo/i586/...

The required memcache daemon is started with 'rcmemcached start' and configured to start permanently with 'chkconfig -a memcached'.