Libzypp/Refactoring

From openSUSE


Contents

Refactoring Libzypp

ma@suse.de, dmacvicar@suse.de

This is the design document for developers participating in libzypp refactoring.

Please update the timeline and issues.

Introduction

The amount of software available out there for users of Linux has changed the requirements of how package managers should think about the concept of a repository.

In former days, repositories were reduced to the original media, plus some favorites.

Nowadays, thanks to the size of the community, and the amount of available products, an average user has access to hundred of repositories.

Repositories like Factory, which are used by community contributors, include the whole distribution metadata in a non optimal format (xml):

16M  filelists.xml.gz
45M other.xml.gz
6.7M primary.xml.gz

(Note that the file sizes are compressed.)

Downloading this data takes time. Parsing it is expensive. Currently zypp has the following limitations:

  • zypp does not separate raw from preprocessed data
  • zypp does a refresh (check if repository data has changed) on every startup
  • zypp has to parse the raw data (xml or tag files) to initialize the resolvables and being able to solve.
  • zypp parses the raw data (repository metadata) every time, instead of using a pre-processed format

(Analysis by Dirk Mueller has shown that parsing and applying regular expression matches account for a majority of zypp startup time)

  • zypp allocates in-memory objects for all the data, even if only a fraction is actually used.

What does this mean for a user:

  • lookup information for a package means implies refreshing sources ( if the user has factory in the list, it probably changed ), parsing all the metadata, look the package and display the information.
  • A simple call to any zypp dependant tool need between 15 and 40 seconds to start depending on the status of the sources.

From 10.1 to 10.2 lot of improvements were made. Mostly eliminating processing being done were it was not needed and giving feedback to the user.

Other package managers, including zenworks, adopted a cache system. The metadata is parsed but stored in a preparsed cache.

Smart uses a native Python object cache. YUM and Zenworks adopted Sqlite. For YUM in development stages, the improvements showed inmediately ( see [1] ) :

yum -C list updates: (dag,base,updates,freshrpms) (5226 packages)

Sqlite: 1.4 seconds 13MB memory usage
Normal: 8.6 seconds 65 MB memory usage

yum -C install xen: (dag,base,updates,freshrpms,development) (8917 packages and a lot of deps (this pulls in python-2.3 and all other kinds of stuff), all headers are cached and it is run upto the point where it asks for confirmation)

sqlite: 19 seconds 38MB memory usage
Normal: 67 seconds 105MB memory usage

There is a drawback however: creating the sqlite caches takes some time: Running sqlite yum list updates as above for the first time needs to create the sqlite caches and takes 20 seconds, memory usage is about 40MB.

Cache creation and refresh can also be done asynchronously from the application and run as a background task.

Same speed differences can be observed with ZenWorks. rug is fast for lookup tasks.

We expect the following improvements:

  • No need to parse every time.
  • Light creation of objects using lazy loading of data.
  • Memory savings using data lazy data loading.
  • Fast startup due to separate source handling, refresh and data queries.
  • Fast lookup using a query interface.

Cache Architecture

The Cache mechanism and metadata fetching can be formed from various layers depending on the responsabilities.

ZYpp Cache and Metadata Fetching Propposed Architecture
ZYpp Cache and Metadata Fetching Propposed Architecture

Metadata and File Retrieval

You can see how we can save download time here: Libzypp/Downloading_Metadata

Media layer

This layer is already there. Its abstracts the different Media backends like http, ftp, local, NFS, iso. Also provides a small verifier API mechanism so you can actually check that the media you are accessing is the one you are expecting (you can't always just consider the URL, in the case of, for example, CDs ).

MediaSet layer

This component is responsible of handling different media numbers for a URL, and make the user change the medias if the incorrect one is inserted. It has therefore to provide an API to set a verifier to each media, and to provide data from a specific media number.

This code was tied inside the SourceImpl before. Now it was refactored out to zypp::MediaSetAccess

Fetcher layer

Use case: Source of X format. You need to download one file from media #1, then download the signature, check it, then open the just transfered file and read a list of more files to transfer (w/their checksums), and then transfer them one by one. There is a chance those files were already transfered before. The task of queuing, checking, and retrieve from cache is common to all metadata sources.

The Fetcher layer uses a MediaSet, and provides a way to queue and start transfer jobs as the component interested in those files know what it has to transfer.

The API allows to queue remote locations to be transfered, with their checksums or signatures, transfer them to a local directory, then add more jobs, transfer, etc. It allows to specify local cache directories where those files might be already be present. In this case the Fetcher just copy them from there (if the checksums match).

What about constructing a file from a old file in cache plus a downloaded .diff? This has to merge the functionality of creatng rpms from deltas and patchrpms, probably using different Diff backends depending on the mimetype.

Downloader layer

This layer is the upper layer of source metadata download. It encapsulates the knowledge of *WHAT* has to be transfered. For example, a Downloader layer for a YUM source, would use a Fetcher to get the content file. Then it would read it to get the list of files from the source, and add those jobs to the Fetcher and continue.

Resolvable data caching and retrieval (aka Cache)

The cache consists in a sqlite3 database with a set of high level APIs that allow to save the resolvable content in a pre-parsed form.

Specific Source Parser

A Specific source parser takes the local metadata (before it was also responsible of downloading) and parses it, adding resolvables and its attributes to the cache, using the Cache Store API.

Store Layer

The Store API provides a service to add resolvables and its attributes to the cache.

The biggest problem when designing an API that will be used by metadata parsers to fill a database is the fact that all the formats read data in different order.

YUM is flawed as it mixes descriptions and user data with basic solver data NVRAD (Name Version Release Architecture Dependencies ). But it is easy to write to a store.

SUSETags is on the other hand has different files for primary and user data.

If we consideer a simple API where we have data objects to pass to the store:

.------------------.
|    package       |
+------------------+
| NVRAD            |
+------------------+
| summary          |
+------------------+
| other data       |
+------------------+
| description      |
'------------------'

The SQL table behind the API, has a resolvables table which stores

|-----------------| |-----------------|
| resolvable      | | package_data    |
|-----------------| |-----------------|
| id  |   NVRAD   | | package_id      |
|-----------------| |-----------------|
                    | other data      |
                    |-----------------|

Inserting a package means, inserting a resolvable entry, getting a new id for it, and then insert a new package entry and fill package_id with it.

YUM can insert this data at the same time while parsing as it is available at the same time. SUSETags can't, so it has to cache the NVRAD and the id, to insert the second block when it becomes available from the translations file.

This causes the design of the data structures to be high dependant on how the metadata is read.

Also

We try to look for a solution that works well in the 99% of the cases, giving the flexibility in the rest 1%.

Interacton between Store and other components (WIP)
Interacton between Store and other components (WIP)

Proposed Solution

- a basic resolvable NVRAD data object - a dynamic fields object:

.------------------------.
|      package_data      |
+------------------+-----+
| summary          | [ ] |
| description      | [ ] |
| group            | [ ] |
| packager         | [ ] |
| license          | [ ] |
'------------------+-----'

Everytime a package object is inserted in the cache, the resolvable entry will be inserted, but also the specific data for the resolvable kind will be created in a empty state. The id of the resolvable will be returned.

The parser can then write the data passing a structure like the one described above, where the first column represents the field and the second the field to update. A SQL UPDATE statement will be generated from this data object, and adding the fields for first time will be no different as UPDATING the fields.

This presents one problem. As the SQL is generated from the data object actve fields, we can't precompile those update statements. This is solved easily. We can assume if a metadata parser is inserting a combination of fields for lot of packages, that it will use the same combination for all packages in most of the cases. We can precompile the statements for a combination of fields and cache them in a precompiled statement pool. When we will insert another data block, we can lookup if a precompiled statement for the combination exists and use it. The only cases that will not benefit from it would be updating all the time in different orders (which will hit the cache when all field combinations are reached). So problem has a easy solution.

Query Layer

The Query layer allows to iterate through resolvable data in the cache and also to provide specific attributes of a resolvable.

sqlite3x layer

This is a very thin layer to use sqlite3 C api in a object oriented way. Mostly to reduce verbosity of the code. ( see [2] )

sqlite3 database

The schema is designed to avoid parsing when reading and to normalize duplicated strings.

The actual schema and some research used in designing this schema can be find here.

Repository Handling

Usecases

[add repo 1]

  • user adds a repository via "zypper sa" or YaST inst_source
  • input values: alias, url and path
    • create RepositoryInfo object
    • if alias was given, check it is not duplicate
    • add the repo info to /etc/repos.d/alias file

Add repository workflow

[remove repo 1]

  • Remove repository via "zypper sd" or YaST inst_source
  • input values: alias, known source list number, or url and path
    • simply delete /etc/repos.d/alias
    • optionally, ask to clean cache for this repo

Remove repository workflow

[refresh repo]

  • Refresh repository
  • input values: alias, known source list number, or url and path
    • Start Fetcher.
    • Tell fetcher to use the old cache location as the place to look for possible unchanged files.
    • Download the metadata (changed files)
    • empty repo cache in database
    • Use the repo parser to fill a new cache.

[start pool]

  • Start the pool to do transactons
    • look all enabled repos in /etc/repos.d
    • look the id of a repo
    • if the repo is empty, ask for refresh
    • if the repo is autorefresh, refresh
    • create a Cached Repostory object passing that id.
    • add resolvables into the pool

[install pkg]

  • Install a package
    • start the pool
    • mark transaction
    • commit

Code Workflow and classes

whiteboard

The idea of he workflow is:

Image:ZyppRepoWorkflowSequence.png

  • The client ask the RepoManager for the known sources. This information is just hints.
  • The client ask the repomanager to create a source bases on these hint, using the cache.
  • The RepoManager complains the source is not cached, and throws.
  • The client tries again, asking to cache the source based on the hints.
  • It creates the repository
  • Creates the resolvables from the repository.

Advantages

  • In any moment the manager has a list of "instanciated" sources (restored) like in SourceManager. The client has to keep a list of the Repositories it has created.

Open Questions

  • What does the client passes to the repomanager back to create or cache a source? the complete hint (RepoInfo) or just a part?
  • Relation between the .repo file and the RepoInfo object (link, explicit, etc?)

Target Commit