Libzypp/Refactoring/CacheSchema

From openSUSE

Contents

Current Schema

ZYpp Cache Propposed (WIP) SQL Schema
Enlarge
ZYpp Cache Propposed (WIP) SQL Schema

Solver Requisites

The queries for the solver include: (Kindly provided by schubi)

  1. get all installed items
  2. get all uninstalled items
  3. get all installed/uninstalled items of a specific kind
  4. just find installed item with same kind/name as item Helper::findInstalledByNameAndKind (const ResPool & pool, const string & name, const Resolvable::Kind & kind)
  5. just find uninstalled item with same kind/name as item Helper::findUninstalledByNameAndKind (const ResPool & pool, const string & name, const Resolvable::Kind & kind)
  6. just find best (according to edition) uninstalled item with same kind/name as item
    // *DOES* check edition
    PoolItem_Ref Helper::findUpdateItem (const ResPool & pool, PoolItem_Ref item)
  7. check if the given item is the best one of the pool
    // Takes arch, edition in order to check it
    bool Helper::isBestUninstalledItem (const ResPool & pool, PoolItem_Ref item)
  8. get all requires,obsoletes,provides,recommends,... of an item
  9. find all INSTALLED items which provides foo
  10. find all UNINSTALLED items which provides foo
  11. find all items which provides foo
  12. find all locked items
  13. find items which have a freshens entry with a given name or corresponds with one of a given name list
  14. find items which have a supplements entry with a given name or corresponds with one of a given name list
  15. find items which have a conflicts entry with a given name or corresponds with one of a given name list
  16. find items which have a recommend entry with a given name or corresponds with one of a given name list
  17. find all INSTALLED items which requires foo
  18. find all UNINSTALLED items which requires foo

Most of those can be represented by single table queries which don't depend on the schema at all except:

  • 6. requires edition comparision
  • 9. 10. 11. 17. 18. Style queries are analyzed in our example for different schemas
  • 13. 14. 15. 16. Same as above

Other system schema comparision

xpkg - http://www.xpkg.org

Uses a simple dependency model (name,major,minor). Database schema only has a packages table.

YUM - http://linux.duke.edu/projects/yum

/var/cache/yum/stable/
total 348M
0    cachecookie
16M  filelists.xml.gz
72M  filelists.xml.gz.sqlite
48   headers
45M  other.xml.gz
172M other.xml.gz.sqlite
48   packages
6.8M primary.xml.gz
37M  primary.xml.gz.sqlite
951  repomd.xml

The schema is basically a 1:1 copy of the XML schema with different databases per xml file. They also keep the raw cache.

The names are not normalized and it is easy to see it looking at the file sizes.

Summaries and descriptions are together with resolvable data.

The time to create factory cache from scratch via NFS is:

real    2m20.947s
user    1m47.155s
sys     0m9.697s

Once is created, looking for information about a file:

real    0m12.470s
user    0m7.028s
sys     0m0.464s

Solver queries with yum

Find package that provides foo:

   sqlite> select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3";
   kdelibs3|3.5.6|4|0
   kdelibs3|3.5.6|4|0
   real    0m0.008s
   user    0m0.008s
   sys     0m0.000s
   sqlite> explain query plan select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3";
   0|1|TABLE provides AS pr WITH INDEX providesname
   1|0|TABLE packages AS pk USING PRIMARY KEY
   sqlite>

Smart

Smart uses its own cache format. Creating it for stable-x86 results in a 6.4M cache file. Creating it takes:

   real    2m18.128s
   user    0m31.826s
   sys     0m2.740s

And later, searching for a package:

   real    0m3.198s
   user    0m2.628s
   sys     0m0.444s

ZYPP Proposed schema Tests

The current schema available in zypp-trunk.

Find package that provides foo:

   select r.name,r.version,r.release,r.epoch from resolvables r,dependencies d,versioned_dependencies vd,names n where d.resolvable_id=r.id and d.dependency_type=1 and vd.name_id=n.id and n.name="kdelibs3" and vd.dependency_id=d.id;
   kdelibs3|3.5.6|4|
   kdelibs3|3.5.6|4|
   
   real    0m0.695s
   user    0m0.636s
   sys     0m0.036s

Slow, this query is resolved as:

   0|2|TABLE versioned_dependencies AS vd
   1|1|TABLE dependencies AS d USING PRIMARY KEY
   2|0|TABLE resolvables AS r USING PRIMARY KEY
   3|3|TABLE names AS n USING PRIMARY KEY

Trying again with a new index:

   CREATE INDEX dependency_type ON dependencies (dependency_type);
   12M
   
   real    0m0.485s
   user    0m0.456s
   sys     0m0.024s

Creating an index for the versioned_dependencies:

   CREATE INDEX versioned_dependencies_dependency ON versioned_dependencies (dependency_id);
   
   0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1
   1|1|TABLE dependencies AS d WITH INDEX dependency_type
   2|0|TABLE resolvables AS r USING PRIMARY KEY
   3|2|TABLE versioned_dependencies AS vd WITH INDEX versioned_dependencies_dependency
   
   real    0m0.579s
   user    0m0.552s
   sys     0m0.016s

Still slow. Last try:

create index vd_name_id on versioned_dependencies(name_id);

   real    0m0.007s
   user    0m0.004s
   sys     0m0.004s
 

This means the bottleneck was in the normalized names table. Now the query is done in this way:

   0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1
   1|2|TABLE versioned_dependencies AS vd WITH INDEX vd_name_id
   2|1|TABLE dependencies AS d USING PRIMARY KEY
   3|0|TABLE resolvables AS r USING PRIMARY KEY

Adding all those indexes keep the database (without file lists and descriptions) in a 15 M size. Some extra indexes can be dropped if the query plan shows they are not used.

With 40.000 packages. A database without normalized names uses 50mb, while a normalized one uses around 43 mb.

The query in the normalized one takes 2.092s. Adding the names index bumps the size to 50mb and gets the query in 0.010s.

The query in the non-normalized one takes 0.443s. Adding the names index bumps the size to 76 mb and gets the query in 0.010s.


stable-x86 distribution facts

  • resolvable names: 8.204
  • unique resolvable names: 8.200
  • dependency names: 119.425
  • unique dependency names: 30.062

If we take the proportions as a fact for every repository, that means 14.5 dependencies per package in average.

If you take in count that adding a repository which is not factory has not much chance of adding much new names the unique 30k unique dependency names will not grow too much, where the number of dependencies still grows up.

Glob queries (ie: zypper)

   sqlite> explain query plan select * from dependencies where name like "kde%";
   0|0|TABLE dependencies

As we see, this query does not use any index, because, ok, there is none.

   sqlite> CREATE INDEX dependency_name ON dependencies (name);
   sqlite> explain query plan select * from dependencies where name like "kde%";
   0|0|TABLE dependencies

This makes sense, as from [1], "The GLOB and LIKE operators are expensive in SQLite because they can't make use of an index. One reason is that these are implemented by user functions which can be overridden". Index can be bypassed, if they slowdown the query:

Example: "WHERE x=5" can be changed to "WHERE x+0=5" or "WHERE +x=5" to bypass the index. ( [1] 5.5.3 )

In contrast, this query does use an index:

   sqlite> explain query plan select * from resolvables r,dependencies d where d.name="kdelibs";
   0|0|TABLE resolvables AS r
  1|1|TABLE dependencies AS d WITH INDEX dependency_name

Database Size

A zmd style schema made with repotools, of stable-x86 takes 5.2 M It is unrealistic as it does not contains descriptions but useful for comparision.

   CREATE INDEX dependency_name ON dependencies (name);

Ths increases the size of the database to 8.4M!!.

We normalize the database, taking the names to a new table.

   insert into dependencies (id, name_id, deptype_id, resolvable_id, refers_id, relation_id, epoch, version, release, pre ) select db.id, n.id, db.deptype_id, db.resolvable_id, db.refers_id, db.relation_id, db.epoch, db.version, db.release, db.pre from dependencies_backup db, names n where n.name=db.name;

The new result is a 30kb difference, after vacuum. (????????)

   5256192 stable.db
   5283840 stable.old.db
   sqlite> select sum(length(name)) from names;
   563224
   sqlite> select sum(length(name)) from dependencies;
   1922000

We should have saved at least 1.922.000 - 563.224 ?????

However the index size for the names column is much smaller. The database without normalized names jumps from 5.1M to 6.0M when creating a name index, instead of the 8.4M mentioned above.

Growing Test

zypp1.db uses name normalization for dependencies. zypp2.db does not.

8.000 packages
9.9M zypp1.db
9.8M zypp2.db
16.000 packages
18M zypp1.db
20M zypp2.db
24.000 packages
26M zypp1.db
30M zypp2.db
32.000 packages
34M zypp1.db
40M zypp2.db
40.000 packages
43M zypp1.db
50M zypp2.db

If x = thousand of packages, and y = database size. A normalized schema has a slomp of 1.1 and a unnormalized schema a slomp of 1.25

File list data

For factory. Complete file list is

2.840.491 files (unique names: 890.373)
119.070 directories

We ommited using a schema with one table and one column.

Test 1, Normalized Directories

|-----------------| |----------------------------------|
| dirs            | | files                            |
|-----------------| |----------------------------------|
| id  |   name    | | dir_id | package_id | name       |
|-----------------| |----------------------------------|

For yum xml format, this is 192M in filelists.xml which can be compressed to 17M filelists.xml.gz

Using a simple schema normalizing the directories:

186M filelist.db

Obtaining the file list of a package:

select d.name,f.name from files f,dirs d where f.dir_id=d.id and f.package_id=7557 takes 0.135s and was not able to improve it more using indexes. This assumes you have the key of the package.

Reverse lookup, find the id of a package that owns file and dir.

'select f.package_id from files f,dirs d where f.dir_id=d.id and f.name="libzypp.so" and d.name="/usr/lib";'
takes 1.582s

Adding a index on file name, grows the database to 259M, but resolves the query in 0.007s.

Test 1, Normalized Directories and file names

|-----------------| |-----------------|
| dir_names       | | file_names      |
|-----------------| |-----------------|
| id  |   name    | | id  |   name    |
|-----------------| |-----------------|
|----------------------------------|
| files                            |
|----------------------------------|
| dir_id | file_id | package_id    |
|----------------------------------|

This approach generates much slower insertions, which can be reduced with some application level cache, for example for the directories, which probably get inserted lot of files for the same directory. You can avoid N selects, keeping the dir_id and name in memory, where N is the average number of files per directory.

This generates a 193M filelist2.db. This includes an index on filenames, otherwise every insertion requires a select on a non indexed text column. So this size is comparable to the 259M non-normalized table.

I could not get the reverse lookup query under a second.

Application level cache for normalized insertions

|-----------------| |-----------------|
| table1          | | table2          |
|-----------------| |-----------------|
| id  |   name    | | id  |   name    |
|-----------------| |-----------------|
|----------------------------------|
| table1_table1                    |
|----------------------------------|
| t1_id  | t2_id   | other         |
|----------------------------------|

If you are inserting a list of:

table1   table2  other
-----------------------
dog      lalal   423432
dog      lelele  423432
cat      foo     423423
cat      bleh    423423
cat      ruby    423423
duck     nah     432432 

If you insert in table1_table1 you will need to check if an id already exists for table1 and table2, using text comparision, which is slow.

As we insert more than 1 table2 object for the same table1 object, the id of table1 can be cached, and the followings won't need another select statement.

Representing in memory data

There are entities that are in memory, like Architectures and we dont want to maintain them in the database store but we would like to be able to relate database information with memory information.

Virtual tables allow you to do this, so you can create a architectures table that is not stored on disk that is only a set of callbacks to answer queries but you can still use it in SQL.

See [3] for more information about virtual tables.

C style API vs C++ Style API

1000 insertions and iterating using a select

   (sqlite3x insert) 2:05 (u 0.07 s 1.09 c 0.00)
   (sqlite3x select) 0 (u 0.01 s 0.00 c 0.00)
   (sqlite-c insert) 2:05 (u 0.09 s 0.89 c 0.00)
   (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00)

Oh sqlite is really slow to insert!!! no, I had autocommit enabled so there was one transaction per insert, if we do all inserts as a transaction

   (sqlite3x insert) 0 (u 0.01 s 0.00 c 0.00)
   (sqlite3x select) 0 (u 0.00 s 0.00 c 0.00)
   (sqlite-c insert) 1 (u 0.01 s 0.00 c 0.00)
   (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00)

Ok, that is fast, lets do 10.000 inserts in a row.

   (sqlite3x insert) 0 (u 0.20 s 0.00 c 0.00)
   (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
   (sqlite-c insert) 0 (u 0.20 s 0.02 c 0.00)
   (sqlite-c select) 0 (u 0.03 s 0.00 c 0.00)

repeat same test with flushed caches:

   (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
   (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
   (sqlite-c insert) 1 (u 0.18 s 0.00 c 0.00)
   (sqlite-c select) 0 (u 0.04 s 0.00 c 0.00)

Conclusion, the c++ layer does not add a significant performance penalty but does reduce the code by 30%-40% as errors are better handled using exceptions instead of checking error values. same with transactions using block scopes.

Some Conclusions

  • Normalization lead to slower queries (due to JOINs)
  • Normalized tables are faster than non-normalized tables using an index.
  • When both use an index, they reach the 0.01s lower limit
  • The index size is much smaller in normalized tables.

Recommendations

  1. Dependencies names normalization can save space. Especially with indexes.
  2. Dependency type normalization is cheap and can provide a flexible capability representation.
  3. separating dependency types in different tables can also save a JOIN, but if combined with [2], leads to (dep types)X(capability impls) tables.

References

  1. http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
  2. http://developer.mozilla.org/en/docs/Storage:Performance
  3. http://www.sqlite.org/cvstrac/wiki?p=VirtualTables
  4. http://katastrophos.net/andre/blog/2007/01/04/sqlite-performance-tuning-and-optimization-on-embedded-systems/