Libzypp/Refactoring/CacheSchema
From openSUSE
Contents |
Current Schema
Solver Requisites
The queries for the solver include: (Kindly provided by schubi)
- get all installed items
- get all uninstalled items
- get all installed/uninstalled items of a specific kind
- just find installed item with same kind/name as item Helper::findInstalledByNameAndKind (const ResPool & pool, const string & name, const Resolvable::Kind & kind)
- just find uninstalled item with same kind/name as item Helper::findUninstalledByNameAndKind (const ResPool & pool, const string & name, const Resolvable::Kind & kind)
- just find best (according to edition) uninstalled item with same kind/name as item
- // *DOES* check edition
- PoolItem_Ref Helper::findUpdateItem (const ResPool & pool, PoolItem_Ref item)
- check if the given item is the best one of the pool
- // Takes arch, edition in order to check it
- bool Helper::isBestUninstalledItem (const ResPool & pool, PoolItem_Ref item)
- get all requires,obsoletes,provides,recommends,... of an item
- find all INSTALLED items which provides foo
- find all UNINSTALLED items which provides foo
- find all items which provides foo
- find all locked items
- find items which have a freshens entry with a given name or corresponds with one of a given name list
- find items which have a supplements entry with a given name or corresponds with one of a given name list
- find items which have a conflicts entry with a given name or corresponds with one of a given name list
- find items which have a recommend entry with a given name or corresponds with one of a given name list
- find all INSTALLED items which requires foo
- find all UNINSTALLED items which requires foo
Most of those can be represented by single table queries which don't depend on the schema at all except:
- 6. requires edition comparision
- 9. 10. 11. 17. 18. Style queries are analyzed in our example for different schemas
- 13. 14. 15. 16. Same as above
Other system schema comparision
xpkg - http://www.xpkg.org
Uses a simple dependency model (name,major,minor). Database schema only has a packages table.
YUM - http://linux.duke.edu/projects/yum
/var/cache/yum/stable/ total 348M 0 cachecookie 16M filelists.xml.gz 72M filelists.xml.gz.sqlite 48 headers 45M other.xml.gz 172M other.xml.gz.sqlite 48 packages 6.8M primary.xml.gz 37M primary.xml.gz.sqlite 951 repomd.xml
The schema is basically a 1:1 copy of the XML schema with different databases per xml file. They also keep the raw cache.
The names are not normalized and it is easy to see it looking at the file sizes.
Summaries and descriptions are together with resolvable data.
The time to create factory cache from scratch via NFS is:
real 2m20.947s user 1m47.155s sys 0m9.697s
Once is created, looking for information about a file:
real 0m12.470s user 0m7.028s sys 0m0.464s
Solver queries with yum
Find package that provides foo:
sqlite> select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3"; kdelibs3|3.5.6|4|0 kdelibs3|3.5.6|4|0
real 0m0.008s user 0m0.008s sys 0m0.000s
sqlite> explain query plan select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3"; 0|1|TABLE provides AS pr WITH INDEX providesname 1|0|TABLE packages AS pk USING PRIMARY KEY sqlite>
Smart
Smart uses its own cache format. Creating it for stable-x86 results in a 6.4M cache file. Creating it takes:
real 2m18.128s user 0m31.826s sys 0m2.740s
And later, searching for a package:
real 0m3.198s user 0m2.628s sys 0m0.444s
ZYPP Proposed schema Tests
The current schema available in zypp-trunk.
Find package that provides foo:
select r.name,r.version,r.release,r.epoch from resolvables r,dependencies d,versioned_dependencies vd,names n where d.resolvable_id=r.id and d.dependency_type=1 and vd.name_id=n.id and n.name="kdelibs3" and vd.dependency_id=d.id; kdelibs3|3.5.6|4| kdelibs3|3.5.6|4| real 0m0.695s user 0m0.636s sys 0m0.036s
Slow, this query is resolved as:
0|2|TABLE versioned_dependencies AS vd 1|1|TABLE dependencies AS d USING PRIMARY KEY 2|0|TABLE resolvables AS r USING PRIMARY KEY 3|3|TABLE names AS n USING PRIMARY KEY
Trying again with a new index:
CREATE INDEX dependency_type ON dependencies (dependency_type); 12M real 0m0.485s user 0m0.456s sys 0m0.024s
Creating an index for the versioned_dependencies:
CREATE INDEX versioned_dependencies_dependency ON versioned_dependencies (dependency_id); 0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1 1|1|TABLE dependencies AS d WITH INDEX dependency_type 2|0|TABLE resolvables AS r USING PRIMARY KEY 3|2|TABLE versioned_dependencies AS vd WITH INDEX versioned_dependencies_dependency real 0m0.579s user 0m0.552s sys 0m0.016s
Still slow. Last try:
create index vd_name_id on versioned_dependencies(name_id);
real 0m0.007s user 0m0.004s sys 0m0.004s
This means the bottleneck was in the normalized names table. Now the query is done in this way:
0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1 1|2|TABLE versioned_dependencies AS vd WITH INDEX vd_name_id 2|1|TABLE dependencies AS d USING PRIMARY KEY 3|0|TABLE resolvables AS r USING PRIMARY KEY
Adding all those indexes keep the database (without file lists and descriptions) in a 15 M size. Some extra indexes can be dropped if the query plan shows they are not used.
With 40.000 packages. A database without normalized names uses 50mb, while a normalized one uses around 43 mb.
The query in the normalized one takes 2.092s. Adding the names index bumps the size to 50mb and gets the query in 0.010s.
The query in the non-normalized one takes 0.443s. Adding the names index bumps the size to 76 mb and gets the query in 0.010s.
stable-x86 distribution facts
- resolvable names: 8.204
- unique resolvable names: 8.200
- dependency names: 119.425
- unique dependency names: 30.062
If we take the proportions as a fact for every repository, that means 14.5 dependencies per package in average.
If you take in count that adding a repository which is not factory has not much chance of adding much new names the unique 30k unique dependency names will not grow too much, where the number of dependencies still grows up.
Glob queries (ie: zypper)
sqlite> explain query plan select * from dependencies where name like "kde%"; 0|0|TABLE dependencies
As we see, this query does not use any index, because, ok, there is none.
sqlite> CREATE INDEX dependency_name ON dependencies (name); sqlite> explain query plan select * from dependencies where name like "kde%"; 0|0|TABLE dependencies
This makes sense, as from [1], "The GLOB and LIKE operators are expensive in SQLite because they can't make use of an index. One reason is that these are implemented by user functions which can be overridden". Index can be bypassed, if they slowdown the query:
Example: "WHERE x=5" can be changed to "WHERE x+0=5" or "WHERE +x=5" to bypass the index. ( [1] 5.5.3 )
In contrast, this query does use an index:
sqlite> explain query plan select * from resolvables r,dependencies d where d.name="kdelibs"; 0|0|TABLE resolvables AS r 1|1|TABLE dependencies AS d WITH INDEX dependency_name
Database Size
A zmd style schema made with repotools, of stable-x86 takes 5.2 M It is unrealistic as it does not contains descriptions but useful for comparision.
CREATE INDEX dependency_name ON dependencies (name);
Ths increases the size of the database to 8.4M!!.
We normalize the database, taking the names to a new table.
insert into dependencies (id, name_id, deptype_id, resolvable_id, refers_id, relation_id, epoch, version, release, pre ) select db.id, n.id, db.deptype_id, db.resolvable_id, db.refers_id, db.relation_id, db.epoch, db.version, db.release, db.pre from dependencies_backup db, names n where n.name=db.name;
The new result is a 30kb difference, after vacuum. (????????)
5256192 stable.db 5283840 stable.old.db
sqlite> select sum(length(name)) from names; 563224 sqlite> select sum(length(name)) from dependencies; 1922000
We should have saved at least 1.922.000 - 563.224 ?????
However the index size for the names column is much smaller. The database without normalized names jumps from 5.1M to 6.0M when creating a name index, instead of the 8.4M mentioned above.
Growing Test
zypp1.db uses name normalization for dependencies. zypp2.db does not.
8.000 packages 9.9M zypp1.db 9.8M zypp2.db
16.000 packages 18M zypp1.db 20M zypp2.db
24.000 packages 26M zypp1.db 30M zypp2.db
32.000 packages 34M zypp1.db 40M zypp2.db
40.000 packages 43M zypp1.db 50M zypp2.db
If x = thousand of packages, and y = database size. A normalized schema has a slomp of 1.1 and a unnormalized schema a slomp of 1.25
File list data
For factory. Complete file list is
2.840.491 files (unique names: 890.373) 119.070 directories
We ommited using a schema with one table and one column.
Test 1, Normalized Directories
|-----------------| |----------------------------------| | dirs | | files | |-----------------| |----------------------------------| | id | name | | dir_id | package_id | name | |-----------------| |----------------------------------|
For yum xml format, this is 192M in filelists.xml which can be compressed to 17M filelists.xml.gz
Using a simple schema normalizing the directories:
186M filelist.db
Obtaining the file list of a package:
select d.name,f.name from files f,dirs d where f.dir_id=d.id and f.package_id=7557 takes 0.135s and was not able to improve it more using indexes. This assumes you have the key of the package.
Reverse lookup, find the id of a package that owns file and dir.
'select f.package_id from files f,dirs d where f.dir_id=d.id and f.name="libzypp.so" and d.name="/usr/lib";' takes 1.582s
Adding a index on file name, grows the database to 259M, but resolves the query in 0.007s.
Test 1, Normalized Directories and file names
|-----------------| |-----------------| | dir_names | | file_names | |-----------------| |-----------------| | id | name | | id | name | |-----------------| |-----------------|
|----------------------------------| | files | |----------------------------------| | dir_id | file_id | package_id | |----------------------------------|
This approach generates much slower insertions, which can be reduced with some application level cache, for example for the directories, which probably get inserted lot of files for the same directory. You can avoid N selects, keeping the dir_id and name in memory, where N is the average number of files per directory.
This generates a 193M filelist2.db. This includes an index on filenames, otherwise every insertion requires a select on a non indexed text column. So this size is comparable to the 259M non-normalized table.
I could not get the reverse lookup query under a second.
Application level cache for normalized insertions
|-----------------| |-----------------| | table1 | | table2 | |-----------------| |-----------------| | id | name | | id | name | |-----------------| |-----------------|
|----------------------------------| | table1_table1 | |----------------------------------| | t1_id | t2_id | other | |----------------------------------|
If you are inserting a list of:
table1 table2 other ----------------------- dog lalal 423432 dog lelele 423432 cat foo 423423 cat bleh 423423 cat ruby 423423 duck nah 432432
If you insert in table1_table1 you will need to check if an id already exists for table1 and table2, using text comparision, which is slow.
As we insert more than 1 table2 object for the same table1 object, the id of table1 can be cached, and the followings won't need another select statement.
Representing in memory data
There are entities that are in memory, like Architectures and we dont want to maintain them in the database store but we would like to be able to relate database information with memory information.
Virtual tables allow you to do this, so you can create a architectures table that is not stored on disk that is only a set of callbacks to answer queries but you can still use it in SQL.
See [3] for more information about virtual tables.
C style API vs C++ Style API
1000 insertions and iterating using a select
(sqlite3x insert) 2:05 (u 0.07 s 1.09 c 0.00) (sqlite3x select) 0 (u 0.01 s 0.00 c 0.00)
(sqlite-c insert) 2:05 (u 0.09 s 0.89 c 0.00) (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00)
Oh sqlite is really slow to insert!!! no, I had autocommit enabled so there was one transaction per insert, if we do all inserts as a transaction
(sqlite3x insert) 0 (u 0.01 s 0.00 c 0.00) (sqlite3x select) 0 (u 0.00 s 0.00 c 0.00)
(sqlite-c insert) 1 (u 0.01 s 0.00 c 0.00) (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00)
Ok, that is fast, lets do 10.000 inserts in a row.
(sqlite3x insert) 0 (u 0.20 s 0.00 c 0.00) (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
(sqlite-c insert) 0 (u 0.20 s 0.02 c 0.00) (sqlite-c select) 0 (u 0.03 s 0.00 c 0.00)
repeat same test with flushed caches:
(sqlite3x select) 0 (u 0.04 s 0.00 c 0.00) (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
(sqlite-c insert) 1 (u 0.18 s 0.00 c 0.00) (sqlite-c select) 0 (u 0.04 s 0.00 c 0.00)
Conclusion, the c++ layer does not add a significant performance penalty but does reduce the code by 30%-40% as errors are better handled using exceptions instead of checking error values. same with transactions using block scopes.
Some Conclusions
- Normalization lead to slower queries (due to JOINs)
- Normalized tables are faster than non-normalized tables using an index.
- When both use an index, they reach the 0.01s lower limit
- The index size is much smaller in normalized tables.
Recommendations
- Dependencies names normalization can save space. Especially with indexes.
- Dependency type normalization is cheap and can provide a flexible capability representation.
- separating dependency types in different tables can also save a JOIN, but if combined with [2], leads to (dep types)X(capability impls) tables.

