The full sync threshold was a hack to ensure we synced the deck in a
memory-efficient way if there was a lot of data to send. The problem is that
it's not easy for the user to predict how many changes there are, and so it
might come as a surprise to them when a sync suddenly switches to a full sync.
In order to be able to send changes in chunks rather than all at once, some
changes had to be made:
- Clients now set usn=-1 when they modify an object, which allows us to
distinguish between objects that have been modified on the server, and ones
that have been modified on the client. If we don't do this, we would have to
buffer the local changes in a temporary location before adding the server
changes.
- Before a client sends the objects to the server, it changes the usn to
maxUsn both in the payload and the local storage.
- We do deletions at the start
- To determine which card or fact is newer, we have to fetch the modification
time of the local version. We do this in batches rather than try to load the
entire list in memory.
Ported the sync code to the latest libanki structure. Key points:
No summary:
The old style got each side to fetch ids+mod times and required the client to
diff them and then request or bundle up the appropriate objects. Instead, we now
get each side to send all changed objects, and it's the responsibility of the
other side to decide what needs to be merged and what needs to be discarded.
This allows us to skip a separate summary step, which saves scanning tables
twice, and allows us to reduce server requests from 4 to 3.
Schema changes:
Certain operations that are difficult to merge (such as changing the number of
fields in a model, or deleting models or groups) result in a full sync. The
user is warned about it in the GUI before such schema-changing operations
execute.
Sync size:
For now, we don't try to deal with large incremental syncs. Because the cards,
facts and revlog can be large in memory (hundreds of megabytes in some cases),
they would have to be chunked for the benefit of devices with a low amount of
memory.
Currently findChanges() uses the full fact/card objects which we're planning to
send to the server. It could be rewritten to fetch a summary (just the id, mod
& rep columns) which would save some memory, and then compare against blocks
of a few hundred remote objects at a time. However, it's a bit more
complicated than that:
- If the local summary is huge it could exceed memory limits. Without a local
summary we'd have to query the db for each record, which could be a lot
slower.
- We currently accumulate a list of remote records we need to add locally.
This list also has the potential to get too big. We would need to
periodically commit the changes as we accumulate them.
- Merging a large amount of changes is also potentially slow on mobile
devices.
Given the fact that certain schema-changing operations require a full sync
anyway, I think it's probably best to concentrate on a chunked full sync for
now instead, as provided the user syncs periodically it should not be easy to
hit the full sync limits except after bulk editing operations.
Chunked partial syncing should be possible to add in the future without any
changes to the deck format.
Still to do:
- deck conf merging
- full syncing
- new http proxy
SQLAlchemy is a great tool, but it wasn't a great fit for Anki:
- We often had to drop down to raw SQL for performance reasons.
- The DB cursors and results were wrapped, which incurred a
sizable performance hit due to introspection. Operations like fetching 50k
records from a hot cache were taking more than twice as long to complete.
- We take advantage of sqlite-specific features, so SQL language abstraction
is useless to us.
- The anki schema is quite small, so manually saving and loading objects is
not a big burden.
In the process of porting to DBAPI, I've refactored the database schema:
- App configuration data that we don't need in joins or bulk updates has been
moved into JSON objects. This simplifies serializing, and means we won't
need DB schema changes to store extra options in the future. This change
obsoletes the deckVars table.
- Renamed tables:
-- fieldModels -> fields
-- cardModels -> templates
-- fields -> fdata
- a number of attribute names have been shortened
Classes like Card, Fact & Model remain. They maintain a reference to the deck.
To write their state to the DB, call .flush().
Objects no longer have their modification time manually updated. Instead, the
modification time is updated when they are flushed. This also applies to the
deck.
Decks will now save on close, because various operations that were done at
deck load will be moved into deck close instead. Operations like undoing
buried card are cheap on a hot cache, but expensive on startup.
Programmatically you can call .close(save=False) to avoid a save and a
modification bump. This will be useful for generating due counts.
Because of the new saving behaviour, the save and save as options will be
removed from the GUI in the future.
The q/a cache and field cache generating has been centralized. Facts will
automatically rebuild the cache on flush; models can do so with
model.updateCache().
Media handling has also been reworked. It has moved into a MediaRegistry
object, which the deck holds. Refcounting has been dropped - it meant we had
to compare old and new value every time facts or models were changed, and
existed for the sole purpose of not showing errors on a missing media
download. Instead we just media.registerText(q+a) when it's updated. The
download function will be expanded to ask the user if they want to continue
after a certain number of files have failed to download, which should be an
adequate alternative. And we now add the file into the media DB when it's
copied to th emedia directory, not when the card is commited. This fixes
duplicates a user would get if they added the same media to a card twice
without adding the card.
The old DeckStorage object had its upgrade code split in a previous commit;
the opening and upgrading code has been merged back together, and put in a
separate storage.py file. The correct way to open a deck now is import anki; d
= anki.Deck(path).
deck.getCard() -> deck.sched.getCard()
same with answerCard
deck.getCard(id) returns a Card object now.
And the DB wrapper has had a few changes:
- sql statements are a more standard DBAPI:
- statement() -> execute()
- statements() -> executemany()
- called like execute(sql, 1, 2, 3) or execute(sql, a=1, b=2, c=3)
- column0 -> list