- make sure we're actually stripping text in the field cache
- make sure a default group is added on upgrade
- make sure old style field references are upgrade
SQLAlchemy is a great tool, but it wasn't a great fit for Anki:
- We often had to drop down to raw SQL for performance reasons.
- The DB cursors and results were wrapped, which incurred a
sizable performance hit due to introspection. Operations like fetching 50k
records from a hot cache were taking more than twice as long to complete.
- We take advantage of sqlite-specific features, so SQL language abstraction
is useless to us.
- The anki schema is quite small, so manually saving and loading objects is
not a big burden.
In the process of porting to DBAPI, I've refactored the database schema:
- App configuration data that we don't need in joins or bulk updates has been
moved into JSON objects. This simplifies serializing, and means we won't
need DB schema changes to store extra options in the future. This change
obsoletes the deckVars table.
- Renamed tables:
-- fieldModels -> fields
-- cardModels -> templates
-- fields -> fdata
- a number of attribute names have been shortened
Classes like Card, Fact & Model remain. They maintain a reference to the deck.
To write their state to the DB, call .flush().
Objects no longer have their modification time manually updated. Instead, the
modification time is updated when they are flushed. This also applies to the
deck.
Decks will now save on close, because various operations that were done at
deck load will be moved into deck close instead. Operations like undoing
buried card are cheap on a hot cache, but expensive on startup.
Programmatically you can call .close(save=False) to avoid a save and a
modification bump. This will be useful for generating due counts.
Because of the new saving behaviour, the save and save as options will be
removed from the GUI in the future.
The q/a cache and field cache generating has been centralized. Facts will
automatically rebuild the cache on flush; models can do so with
model.updateCache().
Media handling has also been reworked. It has moved into a MediaRegistry
object, which the deck holds. Refcounting has been dropped - it meant we had
to compare old and new value every time facts or models were changed, and
existed for the sole purpose of not showing errors on a missing media
download. Instead we just media.registerText(q+a) when it's updated. The
download function will be expanded to ask the user if they want to continue
after a certain number of files have failed to download, which should be an
adequate alternative. And we now add the file into the media DB when it's
copied to th emedia directory, not when the card is commited. This fixes
duplicates a user would get if they added the same media to a card twice
without adding the card.
The old DeckStorage object had its upgrade code split in a previous commit;
the opening and upgrading code has been merged back together, and put in a
separate storage.py file. The correct way to open a deck now is import anki; d
= anki.Deck(path).
deck.getCard() -> deck.sched.getCard()
same with answerCard
deck.getCard(id) returns a Card object now.
And the DB wrapper has had a few changes:
- sql statements are a more standard DBAPI:
- statement() -> execute()
- statements() -> executemany()
- called like execute(sql, 1, 2, 3) or execute(sql, a=1, b=2, c=3)
- column0 -> list
- removed 'created' column from various tables. We don't care when things like
models are created, and card creation time didn't reflect the actual time a
card was created
- facts were previously ordered by their creation date. The code would
manually set the creation time for subsequent facts on import by 0.0001
seconds, and then card due times were set by adding the fact time to the
ordinal number*0.000001. This was prone to error, and the number of zeros used
was actually different in different parts of the code. Instead of this, we
replace it with a 'pos' column on facts, which increments for each new fact.
- importing should add new facts with a higher pos, but concurrent updates in
a synced deck can have multiple facts with the same pos
- due times are completely different now, and depend on the card type
- new cards have due=fact.pos or random(0, 10000)
- reviews have due set to an integer representing days since deck
creation/download
- cards in the learn queue use an integer timestamp in seconds
- many columns like modified, lastSync, factor, interval, etc have been converted to
integer columns. They are cheaper to store (large decks can save 10s of
megabytes) and faster to search for.
- cards have their group assigned on fact creation. In the future we'll add a
per-template option for a default group.
- switch to due/random order for the review queue on upgrade. Users can still
switch to the old behaviour if they want, but many people don't care what
it's set to, and due is considerably faster, which may result in a better
user experience
Users who want to study small subsections at one time (eg, "lesson 14") are
currently best served by creating lots of little decks. This is because:
- selective study is a bit cumbersome to switch between
- the graphs and statitics are for the entire deck
- selective study can be slow on mobile devices - when the list of cards to
hide/show is big, or when there are many due cards, performance can suffer
- scheduling can only be configured per deck
Groups are intended to address the above problems. All cards start off in the
same group, but they can have their group changed. Unlike tags, cards can only
be a member of a single group at once time. This allows us to divide the deck
up into a non-overlapping set of cards, which will make things like showing
due counts for a single category considerably cheaper. The user interface
might want to show something like a deck browser for decks that have more than
one group, showing due counts and allowing people to study each group
individually, or to study all at once.
Instead of storing the scheduling config in the deck or the model, we move the
scheduling into a separate config table, and link that to the groups table.
That way a user can have multiple groups that all share the same scheduling
information if they want.
And deletion tracking is now in a single table.
- model config is now stored as a json-serialized dict, which allows us to
quickly gather the info and allows for adding extra options more easily in
the future
- denormalize modelId into the cards table, so we can get the model scheduling
information without having to hit the facts table
- remove position - since we will handle spacing differently we don't need a
separate variable to due to define sort order
- remove lastInterval from cards; the new cram mode and review early shouldn't
need it
- successive->streak
- add new columns for learn mode
- move cram mode into new file; learn more and review early need more thought
- initial work on learn mode
- initial unit tests
- move most scheduling parameters from deck to models
- remove obsolete fields in deck and models
- decks->deck
- remove deck id reference in models
- move some deckVars into the deck table
- simplify deckstorage
- lock sessionhelper by default
- add models/currentModel as properties instead of ORM mappings
- remove models.tags
- remove remaining support for memory-backed databases
- use a blank string for syncName instead of null
- remove backup code; will handle in gui
- bump version to 100
- update unit tests
Previously we had an index on the value field, which was very expensive for
long fields. Instead we use a separate column and take the first 8 characters
of the field value's md5sum, and index that. In decks with lots of text in
fields, it can cut the deck size by 30% or more, and many decks improve by
10-20%. Decks with only a few characters in fields may increase in size
slightly, but this is offset by the fact that we only generate a checksum for
fields that have uniqueness checking on.
Also, fixed import->update reporting the total # of available facts instead of
the number of facts that were imported.
- media is no longer hashed, and instead stored in the db using its original
name
- when adding media, its checksum is calculated and used to look for
duplicates
- duplicate filenames will result in a number tacked on the file
- the size column is used to count card references to media. If media is
referenced in a fact but not the question or answer, the count will be zero.
- there is no guarantee media will be listed in the media db if it is unused
on the question & answer
- if rebuildMediaDir(delete=True), then entries with zero references are
deleted, along with any unused files in the media dir.
- rebuildMediaDir() will update the internal checksums, and set the checksum
to "" if a file can't be found
- rebuildMediaDir() is a lot less destructive now, and will leave alone
directories it finds in the media folder (but not look in them either)
- rebuildMediaDir() returns more information about the state of media now
- the online and mobile clients will need to to make sure that when
downloading media, entries with no checksum are non-fatal and should not
abort the download process.
- the ref count is updated every time the q/a is updated - so the db should be
up to date after every add/edit/import
- since we look for media on the q/a now, card templates like '<img
src="{{{field}}}">' will work now
- export original files as gone as it is not needed anymore
- move from per-model media URL to deckVar. downloadMissingMedia() uses this
now. Deck subscriptions will have to be updated to share media another way.
- pass deck in formatQA, as latex support is going to change
- obsolete spaceUntil - it serves no useful purpose
- the old per-model spacing variables are obsolete, as the new approach
requires uniform spacing across all models for new cards
- introduce a new per-deck variable: newSpacing
- don't fill new queue if we've done today's cards
- still need to check cramming / review early
newSpacing is a time in seconds to delay introduction of sibling new cards.
It can be applied as many times as necessary as there is no harm in new cards
being delayed repeatedly. Because the default queue length is 200 and it can
take quite some time for the spaced cards to be placed in the queue again, we
use a separate array to track spaced new cards provided the configured delay
is less than 20 minutes. At times under 20 minutes this number is not a
guaranteed minimum spacing - if the new card queue is empty the spaced cards
will be flushed before checking the new queue again, as otherwise we end up
trying to fill on every repetition. The due counts no longer decrease by more
than one if the spacing is less than the due cutoff, since that confused some
users.
Review cards are now placed at the end of the current review queue, and will
never be rescheduled to a different day. The old approach had a number of
problems:
- the more card models you had, the more likely a card would be spaced
multiple times, resulting in you forgetting the card before you get a chance
to review it
- spacing was applied even if the due card was already late
- repeatedly failing one card over a period of days or weeks would also stave
the other cards of attention