For a long time I've been wanting to be able to search my Mastodon feed - not just my own posts (toots, blah) - so this past weekend I started hacking on a tool to download my Home timeline and dump the data into a local DuckDB database for querying.
Some notes so far:
My first discovery was the the number of Statuses returned from mastodon.timeline() (using the mastodon-py library) is server-dependent. I think these timelines are kept in memory by Mastodon and are limited to a certain number of Statuses. The default seems to be 400, though my server - hachyderm.io - will return 800 results.
My initial solution was to define an ORM model using SQLModel (a python lib that combines Pydantic and SQLAlchemy to provide a friendlier ORM interface). I then wrote simple functions to convert a Mastodon API response entity to the appropriate ORM model. This works ok, but I'm unlikely to use the ORM for querying, since I'm probably going to use more data-leaning tools (DuckDB, Pandas, some dashboard app) for searching and analytical tasks. This has me wondering if I should just be dumping results to files that DuckDB can easily load and query, like JSON or Parquet.
However, the dump-to-files approach probably makes more sense if I were to combine (de-normalize) data such that status data and account data are in a single file or set of files, and I'm not sure that's what I need. So, that's an architectural decision I'm still thinking about.
There are annoying differences in the data that Mastodon has (or provides via API). A Status has updated_at but Account doesn't. Status has reblogged, favourited, and bookmarked which reflect whether the logged-in (or requesting) user has performed those actions on the post, but Account doesn't have something similar for followed, which means I need to merge results from the timeline call with results from a call to mastodon.following().
More updates as I go.