Fixing iMessage search with DuckDB
iMessage is an instant messaging service developed by Apple for its devices running iOS and macOS operating systems. It allows users to send text messages, photos, videos, music, and other multimedia content over Wi-Fi or cellular data to other Apple devices.
It's one of the best features of the Apple ecosystem, most of the time it actually does just work.
Except for search. Search, for some reason, is still an unsolved topic at Apple, which in current-year is honestly pretty embarrassing.
So, let's fix it.
Getting the data
Did you know that all of your iMessages are available in an SQLite database on your Mac? The "production" database is located in the folder ~/Library/Messages/chat.db
. Instead of running queries on this, let's create a working copy first.
β― mkdir duckdb-imessage && cd duckdb-imessage
β― cp ~/Library/Messages/chat.db .
Great, now we definitely won't mess up anything important!
O Pato
DuckDB is an
in-process SQL OLAP database management system
which is perfect for our use case as we want something lightweight that enables us to do some more analytical type querying.
Let's start it up and load the required extension that allows us to connect to an SQLite database on our filesystem.
β― duckdb
v0.6.1 919cad22e8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D INSTALL sqlite;
D LOAD sqlite;
Now, we can attach our session to the SQLite file with the sqlite_attach
function.
CALL sqlite_attach('chat.db');
Time to explore!
To get an idea of what tables we can use, run the following command:
D PRAGMA show_tables;
βββββββββββββββββββββββββββββ
β name β
β varchar β
βββββββββββββββββββββββββββββ€
β _SqliteDatabaseProperties β
β attachment β
β chat β
β chat_handle_join β
β chat_message_join β
β deleted_messages β
β handle β
β kvtable β
β message β
β message_attachment_join β
β message_processing_task β
β sqlite_sequence β
β sqlite_stat1 β
β sync_deleted_attachments β
β sync_deleted_chats β
β sync_deleted_messages β
βββββββββββββββββββββββββββββ€
β 16 rows β
βββββββββββββββββββββββββββββ
Alright, not bad, looks like it's fairly well organized. Let's run a few sample queries to get an idea of what we can do!
Let's see how we could rank our messaging partners based on the number of messages exchanged!
SELECT
chat.chat_identifier,
COUNT(chat.chat_identifier) AS message_count
FROM chat
JOIN chat_message_join ON chat."ROWID" = chat_message_join.chat_id
JOIN message ON chat_message_join.message_id = message."ROWID"
GROUP BY chat.chat_identifier
ORDER BY message_count DESC LIMIT 5;
This returns something like the following (phone numbers masked for privacy!)
βββββββββββββββββββββββββ¬ββββββββββββββββ
β chat_identifier β message_count β
β varchar β int64 β
βββββββββββββββββββββββββΌββββββββββββββββ€
β +xxxxxxxxxxx β 25530 β
β +xxxxxxxxxxx β 14913 β
β +xxxxxxxxxxx β 4551 β
β xxxxxxxxxxxxxxxxxxxxx β 2585 β
β +xxxxxxxxxxx β 2112 β
βββββββββββββββββββββββββ΄ββββββββββββββββ
Let's get searching
Now, the original mission of this article is to create something that allows us to search in our iMessage history properly. Let's clean up our data a little first.
We can create a table with cleaned-up timestamps and only the fields that we need
CREATE OR REPLACE TABLE messages_clean AS (
SELECT
message.ROWID AS message_id,
to_timestamp(message.date / 1000000000 + epoch('2001-01-01'::TIMESTAMP)) AS message_at,
chat.chat_identifier,
message.text
FROM
chat
JOIN chat_message_join ON chat."ROWID" = chat_message_join.chat_id
JOIN message ON chat_message_join.message_id = message. "ROWID"
ORDER BY
message_date DESC
);
Now we have a usable dataset that can serve as the source for our search queries.
D select column_name, data_type from information_schema.columns where table_name = 'messages_clean';
βββββββββββββββββββ¬ββββββββββββ
β column_name β data_type β
β varchar β varchar β
βββββββββββββββββββΌββββββββββββ€
β message_id β BIGINT β
β message_at β TIMESTAMP β
β chat_identifier β VARCHAR β
β text β VARCHAR β
βββββββββββββββββββ΄ββββββββββββ
To search we could use the basic SQL functionality to compare strings, but we want to go a bit further than that, so let's build a Full-text Search index over our dataset.
To achieve this, DuckDB provides an extension, that we can use straight out of the box! Let's create our search index:
PRAGMA create_fts_index('messages_clean', 'message_id', '*');
This PRAGMA
builds the index under a newly created schema. The schema will be named after the input table: if an index is created on the table 'main.table_name'
, then the schema will be named 'fts_main_table_name'
, so in our case, the schema is going to be called fts_main_messages_clean
.
And that's it! We can run full search queries against our index now.
For example, if we want to see all the messages that contain the phrase "cute dog", we can do so using something like this:
SELECT text, score
FROM (SELECT *,
fts_main_messages_clean.match_bm25(message_id, 'cute dog') AS score
FROM messages_clean) sq
WHERE score IS NOT NULL
ORDER BY score DESC LIMIT 5;
And our results will look like this
ββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββ
β text β score β
β varchar β double β
ββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββ€
β Their dog is still cute tho β 8.606972176668958 β
β Thereβs a very cute dog here β 8.606972176668958 β
β Ok the dog is cute but Misha is waaaaaay better β 5.085938104395293 β
β Too cute β 4.594079954266784 β
β Theyre cute β 4.594079954266784 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββ
Nice! The score
gives a great estimation of the accuracy of the search results, as we can see from the actual text.
Exactly what we are looking for ~ and way better than anything Apple provides us so far except for all the plumbing required, but at least we had a reason to play around with DuckDB!
Member discussion