Boox Notes Reverse Engineering (Part 1)

How to get your handwritten notes out of Boox Notes Backups

Posted by Daniel on Sunday, January 2, 2022

A walkthrough how I reverse-engineered the Boox Note data format to extract the vector shapes of handwritten notes taken in the Android Boox Notes App.

Why?

For christmas I bought myself a Boox Note Air 2 for note taking as an alternative to the livescribe pen I was using until then. The idea to have an Android tablet but with an E-Ink screen and a stylus intrigued me. I was hoping to use OneNote on it (even though I use Joplin for note-taking nowadays) but found it rather unusable due to the lagging and missing palm rejection. The built-in Boox Note App however was working great, but the export/sync options only saved PDF files. I also did not want another Cloud Service (Onyx) I knew nothing about to save my notes. So I looked around if I could access the notes data somehow and stumbled upon a thread here. That gave me the first pointers on where to start. Basically all data is saved to sqlite databases and some binary blobs. Perfect! A database (usually quite self-explanatory) and some binary data; that’s where some reverse engineering work needs to be done. So let’s get started.

Data generation

First we need some data to work on. So I fired up the Note App on my Boox Note Air and started with a notebook where I drew a simple line on the first page, and a circle on the second. Then I created a backup in the App and copied it to the PC. The backup is simply a folder with a couple of files in it:

  • a ShapeDatabase.db file
  • some [UUID].db files
  • a folder named point where some more folders with UUID names reside and that contain the binary data files most likely holding the actual handwriting/drawing data.

ShapeDatabase

The shape database is pretty straight forward, it contains 4 tables:

  • NoteAccessHistory
  • NoteModel
  • ResourceModel
  • ShapeModel

To start, the NoteModel sounds right, and lo and behold - it actually has useful data! It contains a row for each Notebook I created, including the title, a Unique Id for the Notebook and even a column named pageNameList that contained a JSON structure with a list of unique Ids, the number of UUID items in the list corresponding with the number of pages each notebook had, so that must be the page unique id. Good start! Next up are the other .db files.

Notebook Databases

The other .db files in the backup directory all had a UUID as a name, and they were corresponding with the UUIDs in the NoteModel table, so each notebook as its own database file as well. The notebook databases also looked quite simple, here are the tables:

  • HWRDataModel
  • HWRShapeModel
  • NewShapeModel

The NewShapeModel table again looked promising, so I looked at this first. That table contains a row for each shape, again with an UUID for each row. So shapes also have UUIDs. So now it is time to look at the binary files, the harder nut to crack (but not that hard as we will see).

Loooking at the binary blobs

To make sense of the binary blobs located in the point folder, I drew a simple circle in the Boox Notes app and made a backup. Then I looked up the UUID of the notebook in the database

The beginning of the binary looks like this:

00000000: 0000 0001 3638 6462 3534 3135 2d34 6566  ....68db5415-4ef
00000010: 652d 3438 6534 2d61 3331 642d 6230 6135  e-48e4-a31d-b0a5
00000020: 6336 3265 3031 6635 3435 3366 6662 3561  c62e01f5453ffb5a
00000030: 2d63 3539 342d 3464 6630 2d39 6532 382d  -c594-4df0-9e28-
00000040: 3837 3839 3265 3066 3036 3135 0000 0000  87892e0f0615....
00000050: 42ab a476 4228 ec3d 0000 0022 0000 0000  B..vB(.=..."....
00000060: 42ab a476 4229 aedc 0000 0063 0000 0003  B..vB).....c....
00000070: 42ab 4328 422a 101b 0000 00a8 0000 000e  B.C(B*..........
00000080: 42ab 1282 422a d2bb 0000 00ab 0000 000e  B...B*..........
00000090: 42aa e1dc 422b 955a 0000 00ac 0000 000e  B...B+.Z........
...
...
...
000017d0: 429e 8796 4255 64b1 0000 04f9 0000 0411  B...BUd.........
000017e0: 429d 9455 4255 64b1 0000 02ff 0000 0414  B..UBUd.........
000017f0: 3833 6566 3832 3061 2d33 6132 652d 3463  83ef820a-3a2e-4c
00001800: 6338 2d39 6563 352d 6436 3666 6136 3032  c8-9ec5-d66fa602
00001810: 3032 3933 0000 004c 0000 17a4 0000 17f0  0293...L........

At first glance it looks perfectly simple, there is some header, some stuff at the end, and the actual data in the middle. Even perfectly aligned with our 16 byte lines, so you could almost read the point data by looking at the binary blob: If you look at the lines starting at 00000050, you can already assume that the repeating data are 16 byte blocks, each consisting of 4 32bit values. The header looks like a UUID, and so does the footer (end of the file). Easy! So I wrote a quick python script to parse out the data and plot the data points and played around with the 4 bytes assuming that two of them had to be x/y values, one maybe pen pressure since the App was able to draw line thickness according to pen pressure and a forth mystery one. And tadaa! I got a figure that actually looked like what I have drawn:

677363a74f95cba73c0915dff72e2b7c.png

So I ran the script on another notebook, this time a real note of mine, a whole page of text, not just a circle created to make things easy. And the output was complete garbage. Just random lines. Meh. Ok, back to the drawing board.

Loooking at the binary blobs (this time for real)

The folder for the page not only contained multiple binary files now (thats an easy fix, just read them sequentially one after the other), they also looked different.

The start looks fine. The header has the same format. And the data seems to start out the same way:

00000000: 0000 0001 6264 3831 3438 6563 2d32 3062  ....bd8148ec-20b
00000010: 312d 3438 6436 2d62 3965 652d 3764 3566  1-48d6-b9ee-7d5f
00000020: 3465 3431 6432 3631 3836 3530 3034 6439  4e41d261865004d9
00000030: 2d36 6665 342d 3430 6561 2d39 6666 622d  -6fe4-40ea-9ffb-
00000040: 3633 6335 6234 3533 3763 6631 0000 0000  63c5b4537cf1....
00000050: 42f0 d135 424c 457b 0000 0026 0000 0000  B..5BLE{...&....
00000060: 42f1 3282 424c 457b 0000 0054 0000 0004  B.2.BLE{...T....

But then in the middle something changes:

00000210: 42f2 25c1 424a 5efd 0000 008e 0000 0058  B.%.BJ^........X
00000220: 42f2 25c1 424a c03c 0000 0076 0000 005b  B.%.BJ.<...v...[
00000230: 42f2 5669 424a c03c 0000 0056 0000 005d  B.ViBJ.<...V...]
00000240: 42f2 5669 424a c03c 0000 0032 0000 0060  B.ViBJ.<...2...`
00000250: 0000 0000 42f5 302a 4249 3afd 0000 003b  ....B.0*BI:....;
00000260: 0000 0000 42f5 302a 4249 3afd 0000 006a  ....B.0*BI:....j
00000270: 0000 0003 42f5 302a 4249 3afd 0000 00a3  ....B.0*BI:.....
00000280: 0000 0008 42f5 302a 4248 d9be 0000 00b9  ....B.0*BH......

It is obvious that the alignment changed in the middle of the file. WTF! Somehow there is an extra 0000 0000 thrown in there, so I get an offset when reading the data and of course drawing the points with offsets causes garbage. So I “fixed” it by just ignoring a word if the 16-byte block started with a 0-word. Now it looked better. But of course this is just a dirty hack and by no means correct, a zero X value would throw me off and create an offset instead of fixing it. And there was another problem: The footer changed, it got longer:

00006460: 0000 1257 4307 2302 4235 d890 0000 0e8d  ...WC.#.B5......
00006470: 0000 1259 4307 fdef 4230 24f5 0000 09c9  ...YC...B0$.....
00006480: 0000 125c 3535 6538 6238 6439 2d66 6335  ...\55e8b8d9-fc5
00006490: 352d 3430 6465 2d39 3063 632d 6630 3163  5-40de-90cc-f01c
000064a0: 3231 6633 3935 6639 0000 004c 0000 0204  21f395f9...L....
000064b0: 6337 6338 6339 3337 2d39 3636 392d 3464  c7c8c937-9669-4d
000064c0: 6231 2d38 6565 322d 6331 6666 3961 6138  b1-8ee2-c1ff9aa8
000064d0: 6463 6436 0000 0250 0000 6234 0000 6484  dcd6...P..b4..d.

There are now two UUIDs at the end. And not only that, there is a 8 byte gap with some unknown data between the first and the second UUID but 12 bytes of something after the last UUID. Crap. How do I detect now when the data points end and the footer begins? And what are these UUIDs anyways? Looking at the sqlite database again it started to make sense: The two UUIDs correspond with two of the many shapeUniqueIds. And the extra zero-problem from above occured exactly one time with two UUIDs in the footer. So the data before that extra zero most likely is the point data for the first shape in the file and the data below for the second shape. And how the hell are they parsing the data in their App, the header stays the same. No information about the length or blocks. So it has to be in the footer. Then it dawned on me: The footer actually describes the file structure. The last 32-bit value 0000 6484 actually is the offset where the footer starts! To parse the file you actually have to start with the last word of the file! So now we have another unknown: a 32-bit value after the UUIDs in the footer. That must be offsets too! And yes, they are: the word after the first UUID is 0000 004c, that is where the first shape data starts after the header. The second word is the length of the shape data block. And the second shape start is 0000 0250. That solves our strange extra word problem in the middle of the file, because we are skipping right over that zero-Word when we read the two shape data blocks using the length provided. I can remove my hack and just parse the footer correctly!

Still the binary blob (Putting it all together)

So now I had an understanding of the file format, it looks like this:

Points File
Header Shape Data Blocks (repeated) Footer
bytes description
4 Always 0x1, I assume this is a version number
36 UUID
Bytes Description
44 Shape Info (repeated)
4 Footer Offset (from start of file)
0 EOF (End of file)

Shape Info

Bytes Description
36 Shape UUID
4 Offset
4 Length

Shape Data Block

Bytes Description
4 Drawing Tool?
4 Point X-Axis
4 Point Y-Axis
4 Pressure

Completing the parser

Now that I have figured out the file format, the rendering output of my sample page is still distorted:

not yet correct page output

Two things are going wrong here:

  • Some shapes are not in the right place
  • Some shapes are actually deleted (at some point I deleted “Build Table” and wrote “Build Data Table” instead, but both shapes are visible)

Regarding the first issue, I remembered a column in the sqlite NewShapeModel table called matrixValues. I have certainly moved stuff around the page, and instead of rewriting the points data a transformation matrix is applied to the shape. That makes total sense since you can also rotate shapes. So I read the matrix values from the table and applied the transformation matrix and voila, the shapes are now placed correctly. The second issue is also an easy fix: I only draw a shape when there is a valid entry in the NewShapeModel table. As it turns out, shape data is not deleted from the shape file, only some values are deleted in the Shape table.

Now it looks correct:

Correct page output

There is more: Hand writing recognition!

The sqlite database also contains data that is gathered from the handwriting recognition software of the Boox Note App. We can leverage that as well to search our notes. So I added some lines to my Python script to take a word as input, search for it on the page and draw a red rectangle around the word if found. So when I pass ‘algorithm’ as a search term to my script I get this output:

find text in handwritten notes

Conclusion

Now I have a (probably partially) working parser for Boox Notes data files: https://github.com/DAmesberger/boox-notes-backup-parser

The output can be saved to a PNG file. That in and of itself does not help really, I wanted to have a better way of syncing data than saving PDFs, PNGs are arguably worse. But I have the raw shape data now. I am not yet sure how the best way to proceed is. I may take a shot at implementing a proper parser in Rust (maybe Rust WebAssembly) and write an add-on to Joplin, but I would have to investigate how difficult it would be to add a viewer and search function as a Plugin.

Also currently this only operates on backups. However I want to root my device anyways because I heard that it is quite heavily “phoning home” to some (Onyx?) servers which I don’t want anyways. Maybe that also kills the handwriting recognition (I don’t know if it runs on-device or in the cloud but regarding that it works quite well I have to assume that it is cloud-based). But since we have the data we can try to run handwriting recognition software on the extracted data later anyways.

Git Repository

https://github.com/DAmesberger/boox-notes-backup-parser