@laurel @Pi_rat @p Hydrus is the most disgusting pile of bloated chanware that I’ve ever used with the worst UI known to man.
I like the idea of the program and I tried to like it so much but I just absolutely despise it.
I will most likely prefer something that is much more minimal for this like the less than 200 LOC thing that Pi_rat has written probably.
@Pi_rat @p Anyway I read through the code and I see no major issues with it. It looks simple and clear to read. The database schema also seems properly normalized.
I’m most curious about what the UI for this will look like because that will be what makes or breaks it for me. Tagging an image with multiple tags and then going to the next should require the least amount of clicks and button presses.
@hj @laurel @Pi_rat @p The thing I hate about it the most is that it really wants to take control of your filesystem and move copies of all your files into it’s own folder structure.
I don’t want it to that. Just leave my fucking files where they are and don’t make a copy. Just keep track of where my files in your database.
Hopefully some of this is helpful to you :)
init.sql:
If you are not writing portable sql schemas, then it is not necessary to write "integer not null primary key autoincrement." autoincrement incurs a performance penalty and should be omitted for integer primary keys1.
Additionally, you may safely use datatype TEXT to store the names of your tags, as under the hood, sqlite will treat these the same. In fact, if you wish, you do not have to specify a datatype at all2
The first, biggest thing to consider is the logic related to applying tags to images. Take a look at the non-standard (but very useful) upsert clause3. Present in both sqlite and postgresql, it allows you to handle a uniqueness constraint violation inside the query, without using program logic.
For instance this would let you write something like:
insert into image_tags values (:id, :tag) on conflict do nothing
This turns duplicate tag application into a no-op, saving you some trouble and making your code nicer.
Finally, in the sqlite implementation, cursor.executemany() is functionally equivalent to manually iterating over a loop yourself, and in this case would not require building a list of dictionaries manually to iterate over, if you are concerned with resource usage/speed.
init_db:
It is not necessary to manually parse the database initialization script and execute the commands one at a time. Instead use cursor.executescript() with the loaded contents of the file as a parameter.
add_images:
It may be interesting to take an alternative approach to loading all images known in the directory, and instead repeatedly ask the database if a given file is already present. e.g.
select 1 from images where dir_id = :dir_id and image = :image
This will either return 1 or you will get an empty result set from your cursor. This will probably be faster and will definitely use less memory.
Also, in general prefer named parameters over positional ones (:dir_id, :image) is better than (?, ?). This is stylistic and a personal choice so buyer beware.
create_insert_tuple:
This whole method can probably be gotten rid of by switching your logic to using the upsert syntax to apply tags.
General Notes
Where are you actually committing to the database? AFAICT you're never writing anything to disk unless somehow executemany is autocommitting???
In general, shift all logic related to processing data into the database. It is very good at doing those tasks very fast. Brushing up on sql will go a long way here towards simplifying your python code as much as possible. The sqlite docs are wonderful.
Try to avoid doing data processing in pure python, it is very not fast at that.
init.sql
If you are not writing portable sql schemas, then it is not necessary to write "integer not null primary key autoincrement." autoincrement incurs a performance penalty and should be omitted for integer primary keys1.
I had read about autoincrement being slow but I thought alternative to it would be me generating unique ids myself. I will fix code to use rowids.
Additionally, you may safely use datatype TEXT to store the names of your tags, as under the hood, sqlite will treat these the same.
I was going for VARCHAR(255) (max file name length in ext4) but then friend pointed that database should not be doing those checks, my reasoning was if I set limit on text it will help db have standard row size. I also later came across that sqlite does not actually apply this constraint, and it is my error that I did not change it.
The first, biggest thing to consider is the logic related...without using program logic.
This did come up but I took a break and later I did not have any motivation. I forced myself for past few days but this is becoming fun again so I will try to put much more effort. And as you said it is non standard so it felt weird, I was originally going about try except and logging duplicates.
Finally, in the sqlite implementation, cursor.executemany() is functionally equivalent to manually iterating over a loop yourself, and in this case would not require building a list of dictionaries manually to iterate over, if you are concerned with resource usage/speed.
I thought it would be optimized. I will refactor around this. I was going for dicts since qmark style is going to be deprecated (and as furthur along you have said, I agree dicts are cleaner)
init_db...
I had stong feeling that this could be done in better way, I was split between having whole schema in init function(but that did not sit right either). I did come across executescript in docs but in excitement of working of tagging logic did not gave it much attention. Will fix this.
add_images:
It may be interesting to take an alternative approach to loading all images known in the directory, and instead repeatedly ask the database if a given file is already present. e.g.
select 1 from images where dir_id = :dir_id and image = :imageThis will either return 1 or you will get an empty result set from your cursor. This will probably be faster and will definitely use less memory.
As you have pointed out, Im thinking of upsert syntax to just ignore duplicate inserts. Originally I was trying to figure out a way to have "SELECT id FROM images WHERE image IN ?" and I tried every combination of qmark and dicts but it was no dice but could be done with string formatting, but I rather not use it since Im doing parametrized queries everywhere else, that would be bad.
create_insert_tuple:
This whole method can probably be gotten rid of by switching your logic to using the upsert syntax to apply tags.
It would be extremly cool and I will try hard for this, this one function was what kept me debugging all of today, and problem was I was passing "directory" insted of "dir_id" in tagg.py. There was also problem with if statement where I errantly assumed empty list == False but it was not so. I also dont like setting image_id for every image and allocating it is inefficient.
General Notes
Where are you actually committing to the database? AFAICT you're never writing anything to disk unless somehow executemany is autocommitting???
Ah yes I have autocommit turned on as long as there are no exceptions, I will add explicit commit. I was not even closing db mistaking that context manager will handle closing it for me. Just learned today that I have to close cursor too.
In general, shift all logic related to processing data into the database. It is very good at doing those tasks very fast. Brushing up on sql will go a long way here towards simplifying your python code as much as possible. The sqlite docs are wonderful.
Try to avoid doing data processing in pure python, it is very not fast at that.
Is it correct to use "sqlite3" directly and using functions and passing cursor around them, writing sql statements inside function. Im not sure if there is standard way to do this but personally it seems off. Since python is OOP shouldnt I be creating classes and adding methods? i.e (cursor.select_image(image_id) == cursor.execute("SELECT image FROM images WHERE id = ?", (image_id,))
My absolute first draft of it compared to this is such a mess, I was not using argparse and schema was HORROR. I was certain I would have to make a seprate table for every tag and it would hold image ids. Friend pointed me to correct many to many relation table and that was such a bombshell moment. I learned sql end of last year and at that time I did make correct many to many schemas but since then I have not kept up with it.I LOVE sql(ite) and will read all the Docs you have linked with fervour. Thank you VERY much for giving such detailed review. I will follow you up when I implement queries.
(apolocheese for typos and broken sentences, I deleted post because it had formatting issues, hope they are fixed this time)
I had read about autoincrement being slow but I thought alternative to it would be me generating unique ids myself. I will fix code to use rowids.
I may have not made myself clear here, you need not fuss with rowids yoursefl, you may simply use integer primary key
I thought it would be optimized. I will refactor around this. I was going for dicts since qmark style is going to be deprecated (and as furthur along you have said, I agree dicts are cleaner)
This is a good intuition! In other database drivers (psycopg2 for example), executemany is optimized.
As you have pointed out, Im thinking of upsert syntax to just ignore duplicate inserts. Originally I was trying to figure out a way to have "SELECT id FROM images WHERE image IN ?" and I tried every combination of qmark and dicts but it was no dice but could be done with string formatting, but I rather not use it since Im doing parametrized queries everywhere else, that would be bad.
Good catch here, I don't know how I missed using upsert here but you're absolutely right, this is a good use case for it.
Ah yes I have autocommit turned on as long as there are no exceptions, I will add explicit commit. I was not even closing db mistaking that context manager will handle closing it for me. Just learned today that I have to close cursor too.
When you use the actual python context manager syntax,
with sqlite3.connect("yourdb.sqlite") as db:
with db.cursor() as cursor:
do your query etc etc
...
here the cursor is now closed
db.commit()
here the database connection is now closed
In this case, the python context manager will automatically call the appropriate close function as you fall out of the indented block. If you do it in the imperative style you have been doing you must call close() on each cursor and database connection yourself. In a purely technical sense, all that is being cleaned up when the program exits, and this is not a long-running server process. But it's good practice to use either of these two methods to keep things tidy yourself. Many other objects use this protocol iirc you used it when opening and reading a file.
Is it correct to use "sqlite3" directly and using functions and passing cursor around them, writing sql statements inside function. Im not sure if there is standard way to do this but personally it seems off.
You may pass the connection or cursor object in a function call. However, things get hairier if you start to multithread or multiprocess. sqlite3 module objects are not threadsafe and you will get programming errors if you try to use e.g. a cursor outside of the thread it was created in. This is not really a relevant concern here, though.
As for code style, it might be slightly "cleaner" to pass the database object to functions that use the database, create a cursor for the scope of the function, and close the cursor before you return. This will still alow the callee function control over, e.g. committing to the database in case of a DML statement.
Since python is OOP shouldnt I be creating classes and adding methods? i.e (cursor.select_image(image_id) == cursor.execute("SELECT image FROM images WHERE id = ?", (image_id,))
It is bad practice to attach methods to objects after the object is instantiated, which is what you would be doing were you to do as you stated in this example.
I wouldn't get too caught up in trying to follow any particular paradigm here, your codebase is small and it's a for-fun project. Short of using an ORM to abstract away handling the database connection, carefully passing a connection object around is a-ok, and what the ORNM is doing under the hood, anyway.
I LOVE sql(ite) and will read all the Docs you have linked with fervour. Thank you VERY much for giving such detailed review. I will follow you up when I implement queries.
You're welcome! SQLite is one of my favorite pieces of software, too, if you couldn't tell! I'm a sql and python guy so I'm happy to help out and had fun reading through this and providing commentary today.
(Idk why but markdown does not give paragraph)
You definitely want the DB to enforce as much as you can, but filename length, maybe not. sqlite3 data types aren't enforced, that is correct, but they do usually act as hints to your ORM or library.
For a small enough program, that's fine. But you should probably wrap it in an object so you don't have to deal with details of the query or the DB in other functions, keep the logic clean so functions are topical and legible.
I will look into this, Im not sure how I would go about it at all since I have very surface level understanding of OOP.
I don't know how Python deals with iterators, maybe better to model a cursor as a thing that yields rows to whoever asks.
I'm passing it around to execute relevant queries on it.
Ill go sleep now~~ oyasumi 
@pwm @p @SuperDicq (again, formatting error)
I may have not made myself clear here, you need not fuss with rowids yoursefl, you may simply use integer primary key
You were clear, I just phrased my sentence weird.
In this case, the python context manager will automatically call the appropriate close function as you fall out of the indented block. ... This will still alow the callee function control over, e.g. committing to the database in case of a DML statement.
Noted, will pass connection. (personallythis will also look better than 2 levels of indentations)
Short of using an ORM to abstract away handling the database connection, carefully passing a connection object around is a-ok, and what the ORNM is doing under the hood, anyway.
I was looking into sqlalchemy but and combined with everything else, my attention was spread to thin so, I left it. Nice to know it's not wrong to pass connection
You're welcome! SQLite is one of my favorite pieces of software, too, if you couldn't tell! I'm a sql and python guy so I'm happy to help out and had fun reading through this and providing commentary today.
Eyy!! 
I forgot to ask yesterday, how should I be testing this, for now I manually open up interpreter and db and test, but this gets very tedious. My guess is unittests + a test.db which would return expected row(s). Wanted to write them but procrastinated on this and now I shudder at magnitude of what I would have to cover, probably wont be as bad once I get started and write a few.
Probably you've got to double-space.
Hope this werks, I was using new line and single space
OOP is a convenient way to model this, but essentially it's just keeping things organized by topic...I think it's
null?...small and it's sqlite3 so no pool anyway.
Noted. Its called "None" around this parts. And yes I dont think this will be multi threaded, not even the GUI version.
But don't get carried away writing code to manage things that you don't need managed yet, you know what I mean.
🫡
I forgot to ask yesterday, how should I be testing this, for now I manually open up interpreter and db and test, but this gets very tedious. My guess is unittests + a test.db
For your needs, the built-in unittest module will probably suffice. You'll subclass a test case, write some setup-code that is shared between tests, and then your individual tests. The documentation will give you some good examples. I wouldn't worry about using 3rd party testing frameworks for now (there are plenty).
now I shudder at magnitude of what I would have to cover
You've got ~100loc, that's nothing. Your only wrinkle will be the test database, but just utilize an in-memory one for that (sqlite3.connect(":memory:")) and it will not be persisted to disk for you to clean up later. Or, if you must, you can write some teardown code in your test case that will delete the on-disk database after all the tests run. You've got options.