A Python feed reader library.

Last update: Dec 30, 2022

Overview

reader is a Python feed reader library.

It aims to allow writing feed reader applications without any business code, and without enforcing a dependency on a particular framework.

reader allows you to:

retrieve, store, and manage Atom, RSS, and JSON feeds
mark entries as read or important
add tags and metadata to feeds
filter feeds and articles
full-text search articles
get statistics on feed and user activity
write plugins to extend its functionality
skip all the low level stuff and focus on what makes your feed reader different

...all these with:

a stable, clearly documented API
excellent test coverage
fully typed Python

What reader doesn't do:

provide an UI
provide a REST API (yet)
depend on a web framework
have an opinion of how/where you use it

The following exist, but are optional (and frankly, a bit unpolished):

a minimal web interface
- that works even with text-only browsers
- with automatic tag fixing for podcasts (MP3 enclosures)
a command-line interface

Documentation: reader.readthedocs.io

Usage:

$ pip install reader

>>> from reader import make_reader
>>>
>>> reader = make_reader('db.sqlite')
>>> reader.add_feed('http://www.hellointernet.fm/podcast?format=rss')
>>> reader.update_feeds()
>>>
>>> entries = list(reader.get_entries())
>>> [e.title for e in entries]
['H.I. #108: Project Cyclops', 'H.I. #107: One Year of Weird', ...]
>>>
>>> reader.mark_entry_as_read(entries[0])
>>>
>>> [e.title for e in reader.get_entries(read=False)]
['H.I. #107: One Year of Weird', 'H.I. #106: Water on Mars', ...]
>>> [e.title for e in reader.get_entries(read=True)]
['H.I. #108: Project Cyclops']
>>>
>>> reader.update_search()
>>>
>>> for e in list(reader.search_entries('year'))[:3]:
...     title = e.metadata.get('.title')
...     print(title.value, title.highlights)
...
H.I. #107: One Year of Weird (slice(15, 19, None),)
H.I. #52: 20,000 Years of Torment (slice(17, 22, None),)
H.I. #83: The Best Kind of Prison ()

Comments

Increasing "database is locked" errors during update

Starting with 2020-04-15, there has been an increasing number of "database is locked" errors during update on my reader deployment (update every hour and update --new-only && search update every minute).

Most errors happen at XX:01:0X, which I think indicates the hourly and minutely updates are clashing. It's likely that search update is hogging the database (since we know it has long-running transactions).

I didn't see any metric changes on the host around -04-15.

Logs.

$ head -n1 /var/log/reader/update.log | cut -dT -f1
2019-07-26
$ cat locked.py 
import sys

last_ts = None

for line in sys.stdin:
    if line.startswith('2020-'):
        last_ts, *_ = line.partition(' ')
    if line.startswith('reader.exceptions.StorageError: sqlite3 error: database is locked'):
        print(last_ts)

$ cat /var/log/reader/update.log | python3 locked.py | cut -dT -f1 | uniq -c
      1 2020-04-15
      4 2020-04-16
      3 2020-04-17
      6 2020-04-18
      2 2020-04-19
      1 2020-04-20
      2 2020-04-21
      3 2020-04-22
      3 2020-04-23
      2 2020-04-24
      3 2020-04-25
      1 2020-04-26
      4 2020-04-27
      2 2020-04-28
      3 2020-04-30
      6 2020-05-01
      8 2020-05-02
     12 2020-05-03
      6 2020-05-04
      4 2020-05-05
      5 2020-05-06
      1 2020-05-07
      2 2020-05-08
      3 2020-05-09
      4 2020-05-10
      4 2020-05-11
      7 2020-05-12
      8 2020-05-13
      9 2020-05-14
      6 2020-05-15
      5 2020-05-16
      9 2020-05-17
     16 2020-05-18
      9 2020-05-19
     10 2020-05-20
     15 2020-05-21
     23 2020-05-22
     20 2020-05-23
     19 2020-05-24
     22 2020-05-25
     22 2020-05-26
     21 2020-05-27
     15 2020-05-28
     18 2020-05-29
     14 2020-05-30
     11 2020-05-31
     17 2020-06-01
     13 2020-06-02
     18 2020-06-03
     10 2020-06-04
     15 2020-06-05
     10 2020-06-06
     14 2020-06-07
     15 2020-06-08
     18 2020-06-09
     17 2020-06-10
     19 2020-06-11
     21 2020-06-12
     19 2020-06-13
     16 2020-06-14
     13 2020-06-15
     24 2020-06-16
     24 2020-06-17
     24 2020-06-18
     24 2020-06-19
     24 2020-06-20
     24 2020-06-21
     24 2020-06-22
     24 2020-06-23
     24 2020-06-24
     11 2020-06-25
$ cat /var/log/reader/update.log | python3 locked.py | cut -dT -f2 | cut -d: -f2- | cut -c1-4 | sort | uniq -c | sort -rn | head
    510 01:0
    119 03:0
     76 01:1
     46 01:2
     30 00:5
     14 00:3
     13 00:0
      7 01:3
      7 00:4
      3 05:0

I should check if there was a trigger that started this, or if the number of entries/feeds simply hit some threshold.

Also it, would be nice to show the pid in the logs so I can see which of the commands is failing, and maybe intercepting the exception and showing some nicer error message.

Things that are likely to improve this:

[helps] Enabling WAL (#169).
Using --workers 20 to give the hourly update a chance to finish before the second minute of the hour. Obviously, this isn't actually addressing the problem.
Increasing the timeout passed to sqlite3.connect (at the moment we're using the 5s default).
[doesn't help] Using a SQLite with HAVE_USLEEP (but this is not necessarily a reader issue; we could document it, though).
Adding retrying in reader. ಠ_ಠ, SQLite already has it built in, it's sucky because of the HAVE_USLEEP thing.
Wrap the whole update with a lock. ಠ_ಠ, idem.
Make the search update chunks more spaced out, to allow other stuff to happen.
Make search update not hog the database by not stripping HTML inside the transaction.

bug core

opened by lemon24 15

Entry tags and metadata
Currently two bits of user data can be added to a feed entry (mark_as_read, mark_as_important).

Possible use cases:

Add user notes about the entry

Add tags for entry, add entry to 'saved items'

In case of podcasts, download info, (i.e path to the file, if successfully downloaded or how many times tried to download)

Optionally include user_data to search (as an argument to make_reader).
help wanted wontfix API core
opened by balki 12
SQLite objects created in a thread can only be used in that same thread.
Hi @lemon24

Thanks for making this library.

I am attempting to utilise it for a Telegram bot I am working on. However, I run into the following error:

SQLite objects created in a thread can only be used in that same thread.

Here is my code 😃

Doing some quick Google searching I might need to do something similar to what is recommended in this stack overflow post.

However, I cannot see a way I can provide this to the library currently.

Hope you can help. Thanks and a merry Christmas to you! 🎄
opened by dbrennand 12
Feed decommissioned
Like #149, but there's no replacement.

Now, to make the feed stop updating, I can delete it, but I lose the entries.

Possible ways of keeping the entries:

Have a way to mark a feed as "broken / don't update anymore" (obviously, this could be temporary).

If we can mark an entire feed as important (https://github.com/lemon24/reader/issues/96#issuecomment-628520935), and have an Archived feed where important entries of deleted feeds go (https://github.com/lemon24/reader/issues/96#issuecomment-460077441), marking the feed as important and then deleting it would preserve the entries.

API core
opened by lemon24 11
[Question] Public API for recently fetched entries, or entries added/modified since date?
For app I'm working on, I wish to update some feeds and do something on entries that appeared. But I'm unsure how to implement this last part and get only new entries.

Workflow I sketched so far is:

Disable updates for all feeds in db (because there might be some I don't wish to update)

Start adding feeds to db 2.1. If feed already exists, enable it for updates 2.2. If feed doesn't exist, it will be added and enabled for updates automatically

Update feeds through update_feeds or update_feeds_iter

???

update_feeds doesn't return anything. update_feeds_iter gives me back list of UpdateResult, where each item will have url and counts or exception.

So, I think I can sum all the counts and ask for that many entries. Something like:

count = sum( sum(result.value.new, result.value.updated) for result in results if not isinstance(result.value, ReaderError) ) new_entries = reader.get_entries(limit=count)

But is it guaranteed that get_entries(sort='recent') will include recently updated entry? Even if that entry originally appeared long time ago? I might be misunderstanding what it means for entry to be marked as "updated", so any pointer on that would be helpful, too.

Perhaps I could change my workflow a little - first get monotonic timestamp, then run all the steps, and finally ask for all entries that were added or modified after timestamp. But it seems that there is no API for searching by date? search_entries is designed for full-text search and works only on few columns.

So, my question is:

What is the preferred way of obtaining all entries added in update call? Counting and using get_entries? Calling get_entries for everything and discarding all results that were added / modified before timestamp? Something else?

What does it mean that entry was "updated"?
opened by mirekdlugosz 10
Twitter support
Some notes:

Main use case: get updates on someone's tweets, e.g. https://twitter.com/qntm; maybe replies too.

Account / API key not required (it kinda defeats the purpose).

From 20 minutes of research, snscrape seems to be working (other popular ones seem broken).

The lib part is not stable, but usable.

We can use our own Requests session (by setting a private attribute).

This won't really fit with the retriever/parser model we have now.

#222 has the same issue, converge.

We can use the date of the last tweet as Last-Modified.

We need a limit the first time (scraping is paginated, if we go all the way to the beginning of an account it'll take ages).

Should model the URLs on Twitter's (https://twitter.com/$user, https://twitter.com/$user/with_replies, etc.).

Presentation matters:

Threads should be shown in a sane way.

Media should be inlined (e.g. a link to an image should be shown as an <img ...>).

Titles should work with the dedupe plugin (likely, no title).

plugin
opened by lemon24 9
Handle non-http feeds gracefully in the web app

...or don't handle them at all.

Adding feed /dev/zero kills the web app. People using Reader directly are free to shoot themselves in the foot however they want; the app should not allow them to, especially if it's not their foot they're shooting.

Update: Oh look, we already have a TODO for this: https://github.com/lemon24/reader/blob/165a0af5f3510dd64fc4c75de17e9c5f45f25c06/src/reader/core/parser.py#L155
API web app core

opened by lemon24 9

Some feeds have duplicate entries

Some entries have duplicate entries, or their ids change (resulting in an entry being stored twice).

E.g., a feed that had the entry id format updated:

$ sqlite3 db.sqlite 'select feed, id, updated, title from entries where title like "RE: xkcd%"' -line
   feed = http://sealedabstract.com/feed/
     id = http://sealedabstract.com/?p=2494
updated = 2014-09-09 08:30:25
  title = RE: xkcd #1357 free speech

   feed = http://sealedabstract.com/feed/
     id = /?p=2494
updated = 2014-09-09 07:30:25
  title = RE: xkcd #1357 free speech

If possible, only one should be shown (similar to #78).

API web app

opened by lemon24 9

CLI options must be passed all the time

...which requires either duplicating them everywhere (so when things change, they need to change in more than one place), or having some alias/wrapper (harder to do for subcommands), or just living with it.

opened by lemon24 8
I don't know what's happening during update_feeds()

I don't know what's happening during update_feeds() until it finishes. More so, there's no straightforward way to know programmatically which feeds failed (logging or guessing from feed.last_exception don't count).

From https://clig.dev/#robustness-guidelines:

Responsive is more important than fast. Print something to the user in <100ms. If you’re making a network request, print something before you do it so it doesn’t hang and look broken.

Show progress if something takes a long time. If your program displays no output for a while, it will look broken. A good spinner or progress indicator can make a program appear to be faster than it is.

Doing either of these is hard to do at the moment.
API core

opened by lemon24 7
REST API

Have you thought about providing a REST API returning feed data as JSON? This would help implementing other UI interfaces. I will probably experiment with that and if your are interested provide a PR.
help wanted wontfix

opened by clemera 7
Gemini subscription support
https://gemini.circumlunar.space/docs/companion/subscription.gmi

Prerequisites:

a way to fetch gemini:// URLs

https://pypi.org/project/aiogemini/

https://pypi.org/project/gemurl/

https://github.com/kr1sp1n/awesome-gemini#programming

https://github.com/cbrews/ignition

https://framagit.org/bortzmeyer/agunua

https://tildegit.org/solderpunk/AV-98 ("canonical" CLI client)

https://tildegit.org/solderpunk/CAPCOM ("canonical" CLI feed reader)

figure out how to handle TOFU

a way to render GMI to HTML (re-check awesome-gemini above); only if we fetch linked entries

This might be a good way to explore how a plugin (PyPI) package would work.
opened by lemon24 0
Sort by recently interacted with

It would be nice to get entries the user recently interacted with.

"Interacted" means, at a minimum, marked as (un)read/important. "Set tag" might be nice. "Downloaded enclosure" might be nice too.

Arguably, the mark_as_read plugin (and plugins in general) should not count as an interaction. If we use read_/important_modified, mark_as_read should probably set them to None – but this would likely break the "don't care" tri-state (https://github.com/lemon24/reader/issues/254#issuecomment-938146589).
enhancement core

opened by lemon24 0
4.0 backwards compatibility breaks
This is to track all the backwards compatibility breaks we want to do in 4.0.

Things that require deprecation warnings pre-4.0:

...

Things that do not require / can't (easily) have deprecation warnings pre-4.0:

[ ] make most public dataclass fields KW_ONLY (after we drop Python 3.9)

API core
opened by lemon24 0
Deleting a feed deletes its important entries
They should likely be moved to an "Archived" feed (mentioned in https://github.com/lemon24/reader/issues/96#issuecomment-460077441).

Old entries in the archived feed should not be deleted (by #96); if they remain important, they won't.

The entry source (#276) and original_feed_url should be set to the to-be-deleted feed (if not already set).

core plugin
opened by lemon24 0
Typing cleanup
Clean typing stuff. Might be able to use pyupgrade for this.

To do:

[x] from __future__ import annotations (and the changes it enables)

[ ] don't depend on typing_extensions at runtime (example)

[ ] typing.Self (supported by mypy master, but not by 0.991)

[ ] show types for data objects in docs

[ ] ~~move TagFilter, {Feed,Entry}Filter options to _storage (todo)~~ no, used by _search as well, just remove todo

[ ] make DEFAULT_RESERVED_NAME_SCHEME public

[ ] move reader.core.*Hook to types

To not do:

Consolidate (public) aliases under reader.typing (Flask does it)?

No, reader.types is likely good enough (but maybe consolidate them in one place in the file).

import typing as t, import typing_extensions as te (Flask does it)?

Fewer imports, but makes code look kinda ugly.

Also,

When adding types, the convention is to import types using the form from typing import Union (as opposed to doing just import typing or import typing as t or from typing import *).
opened by lemon24 0

Releases(3.3)

3.3(Dec 19, 2022)

Support Python 3.11. Improve startup speed via lazy imports. See changelog for details.

This release marks reader's 5th anniversary and its 2000th commit.
Source code(tar.gz)
Source code(zip)
3.2(Sep 14, 2022)

Include the total number of entries in UpdatedFeed. Fix bugs in the entry_dedupe, readtime, and mark_as_read plugins. Fix CLI bugs. See changelog for details.
Source code(tar.gz)
Source code(zip)
3.1(Aug 29, 2022)

The readtime plugin has no extra dependencies. Improved get_entries(sort='recent'). See changelog for details.
Source code(tar.gz)
Source code(zip)
3.0(Jul 30, 2022)

Some backwards-incompatible changes. See changelog for details.
Source code(tar.gz)
Source code(zip)
3.0rc1(Jul 25, 2022)

3.0 release candidate. Some backwards-incompatible changes. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.17(Jul 23, 2022)

Deprecate object_id in favor of resource_id. Prevent rare "database is locked" errors when closing the reader. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.16(Jul 17, 2022)
Threading-related improvements:

Allow using a reader from multiple threads directly (no context manager).

Allow reusing a closed reader.

Allow using a reader from multiple asyncio tasks.

See changelog for details.
Source code(tar.gz)
Source code(zip)
2.15(Jul 8, 2022)

Allow using Reader objects from multiple threads. Allow using Reader objects as context managers. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.14(Jun 30, 2022)

Allow others to use mypy with reader. Drop Python 3.7 support. Support PyPy 3.9. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.13(Jun 27, 2022)

twitter experimental plugin. Skip RSS entries with no guid/link instead of failing the entire feed. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.12(Mar 31, 2022)

readtime built-in plugin. More plugin hooks. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.11(Mar 16, 2022)

Fix reader not working with SQLite 3.38 or newer. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.10(Mar 12, 2022)

Entry and global tags. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.9(Feb 7, 2022)

update_feeds() memory usage reduction (~35%!). Minor web app improvements. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.8(Jan 22, 2022)

New unified tags API. Deprecate the old feed tags/metadata APIs. Web app memory usage improvements. See the changelog for details.
Source code(tar.gz)
Source code(zip)
2.7(Jan 15, 2022)

Unify the tag and metadata namespaces. Rename mark_as_read config metadata. before_feed_update_hooks. global_metadata plugin. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.6(Nov 15, 2021)

Filter feeds by feed, tags, broken, and updates_enabled in update_feeds()/update_feeds_iter(). Filter feeds by new in get_feeds()/get_feed_counts(). Lower update_feeds() memory usage on Linux. Add support for CLI plugins. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.5(Oct 28, 2021)

User-added entries, feed URL validation, PyPy 3.8 support, bugfixes; see changelog for details.
Source code(tar.gz)
Source code(zip)
2.4(Oct 19, 2021)

Enable search by default. Store feed description and version. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.3(Oct 11, 2021)

Python 3.10 support. entry_dedupe now deletes old duplicates. Fix for entry_dedupe bug introduces in 2.2. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.2(Oct 8, 2021)

Approximate matching for entry_dedupe. After feed update hooks. Entry read/important modified timestamps. See changelog for details.
Source code(tar.gz)
Source code(zip)
2.1(Aug 18, 2021)

Entry count averages. See the changelog for details.
Source code(tar.gz)
Source code(zip)
2.0(Jul 17, 2021)

Backwards incompatible changes (API clean-up). See the changelog for details.
Source code(tar.gz)
Source code(zip)
1.20(Jul 12, 2021)

after_entry_update_hooks plugin hook is now public; bugfixes; full changelog.
Source code(tar.gz)
Source code(zip)
1.19(Jun 16, 2021)

Drop Python 3.6 support. Support PyPy 3.7. Various bugfixes and deprecations. See the changelog for more details.
Source code(tar.gz)
Source code(zip)
1.18(Jun 3, 2021)

Pre-2.0 deprecations, minor bug fixes. There are some minor compatibility breaks, see the changelog for details.
Source code(tar.gz)
Source code(zip)
1.17(May 6, 2021)

Reserved tag and metadata keys, mark_as_read becomes a built-in plugin; full changelog.
Source code(tar.gz)
Source code(zip)
1.16(Mar 29, 2021)

Built-in plugins; full changelog.
Source code(tar.gz)
Source code(zip)
1.15(Mar 21, 2021)

Update entries based on content changes; full changelog.
Source code(tar.gz)
Source code(zip)
1.14(Feb 22, 2021)

update_feeds_iter(), better reader update command, make_reader(session_timeout=...); full changelog.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository https://reader.readthedocs.io/

A Python feed reader library.

Related tags

Overview

Comments

Releases(3.3)

3.3(Dec 19, 2022)

3.2(Sep 14, 2022)

3.1(Aug 29, 2022)

3.0(Jul 30, 2022)

3.0rc1(Jul 25, 2022)

2.17(Jul 23, 2022)

2.16(Jul 17, 2022)

2.15(Jul 8, 2022)

2.14(Jun 30, 2022)

2.13(Jun 27, 2022)

2.12(Mar 31, 2022)

2.11(Mar 16, 2022)

2.10(Mar 12, 2022)

2.9(Feb 7, 2022)

2.8(Jan 22, 2022)

2.7(Jan 15, 2022)

2.6(Nov 15, 2021)

2.5(Oct 28, 2021)

2.4(Oct 19, 2021)

2.3(Oct 11, 2021)

2.2(Oct 8, 2021)

2.1(Aug 18, 2021)

2.0(Jul 17, 2021)

1.20(Jul 12, 2021)

1.19(Jun 16, 2021)

1.18(Jun 3, 2021)

1.17(May 6, 2021)

1.16(Mar 29, 2021)

1.15(Mar 21, 2021)

1.14(Feb 22, 2021)

Owner

This Python script can enumerate all URLs present in robots.txt files, and test whether they can be accessed or not.

Paxos in Python, tested with Jepsen

Wordle Solver

A tool to improve Boolean satisfiability (SAT) solver user's life

Gobigger Explore For Python

Get a list of the top-10 rejected libraries in your WhiteSource inventory

Never get kicked for inactivity ever again!

Odoo modules related to website/webshop

Very simple encoding scheme that will encode data as a series of OwOs or UwUs.

Plugins for Agisoft Metashape

Hopefully it'll become a very annoying desktop pet

A toolkit for developing and deploying serverless Python code in AWS Lambda.

An AI-powered device to stop people from stealing my packages.

An universal linux port of deezer, supporting both Flatpak and AppImage

A python script for practicing Toki Pona.

This is a fork of the BakeTool with some improvements that I did to have better workflow.

A simple Programming Language

An kind of operating system portal to a variety of apps with pure python

A basic interpreted programming language written in python

Script to calculate delegator epoch returns for all pillars