Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Overview

trafilatura: Web scraping tool for text discovery and retrieval

Python package Python versions Documentation Status Travis build status Code Coverage Downloads

Demo as GIF image

Description

Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.

Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).

The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast. With these objectives in mind, Trafilatura is designed to run in production on millions of web documents. It is based on lxml as well as readability and jusText as fallback.

Features

  • Seamless parallelized online and offline processing:
    • Download and conversion utilities included
    • URLs, HTML files or parsed HTML trees as input
  • Robust and efficient extraction:
    • Main text and/or comments
    • Structural elements preserved: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
    • Extraction of metadata (title, author, date, site name, categories and tags)
  • Several output formats supported:
    • Plain text (minimal formatting)
    • CSV (with metadata, tab-separated values)
    • JSON (with metadata)
    • XML (for metadata and structure) and TEI-XML
  • Link discovery and URL lists:
    • Support for sitemaps and ATOM/RSS feeds
    • Efficient and polite processing of URL queues
    • Blacklisting
  • Optional language detection on extracted content

Evaluation and alternatives

For more detailed results see the evaluation page and evaluation script. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the tests directory.

500 documents, 1487 text and 1496 boilerplate segments (2020-11-06)
Python Package Precision Recall Accuracy F-Score Diff.
justext 2.2.0 (tweaked) 0.870 0.584 0.749 0.699 6.1x
newspaper3k 0.2.8 0.921 0.574 0.763 0.708 12.9x
goose3 3.1.6 0.950 0.629 0.799 0.757 19.0x
boilerpy3 1.0.2 (article mode) 0.851 0.696 0.788 0.766 4.8x
baseline (text markup) 0.746 0.804 0.766 0.774 1x
dragnet 2.0.4 0.906 0.689 0.810 0.783 3.1x
readability-lxml 0.8.1 0.917 0.716 0.826 0.804 5.9x
news-please 1.5.13 0.923 0.711 0.827 0.804 184x
trafilatura 0.6.0 0.924 0.849 0.890 0.885 3.9x
trafilatura 0.6.0 (+ fallbacks) 0.933 0.877 0.907 0.904 8.4x

External evaluations:

Usage and documentation

For further information please refer to the documentation:

License

trafilatura is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What's in it for business?

Roadmap

  • [-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
  • [-] URL lists and document management
  • [-] Configuration and extraction parameters
  • [-] Graphical user interface
  • [ ] Interaction with web archives (notably WARC format)
  • [ ] Integration of natural language processing tools

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page. Thanks to the contributors who submitted features and bugfixes!

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

You can contact me via my contact page or GitHub.

Going further

Online documentation: trafilatura.readthedocs.io.

Tutorials: overview.

Trafilatura: Italian word for wire drawing.

Corresponding posts on Bits of Language (blog).

Comments
  • Celery error with v1.2.1: ValueError: signal only works in main thread

    Celery error with v1.2.1: ValueError: signal only works in main thread

    Having version 1.2.1 it is not possible to launch trafilatura extraction in the async task like celery. https://github.com/adbar/trafilatura/blob/1bb5fee6a4812e53b6597053c25efde995174d79/trafilatura/core.py#L982 It would be better to have HAS_SIGNAL as config variable, and not hardcoded value

    celery_1      |     text = trafilatura.extract(
    celery_1      |   File "/usr/local/lib/python3.8/site-packages/trafilatura/core.py", line 982, in extract
    celery_1      |     signal(SIGALRM, timeout_handler)
    celery_1      |   File "/usr/local/lib/python3.8/signal.py", line 47, in signal
    celery_1      |     handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
    celery_1      | ValueError: signal only works in main thread
    
    feedback 
    opened by alex-bender 16
  • No metadata extraction

    No metadata extraction

    Hello,

    Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.

    My test:

    import trafilatura
    from trafilatura.core import bare_extraction
    
    downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
    
    result = bare_extraction(downloaded, include_formatting=False, with_metadata=True)
    
    print(result)
    

    The results: ({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)

    So, no metadata return.

    Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/. But no category is returned.

    categories_xpaths = [
        """//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or
        starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or
        starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or
        starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""",
        "//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a",
        "//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a",
        '//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a',
        '//header[@class="entry-header"]//a',
        '//div[@class="row" or @class="tags"]//a',
        '//div[contains(@class, "post__categories")]//li//a',
    ]
    

    Another question is that could I get content of article including html format (no clean tags in content)?

    Please help me, thanks for your support!

    enhancement 
    opened by phongtnit 16
  • Issue with multiple authors and preference for meta information

    Issue with multiple authors and preference for meta information

    We shouldnt believe on schema person

    agenda Current: "author": "Sandy Cheu", Should be: "author": "Stephen Teulan; Nikita Weikhardt",

    aged Current: "author":"Consumers", Should be: "author": "Liz Alderslade",

    meta remove single names cath Current: "author": null, Should be: "author": "Rebecca",

    echo Current: "author": null, Should be: "author": "Katie",

    enhancement 
    opened by felipehertzer 15
  • Navigation bar filtering - some bug fixed

    Navigation bar filtering - some bug fixed

    The current repo should work well? I have removed several things that are unused and fixed a tiny bug that affects the accuracy. I have added to the git ignore so that the branch should now get quite clean as well XD

    opened by immortal-autumn 13
  • No Formatting in Plain Text Output

    No Formatting in Plain Text Output

    When using include_formatting for plain text, I'm not seeing any formatting (bold, italics, etc..). The term I'm using supports this. Is this by design or a bug? I tried both the standalone version and using it as a library with trafilatura.extract(downloaded, include_formatting=True).

    enhancement question 
    opened by peterjschroeder 13
  • Performance enhancement

    Performance enhancement

    I. Test file

    test2.py
    from time import time
    
    import requests
    from trafilatura import extract
    
    
    if __name__ == '__main__':
        urls = ["https://en.wikipedia.org/wiki/List_of_Hindi_songs_recorded_by_Asha_Bhosle",
                "https://en.wikipedia.org/wiki/2022_in_video_games",
                "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Kuwait",
                "https://en.wikipedia.org/wiki/Presidency_of_Rodrigo_Duterte",
                "https://en.wikipedia.org/wiki/List_of_2021%E2%80%9322_NBA_season_transactions",
                "https://en.wikipedia.org/wiki/2022_in_sports",
                "https://en.wikipedia.org/wiki/Firefox_version_history",
                "https://en.wikipedia.org/wiki/List_of_common_misconceptions",
                "https://en.wikipedia.org/wiki/Same-sex_union_legislation",
                "https://en.wikipedia.org/wiki/Presidency_of_Donald_Trump",]
    
        cum_time = 0
        for url in urls:        
            resp = requests.get(url)
            t0 = time()
            result = extract(resp.text)
            cum_time = cum_time + time() - t0
        print(cum_time)
    

    II. Test pprofile

    kernprof -lv test2.py
    

    before

    Total time: 0.544693 s
    File: /trafilatura-master/trafilatura/utils.py
    Function: remove_control_characters at line 221
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       221                                           @profile
       222                                           def remove_control_characters(string):
       223                                               '''Prevent non-printable and XML invalid character errors'''
       224     25998     544693.0     21.0    100.0      return ''.join([c for c in string if c.isprintable() or c.isspace()])
    

    after

    Total time: 0.169241 s
    File: /trafilatura-master/trafilatura/utils.py
    Function: remove_control_characters at line 227
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       227                                           @profile
       228                                           def remove_control_characters(string):
       229                                               '''Prevent non-printable and XML invalid character errors'''
       230     25998     169241.0      6.5    100.0      return ''.join(filter(is_printable_or_space, string))
    

    III. Test vprof

    vprof -c -h test2.py
    

    before before

    after after

    feedback 
    opened by deedy5 10
  • Correction in the extraction of authors by tag and by json

    Correction in the extraction of authors by tag and by json

    In this correction:

    • added 'submitted-by' and 'username' tags to xpath
    • the maximum size of the author's name has been increased.
    • regex has been added to remove emoji from author names often found on sites like buzzfeed
    • added a regex to minify json before running the other regex, was having trouble fetching authors when json formatted.
    • added a regex to remove json items like images and organization before searching the author
    • reorganized the extract_json function as it was overwriting meta tags with none when no json was found

    qsr Before this fix: "author": null After this fix: "author": "Kevin Santos"

    perthnow Before this fix: "author": "NCA NewsWire" After this fix: "author": "Finn McHugh"

    buzzfeed Before this fix: "author": "Hameda Nafiz BuzzFeed Staff" After this fix: "author": "Hameda Nafiz"

    buzzfeed Before this fix: "author": "Olivia ❤️" After this fix: "author": "Olivia Community Contributor"

    build Before this fix: "author": null After this fix: "author": "Thoams Lane"

    hunterandbligh Before this fix: none After this fix: "author": "REBECCA MAGRO"

    abc - 'data-component' Before this fix: "author": null After this fix: "author": "Charlotte Gore"

    proactiveinvestors Before this fix: "author": null After this fix: "author": "Calum Muirhead"

    banking Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Sarah Harman"

    hcamag Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Mark Rosanes"

    spacedaily and + 9 sites Before this fix: "author": null After this fix: "author": "Lucie Aubourg"

    first Before this fix:"author": "Nick Griffin", After this fix: "author": "Stan Shamu",

    racing Before this fix:"author": "Ben Sporle - @bensporle; Ben Sporle", After this fix: "author": "Ben Sporle",

    ajn Before this fix:"author": "RABBI GARY ROBUCK July", After this fix: "author": "RABBI GARY ROBUCK",

    ESPN it is not totally fix, but it is better Before this fix: "author": "Andrew Mcglashandeputy Editor, Espncricinfo", After this fix: "author": "Andrew McGlashan Deputy editor; ESPNcricinfo",

    Probono it is not totally fix, but it is better Before this fix: "author": null, After this fix: "author": "Luke Michael; Journalist; @Luke_Michael",

    opened by felipehertzer 10
  • Library is redirecting stderr to /dev/null upon every call

    Library is redirecting stderr to /dev/null upon every call

    If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call: https://github.com/adbar/trafilatura/blob/a56fb3e041175df38a32b1c5ef2e9c7888eeb7a6/trafilatura/external.py#L63

    Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).

    This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.

    Consequently, this redirect may be removed (to be tested).

    opened by dmoklaf 10
  • In parallel trafilatura is marginally slower than goose

    In parallel trafilatura is marginally slower than goose

    I'm not quite sure where to begin with this, it's a strange one. In a real world scenario I tried switching from Goose3 to Trafilatura. I'm processing html extractions in parallel with dask. After switching to trafilatura, I noticed a 30% slowdown. I ended up writing my own evaluation library to verify the results.

    Results from running in parallel: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 383.4737 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 361.3232 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

    Results from running sequentially: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 9.7953 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 23.0045 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

    Note: the dataset evaluated in from scrapinghub/article-extraction-benchmark tool. The only portion of the code that runs in parallel for the bench marks is the extraction. Only the extraction is timed for calculating items/sec.

    In summary: trafilatura is marginally slower than Goose3 in parallel. However sequentially it is twice as fast as Goose3.

    I'm not sure where to begin with this. It can be difficult to profile parallel processing. It may be related to some of the memory leak issues reported with trafilutura, although it appears those have been resolved. Or the caching, I haven't looked into how that functions.

    I will work on publishing my benchmarking tool this afternoon.

    question 
    opened by getorca 9
  • Handle pages where article is split into multiple sibling nodes

    Handle pages where article is split into multiple sibling nodes

    This fixes #85 (and #159).

    It involved a bit of a refactor of the extract_content function, but the basic idea is that it looks through all of the children in the subtree returned from tree.xpath(expr), not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether the BODY_XPATH expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.

    So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.

    This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)

    Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).

    Thanks!

    opened by naftalibeder 9
  • Broken parsing of images

    Broken parsing of images

    I'm not quite sure what's wrong with images but here is reproducer:

    $ curl https://en.wikipedia.org/wiki/Tribe > /tmp/tribe.html
    $ python
    Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> html_wiki_tribe = open('/tmp/tribe.html').read()
    ... text = trafilatura.extract(
    ...     html_wiki_tribe,
    ...     include_images=True
    ... )
    ~/anaconda3/lib/python3.7/site-packages/trafilatura/xml.py in xmltotxt(xmloutput, include_formatting, include_links)
        272             LOGGER.debug('unexpected element: %s', element.tag)
        273             returnlist.extend([textelement, ' '])
    --> 274     return sanitize(''.join(returnlist))
        275 
        276 
    
    TypeError: sequence item 6: expected str instance, NoneType found
    
    

    UPD Looks like this could help: image

    bug 
    opened by alex-bender 9
  • Improve title extraction by removing sitename suffix

    Improve title extraction by removing sitename suffix

    Most os sites add a suffix like:

    • My article title | My Site Name
    • My article title - My Site Name

    There is no need the sitename within the article title

    Common separators are: - | – — • · ‹ › ⁄ « » < > : * ⋆ ~

    Some sites use html entities for this, like: &#8212;

    enhancement 
    opened by andremacola 5
  • Remove unwanted html elements with regex or xpaths

    Remove unwanted html elements with regex or xpaths

    Possibility to remove unnecessary html elements before starting the extraction process.

    There are often some elements within the extracted text that are not article content.

    Titles should by default not come inside the extracted text, or there should be an option to remove them (maybe this requires another issue)

    Something like:

    unwanted = [
      'iframe',
      'button',
      'figcaption',
      'caption',
      'form',
      'aside',
      'script',
      'style',
      'ins',
      'link',
      'header',
      'footer',
      '#comments',
      'nav',
      '.post-comments',
      '.post-tags',
      '.wp-block-embed',
      '.wp-caption-text',
      'svg',
      '[class^=ads]',
      '[class*=ads-]',
      '[style="display:none"]',
      '[style*="display:none"]',
      '[style*="display: none"]',
      '[itemprop*="description"]',
      '.push-web-notification',
      '.mc-column.entities',
      '.newsletter-component',
      '.post-subject',
      '.post-info',
      '.addthis_tool',
      '.pt-cv-wrapper'
    ]
    
    article = trafilatura.bare_extraction(document,
            unwanted_elements=unwanted
            include_comments=False, include_tables=False,
            favor_precision=True, favor_recall=True,
            no_fallback=True, target_language=None,
            date_extraction_params={'extensive_search': True, 'original_date': True, 'outputformat': "%Y-%m-%dT%H:%M:%S%z"},
            config=config)
    
    question 
    opened by andremacola 4
  • feat: Add image urls to metadata

    feat: Add image urls to metadata

    Sometimes an image is not included in text body and we can extract by some SEO TAGS

    Issue: https://github.com/adbar/trafilatura/issues/281

    Unfortunately I didn't have time to create the tests

    opened by andremacola 2
  • Add image urls to metadata

    Add image urls to metadata

    Sometimes an image is not included in text body and we can extract by some SEO TAGS like some article parsers do (https://github.com/extractus/article-extractor/blob/main/src/utils/extractMetaData.js)

    Here some metatags:

    'image'
    'og:image'
    'og:image:url'
    'og:image:secure_url'
    'twitter:image'
    'twitter:image:src'
    
    enhancement 
    opened by andremacola 1
  • Extraction of Youtube iframes and img elements with links

    Extraction of Youtube iframes and img elements with links

    Not able to fetch image tags Not able to fetch iframe tags. From command prompt in windows machine

    trafilatura --sitemap "https://www.lyricspulp.com/" --list > linklist.txt trafilatura --sitemap homepage --list > linklist.txt trafilatura -i linklist.txt --xml -o outputfile.txt trafilatura -i linklist.txt --formatting --links --images --no-comments --xml -o outputfile.txt

    enhancement 
    opened by sampathmende 3
Releases(v1.4.0)
  • v1.4.0(Oct 18, 2022)

    Impact on extraction and output format:

    • better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
    • XML: preserve list type as attribute (#229)
    • XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
    • faster text cleaning and shorter code (#237 with @deedy5, #245)
    • metadata: add language when detector is activated (#224)
    • metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
    • TXT: change markdown formatting of headers by @LaundroMat (#257)

    Smaller changes in convenience functions:

    • add function to clear caches (#219)
    • CLI: change exit code if download fails (#223)
    • settings: use "\n" for multiple user agents by @k-sareen (#241)

    Updates:

    • docs updated (and #244 by @dsgibbons)
    • package dependencies updated

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Jul 29, 2022)

    • fast and robust html2txt() function added (#221)
    • more robust parsing (#228)
    • fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
    • extraction about 10-20% faster, slightly better recall
    • partial fixes for memory leaks (#216)
    • docs extended and updated (#217, #225)
    • prepared deprecation of old process_record() function
    • more stable processing with updated dependencies

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0

    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(May 18, 2022)

    • more efficient rules for extraction
    • metadata: further attributes used (with @felipehertzer)
    • better baseline extraction
    • issues fixed: #202, #204, #205
    • evaluation updated

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(May 2, 2022)

    What's Changed

    • --precision and --recall arguments added to the CLI
    • better text cleaning: paywalls and comments
    • improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
    • further bugs fixed: #189, #192 (with @felipehertzer), #200
    • efficiency: faster module loading and improved RAM footprint

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Mar 7, 2022)

    • efficiency: replaced module readability-lxml by trimmed fork
    • bugs fixed: (#179, #180, #183, #184)
    • improved baseline extraction
    • cleaner metadata (with @felipehertzer)

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Feb 21, 2022)

    • encodings: better detection, output NFC-normalized Unicode
    • maintenance and performance: more efficient code
    • bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
    • prepare compatibility with upcoming Python 3.11
    • changed default settings
    • extended documentation

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Nov 30, 2021)

    • compress HTML backup files & seamlessly open .gz files
    • support JSON web feeds
    • graphical user interface integrated into main package
    • faster downloads: reviewed backoff, compressed data
    • optional modules: downloads with pycurl, language identification with py3langid
    • bugs fixed (#111, #125, #132, #136, #140)
    • minor optimizations and fixes by @vbarbaresi in #124 & #130
    • fixed array with single or multiples entries on json extractor by @felipehertzer in #143
    • code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
    • drop support for Python 3.5

    Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v0.9.3(Oct 21, 2021)

    • better, faster encoding detection: replaced chardet with charset_normalizer
    • faster execution: updated justext to 3.0
    • better extraction of sub-elements in tables (#78, #90)
    • more robust web feed parsing
    • further defined precision- and recall-oriented settings
    • license extraction in footers (#118)

    Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3

    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Oct 6, 2021)

    • first precision- and recall-oriented presets defined
    • improvements in authorship extraction (thanks @felipehertzer)
    • requesting TXT output with formatting now results in Markdown format
    • bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
    • setting for cookies in request headers (thanks @muellermartin)
    • better date extraction thanks to htmldate update
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Aug 2, 2021)

    • improved author extraction (thanks @felipehertzer!)
    • bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
    • docs updated and extended
    • CLI: option names normalized (heed deprecation warnings), new option explore
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jun 15, 2021)

    • focused crawling functions including politeness rules
    • more efficient multi-threaded downloads + use as Python functions
    • documentation extended
    • bugs fixed: extraction and URL handling
    • removed support for Python 3.4
    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Apr 21, 2021)

    • better handling of formatting, links and images, title type as attribute in XML formats
    • more robust sitemaps and feeds processing
    • more accurate extraction
    • further consolidation: code simplified and bugs fixed
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Mar 11, 2021)

  • v0.8.0(Feb 19, 2021)

    • improved link discovery and handling
    • fixes in metadata extraction, feeds and sitemaps processing
    • breaking change: the extract function now reads target format from output_format argument only
    • new extraction option: preserve links, CLI options re-ordered
    • more opportunistic backup extraction
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jan 4, 2021)

    • customizable configuration file to parametrize extraction and downloads
    • better handling of feeds and sitemaps
    • additional CLI options: crytographic hash for file name, use Internet Archive as backup
    • more precise extraction
    • faster downloads: requests replaced with bare urllib3 and custom decoding
    • consolidation: bug fixes and improvements, many thanks to the issues reporters!
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Dec 2, 2020)

    • added bare_extraction function returning Python variables
    • improved link discovery in feeds and sitemaps
    • option to preserve image info
    • fixes (many thanks to bug reporters!)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Nov 6, 2020)

  • v0.5.2(Sep 22, 2020)

    • optional language detector changed: langidpycld3
    • helper function bare_extraction()
    • optional deduplication off by default
    • better URL handling (courlan), more complete metadata
    • code consolidation (cleaner and shorter)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jul 15, 2020)

  • v0.5.0(Jun 2, 2020)

    • faster and more robust text and metadata extraction
    • more efficient batch processing (parallel processing, URL queues)
    • support for ATOM/RSS feeds
    • complete command-line tool with corresponding options
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Apr 24, 2020)

  • v0.1.0(Sep 25, 2019)

Owner
Adrien Barbaresi
Research scientist – web texts, computational linguistics and digital humanities
Adrien Barbaresi
Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation This repository provides two web crawlers to label domain nam

1 Nov 05, 2021
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
热搜榜-python爬虫+正则re+beautifulsoup+xpath

仓库简介 微博热搜榜, 参数wb 百度热搜榜, 参数bd 360热点榜, 参数360 csdn热榜接口, 下方查看 其他热搜待加入 如何使用? 注册vercel fork到你的仓库, 右上角 点击这里完成部署(一键部署) 请求参数 vercel配置好的地址+api?tit=+参数(仓库简介有参数信息

Harry 3 Jul 08, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
A webdriver-based script for reserving Tsinghua badminton courts.

AutoReserve A webdriver-based script for reserving badminton courts. 使用说明 下载 chromedriver 选择当前Chrome对应版本 安装 selenium pip install selenium 更改场次、金额信息dat

Payne Zhang 4 Nov 09, 2021
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

QAI Research 4 Jan 05, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

Instagram Scraper An utility library to scrape data from Instagram hassle-free Go to the website » View Demo · Report Bug · Request Feature About The

2 Jul 06, 2022
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
Pelican plugin that adds site search capability

Search: A Plugin for Pelican This plugin generates an index for searching content on a Pelican-powered site. Why would you want this? Static sites are

22 Nov 21, 2022
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
An arxiv spider

An Arxiv Spider 做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其

Jie Liu 11 Sep 09, 2022
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022
Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Auto Invite To The Organization By Star A GitHub Action script to automatically invite everyone to your organization that stars your repository. What

Max Base 11 Dec 11, 2022
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023