Fix unsmarten text option (#16 )

* Create unsmarten.py * Update unsmarten.py * Update unsmarten.py * Create unsmarten.py
Removed license classifier in favor of SPDX entry.
2026-03-26 12:33:32 +01:00 · 2026-02-06 09:06:12 +01:00 · 2025-04-18 16:06:33 +02:00 · 2025-03-19 21:28:37 +01:00 · 2025-03-13 16:55:40 +01:00 · 2025-03-13 12:51:51 +01:00
19 changed files with 189 additions and 104 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,4 @@ build/
 dist/
 sdist/
 *.egg-info/
 venv/
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,2 +0,0 @@
 graft ebook_converter/data
 exclude .gitignore
--- a/README.rst
+++ b/README.rst
@@ -2,24 +2,39 @@
 Ebook converter
 ===============
-This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
+This is an impudent ripoff of the bits from `Calibre project`_, and is aimed
-for converter thing.
+only for converter thing.
 My motivation is to have only converter for ebooks run from commandline,
 without all of those bells and whistles Calibre has, and with cleanest more
 *pythonic* approach.
 My motivation is to have only the converter for ebooks run from the
 commandline, without all of those bells and whistles Calibre has, and with
 cleanest more *pythonic* approach.
 Requirements
 ------------
 To build and run ebook converter, you'll need:
- Python 3.6 or newer
+- Python 3.10 or newer
 - `Liberation fonts`_
 - setuptools
 - ``pdftohtml``, ``pdfinfo`` and ``pdftoppm`` from `poppler`_ project for
  conversion from PDF available in ``$PATH``
 - ``libxml2-dev`` and ``libxslt-dev`` as dependencies for format manipulation
  from some of the Calibre code
 and several Python packages:
 - `beautifulsoup4`_
 - `css-parser`_
 - `filelock`_
 - `html2text`_
 - `html5-parser`_
 - `msgpack`_
 - `odfpy`_
 - `pillow`_
 - `python-dateutil`_
 - `setuptools`_
 - `tinycss`_
 No Python2 support. Even if Calibre probably still is able to run on Python2, I
 do not have an intention to support it.
@@ -28,9 +43,9 @@ do not have an intention to support it.
 What's supported
 ----------------
-To be able to perform some optimization and make converter more reliable and
+To be able to perform some optimization and make the converter more reliable
-easy to use, first I need to remove some of the features, which are totally not
+and easy to use, first I need to remove some of the features, which are totally
-crucial in my opinion, although they might be re-added later, like, for
+not crucial in my opinion, although they might be re-added later, like, for
 instance there is no automatic language translations depending on the locale
 settings.
@@ -44,15 +59,16 @@ Windows is not currently supported, because of the original spaghetti code.
 This may change in the future, after cleanup of mentioned pasta would be
 completed.
-So called `Kindle periodical` format is not supported, since all we do care are
+So called *Kindle periodical* format (which `Amazon has`_ `killed`_ anyway back
-local files. If there would be downloaded periodical thing (using Calibre for
+in September 2023) is not supported, since all we do care are local files. If
-example), it would be treated as common book.
+there would be downloaded periodical thing (using Calibre for example), it
 would be treated as common book.
 Input formats
 ~~~~~~~~~~~~~
-Currently, I've tested following input formats:
+Currently, I've tested the following input formats:
 - Microsoft Word 2007 and up (``docx``)
 - EPUB, both v2 and v3 (``epub``)
@@ -107,7 +123,7 @@ managers), i.e:
   $ . venv/bin/activate
   (venv) $ git clone https://github.com/gryf/ebook-converter
   (venv) $ cd ebook-converter
-   (venv) $ pip install -r requirements.txt .
+   (venv) $ pip install .
 Simple as that. And from now on, you can issue converter:
@@ -122,9 +138,20 @@ License
 This work is licensed on GPL3 license, like the original work. See LICENSE file
 for details.
 .. _Calibre project: https://calibre-ebook.com/
 .. _pypi: https://pypi.python.org
 .. _Liberation fonts: https://github.com/liberationfonts/liberation-fonts
-.. _Kindle periodical: https://sellercentral.amazon.com/gp/help/external/help.html?itemID=202047960&language=en-US
+.. _Amazon has: https://goodereader.com/blog/kindle/amazon-will-discontinue-newspaper-and-magazine-subscriptions-in-september
 .. _killed: https://www.theverge.com/23861370/amazon-kindle-periodicals-unlimited-ended
 .. _poppler: https://poppler.freedesktop.org/
 .. _beautifulsoup4: https://www.crummy.com/software/BeautifulSoup
 .. _css-parser: https://github.com/ebook-utils/css-parser
 .. _filelock: https://github.com/tox-dev/py-filelock
 .. _html2text: https://github.com/Alir3z4/html2text
 .. _html5-parser: https://html5-parser.readthedocs.io
 .. _msgpack: https://msgpack.org
 .. _odfpy: https://github.com/eea/odfpy
 .. _pillow: https://python-pillow.github.io
 .. _python-dateutil: https://github.com/dateutil/dateutil
 .. _setuptools: https://setuptools.pypa.io
 .. _tinycss: http://tinycss.readthedocs.io
--- a/ebook_converter/constants_old.py
+++ b/ebook_converter/constants_old.py
@@ -32,7 +32,7 @@ def debug():
 # plugins {{{
-class Plugins(collections.Mapping):
+class Plugins(collections.abc.Mapping):
    def __init__(self):
        self._plugins = {}
--- a/ebook_converter/css_selectors/ordered_set.py
+++ b/ebook_converter/css_selectors/ordered_set.py
@@ -19,7 +19,7 @@ def is_iterable(obj):
    return hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes))
-class OrderedSet(collections.MutableSet):
+class OrderedSet(collections.abc.MutableSet):
    """
    An OrderedSet is a custom MutableSet that remembers its order, so that
    every entry has an index that can be looked up.
--- a/ebook_converter/ebooks/conversion/plugins/html_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/html_input.py
@@ -237,7 +237,7 @@ class HTMLInput(InputFormatPlugin):
        if not os.access(link, os.R_OK):
            return link_
        if os.path.isdir(link):
-            self.log.warning(link_, 'is a link to a directory. Ignoring.')
+            self.log.warning('%s is a link to a directory. Ignoring.', link_)
            return link_
        if link not in self.added_resources:
            bhref = os.path.basename(link)
--- a/ebook_converter/ebooks/conversion/plugins/pml_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/pml_output.py
@@ -62,7 +62,7 @@ class PMLOutput(OutputFormatPlugin):
                    im = Image.open(io.BytesIO(item.data))
                else:
                    im = Image.open(io.BytesIO(item.data)).convert('P')
-                    im.thumbnail((300,300), Image.ANTIALIAS)
+                    im.thumbnail((300,300), Image.LANCZOS)
                data = io.BytesIO()
                im.save(data, 'PNG')
--- a/ebook_converter/ebooks/lrf/html/convert_from.py
+++ b/ebook_converter/ebooks/lrf/html/convert_from.py
@@ -1012,7 +1012,7 @@ class HTMLConverter(object):
            self.image_memory.append(pt)  # Neccessary, trust me ;-)
            try:
                im.resize((int(width), int(height)),
-                          PILImage.ANTIALIAS).save(pt, encoding)
+                          PILImage.LANCZOS).save(pt, encoding)
                pt.close()
                self.scaled_images[path] = pt
                return pt.name
@@ -1970,7 +1970,7 @@ def process_file(path, options, logger):
                options.cover = cf.name
                tim = im.resize((int(0.75 * th), th),
-                                PILImage.ANTIALIAS).convert('RGB')
+                                PILImage.LANCZOS).convert('RGB')
                tf = PersistentTemporaryFile(prefix=__appname__ + '_',
                                             suffix=".jpg")
                tf.close()
--- a/ebook_converter/ebooks/lrf/html/table.py
+++ b/ebook_converter/ebooks/lrf/html/table.py
@@ -145,7 +145,7 @@ class Cell(object):
                continue
            word = token.split()
            word = word[0] if word else ""
-            width = font.getsize(word)[0]
+            width = font.getbbox(word)[2]
            if width > mwidth:
                mwidth = width
        return parindent + mwidth + 2
@@ -191,7 +191,7 @@ class Cell(object):
            if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
                font = get_font(ff, self.pts_to_pixels(fs))
            for word in token.split():
-                width, height = font.getsize(word)
+                _, _, width, height = font.getbbox(word)
                left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
        return right+3+max(parindent, 10), bottom
--- a/ebook_converter/ebooks/mobi/mobiml.py
+++ b/ebook_converter/ebooks/mobi/mobiml.py
@@ -452,7 +452,7 @@ class MobiMLizer(object):
                try:
                    item = self.oeb.manifest.hrefs[base.urlnormalize(href)]
                except:
-                    self.oeb.logger.warning('Failed to find image:', href)
+                    self.oeb.logger.warning('Failed to find image: %s', href)
                else:
                    try:
                        width, height = identify(item.data)[1:]
--- a/ebook_converter/ebooks/mobi/writer2/indexer.py
+++ b/ebook_converter/ebooks/mobi/writer2/indexer.py
@@ -444,8 +444,8 @@ class Indexer(object):  # {{{
        if self.is_periodical and self.masthead_offset is None:
            raise ValueError('Periodicals must have a masthead')
-        self.log('Generating MOBI index for a %s', 'periodical' if
+        self.log.info('Generating MOBI index for a %s', 'periodical' if
-                 self.is_periodical else 'book')
+                      self.is_periodical else 'book')
        self.is_flat_periodical = False
        if self.is_periodical:
            periodical_node = next(iter(oeb.toc))
--- a/ebook_converter/ebooks/odt/input.py
+++ b/ebook_converter/ebooks/odt/input.py
@@ -14,13 +14,15 @@ from odf.draw import Frame as odFrame, Image as odImage
 from odf.namespaces import TEXTNS as odTEXTNS
 from ebook_converter.utils import directory
 from ebook_converter.ebooks.oeb import parse_utils
 from ebook_converter.ebooks.oeb.base import _css_logger
 from ebook_converter import polyglot
 class Extract(ODF2XHTML):
-    def extract_pictures(self, zf):
+    def _extract_pictures(self, zf):
        if not os.path.exists('Pictures'):
            os.makedirs('Pictures')
        for name in zf.namelist():
@@ -30,8 +32,8 @@ class Extract(ODF2XHTML):
                with open(name, 'wb') as f:
                    f.write(data)
-    def apply_list_starts(self, root, log):
+    def _apply_list_starts(self, root, log):
-        if not self.list_starts:
+        if not hasattr(self, "list_starts") or not self.list_starts:
            return
        list_starts = frozenset(self.list_starts)
        for ol in root.xpath('//*[local-name() = "ol" and @class]'):
@@ -46,7 +48,7 @@ class Extract(ODF2XHTML):
        self.filter_css(root, log)
        self.extract_css(root, log)
        self.epubify_markup(root, log)
-        self.apply_list_starts(root, log)
+        self._apply_list_starts(root, log)
        html = etree.tostring(root, encoding='utf-8', xml_declaration=True)
        return html
@@ -84,22 +86,21 @@ class Extract(ODF2XHTML):
                    return rule
    def epubify_markup(self, root, log):
        from ebook_converter.ebooks.oeb.base import XPath, XHTML
        # Fix empty title tags
-        for t in XPath('//h:title')(root):
+        for t in parse_utils.XPath('//h:title')(root):
            if not t.text:
                t.text = u' '
        # Fix <p><div> constructs as the asinine epubchecker complains
        # about them
-        pdiv = XPath('//h:p/h:div')
+        pdiv = parse_utils.XPath('//h:p/h:div')
        for div in pdiv(root):
-            div.getparent().tag = XHTML('div')
+            div.getparent().tag = parse_utils.XHTML('div')
        # Remove the position:relative as it causes problems with some epub
        # renderers. Remove display: block on an image inside a div as it is
        # redundant and prevents text-align:center from working in ADE
        # Also ensure that the img is contained in its containing div
-        imgpath = XPath('//h:div/h:img[@style]')
+        imgpath = parse_utils.XPath('//h:div/h:img[@style]')
        for img in imgpath(root):
            div = img.getparent()
            if len(div) == 1:
@@ -119,7 +120,7 @@ class Extract(ODF2XHTML):
        # works in both WebKit and ADE.
        # https://bugs.launchpad.net/bugs/1063207
        # https://bugs.launchpad.net/calibre/+bug/859343
-        imgpath = XPath('descendant::h:div/h:div/h:img')
+        imgpath = parse_utils.XPath('descendant::h:div/h:div/h:img')
        for img in imgpath(root):
            div2 = img.getparent()
            div1 = div2.getparent()
@@ -297,7 +298,7 @@ class Extract(ODF2XHTML):
            with open('index.xhtml', 'wb') as f:
                f.write(polyglot.as_bytes(html))
            zf = ZipFile(stream, 'r')
-            self.extract_pictures(zf)
+            self._extract_pictures(zf)
            opf = OPFCreator(os.path.abspath(os.getcwd()), mi)
            opf.create_manifest([(os.path.abspath(os.path.join(r, f2)), None)
                                 for r, _, fnames in os.walk(os.getcwd())
--- a/ebook_converter/ebooks/oeb/transforms/unsmarten.py
+++ b/ebook_converter/ebooks/oeb/transforms/unsmarten.py
@@ -0,0 +1,28 @@
 __license__ = 'GPL 3'
 __copyright__ = '2011, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 from ebook_converter.ebooks.oeb.base import OEB_DOCS, XPath
 from ebook_converter.ebooks.oeb.parse_utils import barename
 from ebook_converter.utils.unsmarten import unsmarten_text
 class UnsmartenPunctuation:
    def __init__(self):
        self.html_tags = XPath('descendant::h:*')
    def unsmarten(self, root):
        for x in self.html_tags(root):
            if not barename(x.tag) == 'pre':
                if getattr(x, 'text', None):
                    x.text = unsmarten_text(x.text)
                if getattr(x, 'tail', None) and x.tail:
                    x.tail = unsmarten_text(x.tail)
    def __call__(self, oeb, context):
        bx = XPath('//h:body')
        for x in oeb.manifest.items:
            if x.media_type in OEB_DOCS:
                for body in bx(x.data):
                    self.unsmarten(body)
--- a/ebook_converter/main.py
+++ b/ebook_converter/main.py
@@ -4,7 +4,6 @@ import os
 import sys
 from ebook_converter import logging
 from ebook_converter.customize.conversion import OptionRecommendation
 from ebook_converter.ebooks.conversion.plumber import Plumber
@@ -68,6 +67,7 @@ def run(args):
    return 0
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('from_file', help="Input file to be converted")
@@ -83,5 +83,4 @@ def main():
    LOG.set_verbose(args.verbose, args.quiet)
    print(args)
    sys.exit(run(args))
--- a/ebook_converter/utils/unsmarten.py
+++ b/ebook_converter/utils/unsmarten.py
@@ -0,0 +1,40 @@
 __license__ = 'GPL 3'
 __copyright__ = '2011, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 from ebook_converter.utils.mreplace import MReplace
 _mreplace = MReplace({
        '&#8211;': '--',
        '&ndash;': '--',
        '–': '--',
        '&#8212;': '---',
        '&mdash;': '---',
        '—': '---',
        '&#8230;': '...',
        '&hellip;': '...',
        '…': '...',
        '&#8220;': '"',
        '&#8221;': '"',
        '&#8222;': '"',
        '&#8243;': '"',
        '&ldquo;': '"',
        '&rdquo;': '"',
        '&bdquo;': '"',
        '&Prime;': '"',
        '“':'"',
        '”':'"',
        '„':'"',
        '″':'"',
        '&#8216;':"'",
        '&#8217;':"'",
        '&#8242;':"'",
        '&lsquo;':"'",
        '&rsquo;':"'",
        '&prime;':"'",
        '‘':"'",
        '’':"'",
        '′':"'",
 })
 unsmarten_text = _mreplace.mreplace
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,52 @@
 [build-system]
 requires = ["setuptools >= 77.0"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "ebook-converter"
 version = "4.12.0"
 requires-python = ">= 3.10"
 description = "Convert ebook between different formats"
 dependencies = [
    "beautifulsoup4>=4.9.3",
    "css-parser>=1.0.6",
    "filelock>=3.0.12",
    "html2text>=2020.1.16",
    "html5-parser==0.4.12",
    "msgpack>=1.0.0",
    "odfpy>=1.4.1",
    "pillow>=8.0.1",
    "python-dateutil>=2.8.1",
    "setuptools>=61.0",
    "tinycss>=0.4"
 ]
 readme = "README.rst"
 authors = [
    {name = "gryf", email = "gryf73@gmail.com"}
 ]
 license = "GPL-3.0-or-later"
 classifiers = [
    "Environment :: Console",
    "Intended Audience :: Other Audience",
    "Operating System :: POSIX :: Linux",
    "Development Status :: 3 - Alpha",
    "Programming Language :: Python",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3 :: Only",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Programming Language :: Python :: 3.13"
 ]
 [project.urls]
 Repository = "https://github.com/gryf/ebook-converter"
 [project.scripts]
 ebook-converter = "ebook_converter.main:main"
 [tool.setuptools.packages.find]
 exclude = ["snap"]
 [tool.setuptools.package-data]
 "*" = ["*.types", "*.css", "*.html", "*.xhtml", "*.xsl", "*.json"]
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,11 +0,0 @@
 beautifulsoup4>=4.9.3
 css-parser>=1.0.6
 filelock>=3.0.12
 html2text>=2020.1.16
 html5-parser==0.4.9 --no-binary lxml
 msgpack>=1.0.0
 odfpy>=1.4.1
 pillow>=8.0.1
 python-dateutil>=2.8.1
 setuptools>=50.3.2
 tinycss>=0.4
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,46 +0,0 @@
 [metadata]
 name = ebook-converter
 version = 4.12.0
 summary = Convert ebook between different formats
 description-file =
    README.rst
 author = gryf
 author-email = gryf73@gmail.com
 license = GPL3
 license_file = LICENSE
 url = https://github.com/gryf/ebook-converter
 classifier =
    Environment :: Console
    Intended Audience :: Other Audience
    License :: OSI Approved :: GNU General Public License v3 (GPLv3)
    Operating System :: POSIX :: Linux
    Development Status :: 3 - Alpha
    Programming Language :: Python
    Programming Language :: Python :: 3
    Programming Language :: Python :: 3 :: Only
    Programming Language :: Python :: 3.6
    Programming Language :: Python :: 3.7
 [options]
 packages = find:
 include_package_data = True
 install_requires =
    filelock
    python-dateutil
    lxml
    css-parser
    beautifulsoup4
    tinycss
    pillow
    msgpack
    html5-parser
    odfpy
    setuptools
    html2text
 [options.entry_points]
 console_scripts =
    ebook-converter=ebook_converter.main:main
 [options.package_data]
 * = *.types *.css, *.html, *.xsl
--- a/setup.py
+++ b/setup.py
@@ -1,4 +0,0 @@
 import setuptools
 setuptools.setup()
Author	SHA1	Message	Date
Vitaliy Krasnoperov	c89fc132b8	Fix unsmarten text option (#16 ) * Create unsmarten.py * Update unsmarten.py * Update unsmarten.py * Create unsmarten.py	2026-02-06 09:06:12 +01:00
gryf	8b8a92e9fd	Removed license classifier in favor of SPDX entry.	2025-04-18 16:06:33 +02:00
gryf	6b7f796cfb	README update	2025-03-19 21:28:37 +01:00
gryf	72d0858ad8	Move from setup.cfg/py to pure pyproject.toml project definition	2025-03-13 16:55:40 +01:00
Roman Dobosz	4f548ec882	Merge pull request #10 from zagura/add-pyproject-toml Add pyproject.toml	2025-03-13 12:51:51 +01:00
Michał Zagórski	0faa2c0758	Add pyproject.toml	2025-03-12 23:23:22 +01:00
gryf	d37850520b	Remove getsize method of PIL in favor of getbbox	2025-03-10 18:33:05 +01:00
Roman Dobosz	5e56cb8c7a	Merge pull request #9 from NunoSempere/master add dependencies, fix some typos	2025-02-10 16:43:31 +01:00
NunoSempere	084e0d11ce	fix a few README typos mostly the lack of "the". I've left some others which are more charming	2025-01-05 22:32:56 +01:00
NunoSempere	4c3c5a9e27	add missing dependencies (found in Debian 12)	2025-01-05 22:30:09 +01:00
gryf	c240495c3d	Fix for nonexistent attribute in odt input format	2022-12-04 18:26:09 +01:00
gryf	53dea56929	Removed temporary stuff	2022-12-04 18:24:31 +01:00
gryf	ef02332465	Fix couple of logging errors	2022-12-04 18:22:06 +01:00
gryf	74abaf0de0	Fix imports for collections abstract classes	2022-12-04 18:18:07 +01:00
		`@@ -1,2 +0,0 @@`
			`graft ebook_converter/data`
			`exclude .gitignore`