Fix unsmarten text option (#16 )

* Create unsmarten.py * Update unsmarten.py * Update unsmarten.py * Create unsmarten.py
Removed license classifier in favor of SPDX entry.
2026-03-25 11:53:33 +01:00 · 2026-02-06 09:06:12 +01:00 · 2025-04-18 16:06:33 +02:00 · 2025-03-19 21:28:37 +01:00 · 2025-03-13 16:55:40 +01:00 · 2025-03-13 12:51:51 +01:00
19 changed files with 189 additions and 104 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,4 @@ build/
 dist/
 sdist/
 *.egg-info/
+venv/
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,2 +0,0 @@
-graft ebook_converter/data
-exclude .gitignore
--- a/README.rst
+++ b/README.rst
@@ -2,24 +2,39 @@
 Ebook converter
 ===============

-This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
-for converter thing.
-
-My motivation is to have only converter for ebooks run from commandline,
-without all of those bells and whistles Calibre has, and with cleanest more
-*pythonic* approach.
+This is an impudent ripoff of the bits from `Calibre project`_, and is aimed
+only for converter thing.

+My motivation is to have only the converter for ebooks run from the
+commandline, without all of those bells and whistles Calibre has, and with
+cleanest more *pythonic* approach.

 Requirements
 ------------

 To build and run ebook converter, you'll need:

- Python 3.6 or newer
+- Python 3.10 or newer
 - `Liberation fonts`_
 - setuptools
 - ``pdftohtml``, ``pdfinfo`` and ``pdftoppm`` from `poppler`_ project for
  conversion from PDF available in ``$PATH``
+- ``libxml2-dev`` and ``libxslt-dev`` as dependencies for format manipulation
+  from some of the Calibre code
+
+and several Python packages:
+
+- `beautifulsoup4`_
+- `css-parser`_
+- `filelock`_
+- `html2text`_
+- `html5-parser`_
+- `msgpack`_
+- `odfpy`_
+- `pillow`_
+- `python-dateutil`_
+- `setuptools`_
+- `tinycss`_

 No Python2 support. Even if Calibre probably still is able to run on Python2, I
 do not have an intention to support it.
@@ -28,9 +43,9 @@ do not have an intention to support it.
 What's supported
 ----------------

-To be able to perform some optimization and make converter more reliable and
-easy to use, first I need to remove some of the features, which are totally not
-crucial in my opinion, although they might be re-added later, like, for
+To be able to perform some optimization and make the converter more reliable
+and easy to use, first I need to remove some of the features, which are totally
+not crucial in my opinion, although they might be re-added later, like, for
 instance there is no automatic language translations depending on the locale
 settings.

@@ -44,15 +59,16 @@ Windows is not currently supported, because of the original spaghetti code.
 This may change in the future, after cleanup of mentioned pasta would be
 completed.

-So called `Kindle periodical` format is not supported, since all we do care are
-local files. If there would be downloaded periodical thing (using Calibre for
-example), it would be treated as common book.
+So called *Kindle periodical* format (which `Amazon has`_ `killed`_ anyway back
+in September 2023) is not supported, since all we do care are local files. If
+there would be downloaded periodical thing (using Calibre for example), it
+would be treated as common book.


 Input formats
 ~~~~~~~~~~~~~

-Currently, I've tested following input formats:
+Currently, I've tested the following input formats:

 - Microsoft Word 2007 and up (``docx``)
 - EPUB, both v2 and v3 (``epub``)
@@ -107,7 +123,7 @@ managers), i.e:
   $ . venv/bin/activate
   (venv) $ git clone https://github.com/gryf/ebook-converter
   (venv) $ cd ebook-converter
-   (venv) $ pip install -r requirements.txt .
+   (venv) $ pip install .

 Simple as that. And from now on, you can issue converter:

@@ -122,9 +138,20 @@ License
 This work is licensed on GPL3 license, like the original work. See LICENSE file
 for details.

-
 .. _Calibre project: https://calibre-ebook.com/
 .. _pypi: https://pypi.python.org
 .. _Liberation fonts: https://github.com/liberationfonts/liberation-fonts
-.. _Kindle periodical: https://sellercentral.amazon.com/gp/help/external/help.html?itemID=202047960&language=en-US
+.. _Amazon has: https://goodereader.com/blog/kindle/amazon-will-discontinue-newspaper-and-magazine-subscriptions-in-september
+.. _killed: https://www.theverge.com/23861370/amazon-kindle-periodicals-unlimited-ended
 .. _poppler: https://poppler.freedesktop.org/
+.. _beautifulsoup4: https://www.crummy.com/software/BeautifulSoup
+.. _css-parser: https://github.com/ebook-utils/css-parser
+.. _filelock: https://github.com/tox-dev/py-filelock
+.. _html2text: https://github.com/Alir3z4/html2text
+.. _html5-parser: https://html5-parser.readthedocs.io
+.. _msgpack: https://msgpack.org
+.. _odfpy: https://github.com/eea/odfpy
+.. _pillow: https://python-pillow.github.io
+.. _python-dateutil: https://github.com/dateutil/dateutil
+.. _setuptools: https://setuptools.pypa.io
+.. _tinycss: http://tinycss.readthedocs.io
--- a/ebook_converter/constants_old.py
+++ b/ebook_converter/constants_old.py
@@ -32,7 +32,7 @@ def debug():
 # plugins {{{


-class Plugins(collections.Mapping):
+class Plugins(collections.abc.Mapping):

    def __init__(self):
        self._plugins = {}
--- a/ebook_converter/css_selectors/ordered_set.py
+++ b/ebook_converter/css_selectors/ordered_set.py
@@ -19,7 +19,7 @@ def is_iterable(obj):
    return hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes))


-class OrderedSet(collections.MutableSet):
+class OrderedSet(collections.abc.MutableSet):
    """
    An OrderedSet is a custom MutableSet that remembers its order, so that
    every entry has an index that can be looked up.
--- a/ebook_converter/ebooks/conversion/plugins/html_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/html_input.py
@@ -237,7 +237,7 @@ class HTMLInput(InputFormatPlugin):
        if not os.access(link, os.R_OK):
            return link_
        if os.path.isdir(link):
-            self.log.warning(link_, 'is a link to a directory. Ignoring.')
+            self.log.warning('%s is a link to a directory. Ignoring.', link_)
            return link_
        if link not in self.added_resources:
            bhref = os.path.basename(link)
--- a/ebook_converter/ebooks/conversion/plugins/pml_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/pml_output.py
@@ -62,7 +62,7 @@ class PMLOutput(OutputFormatPlugin):
                    im = Image.open(io.BytesIO(item.data))
                else:
                    im = Image.open(io.BytesIO(item.data)).convert('P')
-                    im.thumbnail((300,300), Image.ANTIALIAS)
+                    im.thumbnail((300,300), Image.LANCZOS)

                data = io.BytesIO()
                im.save(data, 'PNG')
--- a/ebook_converter/ebooks/lrf/html/convert_from.py
+++ b/ebook_converter/ebooks/lrf/html/convert_from.py
@@ -1012,7 +1012,7 @@ class HTMLConverter(object):
            self.image_memory.append(pt)  # Neccessary, trust me ;-)
            try:
                im.resize((int(width), int(height)),
-                          PILImage.ANTIALIAS).save(pt, encoding)
+                          PILImage.LANCZOS).save(pt, encoding)
                pt.close()
                self.scaled_images[path] = pt
                return pt.name
@@ -1970,7 +1970,7 @@ def process_file(path, options, logger):
                options.cover = cf.name

                tim = im.resize((int(0.75 * th), th),
-                                PILImage.ANTIALIAS).convert('RGB')
+                                PILImage.LANCZOS).convert('RGB')
                tf = PersistentTemporaryFile(prefix=__appname__ + '_',
                                             suffix=".jpg")
                tf.close()
--- a/ebook_converter/ebooks/lrf/html/table.py
+++ b/ebook_converter/ebooks/lrf/html/table.py
@@ -145,7 +145,7 @@ class Cell(object):
                continue
            word = token.split()
            word = word[0] if word else ""
-            width = font.getsize(word)[0]
+            width = font.getbbox(word)[2]
            if width > mwidth:
                mwidth = width
        return parindent + mwidth + 2
@@ -191,7 +191,7 @@ class Cell(object):
            if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
                font = get_font(ff, self.pts_to_pixels(fs))
            for word in token.split():
-                width, height = font.getsize(word)
+                _, _, width, height = font.getbbox(word)
                left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
        return right+3+max(parindent, 10), bottom

--- a/ebook_converter/ebooks/mobi/mobiml.py
+++ b/ebook_converter/ebooks/mobi/mobiml.py
@@ -452,7 +452,7 @@ class MobiMLizer(object):
                try:
                    item = self.oeb.manifest.hrefs[base.urlnormalize(href)]
                except:
-                    self.oeb.logger.warning('Failed to find image:', href)
+                    self.oeb.logger.warning('Failed to find image: %s', href)
                else:
                    try:
                        width, height = identify(item.data)[1:]
--- a/ebook_converter/ebooks/mobi/writer2/indexer.py
+++ b/ebook_converter/ebooks/mobi/writer2/indexer.py
@@ -444,8 +444,8 @@ class Indexer(object):  # {{{
        if self.is_periodical and self.masthead_offset is None:
            raise ValueError('Periodicals must have a masthead')

-        self.log('Generating MOBI index for a %s', 'periodical' if
-                 self.is_periodical else 'book')
+        self.log.info('Generating MOBI index for a %s', 'periodical' if
+                      self.is_periodical else 'book')
        self.is_flat_periodical = False
        if self.is_periodical:
            periodical_node = next(iter(oeb.toc))
--- a/ebook_converter/ebooks/odt/input.py
+++ b/ebook_converter/ebooks/odt/input.py
@@ -14,13 +14,15 @@ from odf.draw import Frame as odFrame, Image as odImage
 from odf.namespaces import TEXTNS as odTEXTNS

 from ebook_converter.utils import directory
+from ebook_converter.ebooks.oeb import parse_utils
 from ebook_converter.ebooks.oeb.base import _css_logger
 from ebook_converter import polyglot


+
 class Extract(ODF2XHTML):

-    def extract_pictures(self, zf):
+    def _extract_pictures(self, zf):
        if not os.path.exists('Pictures'):
            os.makedirs('Pictures')
        for name in zf.namelist():
@@ -30,8 +32,8 @@ class Extract(ODF2XHTML):
                with open(name, 'wb') as f:
                    f.write(data)

-    def apply_list_starts(self, root, log):
-        if not self.list_starts:
+    def _apply_list_starts(self, root, log):
+        if not hasattr(self, "list_starts") or not self.list_starts:
            return
        list_starts = frozenset(self.list_starts)
        for ol in root.xpath('//*[local-name() = "ol" and @class]'):
@@ -46,7 +48,7 @@ class Extract(ODF2XHTML):
        self.filter_css(root, log)
        self.extract_css(root, log)
        self.epubify_markup(root, log)
-        self.apply_list_starts(root, log)
+        self._apply_list_starts(root, log)
        html = etree.tostring(root, encoding='utf-8', xml_declaration=True)
        return html

@@ -84,22 +86,21 @@ class Extract(ODF2XHTML):
                    return rule

    def epubify_markup(self, root, log):
-        from ebook_converter.ebooks.oeb.base import XPath, XHTML
        # Fix empty title tags
-        for t in XPath('//h:title')(root):
+        for t in parse_utils.XPath('//h:title')(root):
            if not t.text:
                t.text = u' '
        # Fix <p><div> constructs as the asinine epubchecker complains
        # about them
-        pdiv = XPath('//h:p/h:div')
+        pdiv = parse_utils.XPath('//h:p/h:div')
        for div in pdiv(root):
-            div.getparent().tag = XHTML('div')
+            div.getparent().tag = parse_utils.XHTML('div')

        # Remove the position:relative as it causes problems with some epub
        # renderers. Remove display: block on an image inside a div as it is
        # redundant and prevents text-align:center from working in ADE
        # Also ensure that the img is contained in its containing div
-        imgpath = XPath('//h:div/h:img[@style]')
+        imgpath = parse_utils.XPath('//h:div/h:img[@style]')
        for img in imgpath(root):
            div = img.getparent()
            if len(div) == 1:
@@ -119,7 +120,7 @@ class Extract(ODF2XHTML):
        # works in both WebKit and ADE.
        # https://bugs.launchpad.net/bugs/1063207
        # https://bugs.launchpad.net/calibre/+bug/859343
-        imgpath = XPath('descendant::h:div/h:div/h:img')
+        imgpath = parse_utils.XPath('descendant::h:div/h:div/h:img')
        for img in imgpath(root):
            div2 = img.getparent()
            div1 = div2.getparent()
@@ -297,7 +298,7 @@ class Extract(ODF2XHTML):
            with open('index.xhtml', 'wb') as f:
                f.write(polyglot.as_bytes(html))
            zf = ZipFile(stream, 'r')
-            self.extract_pictures(zf)
+            self._extract_pictures(zf)
            opf = OPFCreator(os.path.abspath(os.getcwd()), mi)
            opf.create_manifest([(os.path.abspath(os.path.join(r, f2)), None)
                                 for r, _, fnames in os.walk(os.getcwd())
--- a/ebook_converter/ebooks/oeb/transforms/unsmarten.py
+++ b/ebook_converter/ebooks/oeb/transforms/unsmarten.py
@@ -0,0 +1,28 @@
+__license__ = 'GPL 3'
+__copyright__ = '2011, John Schember <john@nachtimwald.com>'
+__docformat__ = 'restructuredtext en'
+
+from ebook_converter.ebooks.oeb.base import OEB_DOCS, XPath
+from ebook_converter.ebooks.oeb.parse_utils import barename
+from ebook_converter.utils.unsmarten import unsmarten_text
+
+
+class UnsmartenPunctuation:
+
+    def __init__(self):
+        self.html_tags = XPath('descendant::h:*')
+
+    def unsmarten(self, root):
+        for x in self.html_tags(root):
+            if not barename(x.tag) == 'pre':
+                if getattr(x, 'text', None):
+                    x.text = unsmarten_text(x.text)
+                if getattr(x, 'tail', None) and x.tail:
+                    x.tail = unsmarten_text(x.tail)
+
+    def __call__(self, oeb, context):
+        bx = XPath('//h:body')
+        for x in oeb.manifest.items:
+            if x.media_type in OEB_DOCS:
+                for body in bx(x.data):
+                    self.unsmarten(body)
--- a/ebook_converter/main.py
+++ b/ebook_converter/main.py
@@ -4,7 +4,6 @@ import os
 import sys

 from ebook_converter import logging
-from ebook_converter.customize.conversion import OptionRecommendation
 from ebook_converter.ebooks.conversion.plumber import Plumber


@@ -68,6 +67,7 @@ def run(args):

    return 0

+
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('from_file', help="Input file to be converted")
@@ -83,5 +83,4 @@ def main():

    LOG.set_verbose(args.verbose, args.quiet)

-    print(args)
    sys.exit(run(args))
--- a/ebook_converter/utils/unsmarten.py
+++ b/ebook_converter/utils/unsmarten.py
@@ -0,0 +1,40 @@
+__license__ = 'GPL 3'
+__copyright__ = '2011, John Schember <john@nachtimwald.com>'
+__docformat__ = 'restructuredtext en'
+
+from ebook_converter.utils.mreplace import MReplace
+
+_mreplace = MReplace({
+        '&#8211;': '--',
+        '&ndash;': '--',
+        '–': '--',
+        '&#8212;': '---',
+        '&mdash;': '---',
+        '—': '---',
+        '&#8230;': '...',
+        '&hellip;': '...',
+        '…': '...',
+        '&#8220;': '"',
+        '&#8221;': '"',
+        '&#8222;': '"',
+        '&#8243;': '"',
+        '&ldquo;': '"',
+        '&rdquo;': '"',
+        '&bdquo;': '"',
+        '&Prime;': '"',
+        '“':'"',
+        '”':'"',
+        '„':'"',
+        '″':'"',
+        '&#8216;':"'",
+        '&#8217;':"'",
+        '&#8242;':"'",
+        '&lsquo;':"'",
+        '&rsquo;':"'",
+        '&prime;':"'",
+        '‘':"'",
+        '’':"'",
+        '′':"'",
+})
+
+unsmarten_text = _mreplace.mreplace
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,52 @@
+[build-system]
+requires = ["setuptools >= 77.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "ebook-converter"
+version = "4.12.0"
+requires-python = ">= 3.10"
+description = "Convert ebook between different formats"
+dependencies = [
+    "beautifulsoup4>=4.9.3",
+    "css-parser>=1.0.6",
+    "filelock>=3.0.12",
+    "html2text>=2020.1.16",
+    "html5-parser==0.4.12",
+    "msgpack>=1.0.0",
+    "odfpy>=1.4.1",
+    "pillow>=8.0.1",
+    "python-dateutil>=2.8.1",
+    "setuptools>=61.0",
+    "tinycss>=0.4"
+]
+readme = "README.rst"
+authors = [
+    {name = "gryf", email = "gryf73@gmail.com"}
+]
+license = "GPL-3.0-or-later"
+classifiers = [
+    "Environment :: Console",
+    "Intended Audience :: Other Audience",
+    "Operating System :: POSIX :: Linux",
+    "Development Status :: 3 - Alpha",
+    "Programming Language :: Python",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3 :: Only",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13"
+]
+
+[project.urls]
+Repository = "https://github.com/gryf/ebook-converter"
+
+[project.scripts]
+ebook-converter = "ebook_converter.main:main"
+
+[tool.setuptools.packages.find]
+exclude = ["snap"]
+
+[tool.setuptools.package-data]
+"*" = ["*.types", "*.css", "*.html", "*.xhtml", "*.xsl", "*.json"]
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,11 +0,0 @@
-beautifulsoup4>=4.9.3
-css-parser>=1.0.6
-filelock>=3.0.12
-html2text>=2020.1.16
-html5-parser==0.4.9 --no-binary lxml
-msgpack>=1.0.0
-odfpy>=1.4.1
-pillow>=8.0.1
-python-dateutil>=2.8.1
-setuptools>=50.3.2
-tinycss>=0.4
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,46 +0,0 @@
-[metadata]
-name = ebook-converter
-version = 4.12.0
-summary = Convert ebook between different formats
-description-file =
-    README.rst
-author = gryf
-author-email = gryf73@gmail.com
-license = GPL3
-license_file = LICENSE
-url = https://github.com/gryf/ebook-converter
-classifier =
-    Environment :: Console
-    Intended Audience :: Other Audience
-    License :: OSI Approved :: GNU General Public License v3 (GPLv3)
-    Operating System :: POSIX :: Linux
-    Development Status :: 3 - Alpha
-    Programming Language :: Python
-    Programming Language :: Python :: 3
-    Programming Language :: Python :: 3 :: Only
-    Programming Language :: Python :: 3.6
-    Programming Language :: Python :: 3.7
-
-[options]
-packages = find:
-include_package_data = True
-install_requires =
-    filelock
-    python-dateutil
-    lxml
-    css-parser
-    beautifulsoup4
-    tinycss
-    pillow
-    msgpack
-    html5-parser
-    odfpy
-    setuptools
-    html2text
-
-[options.entry_points]
-console_scripts =
-    ebook-converter=ebook_converter.main:main
-
-[options.package_data]
-* = *.types *.css, *.html, *.xsl
--- a/setup.py
+++ b/setup.py
@@ -1,4 +0,0 @@
-import setuptools
-
-
-setuptools.setup()
Author	SHA1	Message	Date
Vitaliy Krasnoperov	c89fc132b8	Fix unsmarten text option (#16 ) * Create unsmarten.py * Update unsmarten.py * Update unsmarten.py * Create unsmarten.py	2026-02-06 09:06:12 +01:00
gryf	8b8a92e9fd	Removed license classifier in favor of SPDX entry.	2025-04-18 16:06:33 +02:00
gryf	6b7f796cfb	README update	2025-03-19 21:28:37 +01:00
gryf	72d0858ad8	Move from setup.cfg/py to pure pyproject.toml project definition	2025-03-13 16:55:40 +01:00
Roman Dobosz	4f548ec882	Merge pull request #10 from zagura/add-pyproject-toml Add pyproject.toml	2025-03-13 12:51:51 +01:00
Michał Zagórski	0faa2c0758	Add pyproject.toml	2025-03-12 23:23:22 +01:00
gryf	d37850520b	Remove getsize method of PIL in favor of getbbox	2025-03-10 18:33:05 +01:00
Roman Dobosz	5e56cb8c7a	Merge pull request #9 from NunoSempere/master add dependencies, fix some typos	2025-02-10 16:43:31 +01:00
NunoSempere	084e0d11ce	fix a few README typos mostly the lack of "the". I've left some others which are more charming	2025-01-05 22:32:56 +01:00
NunoSempere	4c3c5a9e27	add missing dependencies (found in Debian 12)	2025-01-05 22:30:09 +01:00
gryf	c240495c3d	Fix for nonexistent attribute in odt input format	2022-12-04 18:26:09 +01:00
gryf	53dea56929	Removed temporary stuff	2022-12-04 18:24:31 +01:00
gryf	ef02332465	Fix couple of logging errors	2022-12-04 18:22:06 +01:00
gryf	74abaf0de0	Fix imports for collections abstract classes	2022-12-04 18:18:07 +01:00