Fix unsmarten text option (#16 )

* Create unsmarten.py * Update unsmarten.py * Update unsmarten.py * Create unsmarten.py
Removed license classifier in favor of SPDX entry.
2026-03-25 11:53:33 +01:00 · 2026-02-06 09:06:12 +01:00 · 2025-04-18 16:06:33 +02:00 · 2025-03-19 21:28:37 +01:00 · 2025-03-13 16:55:40 +01:00 · 2025-03-13 12:51:51 +01:00
12 changed files with 170 additions and 85 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,4 @@ build/
 dist/
 sdist/
 *.egg-info/
+venv/
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,2 +0,0 @@
-graft ebook_converter/data
-exclude .gitignore
--- a/README.rst
+++ b/README.rst
@@ -2,24 +2,39 @@
 Ebook converter
 ===============

-This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
-for converter thing.
-
-My motivation is to have only converter for ebooks run from commandline,
-without all of those bells and whistles Calibre has, and with cleanest more
-*pythonic* approach.
+This is an impudent ripoff of the bits from `Calibre project`_, and is aimed
+only for converter thing.

+My motivation is to have only the converter for ebooks run from the
+commandline, without all of those bells and whistles Calibre has, and with
+cleanest more *pythonic* approach.

 Requirements
 ------------

 To build and run ebook converter, you'll need:

- Python 3.6 or newer
+- Python 3.10 or newer
 - `Liberation fonts`_
 - setuptools
 - ``pdftohtml``, ``pdfinfo`` and ``pdftoppm`` from `poppler`_ project for
  conversion from PDF available in ``$PATH``
+- ``libxml2-dev`` and ``libxslt-dev`` as dependencies for format manipulation
+  from some of the Calibre code
+
+and several Python packages:
+
+- `beautifulsoup4`_
+- `css-parser`_
+- `filelock`_
+- `html2text`_
+- `html5-parser`_
+- `msgpack`_
+- `odfpy`_
+- `pillow`_
+- `python-dateutil`_
+- `setuptools`_
+- `tinycss`_

 No Python2 support. Even if Calibre probably still is able to run on Python2, I
 do not have an intention to support it.
@@ -28,9 +43,9 @@ do not have an intention to support it.
 What's supported
 ----------------

-To be able to perform some optimization and make converter more reliable and
-easy to use, first I need to remove some of the features, which are totally not
-crucial in my opinion, although they might be re-added later, like, for
+To be able to perform some optimization and make the converter more reliable
+and easy to use, first I need to remove some of the features, which are totally
+not crucial in my opinion, although they might be re-added later, like, for
 instance there is no automatic language translations depending on the locale
 settings.

@@ -44,15 +59,16 @@ Windows is not currently supported, because of the original spaghetti code.
 This may change in the future, after cleanup of mentioned pasta would be
 completed.

-So called `Kindle periodical` format is not supported, since all we do care are
-local files. If there would be downloaded periodical thing (using Calibre for
-example), it would be treated as common book.
+So called *Kindle periodical* format (which `Amazon has`_ `killed`_ anyway back
+in September 2023) is not supported, since all we do care are local files. If
+there would be downloaded periodical thing (using Calibre for example), it
+would be treated as common book.


 Input formats
 ~~~~~~~~~~~~~

-Currently, I've tested following input formats:
+Currently, I've tested the following input formats:

 - Microsoft Word 2007 and up (``docx``)
 - EPUB, both v2 and v3 (``epub``)
@@ -107,7 +123,7 @@ managers), i.e:
   $ . venv/bin/activate
   (venv) $ git clone https://github.com/gryf/ebook-converter
   (venv) $ cd ebook-converter
-   (venv) $ pip install -r requirements.txt .
+   (venv) $ pip install .

 Simple as that. And from now on, you can issue converter:

@@ -122,9 +138,20 @@ License
 This work is licensed on GPL3 license, like the original work. See LICENSE file
 for details.

-
 .. _Calibre project: https://calibre-ebook.com/
 .. _pypi: https://pypi.python.org
 .. _Liberation fonts: https://github.com/liberationfonts/liberation-fonts
-.. _Kindle periodical: https://sellercentral.amazon.com/gp/help/external/help.html?itemID=202047960&language=en-US
+.. _Amazon has: https://goodereader.com/blog/kindle/amazon-will-discontinue-newspaper-and-magazine-subscriptions-in-september
+.. _killed: https://www.theverge.com/23861370/amazon-kindle-periodicals-unlimited-ended
 .. _poppler: https://poppler.freedesktop.org/
+.. _beautifulsoup4: https://www.crummy.com/software/BeautifulSoup
+.. _css-parser: https://github.com/ebook-utils/css-parser
+.. _filelock: https://github.com/tox-dev/py-filelock
+.. _html2text: https://github.com/Alir3z4/html2text
+.. _html5-parser: https://html5-parser.readthedocs.io
+.. _msgpack: https://msgpack.org
+.. _odfpy: https://github.com/eea/odfpy
+.. _pillow: https://python-pillow.github.io
+.. _python-dateutil: https://github.com/dateutil/dateutil
+.. _setuptools: https://setuptools.pypa.io
+.. _tinycss: http://tinycss.readthedocs.io
--- a/ebook_converter/ebooks/conversion/plugins/pml_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/pml_output.py
@@ -62,7 +62,7 @@ class PMLOutput(OutputFormatPlugin):
                    im = Image.open(io.BytesIO(item.data))
                else:
                    im = Image.open(io.BytesIO(item.data)).convert('P')
-                    im.thumbnail((300,300), Image.ANTIALIAS)
+                    im.thumbnail((300,300), Image.LANCZOS)

                data = io.BytesIO()
                im.save(data, 'PNG')
--- a/ebook_converter/ebooks/lrf/html/convert_from.py
+++ b/ebook_converter/ebooks/lrf/html/convert_from.py
@@ -1012,7 +1012,7 @@ class HTMLConverter(object):
            self.image_memory.append(pt)  # Neccessary, trust me ;-)
            try:
                im.resize((int(width), int(height)),
-                          PILImage.ANTIALIAS).save(pt, encoding)
+                          PILImage.LANCZOS).save(pt, encoding)
                pt.close()
                self.scaled_images[path] = pt
                return pt.name
@@ -1970,7 +1970,7 @@ def process_file(path, options, logger):
                options.cover = cf.name

                tim = im.resize((int(0.75 * th), th),
-                                PILImage.ANTIALIAS).convert('RGB')
+                                PILImage.LANCZOS).convert('RGB')
                tf = PersistentTemporaryFile(prefix=__appname__ + '_',
                                             suffix=".jpg")
                tf.close()
--- a/ebook_converter/ebooks/lrf/html/table.py
+++ b/ebook_converter/ebooks/lrf/html/table.py
@@ -145,7 +145,7 @@ class Cell(object):
                continue
            word = token.split()
            word = word[0] if word else ""
-            width = font.getsize(word)[0]
+            width = font.getbbox(word)[2]
            if width > mwidth:
                mwidth = width
        return parindent + mwidth + 2
@@ -191,7 +191,7 @@ class Cell(object):
            if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
                font = get_font(ff, self.pts_to_pixels(fs))
            for word in token.split():
-                width, height = font.getsize(word)
+                _, _, width, height = font.getbbox(word)
                left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
        return right+3+max(parindent, 10), bottom

--- a/ebook_converter/ebooks/oeb/transforms/unsmarten.py
+++ b/ebook_converter/ebooks/oeb/transforms/unsmarten.py
@@ -0,0 +1,28 @@
+__license__ = 'GPL 3'
+__copyright__ = '2011, John Schember <john@nachtimwald.com>'
+__docformat__ = 'restructuredtext en'
+
+from ebook_converter.ebooks.oeb.base import OEB_DOCS, XPath
+from ebook_converter.ebooks.oeb.parse_utils import barename
+from ebook_converter.utils.unsmarten import unsmarten_text
+
+
+class UnsmartenPunctuation:
+
+    def __init__(self):
+        self.html_tags = XPath('descendant::h:*')
+
+    def unsmarten(self, root):
+        for x in self.html_tags(root):
+            if not barename(x.tag) == 'pre':
+                if getattr(x, 'text', None):
+                    x.text = unsmarten_text(x.text)
+                if getattr(x, 'tail', None) and x.tail:
+                    x.tail = unsmarten_text(x.tail)
+
+    def __call__(self, oeb, context):
+        bx = XPath('//h:body')
+        for x in oeb.manifest.items:
+            if x.media_type in OEB_DOCS:
+                for body in bx(x.data):
+                    self.unsmarten(body)
--- a/ebook_converter/utils/unsmarten.py
+++ b/ebook_converter/utils/unsmarten.py
@@ -0,0 +1,40 @@
+__license__ = 'GPL 3'
+__copyright__ = '2011, John Schember <john@nachtimwald.com>'
+__docformat__ = 'restructuredtext en'
+
+from ebook_converter.utils.mreplace import MReplace
+
+_mreplace = MReplace({
+        '&#8211;': '--',
+        '&ndash;': '--',
+        '–': '--',
+        '&#8212;': '---',
+        '&mdash;': '---',
+        '—': '---',
+        '&#8230;': '...',
+        '&hellip;': '...',
+        '…': '...',
+        '&#8220;': '"',
+        '&#8221;': '"',
+        '&#8222;': '"',
+        '&#8243;': '"',
+        '&ldquo;': '"',
+        '&rdquo;': '"',
+        '&bdquo;': '"',
+        '&Prime;': '"',
+        '“':'"',
+        '”':'"',
+        '„':'"',
+        '″':'"',
+        '&#8216;':"'",
+        '&#8217;':"'",
+        '&#8242;':"'",
+        '&lsquo;':"'",
+        '&rsquo;':"'",
+        '&prime;':"'",
+        '‘':"'",
+        '’':"'",
+        '′':"'",
+})
+
+unsmarten_text = _mreplace.mreplace
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,52 @@
+[build-system]
+requires = ["setuptools >= 77.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "ebook-converter"
+version = "4.12.0"
+requires-python = ">= 3.10"
+description = "Convert ebook between different formats"
+dependencies = [
+    "beautifulsoup4>=4.9.3",
+    "css-parser>=1.0.6",
+    "filelock>=3.0.12",
+    "html2text>=2020.1.16",
+    "html5-parser==0.4.12",
+    "msgpack>=1.0.0",
+    "odfpy>=1.4.1",
+    "pillow>=8.0.1",
+    "python-dateutil>=2.8.1",
+    "setuptools>=61.0",
+    "tinycss>=0.4"
+]
+readme = "README.rst"
+authors = [
+    {name = "gryf", email = "gryf73@gmail.com"}
+]
+license = "GPL-3.0-or-later"
+classifiers = [
+    "Environment :: Console",
+    "Intended Audience :: Other Audience",
+    "Operating System :: POSIX :: Linux",
+    "Development Status :: 3 - Alpha",
+    "Programming Language :: Python",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3 :: Only",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13"
+]
+
+[project.urls]
+Repository = "https://github.com/gryf/ebook-converter"
+
+[project.scripts]
+ebook-converter = "ebook_converter.main:main"
+
+[tool.setuptools.packages.find]
+exclude = ["snap"]
+
+[tool.setuptools.package-data]
+"*" = ["*.types", "*.css", "*.html", "*.xhtml", "*.xsl", "*.json"]
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,11 +0,0 @@
-beautifulsoup4>=4.9.3
-css-parser>=1.0.6
-filelock>=3.0.12
-html2text>=2020.1.16
-html5-parser==0.4.9 --no-binary lxml
-msgpack>=1.0.0
-odfpy>=1.4.1
-pillow>=8.0.1
-python-dateutil>=2.8.1
-setuptools>=50.3.2
-tinycss>=0.4
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,46 +0,0 @@
-[metadata]
-name = ebook-converter
-version = 4.12.0
-summary = Convert ebook between different formats
-description-file =
-    README.rst
-author = gryf
-author-email = gryf73@gmail.com
-license = GPL3
-license_file = LICENSE
-url = https://github.com/gryf/ebook-converter
-classifier =
-    Environment :: Console
-    Intended Audience :: Other Audience
-    License :: OSI Approved :: GNU General Public License v3 (GPLv3)
-    Operating System :: POSIX :: Linux
-    Development Status :: 3 - Alpha
-    Programming Language :: Python
-    Programming Language :: Python :: 3
-    Programming Language :: Python :: 3 :: Only
-    Programming Language :: Python :: 3.6
-    Programming Language :: Python :: 3.7
-
-[options]
-packages = find:
-include_package_data = True
-install_requires =
-    filelock
-    python-dateutil
-    lxml
-    css-parser
-    beautifulsoup4
-    tinycss
-    pillow
-    msgpack
-    html5-parser
-    odfpy
-    setuptools
-    html2text
-
-[options.entry_points]
-console_scripts =
-    ebook-converter=ebook_converter.main:main
-
-[options.package_data]
-* = *.types *.css, *.html, *.xsl
--- a/setup.py
+++ b/setup.py
@@ -1,4 +0,0 @@
-import setuptools
-
-
-setuptools.setup()
Author	SHA1	Message	Date
Vitaliy Krasnoperov	c89fc132b8	Fix unsmarten text option (#16 ) * Create unsmarten.py * Update unsmarten.py * Update unsmarten.py * Create unsmarten.py	2026-02-06 09:06:12 +01:00
gryf	8b8a92e9fd	Removed license classifier in favor of SPDX entry.	2025-04-18 16:06:33 +02:00
gryf	6b7f796cfb	README update	2025-03-19 21:28:37 +01:00
gryf	72d0858ad8	Move from setup.cfg/py to pure pyproject.toml project definition	2025-03-13 16:55:40 +01:00
Roman Dobosz	4f548ec882	Merge pull request #10 from zagura/add-pyproject-toml Add pyproject.toml	2025-03-13 12:51:51 +01:00
Michał Zagórski	0faa2c0758	Add pyproject.toml	2025-03-12 23:23:22 +01:00
gryf	d37850520b	Remove getsize method of PIL in favor of getbbox	2025-03-10 18:33:05 +01:00
Roman Dobosz	5e56cb8c7a	Merge pull request #9 from NunoSempere/master add dependencies, fix some typos	2025-02-10 16:43:31 +01:00
NunoSempere	084e0d11ce	fix a few README typos mostly the lack of "the". I've left some others which are more charming	2025-01-05 22:32:56 +01:00
NunoSempere	4c3c5a9e27	add missing dependencies (found in Debian 12)	2025-01-05 22:30:09 +01:00