1
0
mirror of https://github.com/gryf/ebook-converter.git synced 2026-03-25 11:53:33 +01:00

Compare commits

..

14 Commits

Author SHA1 Message Date
Vitaliy Krasnoperov
c89fc132b8 Fix unsmarten text option (#16)
* Create unsmarten.py

* Update unsmarten.py

* Update unsmarten.py

* Create unsmarten.py
2026-02-06 09:06:12 +01:00
8b8a92e9fd Removed license classifier in favor of SPDX entry. 2025-04-18 16:06:33 +02:00
6b7f796cfb README update 2025-03-19 21:28:37 +01:00
72d0858ad8 Move from setup.cfg/py to pure pyproject.toml project definition 2025-03-13 16:55:40 +01:00
4f548ec882 Merge pull request #10 from zagura/add-pyproject-toml
Add pyproject.toml
2025-03-13 12:51:51 +01:00
Michał Zagórski
0faa2c0758 Add pyproject.toml 2025-03-12 23:23:22 +01:00
d37850520b Remove getsize method of PIL in favor of getbbox 2025-03-10 18:33:05 +01:00
5e56cb8c7a Merge pull request #9 from NunoSempere/master
add dependencies, fix some typos
2025-02-10 16:43:31 +01:00
NunoSempere
084e0d11ce fix a few README typos
mostly the lack of "the". I've left some others which are more
charming
2025-01-05 22:32:56 +01:00
NunoSempere
4c3c5a9e27 add missing dependencies (found in Debian 12) 2025-01-05 22:30:09 +01:00
c240495c3d Fix for nonexistent attribute in odt input format 2022-12-04 18:26:09 +01:00
53dea56929 Removed temporary stuff 2022-12-04 18:24:31 +01:00
ef02332465 Fix couple of logging errors 2022-12-04 18:22:06 +01:00
74abaf0de0 Fix imports for collections abstract classes 2022-12-04 18:18:07 +01:00
19 changed files with 189 additions and 104 deletions

1
.gitignore vendored
View File

@@ -3,3 +3,4 @@ build/
dist/
sdist/
*.egg-info/
venv/

View File

@@ -1,2 +0,0 @@
graft ebook_converter/data
exclude .gitignore

View File

@@ -2,24 +2,39 @@
Ebook converter
===============
This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
for converter thing.
My motivation is to have only converter for ebooks run from commandline,
without all of those bells and whistles Calibre has, and with cleanest more
*pythonic* approach.
This is an impudent ripoff of the bits from `Calibre project`_, and is aimed
only for converter thing.
My motivation is to have only the converter for ebooks run from the
commandline, without all of those bells and whistles Calibre has, and with
cleanest more *pythonic* approach.
Requirements
------------
To build and run ebook converter, you'll need:
- Python 3.6 or newer
- Python 3.10 or newer
- `Liberation fonts`_
- setuptools
- ``pdftohtml``, ``pdfinfo`` and ``pdftoppm`` from `poppler`_ project for
conversion from PDF available in ``$PATH``
- ``libxml2-dev`` and ``libxslt-dev`` as dependencies for format manipulation
from some of the Calibre code
and several Python packages:
- `beautifulsoup4`_
- `css-parser`_
- `filelock`_
- `html2text`_
- `html5-parser`_
- `msgpack`_
- `odfpy`_
- `pillow`_
- `python-dateutil`_
- `setuptools`_
- `tinycss`_
No Python2 support. Even if Calibre probably still is able to run on Python2, I
do not have an intention to support it.
@@ -28,9 +43,9 @@ do not have an intention to support it.
What's supported
----------------
To be able to perform some optimization and make converter more reliable and
easy to use, first I need to remove some of the features, which are totally not
crucial in my opinion, although they might be re-added later, like, for
To be able to perform some optimization and make the converter more reliable
and easy to use, first I need to remove some of the features, which are totally
not crucial in my opinion, although they might be re-added later, like, for
instance there is no automatic language translations depending on the locale
settings.
@@ -44,15 +59,16 @@ Windows is not currently supported, because of the original spaghetti code.
This may change in the future, after cleanup of mentioned pasta would be
completed.
So called `Kindle periodical` format is not supported, since all we do care are
local files. If there would be downloaded periodical thing (using Calibre for
example), it would be treated as common book.
So called *Kindle periodical* format (which `Amazon has`_ `killed`_ anyway back
in September 2023) is not supported, since all we do care are local files. If
there would be downloaded periodical thing (using Calibre for example), it
would be treated as common book.
Input formats
~~~~~~~~~~~~~
Currently, I've tested following input formats:
Currently, I've tested the following input formats:
- Microsoft Word 2007 and up (``docx``)
- EPUB, both v2 and v3 (``epub``)
@@ -107,7 +123,7 @@ managers), i.e:
$ . venv/bin/activate
(venv) $ git clone https://github.com/gryf/ebook-converter
(venv) $ cd ebook-converter
(venv) $ pip install -r requirements.txt .
(venv) $ pip install .
Simple as that. And from now on, you can issue converter:
@@ -122,9 +138,20 @@ License
This work is licensed on GPL3 license, like the original work. See LICENSE file
for details.
.. _Calibre project: https://calibre-ebook.com/
.. _pypi: https://pypi.python.org
.. _Liberation fonts: https://github.com/liberationfonts/liberation-fonts
.. _Kindle periodical: https://sellercentral.amazon.com/gp/help/external/help.html?itemID=202047960&language=en-US
.. _Amazon has: https://goodereader.com/blog/kindle/amazon-will-discontinue-newspaper-and-magazine-subscriptions-in-september
.. _killed: https://www.theverge.com/23861370/amazon-kindle-periodicals-unlimited-ended
.. _poppler: https://poppler.freedesktop.org/
.. _beautifulsoup4: https://www.crummy.com/software/BeautifulSoup
.. _css-parser: https://github.com/ebook-utils/css-parser
.. _filelock: https://github.com/tox-dev/py-filelock
.. _html2text: https://github.com/Alir3z4/html2text
.. _html5-parser: https://html5-parser.readthedocs.io
.. _msgpack: https://msgpack.org
.. _odfpy: https://github.com/eea/odfpy
.. _pillow: https://python-pillow.github.io
.. _python-dateutil: https://github.com/dateutil/dateutil
.. _setuptools: https://setuptools.pypa.io
.. _tinycss: http://tinycss.readthedocs.io

View File

@@ -32,7 +32,7 @@ def debug():
# plugins {{{
class Plugins(collections.Mapping):
class Plugins(collections.abc.Mapping):
def __init__(self):
self._plugins = {}

View File

@@ -19,7 +19,7 @@ def is_iterable(obj):
return hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes))
class OrderedSet(collections.MutableSet):
class OrderedSet(collections.abc.MutableSet):
"""
An OrderedSet is a custom MutableSet that remembers its order, so that
every entry has an index that can be looked up.

View File

@@ -237,7 +237,7 @@ class HTMLInput(InputFormatPlugin):
if not os.access(link, os.R_OK):
return link_
if os.path.isdir(link):
self.log.warning(link_, 'is a link to a directory. Ignoring.')
self.log.warning('%s is a link to a directory. Ignoring.', link_)
return link_
if link not in self.added_resources:
bhref = os.path.basename(link)

View File

@@ -62,7 +62,7 @@ class PMLOutput(OutputFormatPlugin):
im = Image.open(io.BytesIO(item.data))
else:
im = Image.open(io.BytesIO(item.data)).convert('P')
im.thumbnail((300,300), Image.ANTIALIAS)
im.thumbnail((300,300), Image.LANCZOS)
data = io.BytesIO()
im.save(data, 'PNG')

View File

@@ -1012,7 +1012,7 @@ class HTMLConverter(object):
self.image_memory.append(pt) # Neccessary, trust me ;-)
try:
im.resize((int(width), int(height)),
PILImage.ANTIALIAS).save(pt, encoding)
PILImage.LANCZOS).save(pt, encoding)
pt.close()
self.scaled_images[path] = pt
return pt.name
@@ -1970,7 +1970,7 @@ def process_file(path, options, logger):
options.cover = cf.name
tim = im.resize((int(0.75 * th), th),
PILImage.ANTIALIAS).convert('RGB')
PILImage.LANCZOS).convert('RGB')
tf = PersistentTemporaryFile(prefix=__appname__ + '_',
suffix=".jpg")
tf.close()

View File

@@ -145,7 +145,7 @@ class Cell(object):
continue
word = token.split()
word = word[0] if word else ""
width = font.getsize(word)[0]
width = font.getbbox(word)[2]
if width > mwidth:
mwidth = width
return parindent + mwidth + 2
@@ -191,7 +191,7 @@ class Cell(object):
if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
font = get_font(ff, self.pts_to_pixels(fs))
for word in token.split():
width, height = font.getsize(word)
_, _, width, height = font.getbbox(word)
left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
return right+3+max(parindent, 10), bottom

View File

@@ -452,7 +452,7 @@ class MobiMLizer(object):
try:
item = self.oeb.manifest.hrefs[base.urlnormalize(href)]
except:
self.oeb.logger.warning('Failed to find image:', href)
self.oeb.logger.warning('Failed to find image: %s', href)
else:
try:
width, height = identify(item.data)[1:]

View File

@@ -444,8 +444,8 @@ class Indexer(object): # {{{
if self.is_periodical and self.masthead_offset is None:
raise ValueError('Periodicals must have a masthead')
self.log('Generating MOBI index for a %s', 'periodical' if
self.is_periodical else 'book')
self.log.info('Generating MOBI index for a %s', 'periodical' if
self.is_periodical else 'book')
self.is_flat_periodical = False
if self.is_periodical:
periodical_node = next(iter(oeb.toc))

View File

@@ -14,13 +14,15 @@ from odf.draw import Frame as odFrame, Image as odImage
from odf.namespaces import TEXTNS as odTEXTNS
from ebook_converter.utils import directory
from ebook_converter.ebooks.oeb import parse_utils
from ebook_converter.ebooks.oeb.base import _css_logger
from ebook_converter import polyglot
class Extract(ODF2XHTML):
def extract_pictures(self, zf):
def _extract_pictures(self, zf):
if not os.path.exists('Pictures'):
os.makedirs('Pictures')
for name in zf.namelist():
@@ -30,8 +32,8 @@ class Extract(ODF2XHTML):
with open(name, 'wb') as f:
f.write(data)
def apply_list_starts(self, root, log):
if not self.list_starts:
def _apply_list_starts(self, root, log):
if not hasattr(self, "list_starts") or not self.list_starts:
return
list_starts = frozenset(self.list_starts)
for ol in root.xpath('//*[local-name() = "ol" and @class]'):
@@ -46,7 +48,7 @@ class Extract(ODF2XHTML):
self.filter_css(root, log)
self.extract_css(root, log)
self.epubify_markup(root, log)
self.apply_list_starts(root, log)
self._apply_list_starts(root, log)
html = etree.tostring(root, encoding='utf-8', xml_declaration=True)
return html
@@ -84,22 +86,21 @@ class Extract(ODF2XHTML):
return rule
def epubify_markup(self, root, log):
from ebook_converter.ebooks.oeb.base import XPath, XHTML
# Fix empty title tags
for t in XPath('//h:title')(root):
for t in parse_utils.XPath('//h:title')(root):
if not t.text:
t.text = u' '
# Fix <p><div> constructs as the asinine epubchecker complains
# about them
pdiv = XPath('//h:p/h:div')
pdiv = parse_utils.XPath('//h:p/h:div')
for div in pdiv(root):
div.getparent().tag = XHTML('div')
div.getparent().tag = parse_utils.XHTML('div')
# Remove the position:relative as it causes problems with some epub
# renderers. Remove display: block on an image inside a div as it is
# redundant and prevents text-align:center from working in ADE
# Also ensure that the img is contained in its containing div
imgpath = XPath('//h:div/h:img[@style]')
imgpath = parse_utils.XPath('//h:div/h:img[@style]')
for img in imgpath(root):
div = img.getparent()
if len(div) == 1:
@@ -119,7 +120,7 @@ class Extract(ODF2XHTML):
# works in both WebKit and ADE.
# https://bugs.launchpad.net/bugs/1063207
# https://bugs.launchpad.net/calibre/+bug/859343
imgpath = XPath('descendant::h:div/h:div/h:img')
imgpath = parse_utils.XPath('descendant::h:div/h:div/h:img')
for img in imgpath(root):
div2 = img.getparent()
div1 = div2.getparent()
@@ -297,7 +298,7 @@ class Extract(ODF2XHTML):
with open('index.xhtml', 'wb') as f:
f.write(polyglot.as_bytes(html))
zf = ZipFile(stream, 'r')
self.extract_pictures(zf)
self._extract_pictures(zf)
opf = OPFCreator(os.path.abspath(os.getcwd()), mi)
opf.create_manifest([(os.path.abspath(os.path.join(r, f2)), None)
for r, _, fnames in os.walk(os.getcwd())

View File

@@ -0,0 +1,28 @@
__license__ = 'GPL 3'
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
from ebook_converter.ebooks.oeb.base import OEB_DOCS, XPath
from ebook_converter.ebooks.oeb.parse_utils import barename
from ebook_converter.utils.unsmarten import unsmarten_text
class UnsmartenPunctuation:
def __init__(self):
self.html_tags = XPath('descendant::h:*')
def unsmarten(self, root):
for x in self.html_tags(root):
if not barename(x.tag) == 'pre':
if getattr(x, 'text', None):
x.text = unsmarten_text(x.text)
if getattr(x, 'tail', None) and x.tail:
x.tail = unsmarten_text(x.tail)
def __call__(self, oeb, context):
bx = XPath('//h:body')
for x in oeb.manifest.items:
if x.media_type in OEB_DOCS:
for body in bx(x.data):
self.unsmarten(body)

View File

@@ -4,7 +4,6 @@ import os
import sys
from ebook_converter import logging
from ebook_converter.customize.conversion import OptionRecommendation
from ebook_converter.ebooks.conversion.plumber import Plumber
@@ -68,6 +67,7 @@ def run(args):
return 0
def main():
parser = argparse.ArgumentParser()
parser.add_argument('from_file', help="Input file to be converted")
@@ -83,5 +83,4 @@ def main():
LOG.set_verbose(args.verbose, args.quiet)
print(args)
sys.exit(run(args))

View File

@@ -0,0 +1,40 @@
__license__ = 'GPL 3'
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
from ebook_converter.utils.mreplace import MReplace
_mreplace = MReplace({
'&#8211;': '--',
'&ndash;': '--',
'': '--',
'&#8212;': '---',
'&mdash;': '---',
'': '---',
'&#8230;': '...',
'&hellip;': '...',
'': '...',
'&#8220;': '"',
'&#8221;': '"',
'&#8222;': '"',
'&#8243;': '"',
'&ldquo;': '"',
'&rdquo;': '"',
'&bdquo;': '"',
'&Prime;': '"',
'':'"',
'':'"',
'':'"',
'':'"',
'&#8216;':"'",
'&#8217;':"'",
'&#8242;':"'",
'&lsquo;':"'",
'&rsquo;':"'",
'&prime;':"'",
'':"'",
'':"'",
'':"'",
})
unsmarten_text = _mreplace.mreplace

52
pyproject.toml Normal file
View File

@@ -0,0 +1,52 @@
[build-system]
requires = ["setuptools >= 77.0"]
build-backend = "setuptools.build_meta"
[project]
name = "ebook-converter"
version = "4.12.0"
requires-python = ">= 3.10"
description = "Convert ebook between different formats"
dependencies = [
"beautifulsoup4>=4.9.3",
"css-parser>=1.0.6",
"filelock>=3.0.12",
"html2text>=2020.1.16",
"html5-parser==0.4.12",
"msgpack>=1.0.0",
"odfpy>=1.4.1",
"pillow>=8.0.1",
"python-dateutil>=2.8.1",
"setuptools>=61.0",
"tinycss>=0.4"
]
readme = "README.rst"
authors = [
{name = "gryf", email = "gryf73@gmail.com"}
]
license = "GPL-3.0-or-later"
classifiers = [
"Environment :: Console",
"Intended Audience :: Other Audience",
"Operating System :: POSIX :: Linux",
"Development Status :: 3 - Alpha",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13"
]
[project.urls]
Repository = "https://github.com/gryf/ebook-converter"
[project.scripts]
ebook-converter = "ebook_converter.main:main"
[tool.setuptools.packages.find]
exclude = ["snap"]
[tool.setuptools.package-data]
"*" = ["*.types", "*.css", "*.html", "*.xhtml", "*.xsl", "*.json"]

View File

@@ -1,11 +0,0 @@
beautifulsoup4>=4.9.3
css-parser>=1.0.6
filelock>=3.0.12
html2text>=2020.1.16
html5-parser==0.4.9 --no-binary lxml
msgpack>=1.0.0
odfpy>=1.4.1
pillow>=8.0.1
python-dateutil>=2.8.1
setuptools>=50.3.2
tinycss>=0.4

View File

@@ -1,46 +0,0 @@
[metadata]
name = ebook-converter
version = 4.12.0
summary = Convert ebook between different formats
description-file =
README.rst
author = gryf
author-email = gryf73@gmail.com
license = GPL3
license_file = LICENSE
url = https://github.com/gryf/ebook-converter
classifier =
Environment :: Console
Intended Audience :: Other Audience
License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System :: POSIX :: Linux
Development Status :: 3 - Alpha
Programming Language :: Python
Programming Language :: Python :: 3
Programming Language :: Python :: 3 :: Only
Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7
[options]
packages = find:
include_package_data = True
install_requires =
filelock
python-dateutil
lxml
css-parser
beautifulsoup4
tinycss
pillow
msgpack
html5-parser
odfpy
setuptools
html2text
[options.entry_points]
console_scripts =
ebook-converter=ebook_converter.main:main
[options.package_data]
* = *.types *.css, *.html, *.xsl

View File

@@ -1,4 +0,0 @@
import setuptools
setuptools.setup()