mirror of
https://github.com/gryf/ebook-converter.git
synced 2026-03-25 11:53:33 +01:00
Compare commits
14 Commits
76e604c951
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c89fc132b8 | ||
| 8b8a92e9fd | |||
| 6b7f796cfb | |||
| 72d0858ad8 | |||
| 4f548ec882 | |||
|
|
0faa2c0758 | ||
| d37850520b | |||
| 5e56cb8c7a | |||
|
|
084e0d11ce | ||
|
|
4c3c5a9e27 | ||
| c240495c3d | |||
| 53dea56929 | |||
| ef02332465 | |||
| 74abaf0de0 |
1
.gitignore
vendored
1
.gitignore
vendored
@@ -3,3 +3,4 @@ build/
|
||||
dist/
|
||||
sdist/
|
||||
*.egg-info/
|
||||
venv/
|
||||
|
||||
@@ -1,2 +0,0 @@
|
||||
graft ebook_converter/data
|
||||
exclude .gitignore
|
||||
61
README.rst
61
README.rst
@@ -2,24 +2,39 @@
|
||||
Ebook converter
|
||||
===============
|
||||
|
||||
This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
|
||||
for converter thing.
|
||||
|
||||
My motivation is to have only converter for ebooks run from commandline,
|
||||
without all of those bells and whistles Calibre has, and with cleanest more
|
||||
*pythonic* approach.
|
||||
This is an impudent ripoff of the bits from `Calibre project`_, and is aimed
|
||||
only for converter thing.
|
||||
|
||||
My motivation is to have only the converter for ebooks run from the
|
||||
commandline, without all of those bells and whistles Calibre has, and with
|
||||
cleanest more *pythonic* approach.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
To build and run ebook converter, you'll need:
|
||||
|
||||
- Python 3.6 or newer
|
||||
- Python 3.10 or newer
|
||||
- `Liberation fonts`_
|
||||
- setuptools
|
||||
- ``pdftohtml``, ``pdfinfo`` and ``pdftoppm`` from `poppler`_ project for
|
||||
conversion from PDF available in ``$PATH``
|
||||
- ``libxml2-dev`` and ``libxslt-dev`` as dependencies for format manipulation
|
||||
from some of the Calibre code
|
||||
|
||||
and several Python packages:
|
||||
|
||||
- `beautifulsoup4`_
|
||||
- `css-parser`_
|
||||
- `filelock`_
|
||||
- `html2text`_
|
||||
- `html5-parser`_
|
||||
- `msgpack`_
|
||||
- `odfpy`_
|
||||
- `pillow`_
|
||||
- `python-dateutil`_
|
||||
- `setuptools`_
|
||||
- `tinycss`_
|
||||
|
||||
No Python2 support. Even if Calibre probably still is able to run on Python2, I
|
||||
do not have an intention to support it.
|
||||
@@ -28,9 +43,9 @@ do not have an intention to support it.
|
||||
What's supported
|
||||
----------------
|
||||
|
||||
To be able to perform some optimization and make converter more reliable and
|
||||
easy to use, first I need to remove some of the features, which are totally not
|
||||
crucial in my opinion, although they might be re-added later, like, for
|
||||
To be able to perform some optimization and make the converter more reliable
|
||||
and easy to use, first I need to remove some of the features, which are totally
|
||||
not crucial in my opinion, although they might be re-added later, like, for
|
||||
instance there is no automatic language translations depending on the locale
|
||||
settings.
|
||||
|
||||
@@ -44,15 +59,16 @@ Windows is not currently supported, because of the original spaghetti code.
|
||||
This may change in the future, after cleanup of mentioned pasta would be
|
||||
completed.
|
||||
|
||||
So called `Kindle periodical` format is not supported, since all we do care are
|
||||
local files. If there would be downloaded periodical thing (using Calibre for
|
||||
example), it would be treated as common book.
|
||||
So called *Kindle periodical* format (which `Amazon has`_ `killed`_ anyway back
|
||||
in September 2023) is not supported, since all we do care are local files. If
|
||||
there would be downloaded periodical thing (using Calibre for example), it
|
||||
would be treated as common book.
|
||||
|
||||
|
||||
Input formats
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Currently, I've tested following input formats:
|
||||
Currently, I've tested the following input formats:
|
||||
|
||||
- Microsoft Word 2007 and up (``docx``)
|
||||
- EPUB, both v2 and v3 (``epub``)
|
||||
@@ -107,7 +123,7 @@ managers), i.e:
|
||||
$ . venv/bin/activate
|
||||
(venv) $ git clone https://github.com/gryf/ebook-converter
|
||||
(venv) $ cd ebook-converter
|
||||
(venv) $ pip install -r requirements.txt .
|
||||
(venv) $ pip install .
|
||||
|
||||
Simple as that. And from now on, you can issue converter:
|
||||
|
||||
@@ -122,9 +138,20 @@ License
|
||||
This work is licensed on GPL3 license, like the original work. See LICENSE file
|
||||
for details.
|
||||
|
||||
|
||||
.. _Calibre project: https://calibre-ebook.com/
|
||||
.. _pypi: https://pypi.python.org
|
||||
.. _Liberation fonts: https://github.com/liberationfonts/liberation-fonts
|
||||
.. _Kindle periodical: https://sellercentral.amazon.com/gp/help/external/help.html?itemID=202047960&language=en-US
|
||||
.. _Amazon has: https://goodereader.com/blog/kindle/amazon-will-discontinue-newspaper-and-magazine-subscriptions-in-september
|
||||
.. _killed: https://www.theverge.com/23861370/amazon-kindle-periodicals-unlimited-ended
|
||||
.. _poppler: https://poppler.freedesktop.org/
|
||||
.. _beautifulsoup4: https://www.crummy.com/software/BeautifulSoup
|
||||
.. _css-parser: https://github.com/ebook-utils/css-parser
|
||||
.. _filelock: https://github.com/tox-dev/py-filelock
|
||||
.. _html2text: https://github.com/Alir3z4/html2text
|
||||
.. _html5-parser: https://html5-parser.readthedocs.io
|
||||
.. _msgpack: https://msgpack.org
|
||||
.. _odfpy: https://github.com/eea/odfpy
|
||||
.. _pillow: https://python-pillow.github.io
|
||||
.. _python-dateutil: https://github.com/dateutil/dateutil
|
||||
.. _setuptools: https://setuptools.pypa.io
|
||||
.. _tinycss: http://tinycss.readthedocs.io
|
||||
|
||||
@@ -32,7 +32,7 @@ def debug():
|
||||
# plugins {{{
|
||||
|
||||
|
||||
class Plugins(collections.Mapping):
|
||||
class Plugins(collections.abc.Mapping):
|
||||
|
||||
def __init__(self):
|
||||
self._plugins = {}
|
||||
|
||||
@@ -19,7 +19,7 @@ def is_iterable(obj):
|
||||
return hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes))
|
||||
|
||||
|
||||
class OrderedSet(collections.MutableSet):
|
||||
class OrderedSet(collections.abc.MutableSet):
|
||||
"""
|
||||
An OrderedSet is a custom MutableSet that remembers its order, so that
|
||||
every entry has an index that can be looked up.
|
||||
|
||||
@@ -237,7 +237,7 @@ class HTMLInput(InputFormatPlugin):
|
||||
if not os.access(link, os.R_OK):
|
||||
return link_
|
||||
if os.path.isdir(link):
|
||||
self.log.warning(link_, 'is a link to a directory. Ignoring.')
|
||||
self.log.warning('%s is a link to a directory. Ignoring.', link_)
|
||||
return link_
|
||||
if link not in self.added_resources:
|
||||
bhref = os.path.basename(link)
|
||||
|
||||
@@ -62,7 +62,7 @@ class PMLOutput(OutputFormatPlugin):
|
||||
im = Image.open(io.BytesIO(item.data))
|
||||
else:
|
||||
im = Image.open(io.BytesIO(item.data)).convert('P')
|
||||
im.thumbnail((300,300), Image.ANTIALIAS)
|
||||
im.thumbnail((300,300), Image.LANCZOS)
|
||||
|
||||
data = io.BytesIO()
|
||||
im.save(data, 'PNG')
|
||||
|
||||
@@ -1012,7 +1012,7 @@ class HTMLConverter(object):
|
||||
self.image_memory.append(pt) # Neccessary, trust me ;-)
|
||||
try:
|
||||
im.resize((int(width), int(height)),
|
||||
PILImage.ANTIALIAS).save(pt, encoding)
|
||||
PILImage.LANCZOS).save(pt, encoding)
|
||||
pt.close()
|
||||
self.scaled_images[path] = pt
|
||||
return pt.name
|
||||
@@ -1970,7 +1970,7 @@ def process_file(path, options, logger):
|
||||
options.cover = cf.name
|
||||
|
||||
tim = im.resize((int(0.75 * th), th),
|
||||
PILImage.ANTIALIAS).convert('RGB')
|
||||
PILImage.LANCZOS).convert('RGB')
|
||||
tf = PersistentTemporaryFile(prefix=__appname__ + '_',
|
||||
suffix=".jpg")
|
||||
tf.close()
|
||||
|
||||
@@ -145,7 +145,7 @@ class Cell(object):
|
||||
continue
|
||||
word = token.split()
|
||||
word = word[0] if word else ""
|
||||
width = font.getsize(word)[0]
|
||||
width = font.getbbox(word)[2]
|
||||
if width > mwidth:
|
||||
mwidth = width
|
||||
return parindent + mwidth + 2
|
||||
@@ -191,7 +191,7 @@ class Cell(object):
|
||||
if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
|
||||
font = get_font(ff, self.pts_to_pixels(fs))
|
||||
for word in token.split():
|
||||
width, height = font.getsize(word)
|
||||
_, _, width, height = font.getbbox(word)
|
||||
left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
|
||||
return right+3+max(parindent, 10), bottom
|
||||
|
||||
|
||||
@@ -452,7 +452,7 @@ class MobiMLizer(object):
|
||||
try:
|
||||
item = self.oeb.manifest.hrefs[base.urlnormalize(href)]
|
||||
except:
|
||||
self.oeb.logger.warning('Failed to find image:', href)
|
||||
self.oeb.logger.warning('Failed to find image: %s', href)
|
||||
else:
|
||||
try:
|
||||
width, height = identify(item.data)[1:]
|
||||
|
||||
@@ -444,8 +444,8 @@ class Indexer(object): # {{{
|
||||
if self.is_periodical and self.masthead_offset is None:
|
||||
raise ValueError('Periodicals must have a masthead')
|
||||
|
||||
self.log('Generating MOBI index for a %s', 'periodical' if
|
||||
self.is_periodical else 'book')
|
||||
self.log.info('Generating MOBI index for a %s', 'periodical' if
|
||||
self.is_periodical else 'book')
|
||||
self.is_flat_periodical = False
|
||||
if self.is_periodical:
|
||||
periodical_node = next(iter(oeb.toc))
|
||||
|
||||
@@ -14,13 +14,15 @@ from odf.draw import Frame as odFrame, Image as odImage
|
||||
from odf.namespaces import TEXTNS as odTEXTNS
|
||||
|
||||
from ebook_converter.utils import directory
|
||||
from ebook_converter.ebooks.oeb import parse_utils
|
||||
from ebook_converter.ebooks.oeb.base import _css_logger
|
||||
from ebook_converter import polyglot
|
||||
|
||||
|
||||
|
||||
class Extract(ODF2XHTML):
|
||||
|
||||
def extract_pictures(self, zf):
|
||||
def _extract_pictures(self, zf):
|
||||
if not os.path.exists('Pictures'):
|
||||
os.makedirs('Pictures')
|
||||
for name in zf.namelist():
|
||||
@@ -30,8 +32,8 @@ class Extract(ODF2XHTML):
|
||||
with open(name, 'wb') as f:
|
||||
f.write(data)
|
||||
|
||||
def apply_list_starts(self, root, log):
|
||||
if not self.list_starts:
|
||||
def _apply_list_starts(self, root, log):
|
||||
if not hasattr(self, "list_starts") or not self.list_starts:
|
||||
return
|
||||
list_starts = frozenset(self.list_starts)
|
||||
for ol in root.xpath('//*[local-name() = "ol" and @class]'):
|
||||
@@ -46,7 +48,7 @@ class Extract(ODF2XHTML):
|
||||
self.filter_css(root, log)
|
||||
self.extract_css(root, log)
|
||||
self.epubify_markup(root, log)
|
||||
self.apply_list_starts(root, log)
|
||||
self._apply_list_starts(root, log)
|
||||
html = etree.tostring(root, encoding='utf-8', xml_declaration=True)
|
||||
return html
|
||||
|
||||
@@ -84,22 +86,21 @@ class Extract(ODF2XHTML):
|
||||
return rule
|
||||
|
||||
def epubify_markup(self, root, log):
|
||||
from ebook_converter.ebooks.oeb.base import XPath, XHTML
|
||||
# Fix empty title tags
|
||||
for t in XPath('//h:title')(root):
|
||||
for t in parse_utils.XPath('//h:title')(root):
|
||||
if not t.text:
|
||||
t.text = u' '
|
||||
# Fix <p><div> constructs as the asinine epubchecker complains
|
||||
# about them
|
||||
pdiv = XPath('//h:p/h:div')
|
||||
pdiv = parse_utils.XPath('//h:p/h:div')
|
||||
for div in pdiv(root):
|
||||
div.getparent().tag = XHTML('div')
|
||||
div.getparent().tag = parse_utils.XHTML('div')
|
||||
|
||||
# Remove the position:relative as it causes problems with some epub
|
||||
# renderers. Remove display: block on an image inside a div as it is
|
||||
# redundant and prevents text-align:center from working in ADE
|
||||
# Also ensure that the img is contained in its containing div
|
||||
imgpath = XPath('//h:div/h:img[@style]')
|
||||
imgpath = parse_utils.XPath('//h:div/h:img[@style]')
|
||||
for img in imgpath(root):
|
||||
div = img.getparent()
|
||||
if len(div) == 1:
|
||||
@@ -119,7 +120,7 @@ class Extract(ODF2XHTML):
|
||||
# works in both WebKit and ADE.
|
||||
# https://bugs.launchpad.net/bugs/1063207
|
||||
# https://bugs.launchpad.net/calibre/+bug/859343
|
||||
imgpath = XPath('descendant::h:div/h:div/h:img')
|
||||
imgpath = parse_utils.XPath('descendant::h:div/h:div/h:img')
|
||||
for img in imgpath(root):
|
||||
div2 = img.getparent()
|
||||
div1 = div2.getparent()
|
||||
@@ -297,7 +298,7 @@ class Extract(ODF2XHTML):
|
||||
with open('index.xhtml', 'wb') as f:
|
||||
f.write(polyglot.as_bytes(html))
|
||||
zf = ZipFile(stream, 'r')
|
||||
self.extract_pictures(zf)
|
||||
self._extract_pictures(zf)
|
||||
opf = OPFCreator(os.path.abspath(os.getcwd()), mi)
|
||||
opf.create_manifest([(os.path.abspath(os.path.join(r, f2)), None)
|
||||
for r, _, fnames in os.walk(os.getcwd())
|
||||
|
||||
28
ebook_converter/ebooks/oeb/transforms/unsmarten.py
Normal file
28
ebook_converter/ebooks/oeb/transforms/unsmarten.py
Normal file
@@ -0,0 +1,28 @@
|
||||
__license__ = 'GPL 3'
|
||||
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
|
||||
__docformat__ = 'restructuredtext en'
|
||||
|
||||
from ebook_converter.ebooks.oeb.base import OEB_DOCS, XPath
|
||||
from ebook_converter.ebooks.oeb.parse_utils import barename
|
||||
from ebook_converter.utils.unsmarten import unsmarten_text
|
||||
|
||||
|
||||
class UnsmartenPunctuation:
|
||||
|
||||
def __init__(self):
|
||||
self.html_tags = XPath('descendant::h:*')
|
||||
|
||||
def unsmarten(self, root):
|
||||
for x in self.html_tags(root):
|
||||
if not barename(x.tag) == 'pre':
|
||||
if getattr(x, 'text', None):
|
||||
x.text = unsmarten_text(x.text)
|
||||
if getattr(x, 'tail', None) and x.tail:
|
||||
x.tail = unsmarten_text(x.tail)
|
||||
|
||||
def __call__(self, oeb, context):
|
||||
bx = XPath('//h:body')
|
||||
for x in oeb.manifest.items:
|
||||
if x.media_type in OEB_DOCS:
|
||||
for body in bx(x.data):
|
||||
self.unsmarten(body)
|
||||
@@ -4,7 +4,6 @@ import os
|
||||
import sys
|
||||
|
||||
from ebook_converter import logging
|
||||
from ebook_converter.customize.conversion import OptionRecommendation
|
||||
from ebook_converter.ebooks.conversion.plumber import Plumber
|
||||
|
||||
|
||||
@@ -68,6 +67,7 @@ def run(args):
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('from_file', help="Input file to be converted")
|
||||
@@ -83,5 +83,4 @@ def main():
|
||||
|
||||
LOG.set_verbose(args.verbose, args.quiet)
|
||||
|
||||
print(args)
|
||||
sys.exit(run(args))
|
||||
|
||||
40
ebook_converter/utils/unsmarten.py
Normal file
40
ebook_converter/utils/unsmarten.py
Normal file
@@ -0,0 +1,40 @@
|
||||
__license__ = 'GPL 3'
|
||||
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
|
||||
__docformat__ = 'restructuredtext en'
|
||||
|
||||
from ebook_converter.utils.mreplace import MReplace
|
||||
|
||||
_mreplace = MReplace({
|
||||
'–': '--',
|
||||
'–': '--',
|
||||
'–': '--',
|
||||
'—': '---',
|
||||
'—': '---',
|
||||
'—': '---',
|
||||
'…': '...',
|
||||
'…': '...',
|
||||
'…': '...',
|
||||
'“': '"',
|
||||
'”': '"',
|
||||
'„': '"',
|
||||
'″': '"',
|
||||
'“': '"',
|
||||
'”': '"',
|
||||
'„': '"',
|
||||
'″': '"',
|
||||
'“':'"',
|
||||
'”':'"',
|
||||
'„':'"',
|
||||
'″':'"',
|
||||
'‘':"'",
|
||||
'’':"'",
|
||||
'′':"'",
|
||||
'‘':"'",
|
||||
'’':"'",
|
||||
'′':"'",
|
||||
'‘':"'",
|
||||
'’':"'",
|
||||
'′':"'",
|
||||
})
|
||||
|
||||
unsmarten_text = _mreplace.mreplace
|
||||
52
pyproject.toml
Normal file
52
pyproject.toml
Normal file
@@ -0,0 +1,52 @@
|
||||
[build-system]
|
||||
requires = ["setuptools >= 77.0"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "ebook-converter"
|
||||
version = "4.12.0"
|
||||
requires-python = ">= 3.10"
|
||||
description = "Convert ebook between different formats"
|
||||
dependencies = [
|
||||
"beautifulsoup4>=4.9.3",
|
||||
"css-parser>=1.0.6",
|
||||
"filelock>=3.0.12",
|
||||
"html2text>=2020.1.16",
|
||||
"html5-parser==0.4.12",
|
||||
"msgpack>=1.0.0",
|
||||
"odfpy>=1.4.1",
|
||||
"pillow>=8.0.1",
|
||||
"python-dateutil>=2.8.1",
|
||||
"setuptools>=61.0",
|
||||
"tinycss>=0.4"
|
||||
]
|
||||
readme = "README.rst"
|
||||
authors = [
|
||||
{name = "gryf", email = "gryf73@gmail.com"}
|
||||
]
|
||||
license = "GPL-3.0-or-later"
|
||||
classifiers = [
|
||||
"Environment :: Console",
|
||||
"Intended Audience :: Other Audience",
|
||||
"Operating System :: POSIX :: Linux",
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Programming Language :: Python",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3 :: Only",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13"
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
Repository = "https://github.com/gryf/ebook-converter"
|
||||
|
||||
[project.scripts]
|
||||
ebook-converter = "ebook_converter.main:main"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
exclude = ["snap"]
|
||||
|
||||
[tool.setuptools.package-data]
|
||||
"*" = ["*.types", "*.css", "*.html", "*.xhtml", "*.xsl", "*.json"]
|
||||
@@ -1,11 +0,0 @@
|
||||
beautifulsoup4>=4.9.3
|
||||
css-parser>=1.0.6
|
||||
filelock>=3.0.12
|
||||
html2text>=2020.1.16
|
||||
html5-parser==0.4.9 --no-binary lxml
|
||||
msgpack>=1.0.0
|
||||
odfpy>=1.4.1
|
||||
pillow>=8.0.1
|
||||
python-dateutil>=2.8.1
|
||||
setuptools>=50.3.2
|
||||
tinycss>=0.4
|
||||
46
setup.cfg
46
setup.cfg
@@ -1,46 +0,0 @@
|
||||
[metadata]
|
||||
name = ebook-converter
|
||||
version = 4.12.0
|
||||
summary = Convert ebook between different formats
|
||||
description-file =
|
||||
README.rst
|
||||
author = gryf
|
||||
author-email = gryf73@gmail.com
|
||||
license = GPL3
|
||||
license_file = LICENSE
|
||||
url = https://github.com/gryf/ebook-converter
|
||||
classifier =
|
||||
Environment :: Console
|
||||
Intended Audience :: Other Audience
|
||||
License :: OSI Approved :: GNU General Public License v3 (GPLv3)
|
||||
Operating System :: POSIX :: Linux
|
||||
Development Status :: 3 - Alpha
|
||||
Programming Language :: Python
|
||||
Programming Language :: Python :: 3
|
||||
Programming Language :: Python :: 3 :: Only
|
||||
Programming Language :: Python :: 3.6
|
||||
Programming Language :: Python :: 3.7
|
||||
|
||||
[options]
|
||||
packages = find:
|
||||
include_package_data = True
|
||||
install_requires =
|
||||
filelock
|
||||
python-dateutil
|
||||
lxml
|
||||
css-parser
|
||||
beautifulsoup4
|
||||
tinycss
|
||||
pillow
|
||||
msgpack
|
||||
html5-parser
|
||||
odfpy
|
||||
setuptools
|
||||
html2text
|
||||
|
||||
[options.entry_points]
|
||||
console_scripts =
|
||||
ebook-converter=ebook_converter.main:main
|
||||
|
||||
[options.package_data]
|
||||
* = *.types *.css, *.html, *.xsl
|
||||
Reference in New Issue
Block a user