1
0
mirror of https://github.com/gryf/ebook-converter.git synced 2026-03-06 09:15:55 +01:00

Initial import

This commit is contained in:
2020-03-31 17:15:23 +02:00
commit d97ea9b0bc
311 changed files with 131419 additions and 0 deletions

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
__pycache__

674
LICENSE Normal file
View File

@@ -0,0 +1,674 @@
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
<program> Copyright (C) <year> <name of author>
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<https://www.gnu.org/licenses/>.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
<https://www.gnu.org/philosophy/why-not-lgpl.html>.

26
README.rst Normal file
View File

@@ -0,0 +1,26 @@
===============
Ebook converter
===============
This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
for converter thing.
My motivation is to have only converter for ebooks run from commandline,
without all of those bells and whistles Calibre have, and with cleanest more
*pythonic* approach.
Installation
------------
TBD.
License
-------
This work is licensed on GPL3 license, like the original work. See LICENSE file
for details.
.. _Calibre project: https://calibre-ebook.com/

681
ebook_converter/__init__.py Normal file
View File

@@ -0,0 +1,681 @@
from __future__ import unicode_literals, print_function
''' E-book management software'''
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import sys, os, re, time, random, warnings
from polyglot.builtins import codepoint_to_chr, unicode_type, range, hasenv, native_string_type
from math import floor
from functools import partial
if not hasenv('CALIBRE_SHOW_DEPRECATION_WARNINGS'):
warnings.simplefilter('ignore', DeprecationWarning)
try:
os.getcwd()
except EnvironmentError:
os.chdir(os.path.expanduser('~'))
from calibre.constants import (iswindows, isosx, islinux, isfrozen,
isbsd, preferred_encoding, __appname__, __version__, __author__,
win32event, win32api, winerror, fcntl, ispy3,
filesystem_encoding, plugins, config_dir)
from calibre.startup import winutil, winutilerror
from calibre.utils.icu import safe_chr
if False:
# Prevent pyflakes from complaining
winutil, winutilerror, __appname__, islinux, __version__
fcntl, win32event, isfrozen, __author__
winerror, win32api, isbsd, config_dir
_mt_inited = False
def _init_mimetypes():
global _mt_inited
import mimetypes
mimetypes.init([P('mime.types')])
_mt_inited = True
def guess_type(*args, **kwargs):
import mimetypes
if not _mt_inited:
_init_mimetypes()
return mimetypes.guess_type(*args, **kwargs)
def guess_all_extensions(*args, **kwargs):
import mimetypes
if not _mt_inited:
_init_mimetypes()
return mimetypes.guess_all_extensions(*args, **kwargs)
def guess_extension(*args, **kwargs):
import mimetypes
if not _mt_inited:
_init_mimetypes()
ext = mimetypes.guess_extension(*args, **kwargs)
if not ext and args and args[0] == 'application/x-palmreader':
ext = '.pdb'
return ext
def get_types_map():
import mimetypes
if not _mt_inited:
_init_mimetypes()
return mimetypes.types_map
def to_unicode(raw, encoding='utf-8', errors='strict'):
if isinstance(raw, unicode_type):
return raw
return raw.decode(encoding, errors)
def patheq(p1, p2):
p = os.path
d = lambda x : p.normcase(p.normpath(p.realpath(p.normpath(x))))
if not p1 or not p2:
return False
return d(p1) == d(p2)
def unicode_path(path, abs=False):
if isinstance(path, bytes):
path = path.decode(filesystem_encoding)
if abs:
path = os.path.abspath(path)
return path
def osx_version():
if isosx:
import platform
src = platform.mac_ver()[0]
m = re.match(r'(\d+)\.(\d+)\.(\d+)', src)
if m:
return int(m.group(1)), int(m.group(2)), int(m.group(3))
def confirm_config_name(name):
return name + '_again'
_filename_sanitize_unicode = frozenset(('\\', '|', '?', '*', '<', # no2to3
'"', ':', '>', '+', '/') + tuple(map(codepoint_to_chr, range(32)))) # no2to3
def sanitize_file_name(name, substitute='_'):
'''
Sanitize the filename `name`. All invalid characters are replaced by `substitute`.
The set of invalid characters is the union of the invalid characters in Windows,
macOS and Linux. Also removes leading and trailing whitespace.
**WARNING:** This function also replaces path separators, so only pass file names
and not full paths to it.
'''
if isbytestring(name):
name = name.decode(filesystem_encoding, 'replace')
if isbytestring(substitute):
substitute = substitute.decode(filesystem_encoding, 'replace')
chars = (substitute if c in _filename_sanitize_unicode else c for c in name)
one = ''.join(chars)
one = re.sub(r'\s', ' ', one).strip()
bname, ext = os.path.splitext(one)
one = re.sub(r'^\.+$', '_', bname)
one = one.replace('..', substitute)
one += ext
# Windows doesn't like path components that end with a period or space
if one and one[-1] in ('.', ' '):
one = one[:-1]+'_'
# Names starting with a period are hidden on Unix
if one.startswith('.'):
one = '_' + one[1:]
return one
sanitize_file_name2 = sanitize_file_name_unicode = sanitize_file_name
def prints(*args, **kwargs):
'''
Print unicode arguments safely by encoding them to preferred_encoding
Has the same signature as the print function from Python 3, except for the
additional keyword argument safe_encode, which if set to True will cause the
function to use repr when encoding fails.
Returns the number of bytes written.
'''
file = kwargs.get('file', sys.stdout)
file = getattr(file, 'buffer', file)
enc = 'utf-8' if hasenv('CALIBRE_WORKER') else preferred_encoding
sep = kwargs.get('sep', ' ')
if not isinstance(sep, bytes):
sep = sep.encode(enc)
end = kwargs.get('end', '\n')
if not isinstance(end, bytes):
end = end.encode(enc)
safe_encode = kwargs.get('safe_encode', False)
count = 0
for i, arg in enumerate(args):
if isinstance(arg, unicode_type):
if iswindows:
from calibre.utils.terminal import Detect
cs = Detect(file)
if cs.is_console:
cs.write_unicode_text(arg)
count += len(arg)
if i != len(args)-1:
file.write(sep)
count += len(sep)
continue
try:
arg = arg.encode(enc)
except UnicodeEncodeError:
try:
arg = arg.encode('utf-8')
except:
if not safe_encode:
raise
arg = repr(arg)
if not isinstance(arg, bytes):
try:
arg = native_string_type(arg)
except ValueError:
arg = unicode_type(arg)
if isinstance(arg, unicode_type):
try:
arg = arg.encode(enc)
except UnicodeEncodeError:
try:
arg = arg.encode('utf-8')
except:
if not safe_encode:
raise
arg = repr(arg)
try:
file.write(arg)
count += len(arg)
except:
from polyglot import reprlib
arg = reprlib.repr(arg)
file.write(arg)
count += len(arg)
if i != len(args)-1:
file.write(sep)
count += len(sep)
file.write(end)
count += len(end)
return count
class CommandLineError(Exception):
pass
def setup_cli_handlers(logger, level):
import logging
if hasenv('CALIBRE_WORKER') and logger.handlers:
return
logger.setLevel(level)
if level == logging.WARNING:
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(logging.Formatter('%(levelname)s: %(message)s'))
handler.setLevel(logging.WARNING)
elif level == logging.INFO:
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(logging.Formatter())
handler.setLevel(logging.INFO)
elif level == logging.DEBUG:
handler = logging.StreamHandler(sys.stderr)
handler.setLevel(logging.DEBUG)
handler.setFormatter(logging.Formatter('[%(levelname)s] %(filename)s:%(lineno)s: %(message)s'))
logger.addHandler(handler)
def load_library(name, cdll):
if iswindows:
return cdll.LoadLibrary(name)
if isosx:
name += '.dylib'
if hasattr(sys, 'frameworks_dir'):
return cdll.LoadLibrary(os.path.join(getattr(sys, 'frameworks_dir'), name))
return cdll.LoadLibrary(name)
return cdll.LoadLibrary(name+'.so')
def extract(path, dir):
extractor = None
# First use the file header to identify its type
with open(path, 'rb') as f:
id_ = f.read(3)
if id_ == b'Rar':
from calibre.utils.unrar import extract as rarextract
extractor = rarextract
elif id_.startswith(b'PK'):
from calibre.libunzip import extract as zipextract
extractor = zipextract
if extractor is None:
# Fallback to file extension
ext = os.path.splitext(path)[1][1:].lower()
if ext in ['zip', 'cbz', 'epub', 'oebzip']:
from calibre.libunzip import extract as zipextract
extractor = zipextract
elif ext in ['cbr', 'rar']:
from calibre.utils.unrar import extract as rarextract
extractor = rarextract
if extractor is None:
raise Exception('Unknown archive type')
extractor(path, dir)
def get_proxies(debug=True):
from polyglot.urllib import getproxies
proxies = getproxies()
for key, proxy in list(proxies.items()):
if not proxy or '..' in proxy or key == 'auto':
del proxies[key]
continue
if proxy.startswith(key+'://'):
proxy = proxy[len(key)+3:]
if key == 'https' and proxy.startswith('http://'):
proxy = proxy[7:]
if proxy.endswith('/'):
proxy = proxy[:-1]
if len(proxy) > 4:
proxies[key] = proxy
else:
prints('Removing invalid', key, 'proxy:', proxy)
del proxies[key]
if proxies and debug:
prints('Using proxies:', proxies)
return proxies
def get_parsed_proxy(typ='http', debug=True):
proxies = get_proxies(debug)
proxy = proxies.get(typ, None)
if proxy:
pattern = re.compile((
'(?:ptype://)?'
'(?:(?P<user>\\w+):(?P<pass>.*)@)?'
'(?P<host>[\\w\\-\\.]+)'
'(?::(?P<port>\\d+))?').replace('ptype', typ)
)
match = pattern.match(proxies[typ])
if match:
try:
ans = {
'host' : match.group('host'),
'port' : match.group('port'),
'user' : match.group('user'),
'pass' : match.group('pass')
}
if ans['port']:
ans['port'] = int(ans['port'])
except:
if debug:
import traceback
traceback.print_exc()
else:
if debug:
prints('Using http proxy', unicode_type(ans))
return ans
def get_proxy_info(proxy_scheme, proxy_string):
'''
Parse all proxy information from a proxy string (as returned by
get_proxies). The returned dict will have members set to None when the info
is not available in the string. If an exception occurs parsing the string
this method returns None.
'''
from polyglot.urllib import urlparse
try:
proxy_url = '%s://%s'%(proxy_scheme, proxy_string)
urlinfo = urlparse(proxy_url)
ans = {
'scheme': urlinfo.scheme,
'hostname': urlinfo.hostname,
'port': urlinfo.port,
'username': urlinfo.username,
'password': urlinfo.password,
}
except Exception:
return None
return ans
# IE 11 on windows 7
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
USER_AGENT_MOBILE = 'Mozilla/5.0 (Windows; U; Windows CE 5.1; rv:1.8.1a3) Gecko/20060610 Minimo/0.016'
def is_mobile_ua(ua):
return 'Mobile/' in ua or 'Mobile ' in ua
def random_user_agent(choose=None, allow_ie=True):
from calibre.utils.random_ua import common_user_agents
ua_list = common_user_agents()
ua_list = [x for x in ua_list if not is_mobile_ua(x)]
if not allow_ie:
ua_list = [x for x in ua_list if 'Trident/' not in x and 'Edge/' not in x]
return random.choice(ua_list) if choose is None else ua_list[choose]
def browser(honor_time=True, max_time=2, mobile_browser=False, user_agent=None, verify_ssl_certificates=True, handle_refresh=True):
'''
Create a mechanize browser for web scraping. The browser handles cookies,
refresh requests and ignores robots.txt. Also uses proxy if available.
:param honor_time: If True honors pause time in refresh requests
:param max_time: Maximum time in seconds to wait during a refresh request
:param verify_ssl_certificates: If false SSL certificates errors are ignored
'''
from calibre.utils.browser import Browser
opener = Browser(verify_ssl=verify_ssl_certificates)
opener.set_handle_refresh(handle_refresh, max_time=max_time, honor_time=honor_time)
opener.set_handle_robots(False)
if user_agent is None:
user_agent = USER_AGENT_MOBILE if mobile_browser else USER_AGENT
opener.addheaders = [('User-agent', user_agent)]
proxies = get_proxies()
to_add = {}
http_proxy = proxies.get('http', None)
if http_proxy:
to_add['http'] = http_proxy
https_proxy = proxies.get('https', None)
if https_proxy:
to_add['https'] = https_proxy
if to_add:
opener.set_proxies(to_add)
return opener
def fit_image(width, height, pwidth, pheight):
'''
Fit image in box of width pwidth and height pheight.
@param width: Width of image
@param height: Height of image
@param pwidth: Width of box
@param pheight: Height of box
@return: scaled, new_width, new_height. scaled is True iff new_width and/or new_height is different from width or height.
'''
scaled = height > pheight or width > pwidth
if height > pheight:
corrf = pheight / float(height)
width, height = floor(corrf*width), pheight
if width > pwidth:
corrf = pwidth / float(width)
width, height = pwidth, floor(corrf*height)
if height > pheight:
corrf = pheight / float(height)
width, height = floor(corrf*width), pheight
return scaled, int(width), int(height)
class CurrentDir(object):
def __init__(self, path):
self.path = path
self.cwd = None
def __enter__(self, *args):
self.cwd = os.getcwd()
os.chdir(self.path)
return self.cwd
def __exit__(self, *args):
try:
os.chdir(self.cwd)
except EnvironmentError:
# The previous CWD no longer exists
pass
_ncpus = None
if ispy3:
def detect_ncpus():
global _ncpus
if _ncpus is None:
_ncpus = max(1, os.cpu_count() or 1)
return _ncpus
else:
def detect_ncpus():
"""Detects the number of effective CPUs in the system"""
global _ncpus
if _ncpus is None:
if iswindows:
import win32api
ans = win32api.GetSystemInfo()[5]
else:
import multiprocessing
ans = -1
try:
ans = multiprocessing.cpu_count()
except Exception:
from PyQt5.Qt import QThread
ans = QThread.idealThreadCount()
_ncpus = max(1, ans)
return _ncpus
relpath = os.path.relpath
def walk(dir):
''' A nice interface to os.walk '''
for record in os.walk(dir):
for f in record[-1]:
yield os.path.join(record[0], f)
def strftime(fmt, t=None):
''' A version of strftime that returns unicode strings and tries to handle dates
before 1900 '''
if not fmt:
return ''
if t is None:
t = time.localtime()
if hasattr(t, 'timetuple'):
t = t.timetuple()
early_year = t[0] < 1900
if early_year:
replacement = 1900 if t[0]%4 == 0 else 1901
fmt = fmt.replace('%Y', '_early year hack##')
t = list(t)
orig_year = t[0]
t[0] = replacement
t = time.struct_time(t)
ans = None
if iswindows:
if isinstance(fmt, bytes):
fmt = fmt.decode('mbcs', 'replace')
fmt = fmt.replace('%e', '%#d')
ans = plugins['winutil'][0].strftime(fmt, t)
else:
ans = time.strftime(fmt, t)
if isinstance(ans, bytes):
ans = ans.decode(preferred_encoding, 'replace')
if early_year:
ans = ans.replace('_early year hack##', unicode_type(orig_year))
return ans
def my_unichr(num):
try:
return safe_chr(num)
except (ValueError, OverflowError):
return '?'
def entity_to_unicode(match, exceptions=[], encoding='cp1252',
result_exceptions={}):
'''
:param match: A match object such that '&'+match.group(1)';' is the entity.
:param exceptions: A list of entities to not convert (Each entry is the name of the entity, for e.g. 'apos' or '#1234'
:param encoding: The encoding to use to decode numeric entities between 128 and 256.
If None, the Unicode UCS encoding is used. A common encoding is cp1252.
:param result_exceptions: A mapping of characters to entities. If the result
is in result_exceptions, result_exception[result] is returned instead.
Convenient way to specify exception for things like < or > that can be
specified by various actual entities.
'''
def check(ch):
return result_exceptions.get(ch, ch)
ent = match.group(1)
if ent in exceptions:
return '&'+ent+';'
if ent in {'apos', 'squot'}: # squot is generated by some broken CMS software
return check("'")
if ent == 'hellips':
ent = 'hellip'
if ent.startswith('#'):
try:
if ent[1] in ('x', 'X'):
num = int(ent[2:], 16)
else:
num = int(ent[1:])
except:
return '&'+ent+';'
if encoding is None or num > 255:
return check(my_unichr(num))
try:
return check(bytes(bytearray((num,))).decode(encoding))
except UnicodeDecodeError:
return check(my_unichr(num))
from calibre.ebooks.html_entities import html5_entities
try:
return check(html5_entities[ent])
except KeyError:
pass
from polyglot.html_entities import name2codepoint
try:
return check(my_unichr(name2codepoint[ent]))
except KeyError:
return '&'+ent+';'
_ent_pat = re.compile(r'&(\S+?);')
xml_entity_to_unicode = partial(entity_to_unicode, result_exceptions={
'"' : '&quot;',
"'" : '&apos;',
'<' : '&lt;',
'>' : '&gt;',
'&' : '&amp;'})
def replace_entities(raw, encoding='cp1252'):
return _ent_pat.sub(partial(entity_to_unicode, encoding=encoding), raw)
def xml_replace_entities(raw, encoding='cp1252'):
return _ent_pat.sub(partial(xml_entity_to_unicode, encoding=encoding), raw)
def prepare_string_for_xml(raw, attribute=False):
raw = _ent_pat.sub(entity_to_unicode, raw)
raw = raw.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')
if attribute:
raw = raw.replace('"', '&quot;').replace("'", '&apos;')
return raw
def isbytestring(obj):
return isinstance(obj, bytes)
def force_unicode(obj, enc=preferred_encoding):
if isbytestring(obj):
try:
obj = obj.decode(enc)
except Exception:
try:
obj = obj.decode(filesystem_encoding if enc ==
preferred_encoding else preferred_encoding)
except Exception:
try:
obj = obj.decode('utf-8')
except Exception:
obj = repr(obj)
if isbytestring(obj):
obj = obj.decode('utf-8')
return obj
def as_unicode(obj, enc=preferred_encoding):
if not isbytestring(obj):
try:
obj = unicode_type(obj)
except Exception:
try:
obj = native_string_type(obj)
except Exception:
obj = repr(obj)
return force_unicode(obj, enc=enc)
def url_slash_cleaner(url):
'''
Removes redundant /'s from url's.
'''
return re.sub(r'(?<!:)/{2,}', '/', url)
def human_readable(size, sep=' '):
""" Convert a size in bytes into a human readable form """
divisor, suffix = 1, "B"
for i, candidate in enumerate(('B', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB')):
if size < (1 << ((i + 1) * 10)):
divisor, suffix = (1 << (i * 10)), candidate
break
size = unicode_type(float(size)/divisor)
if size.find(".") > -1:
size = size[:size.find(".")+2]
if size.endswith('.0'):
size = size[:-2]
return size + sep + suffix
def ipython(user_ns=None):
from calibre.utils.ipython import ipython
ipython(user_ns=user_ns)
def fsync(fileobj):
fileobj.flush()
os.fsync(fileobj.fileno())
if islinux and getattr(fileobj, 'name', None):
# On Linux kernels after 5.1.9 and 4.19.50 using fsync without any
# following activity causes Kindles to eject. Instead of fixing this in
# the obvious way, which is to have the kernel send some harmless
# filesystem activity after the FSYNC, the kernel developers seem to
# think the correct solution is to disable FSYNC using a mount flag
# which users will have to turn on manually. So instead we create some
# harmless filesystem activity, and who cares about performance.
# See https://bugs.launchpad.net/calibre/+bug/1834641
# and https://bugzilla.kernel.org/show_bug.cgi?id=203973
# To check for the existence of the bug, simply run:
# python -c "p = '/run/media/kovid/Kindle/driveinfo.calibre'; f = open(p, 'r+b'); os.fsync(f.fileno());"
# this will cause the Kindle to disconnect.
try:
os.utime(fileobj.name, None)
except Exception:
import traceback
traceback.print_exc()

View File

@@ -0,0 +1,343 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
# License: GPLv3 Copyright: 2015, Kovid Goyal <kovid at kovidgoyal.net>
from __future__ import print_function, unicode_literals
from polyglot.builtins import map, unicode_type, environ_item, hasenv, getenv, as_unicode, native_string_type
import sys, locale, codecs, os, importlib, collections
__appname__ = 'calibre'
numeric_version = (4, 12, 0)
__version__ = '.'.join(map(unicode_type, numeric_version))
git_version = None
__author__ = "Kovid Goyal <kovid@kovidgoyal.net>"
'''
Various run time constants.
'''
_plat = sys.platform.lower()
iswindows = 'win32' in _plat or 'win64' in _plat
isosx = 'darwin' in _plat
isnewosx = isosx and getattr(sys, 'new_app_bundle', False)
isfreebsd = 'freebsd' in _plat
isnetbsd = 'netbsd' in _plat
isdragonflybsd = 'dragonfly' in _plat
isbsd = isfreebsd or isnetbsd or isdragonflybsd
ishaiku = 'haiku1' in _plat
islinux = not(iswindows or isosx or isbsd or ishaiku)
isfrozen = hasattr(sys, 'frozen')
isunix = isosx or islinux or ishaiku
isportable = hasenv('CALIBRE_PORTABLE_BUILD')
ispy3 = sys.version_info.major > 2
isxp = isoldvista = False
if iswindows:
wver = sys.getwindowsversion()
isxp = wver.major < 6
isoldvista = wver.build < 6002
is64bit = sys.maxsize > (1 << 32)
isworker = hasenv('CALIBRE_WORKER') or hasenv('CALIBRE_SIMPLE_WORKER')
if isworker:
os.environ.pop(environ_item('CALIBRE_FORCE_ANSI'), None)
FAKE_PROTOCOL, FAKE_HOST = 'clbr', 'internal.invalid'
VIEWER_APP_UID = 'com.calibre-ebook.viewer'
EDITOR_APP_UID = 'com.calibre-ebook.edit-book'
MAIN_APP_UID = 'com.calibre-ebook.main-gui'
STORE_DIALOG_APP_UID = 'com.calibre-ebook.store-dialog'
TOC_DIALOG_APP_UID = 'com.calibre-ebook.toc-editor'
try:
preferred_encoding = locale.getpreferredencoding()
codecs.lookup(preferred_encoding)
except:
preferred_encoding = 'utf-8'
win32event = importlib.import_module('win32event') if iswindows else None
winerror = importlib.import_module('winerror') if iswindows else None
win32api = importlib.import_module('win32api') if iswindows else None
fcntl = None if iswindows else importlib.import_module('fcntl')
dark_link_color = '#6cb4ee'
_osx_ver = None
def get_osx_version():
global _osx_ver
if _osx_ver is None:
import platform
from collections import namedtuple
OSX = namedtuple('OSX', 'major minor tertiary')
try:
ver = platform.mac_ver()[0].split('.')
if len(ver) == 2:
ver.append(0)
_osx_ver = OSX(*map(int, ver)) # no2to3
except Exception:
_osx_ver = OSX(0, 0, 0)
return _osx_ver
filesystem_encoding = sys.getfilesystemencoding()
if filesystem_encoding is None:
filesystem_encoding = 'utf-8'
else:
try:
if codecs.lookup(filesystem_encoding).name == 'ascii':
filesystem_encoding = 'utf-8'
# On linux, unicode arguments to os file functions are coerced to an ascii
# bytestring if sys.getfilesystemencoding() == 'ascii', which is
# just plain dumb. This is fixed by the icu.py module which, when
# imported changes ascii to utf-8
except Exception:
filesystem_encoding = 'utf-8'
DEBUG = hasenv('CALIBRE_DEBUG')
def debug():
global DEBUG
DEBUG = True
def _get_cache_dir():
import errno
confcache = os.path.join(config_dir, 'caches')
try:
os.makedirs(confcache)
except EnvironmentError as err:
if err.errno != errno.EEXIST:
raise
if isportable:
return confcache
ccd = getenv('CALIBRE_CACHE_DIRECTORY')
if ccd is not None:
ans = os.path.abspath(ccd)
try:
os.makedirs(ans)
return ans
except EnvironmentError as err:
if err.errno == errno.EEXIST:
return ans
if iswindows:
w = plugins['winutil'][0]
try:
candidate = os.path.join(w.special_folder_path(w.CSIDL_LOCAL_APPDATA), '%s-cache'%__appname__)
except ValueError:
return confcache
elif isosx:
candidate = os.path.join(os.path.expanduser('~/Library/Caches'), __appname__)
else:
candidate = getenv('XDG_CACHE_HOME', '~/.cache')
candidate = os.path.join(os.path.expanduser(candidate),
__appname__)
if isinstance(candidate, bytes):
try:
candidate = candidate.decode(filesystem_encoding)
except ValueError:
candidate = confcache
try:
os.makedirs(candidate)
except EnvironmentError as err:
if err.errno != errno.EEXIST:
candidate = confcache
return candidate
def cache_dir():
ans = getattr(cache_dir, 'ans', None)
if ans is None:
ans = cache_dir.ans = os.path.realpath(_get_cache_dir())
return ans
plugins_loc = sys.extensions_location
if ispy3:
plugins_loc = os.path.join(plugins_loc, '3')
# plugins {{{
class Plugins(collections.Mapping):
def __init__(self):
self._plugins = {}
plugins = [
'pictureflow',
'lzx',
'msdes',
'podofo',
'cPalmdoc',
'progress_indicator',
'chmlib',
'icu',
'speedup',
'html_as_json',
'unicode_names',
'html_syntax_highlighter',
'hyphen',
'freetype',
'imageops',
'hunspell',
'_patiencediff_c',
'bzzdec',
'matcher',
'tokenizer',
'certgen',
'lzma_binding',
]
if not ispy3:
plugins.extend([
'monotonic',
'zlib2',
])
if iswindows:
plugins.extend(['winutil', 'wpd', 'winfonts'])
if isosx:
plugins.append('usbobserver')
plugins.append('cocoa')
if isfreebsd or ishaiku or islinux or isosx:
plugins.append('libusb')
plugins.append('libmtp')
self.plugins = frozenset(plugins)
def load_plugin(self, name):
if name in self._plugins:
return
sys.path.insert(0, plugins_loc)
try:
del sys.modules[name]
except KeyError:
pass
plugin_err = ''
try:
p = importlib.import_module(name)
except Exception as err:
p = None
try:
plugin_err = unicode_type(err)
except Exception:
plugin_err = as_unicode(native_string_type(err), encoding=preferred_encoding, errors='replace')
self._plugins[name] = p, plugin_err
sys.path.remove(plugins_loc)
def __iter__(self):
return iter(self.plugins)
def __len__(self):
return len(self.plugins)
def __contains__(self, name):
return name in self.plugins
def __getitem__(self, name):
if name not in self.plugins:
raise KeyError('No plugin named %r'%name)
self.load_plugin(name)
return self._plugins[name]
plugins = None
if plugins is None:
plugins = Plugins()
# }}}
# config_dir {{{
CONFIG_DIR_MODE = 0o700
cconfd = getenv('CALIBRE_CONFIG_DIRECTORY')
if cconfd is not None:
config_dir = os.path.abspath(cconfd)
elif iswindows:
if plugins['winutil'][0] is None:
raise Exception(plugins['winutil'][1])
try:
config_dir = plugins['winutil'][0].special_folder_path(plugins['winutil'][0].CSIDL_APPDATA)
except ValueError:
config_dir = None
if not config_dir or not os.access(config_dir, os.W_OK|os.X_OK):
config_dir = os.path.expanduser('~')
config_dir = os.path.join(config_dir, 'calibre')
elif isosx:
config_dir = os.path.expanduser('~/Library/Preferences/calibre')
else:
bdir = os.path.abspath(os.path.expanduser(getenv('XDG_CONFIG_HOME', '~/.config')))
config_dir = os.path.join(bdir, 'calibre')
try:
os.makedirs(config_dir, mode=CONFIG_DIR_MODE)
except:
pass
if not os.path.exists(config_dir) or \
not os.access(config_dir, os.W_OK) or not \
os.access(config_dir, os.X_OK):
print('No write acces to', config_dir, 'using a temporary dir instead')
import tempfile, atexit
config_dir = tempfile.mkdtemp(prefix='calibre-config-')
def cleanup_cdir():
try:
import shutil
shutil.rmtree(config_dir)
except:
pass
atexit.register(cleanup_cdir)
# }}}
dv = getenv('CALIBRE_DEVELOP_FROM')
is_running_from_develop = bool(getattr(sys, 'frozen', False) and dv and os.path.abspath(dv) in sys.path)
del dv
def get_version():
'''Return version string for display to user '''
if git_version is not None:
v = git_version
else:
v = __version__
if numeric_version[-1] == 0:
v = v[:-2]
if is_running_from_develop:
v += '*'
if iswindows and is64bit:
v += ' [64bit]'
return v
def get_portable_base():
'Return path to the directory that contains calibre-portable.exe or None'
if isportable:
return os.path.dirname(os.path.dirname(getenv('CALIBRE_PORTABLE_BUILD')))
def get_windows_username():
'''
Return the user name of the currently logged in user as a unicode string.
Note that usernames on windows are case insensitive, the case of the value
returned depends on what the user typed into the login box at login time.
'''
username = plugins['winutil'][0].username
return username()
def get_windows_temp_path():
temp_path = plugins['winutil'][0].temp_path
return temp_path()
def get_windows_user_locale_name():
locale_name = plugins['winutil'][0].locale_name
return locale_name()
def get_windows_number_formats():
ans = getattr(get_windows_number_formats, 'ans', None)
if ans is None:
localeconv = plugins['winutil'][0].localeconv
d = localeconv()
thousands_sep, decimal_point = d['thousands_sep'], d['decimal_point']
ans = get_windows_number_formats.ans = thousands_sep, decimal_point
return ans

View File

@@ -0,0 +1,12 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
from css_selectors.parser import parse
from css_selectors.select import Select, INAPPROPRIATE_PSEUDO_CLASSES
from css_selectors.errors import SelectorError, SelectorSyntaxError, ExpressionError
__all__ = ['parse', 'Select', 'INAPPROPRIATE_PSEUDO_CLASSES', 'SelectorError', 'SelectorSyntaxError', 'ExpressionError']

View File

@@ -0,0 +1,18 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
class SelectorError(ValueError):
"""Common parent for SelectorSyntaxError and ExpressionError"""
class SelectorSyntaxError(SelectorError):
"""Parsing a selector that does not match the grammar."""
class ExpressionError(SelectorError):
"""Unknown or unsupported selector (eg. pseudo-class)."""

View File

@@ -0,0 +1,133 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
import collections
from polyglot.builtins import string_or_bytes
SLICE_ALL = slice(None)
def is_iterable(obj):
"""
Are we being asked to look up a list of things, instead of a single thing?
We check for the `__iter__` attribute so that this can cover types that
don't have to be known by this module, such as NumPy arrays.
Strings, however, should be considered as atomic values to look up, not
iterables.
"""
return hasattr(obj, '__iter__') and not isinstance(obj, string_or_bytes)
class OrderedSet(collections.MutableSet):
"""
An OrderedSet is a custom MutableSet that remembers its order, so that
every entry has an index that can be looked up.
"""
def __init__(self, iterable=None):
self.items = []
self.map = {}
if iterable is not None:
for item in iterable:
idx = self.map.get(item)
if idx is None:
self.map[item] = len(self.items)
self.items.append(item)
def __len__(self):
return len(self.items)
def __getitem__(self, index):
"""
Get the item at a given index.
If `index` is a slice, you will get back that slice of items. If it's
the slice [:], exactly the same object is returned. (If you want an
independent copy of an OrderedSet, use `OrderedSet.copy()`.)
If `index` is an iterable, you'll get the OrderedSet of items
corresponding to those indices. This is similar to NumPy's
"fancy indexing".
"""
if index == SLICE_ALL:
return self
elif hasattr(index, '__index__') or isinstance(index, slice):
result = self.items[index]
if isinstance(result, list):
return OrderedSet(result)
else:
return result
elif is_iterable(index):
return OrderedSet([self.items[i] for i in index])
else:
raise TypeError("Don't know how to index an OrderedSet by %r" %
index)
def copy(self):
return OrderedSet(self)
def __getstate__(self):
return tuple(self)
def __setstate__(self, state):
self.__init__(state)
def __contains__(self, key):
return key in self.map
def add(self, key):
"""
Add `key` as an item to this OrderedSet, then return its index.
If `key` is already in the OrderedSet, return the index it already
had.
"""
index = self.map.get(key)
if index is None:
self.map[key] = index = len(self.items)
self.items.append(key)
return index
def index(self, key):
"""
Get the index of a given entry, raising an IndexError if it's not
present.
`key` can be an iterable of entries that is not a string, in which case
this returns a list of indices.
"""
if is_iterable(key):
return [self.index(subkey) for subkey in key]
return self.map[key]
def discard(self, key):
index = self.map.get(key)
if index is not None:
self.items.pop(index)
for item in self.items[index:]:
self.map[item] -= 1
return True
return False
def __iter__(self):
return iter(self.items)
def __reversed__(self):
return reversed(self.items)
def __repr__(self):
if not self:
return '%s()' % (self.__class__.__name__,)
return '%s(%r)' % (self.__class__.__name__, list(self))
def __eq__(self, other):
if isinstance(other, OrderedSet):
return len(self) == len(other) and self.items == other.items
try:
return type(other)(self.map) == other
except TypeError:
return False

View File

@@ -0,0 +1,791 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
"""
Tokenizer, parser and parsed objects for CSS selectors.
:copyright: (c) 2007-2012 Ian Bicking and contributors.
See AUTHORS for more details.
:license: BSD, see LICENSE for more details.
"""
import sys
import re
import operator
import string
from css_selectors.errors import SelectorSyntaxError, ExpressionError
from polyglot.builtins import unicode_type, codepoint_to_chr, range
utab = {c:c+32 for c in range(ord(u'A'), ord(u'Z')+1)}
if sys.version_info.major < 3:
tab = string.maketrans(string.ascii_uppercase, string.ascii_lowercase)
def ascii_lower(string):
"""Lower-case, but only in the ASCII range."""
return string.translate(utab if isinstance(string, unicode_type) else tab)
def urepr(x):
if isinstance(x, list):
return '[%s]' % ', '.join((map(urepr, x)))
ans = repr(x)
if ans.startswith("u'") or ans.startswith('u"'):
ans = ans[1:]
return ans
else:
def ascii_lower(x):
return x.translate(utab)
urepr = repr
# Parsed objects
class Selector(object):
"""
Represents a parsed selector.
"""
def __init__(self, tree, pseudo_element=None):
self.parsed_tree = tree
if pseudo_element is not None and not isinstance(
pseudo_element, FunctionalPseudoElement):
pseudo_element = ascii_lower(pseudo_element)
#: A :class:`FunctionalPseudoElement`,
#: or the identifier for the pseudo-element as a string,
# or ``None``.
#:
#: +-------------------------+----------------+--------------------------------+
#: | | Selector | Pseudo-element |
#: +=========================+================+================================+
#: | CSS3 syntax | ``a::before`` | ``'before'`` |
#: +-------------------------+----------------+--------------------------------+
#: | Older syntax | ``a:before`` | ``'before'`` |
#: +-------------------------+----------------+--------------------------------+
#: | From the Lists3_ draft, | ``li::marker`` | ``'marker'`` |
#: | not in Selectors3 | | |
#: +-------------------------+----------------+--------------------------------+
#: | Invalid pseudo-class | ``li:marker`` | ``None`` |
#: +-------------------------+----------------+--------------------------------+
#: | Functinal | ``a::foo(2)`` | ``FunctionalPseudoElement(…)`` |
#: +-------------------------+----------------+--------------------------------+
#:
# : .. _Lists3: http://www.w3.org/TR/2011/WD-css3-lists-20110524/#marker-pseudoelement
self.pseudo_element = pseudo_element
def __repr__(self):
if isinstance(self.pseudo_element, FunctionalPseudoElement):
pseudo_element = repr(self.pseudo_element)
if self.pseudo_element:
pseudo_element = '::%s' % self.pseudo_element
else:
pseudo_element = ''
return '%s[%r%s]' % (
self.__class__.__name__, self.parsed_tree, pseudo_element)
def specificity(self):
"""Return the specificity_ of this selector as a tuple of 3 integers.
.. _specificity: http://www.w3.org/TR/selectors/#specificity
"""
a, b, c = self.parsed_tree.specificity()
if self.pseudo_element:
c += 1
return a, b, c
class Class(object):
"""
Represents selector.class_name
"""
def __init__(self, selector, class_name):
self.selector = selector
self.class_name = class_name
def __repr__(self):
return '%s[%r.%s]' % (
self.__class__.__name__, self.selector, self.class_name)
def specificity(self):
a, b, c = self.selector.specificity()
b += 1
return a, b, c
class FunctionalPseudoElement(object):
"""
Represents selector::name(arguments)
.. attribute:: name
The name (identifier) of the pseudo-element, as a string.
.. attribute:: arguments
The arguments of the pseudo-element, as a list of tokens.
**Note:** tokens are not part of the public API,
and may change between versions.
Use at your own risks.
"""
def __init__(self, name, arguments):
self.name = ascii_lower(name)
self.arguments = arguments
def __repr__(self):
return '%s[::%s(%s)]' % (
self.__class__.__name__, self.name,
urepr([token.value for token in self.arguments]))
def argument_types(self):
return [token.type for token in self.arguments]
def specificity(self):
a, b, c = self.selector.specificity()
b += 1
return a, b, c
class Function(object):
"""
Represents selector:name(expr)
"""
def __init__(self, selector, name, arguments):
self.selector = selector
self.name = ascii_lower(name)
self.arguments = arguments
self._parsed_arguments = None
def __repr__(self):
return '%s[%r:%s(%s)]' % (
self.__class__.__name__, self.selector, self.name,
urepr([token.value for token in self.arguments]))
def argument_types(self):
return [token.type for token in self.arguments]
@property
def parsed_arguments(self):
if self._parsed_arguments is None:
try:
self._parsed_arguments = parse_series(self.arguments)
except ValueError:
raise ExpressionError("Invalid series: '%r'" % self.arguments)
return self._parsed_arguments
def parse_arguments(self):
if not self.arguments_parsed:
self.arguments_parsed = True
def specificity(self):
a, b, c = self.selector.specificity()
b += 1
return a, b, c
class Pseudo(object):
"""
Represents selector:ident
"""
def __init__(self, selector, ident):
self.selector = selector
self.ident = ascii_lower(ident)
def __repr__(self):
return '%s[%r:%s]' % (
self.__class__.__name__, self.selector, self.ident)
def specificity(self):
a, b, c = self.selector.specificity()
b += 1
return a, b, c
class Negation(object):
"""
Represents selector:not(subselector)
"""
def __init__(self, selector, subselector):
self.selector = selector
self.subselector = subselector
def __repr__(self):
return '%s[%r:not(%r)]' % (
self.__class__.__name__, self.selector, self.subselector)
def specificity(self):
a1, b1, c1 = self.selector.specificity()
a2, b2, c2 = self.subselector.specificity()
return a1 + a2, b1 + b2, c1 + c2
class Attrib(object):
"""
Represents selector[namespace|attrib operator value]
"""
def __init__(self, selector, namespace, attrib, operator, value):
self.selector = selector
self.namespace = namespace
self.attrib = attrib
self.operator = operator
self.value = value
def __repr__(self):
if self.namespace:
attrib = '%s|%s' % (self.namespace, self.attrib)
else:
attrib = self.attrib
if self.operator == 'exists':
return '%s[%r[%s]]' % (
self.__class__.__name__, self.selector, attrib)
else:
return '%s[%r[%s %s %s]]' % (
self.__class__.__name__, self.selector, attrib,
self.operator, urepr(self.value))
def specificity(self):
a, b, c = self.selector.specificity()
b += 1
return a, b, c
class Element(object):
"""
Represents namespace|element
`None` is for the universal selector '*'
"""
def __init__(self, namespace=None, element=None):
self.namespace = namespace
self.element = element
def __repr__(self):
element = self.element or '*'
if self.namespace:
element = '%s|%s' % (self.namespace, element)
return '%s[%s]' % (self.__class__.__name__, element)
def specificity(self):
if self.element:
return 0, 0, 1
else:
return 0, 0, 0
class Hash(object):
"""
Represents selector#id
"""
def __init__(self, selector, id):
self.selector = selector
self.id = id
def __repr__(self):
return '%s[%r#%s]' % (
self.__class__.__name__, self.selector, self.id)
def specificity(self):
a, b, c = self.selector.specificity()
a += 1
return a, b, c
class CombinedSelector(object):
def __init__(self, selector, combinator, subselector):
assert selector is not None
self.selector = selector
self.combinator = combinator
self.subselector = subselector
def __repr__(self):
if self.combinator == ' ':
comb = '<followed>'
else:
comb = self.combinator
return '%s[%r %s %r]' % (
self.__class__.__name__, self.selector, comb, self.subselector)
def specificity(self):
a1, b1, c1 = self.selector.specificity()
a2, b2, c2 = self.subselector.specificity()
return a1 + a2, b1 + b2, c1 + c2
# Parser
# foo
_el_re = re.compile(r'^[ \t\r\n\f]*([a-zA-Z]+)[ \t\r\n\f]*$')
# foo#bar or #bar
_id_re = re.compile(r'^[ \t\r\n\f]*([a-zA-Z]*)#([a-zA-Z0-9_-]+)[ \t\r\n\f]*$')
# foo.bar or .bar
_class_re = re.compile(
r'^[ \t\r\n\f]*([a-zA-Z]*)\.([a-zA-Z][a-zA-Z0-9_-]*)[ \t\r\n\f]*$')
def parse(css):
"""Parse a CSS *group of selectors*.
:param css:
A *group of selectors* as an Unicode string.
:raises:
:class:`SelectorSyntaxError` on invalid selectors.
:returns:
A list of parsed :class:`Selector` objects, one for each
selector in the comma-separated group.
"""
# Fast path for simple cases
match = _el_re.match(css)
if match:
return [Selector(Element(element=match.group(1)))]
match = _id_re.match(css)
if match is not None:
return [Selector(Hash(Element(element=match.group(1) or None),
match.group(2)))]
match = _class_re.match(css)
if match is not None:
return [Selector(Class(Element(element=match.group(1) or None),
match.group(2)))]
stream = TokenStream(tokenize(css))
stream.source = css
return list(parse_selector_group(stream))
# except SelectorSyntaxError:
# e = sys.exc_info()[1]
# message = "%s at %s -> %r" % (
# e, stream.used, stream.peek())
# e.msg = message
# e.args = tuple([message])
# raise
def parse_selector_group(stream):
stream.skip_whitespace()
while 1:
yield Selector(*parse_selector(stream))
if stream.peek() == ('DELIM', ','):
stream.next()
stream.skip_whitespace()
else:
break
def parse_selector(stream):
result, pseudo_element = parse_simple_selector(stream)
while 1:
stream.skip_whitespace()
peek = stream.peek()
if peek in (('EOF', None), ('DELIM', ',')):
break
if pseudo_element:
raise SelectorSyntaxError(
'Got pseudo-element ::%s not at the end of a selector'
% pseudo_element)
if peek.is_delim('+', '>', '~'):
# A combinator
combinator = stream.next().value
stream.skip_whitespace()
else:
# By exclusion, the last parse_simple_selector() ended
# at peek == ' '
combinator = ' '
next_selector, pseudo_element = parse_simple_selector(stream)
result = CombinedSelector(result, combinator, next_selector)
return result, pseudo_element
special_pseudo_elements = (
'first-line', 'first-letter', 'before', 'after')
def parse_simple_selector(stream, inside_negation=False):
stream.skip_whitespace()
selector_start = len(stream.used)
peek = stream.peek()
if peek.type == 'IDENT' or peek == ('DELIM', '*'):
if peek.type == 'IDENT':
namespace = stream.next().value
else:
stream.next()
namespace = None
if stream.peek() == ('DELIM', '|'):
stream.next()
element = stream.next_ident_or_star()
else:
element = namespace
namespace = None
else:
element = namespace = None
result = Element(namespace, element)
pseudo_element = None
while 1:
peek = stream.peek()
if peek.type in ('S', 'EOF') or peek.is_delim(',', '+', '>', '~') or (
inside_negation and peek == ('DELIM', ')')):
break
if pseudo_element:
raise SelectorSyntaxError(
'Got pseudo-element ::%s not at the end of a selector'
% pseudo_element)
if peek.type == 'HASH':
result = Hash(result, stream.next().value)
elif peek == ('DELIM', '.'):
stream.next()
result = Class(result, stream.next_ident())
elif peek == ('DELIM', '['):
stream.next()
result = parse_attrib(result, stream)
elif peek == ('DELIM', ':'):
stream.next()
if stream.peek() == ('DELIM', ':'):
stream.next()
pseudo_element = stream.next_ident()
if stream.peek() == ('DELIM', '('):
stream.next()
pseudo_element = FunctionalPseudoElement(
pseudo_element, parse_arguments(stream))
continue
ident = stream.next_ident()
if ident.lower() in special_pseudo_elements:
# Special case: CSS 2.1 pseudo-elements can have a single ':'
# Any new pseudo-element must have two.
pseudo_element = unicode_type(ident)
continue
if stream.peek() != ('DELIM', '('):
result = Pseudo(result, ident)
continue
stream.next()
stream.skip_whitespace()
if ident.lower() == 'not':
if inside_negation:
raise SelectorSyntaxError('Got nested :not()')
argument, argument_pseudo_element = parse_simple_selector(
stream, inside_negation=True)
next = stream.next()
if argument_pseudo_element:
raise SelectorSyntaxError(
'Got pseudo-element ::%s inside :not() at %s'
% (argument_pseudo_element, next.pos))
if next != ('DELIM', ')'):
raise SelectorSyntaxError("Expected ')', got %s" % (next,))
result = Negation(result, argument)
else:
result = Function(result, ident, parse_arguments(stream))
else:
raise SelectorSyntaxError(
"Expected selector, got %s" % (peek,))
if len(stream.used) == selector_start:
raise SelectorSyntaxError(
"Expected selector, got %s" % (stream.peek(),))
return result, pseudo_element
def parse_arguments(stream):
arguments = []
while 1:
stream.skip_whitespace()
next = stream.next()
if next.type in ('IDENT', 'STRING', 'NUMBER') or next in [
('DELIM', '+'), ('DELIM', '-')]:
arguments.append(next)
elif next == ('DELIM', ')'):
return arguments
else:
raise SelectorSyntaxError(
"Expected an argument, got %s" % (next,))
def parse_attrib(selector, stream):
stream.skip_whitespace()
attrib = stream.next_ident_or_star()
if attrib is None and stream.peek() != ('DELIM', '|'):
raise SelectorSyntaxError(
"Expected '|', got %s" % (stream.peek(),))
if stream.peek() == ('DELIM', '|'):
stream.next()
if stream.peek() == ('DELIM', '='):
namespace = None
stream.next()
op = '|='
else:
namespace = attrib
attrib = stream.next_ident()
op = None
else:
namespace = op = None
if op is None:
stream.skip_whitespace()
next = stream.next()
if next == ('DELIM', ']'):
return Attrib(selector, namespace, attrib, 'exists', None)
elif next == ('DELIM', '='):
op = '='
elif next.is_delim('^', '$', '*', '~', '|', '!') and (
stream.peek() == ('DELIM', '=')):
op = next.value + '='
stream.next()
else:
raise SelectorSyntaxError(
"Operator expected, got %s" % (next,))
stream.skip_whitespace()
value = stream.next()
if value.type not in ('IDENT', 'STRING'):
raise SelectorSyntaxError(
"Expected string or ident, got %s" % (value,))
stream.skip_whitespace()
next = stream.next()
if next != ('DELIM', ']'):
raise SelectorSyntaxError(
"Expected ']', got %s" % (next,))
return Attrib(selector, namespace, attrib, op, value.value)
def parse_series(tokens):
"""
Parses the arguments for :nth-child() and friends.
:raises: A list of tokens
:returns: :``(a, b)``
"""
for token in tokens:
if token.type == 'STRING':
raise ValueError('String tokens not allowed in series.')
s = ''.join(token.value for token in tokens).strip()
if s == 'odd':
return (2, 1)
elif s == 'even':
return (2, 0)
elif s == 'n':
return (1, 0)
if 'n' not in s:
# Just b
return (0, int(s))
a, b = s.split('n', 1)
if not a:
a = 1
elif a == '-' or a == '+':
a = int(a+'1')
else:
a = int(a)
if not b:
b = 0
else:
b = int(b)
return (a, b)
# Token objects
class Token(tuple):
def __new__(cls, type_, value, pos):
obj = tuple.__new__(cls, (type_, value))
obj.pos = pos
return obj
def __repr__(self):
return "<%s '%s' at %i>" % (self.type, self.value, self.pos)
def is_delim(self, *values):
return self.type == 'DELIM' and self.value in values
type = property(operator.itemgetter(0))
value = property(operator.itemgetter(1))
class EOFToken(Token):
def __new__(cls, pos):
return Token.__new__(cls, 'EOF', None, pos)
def __repr__(self):
return '<%s at %i>' % (self.type, self.pos)
# Tokenizer
class TokenMacros:
unicode_escape = r'\\([0-9a-f]{1,6})(?:\r\n|[ \n\r\t\f])?'
escape = unicode_escape + r'|\\[^\n\r\f0-9a-f]'
string_escape = r'\\(?:\n|\r\n|\r|\f)|' + escape
nonascii = r'[^\0-\177]'
nmchar = '[_a-z0-9-]|%s|%s' % (escape, nonascii)
nmstart = '[_a-z]|%s|%s' % (escape, nonascii)
def _compile(pattern):
return re.compile(pattern % vars(TokenMacros), re.IGNORECASE).match
_match_whitespace = _compile(r'[ \t\r\n\f]+')
_match_number = _compile(r'[+-]?(?:[0-9]*\.[0-9]+|[0-9]+)')
_match_hash = _compile('#(?:%(nmchar)s)+')
_match_ident = _compile('-?(?:%(nmstart)s)(?:%(nmchar)s)*')
_match_string_by_quote = {
"'": _compile(r"([^\n\r\f\\']|%(string_escape)s)*"),
'"': _compile(r'([^\n\r\f\\"]|%(string_escape)s)*'),
}
_sub_simple_escape = re.compile(r'\\(.)').sub
_sub_unicode_escape = re.compile(TokenMacros.unicode_escape, re.I).sub
_sub_newline_escape =re.compile(r'\\(?:\n|\r\n|\r|\f)').sub
# Same as r'\1', but faster on CPython
if hasattr(operator, 'methodcaller'):
# Python 2.6+
_replace_simple = operator.methodcaller('group', 1)
else:
def _replace_simple(match):
return match.group(1)
def _replace_unicode(match):
codepoint = int(match.group(1), 16)
if codepoint > sys.maxunicode:
codepoint = 0xFFFD
return codepoint_to_chr(codepoint)
def unescape_ident(value):
value = _sub_unicode_escape(_replace_unicode, value)
value = _sub_simple_escape(_replace_simple, value)
return value
def tokenize(s):
pos = 0
len_s = len(s)
while pos < len_s:
match = _match_whitespace(s, pos=pos)
if match:
yield Token('S', ' ', pos)
pos = match.end()
continue
match = _match_ident(s, pos=pos)
if match:
value = _sub_simple_escape(_replace_simple,
_sub_unicode_escape(_replace_unicode, match.group()))
yield Token('IDENT', value, pos)
pos = match.end()
continue
match = _match_hash(s, pos=pos)
if match:
value = _sub_simple_escape(_replace_simple,
_sub_unicode_escape(_replace_unicode, match.group()[1:]))
yield Token('HASH', value, pos)
pos = match.end()
continue
quote = s[pos]
if quote in _match_string_by_quote:
match = _match_string_by_quote[quote](s, pos=pos + 1)
assert match, 'Should have found at least an empty match'
end_pos = match.end()
if end_pos == len_s:
raise SelectorSyntaxError('Unclosed string at %s' % pos)
if s[end_pos] != quote:
raise SelectorSyntaxError('Invalid string at %s' % pos)
value = _sub_simple_escape(_replace_simple,
_sub_unicode_escape(_replace_unicode,
_sub_newline_escape('', match.group())))
yield Token('STRING', value, pos)
pos = end_pos + 1
continue
match = _match_number(s, pos=pos)
if match:
value = match.group()
yield Token('NUMBER', value, pos)
pos = match.end()
continue
pos2 = pos + 2
if s[pos:pos2] == '/*':
pos = s.find('*/', pos2)
if pos == -1:
pos = len_s
else:
pos += 2
continue
yield Token('DELIM', s[pos], pos)
pos += 1
assert pos == len_s
yield EOFToken(pos)
class TokenStream(object):
def __init__(self, tokens, source=None):
self.used = []
self.tokens = iter(tokens)
self.source = source
self.peeked = None
self._peeking = False
try:
self.next_token = self.tokens.next
except AttributeError:
# Python 3
self.next_token = self.tokens.__next__
def next(self):
if self._peeking:
self._peeking = False
self.used.append(self.peeked)
return self.peeked
else:
next = self.next_token()
self.used.append(next)
return next
def peek(self):
if not self._peeking:
self.peeked = self.next_token()
self._peeking = True
return self.peeked
def next_ident(self):
next = self.next()
if next.type != 'IDENT':
raise SelectorSyntaxError('Expected ident, got %s' % (next,))
return next.value
def next_ident_or_star(self):
next = self.next()
if next.type == 'IDENT':
return next.value
elif next == ('DELIM', '*'):
return None
else:
raise SelectorSyntaxError(
"Expected ident or '*', got %s" % (next,))
def skip_whitespace(self):
peek = self.peek()
if peek.type == 'S':
self.next()

View File

@@ -0,0 +1,694 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
import re, itertools
from collections import OrderedDict, defaultdict
from functools import wraps
from itertools import chain
from lxml import etree
from css_selectors.errors import ExpressionError
from css_selectors.parser import parse, ascii_lower, Element, FunctionalPseudoElement
from css_selectors.ordered_set import OrderedSet
from polyglot.builtins import iteritems, itervalues
PARSE_CACHE_SIZE = 200
parse_cache = OrderedDict()
XPATH_CACHE_SIZE = 30
xpath_cache = OrderedDict()
# Test that the string is not empty and does not contain whitespace
is_non_whitespace = re.compile(r'^[^ \t\r\n\f]+$').match
def get_parsed_selector(raw):
try:
return parse_cache[raw]
except KeyError:
parse_cache[raw] = ans = parse(raw)
if len(parse_cache) > PARSE_CACHE_SIZE:
parse_cache.pop(next(iter(parse_cache)))
return ans
def get_compiled_xpath(expr):
try:
return xpath_cache[expr]
except KeyError:
xpath_cache[expr] = ans = etree.XPath(expr)
if len(xpath_cache) > XPATH_CACHE_SIZE:
xpath_cache.pop(next(iter(xpath_cache)))
return ans
class AlwaysIn(object):
def __contains__(self, x):
return True
always_in = AlwaysIn()
def trace_wrapper(func):
@wraps(func)
def trace(*args, **kwargs):
targs = args[1:] if args and isinstance(args[0], Select) else args
print('Called:', func.__name__, 'with args:', targs, kwargs or '')
return func(*args, **kwargs)
return trace
def normalize_language_tag(tag):
"""Return a list of normalized combinations for a `BCP 47` language tag.
Example:
>>> normalize_language_tag('de_AT-1901')
['de-at-1901', 'de-at', 'de-1901', 'de']
"""
# normalize:
tag = ascii_lower(tag).replace('_','-')
# split (except singletons, which mark the following tag as non-standard):
tag = re.sub(r'-([a-zA-Z0-9])-', r'-\1_', tag)
subtags = [subtag.replace('_', '-') for subtag in tag.split('-')]
base_tag = (subtags.pop(0),)
taglist = {base_tag[0]}
# find all combinations of subtags
for n in range(len(subtags), 0, -1):
for tags in itertools.combinations(subtags, n):
taglist.add('-'.join(base_tag + tags))
return taglist
INAPPROPRIATE_PSEUDO_CLASSES = frozenset((
'active', 'after', 'disabled', 'visited', 'link', 'before', 'focus', 'first-letter', 'enabled', 'first-line', 'hover', 'checked', 'target'))
class Select(object):
'''
This class implements CSS Level 3 selectors
(http://www.w3.org/TR/css3-selectors) on an lxml tree, with caching for
performance. To use:
>>> from css_selectors import Select
>>> select = Select(root) # Where root is an lxml document
>>> print(tuple(select('p.myclass')))
Tags are returned in document order. Note that attribute and tag names are
matched case-insensitively. Class and id values are also matched
case-insensitively. Also namespaces are ignored (this is for performance of
the common case). The UI related selectors are not implemented, such as
:enabled, :disabled, :checked, :hover, etc. Similarly, the non-element
related selectors such as ::first-line, ::first-letter, ::before, etc. are
not implemented.
WARNING: This class uses internal caches. You *must not* make any changes
to the lxml tree. If you do make some changes, either create a new Select
object or call :meth:`invalidate_caches`.
This class can be easily sub-classed to work with tree implementations
other than lxml. Simply override the methods in the ``Tree Integration``
block below.
The caching works by maintaining internal maps from classes/ids/tag
names/etc. to node sets. These caches are populated as needed, and used
for all subsequent selections. Thus, for best performance you should use
the same selector object for finding the matching nodes for multiple
queries. Of course, remember not to change the tree in between queries.
'''
combinator_mapping = {
' ': 'descendant',
'>': 'child',
'+': 'direct_adjacent',
'~': 'indirect_adjacent',
}
attribute_operator_mapping = {
'exists': 'exists',
'=': 'equals',
'~=': 'includes',
'|=': 'dashmatch',
'^=': 'prefixmatch',
'$=': 'suffixmatch',
'*=': 'substringmatch',
}
def __init__(self, root, default_lang=None, ignore_inappropriate_pseudo_classes=False, dispatch_map=None, trace=False):
if hasattr(root, 'getroot'):
root = root.getroot()
self.root = root
self.dispatch_map = dispatch_map or default_dispatch_map
self.invalidate_caches()
self.default_lang = default_lang
if trace:
self.dispatch_map = {k:trace_wrapper(v) for k, v in iteritems(self.dispatch_map)}
if ignore_inappropriate_pseudo_classes:
self.ignore_inappropriate_pseudo_classes = INAPPROPRIATE_PSEUDO_CLASSES
else:
self.ignore_inappropriate_pseudo_classes = frozenset()
# External API {{{
def invalidate_caches(self):
'Invalidate all caches. You must call this before using this object if you have made changes to the HTML tree'
self._element_map = None
self._id_map = None
self._class_map = None
self._attrib_map = None
self._attrib_space_map = None
self._lang_map = None
self.map_tag_name = ascii_lower
if '{' in self.root.tag:
def map_tag_name(x):
return ascii_lower(x.rpartition('}')[2])
self.map_tag_name = map_tag_name
def __call__(self, selector, root=None):
''' Return an iterator over all matching tags, in document order.
Normally, all matching tags in the document are returned, is you
specify root, then only tags that are root or descendants of root are
returned. Note that this can be very expensive if root has a lot of
descendants. '''
seen = set()
if root is not None:
root = frozenset(self.itertag(root))
for parsed_selector in get_parsed_selector(selector):
for item in self.iterparsedselector(parsed_selector):
if item not in seen and (root is None or item in root):
yield item
seen.add(item)
def has_matches(self, selector, root=None):
'Return True iff selector matches at least one item in the tree'
for elem in self(selector, root=root):
return True
return False
# }}}
def iterparsedselector(self, parsed_selector):
type_name = type(parsed_selector).__name__
try:
func = self.dispatch_map[ascii_lower(type_name)]
except KeyError:
raise ExpressionError('%s is not supported' % type_name)
for item in func(self, parsed_selector):
yield item
@property
def element_map(self):
if self._element_map is None:
self._element_map = em = defaultdict(OrderedSet)
for tag in self.itertag():
em[self.map_tag_name(tag.tag)].add(tag)
return self._element_map
@property
def id_map(self):
if self._id_map is None:
self._id_map = im = defaultdict(OrderedSet)
lower = ascii_lower
for elem in self.iteridtags():
im[lower(elem.get('id'))].add(elem)
return self._id_map
@property
def class_map(self):
if self._class_map is None:
self._class_map = cm = defaultdict(OrderedSet)
lower = ascii_lower
for elem in self.iterclasstags():
for cls in elem.get('class').split():
cm[lower(cls)].add(elem)
return self._class_map
@property
def attrib_map(self):
if self._attrib_map is None:
self._attrib_map = am = defaultdict(lambda : defaultdict(OrderedSet))
map_attrib_name = ascii_lower
if '{' in self.root.tag:
def map_attrib_name(x):
return ascii_lower(x.rpartition('}')[2])
for tag in self.itertag():
for attr, val in iteritems(tag.attrib):
am[map_attrib_name(attr)][val].add(tag)
return self._attrib_map
@property
def attrib_space_map(self):
if self._attrib_space_map is None:
self._attrib_space_map = am = defaultdict(lambda : defaultdict(OrderedSet))
map_attrib_name = ascii_lower
if '{' in self.root.tag:
def map_attrib_name(x):
return ascii_lower(x.rpartition('}')[2])
for tag in self.itertag():
for attr, val in iteritems(tag.attrib):
for v in val.split():
am[map_attrib_name(attr)][v].add(tag)
return self._attrib_space_map
@property
def lang_map(self):
if self._lang_map is None:
self._lang_map = lm = defaultdict(OrderedSet)
dl = normalize_language_tag(self.default_lang) if self.default_lang else None
lmap = {tag:dl for tag in self.itertag()} if dl else {}
for tag in self.itertag():
lang = None
for attr in ('{http://www.w3.org/XML/1998/namespace}lang', 'lang'):
lang = tag.get(attr)
if lang:
lang = normalize_language_tag(lang)
for dtag in self.itertag(tag):
lmap[dtag] = lang
for tag, langs in iteritems(lmap):
for lang in langs:
lm[lang].add(tag)
return self._lang_map
# Tree Integration {{{
def itertag(self, tag=None):
return (self.root if tag is None else tag).iter('*')
def iterdescendants(self, tag=None):
return (self.root if tag is None else tag).iterdescendants('*')
def iterchildren(self, tag=None):
return (self.root if tag is None else tag).iterchildren('*')
def itersiblings(self, tag=None, preceding=False):
return (self.root if tag is None else tag).itersiblings('*', preceding=preceding)
def iteridtags(self):
return get_compiled_xpath('//*[@id]')(self.root)
def iterclasstags(self):
return get_compiled_xpath('//*[@class]')(self.root)
def sibling_count(self, child, before=True, same_type=False):
' Return the number of siblings before or after child or raise ValueError if child has no parent. '
parent = child.getparent()
if parent is None:
raise ValueError('Child has no parent')
if same_type:
siblings = OrderedSet(child.itersiblings(preceding=before))
return len(self.element_map[self.map_tag_name(child.tag)] & siblings)
else:
if before:
return parent.index(child)
return len(parent) - parent.index(child) - 1
def all_sibling_count(self, child, same_type=False):
' Return the number of siblings of child or raise ValueError if child has no parent '
parent = child.getparent()
if parent is None:
raise ValueError('Child has no parent')
if same_type:
siblings = OrderedSet(chain(child.itersiblings(preceding=False), child.itersiblings(preceding=True)))
return len(self.element_map[self.map_tag_name(child.tag)] & siblings)
else:
return len(parent) - 1
def is_empty(self, elem):
' Return True iff elem has no child tags and no text content '
for child in elem:
# Check for comment/PI nodes with tail text
if child.tail:
return False
return len(tuple(elem.iterchildren('*'))) == 0 and not elem.text
# }}}
# Combinators {{{
def select_combinedselector(cache, combined):
"""Translate a combined selector."""
combinator = cache.combinator_mapping[combined.combinator]
# Fast path for when the sub-selector is all elements
right = None if isinstance(combined.subselector, Element) and (
combined.subselector.element or '*') == '*' else cache.iterparsedselector(combined.subselector)
for item in cache.dispatch_map[combinator](cache, cache.iterparsedselector(combined.selector), right):
yield item
def select_descendant(cache, left, right):
"""right is a child, grand-child or further descendant of left"""
right = always_in if right is None else frozenset(right)
for ancestor in left:
for descendant in cache.iterdescendants(ancestor):
if descendant in right:
yield descendant
def select_child(cache, left, right):
"""right is an immediate child of left"""
right = always_in if right is None else frozenset(right)
for parent in left:
for child in cache.iterchildren(parent):
if child in right:
yield child
def select_direct_adjacent(cache, left, right):
"""right is a sibling immediately after left"""
right = always_in if right is None else frozenset(right)
for parent in left:
for sibling in cache.itersiblings(parent):
if sibling in right:
yield sibling
break
def select_indirect_adjacent(cache, left, right):
"""right is a sibling after left, immediately or not"""
right = always_in if right is None else frozenset(right)
for parent in left:
for sibling in cache.itersiblings(parent):
if sibling in right:
yield sibling
# }}}
def select_element(cache, selector):
"""A type or universal selector."""
element = selector.element
if not element or element == '*':
for elem in cache.itertag():
yield elem
else:
for elem in cache.element_map[ascii_lower(element)]:
yield elem
def select_hash(cache, selector):
'An id selector'
items = cache.id_map[ascii_lower(selector.id)]
if len(items) > 0:
for elem in cache.iterparsedselector(selector.selector):
if elem in items:
yield elem
def select_class(cache, selector):
'A class selector'
items = cache.class_map[ascii_lower(selector.class_name)]
if items:
for elem in cache.iterparsedselector(selector.selector):
if elem in items:
yield elem
def select_negation(cache, selector):
'Implement :not()'
exclude = frozenset(cache.iterparsedselector(selector.subselector))
for item in cache.iterparsedselector(selector.selector):
if item not in exclude:
yield item
# Attribute selectors {{{
def select_attrib(cache, selector):
operator = cache.attribute_operator_mapping[selector.operator]
items = frozenset(cache.dispatch_map[operator](cache, ascii_lower(selector.attrib), selector.value))
for item in cache.iterparsedselector(selector.selector):
if item in items:
yield item
def select_exists(cache, attrib, value=None):
for elem_set in itervalues(cache.attrib_map[attrib]):
for elem in elem_set:
yield elem
def select_equals(cache, attrib, value):
for elem in cache.attrib_map[attrib][value]:
yield elem
def select_includes(cache, attrib, value):
if is_non_whitespace(value):
for elem in cache.attrib_space_map[attrib][value]:
yield elem
def select_dashmatch(cache, attrib, value):
if value:
for val, elem_set in iteritems(cache.attrib_map[attrib]):
if val == value or val.startswith(value + '-'):
for elem in elem_set:
yield elem
def select_prefixmatch(cache, attrib, value):
if value:
for val, elem_set in iteritems(cache.attrib_map[attrib]):
if val.startswith(value):
for elem in elem_set:
yield elem
def select_suffixmatch(cache, attrib, value):
if value:
for val, elem_set in iteritems(cache.attrib_map[attrib]):
if val.endswith(value):
for elem in elem_set:
yield elem
def select_substringmatch(cache, attrib, value):
if value:
for val, elem_set in iteritems(cache.attrib_map[attrib]):
if value in val:
for elem in elem_set:
yield elem
# }}}
# Function selectors {{{
def select_function(cache, function):
"""Select with a functional pseudo-class."""
fname = function.name.replace('-', '_')
try:
func = cache.dispatch_map[fname]
except KeyError:
raise ExpressionError(
"The pseudo-class :%s() is unknown" % function.name)
if fname == 'lang':
items = frozenset(func(cache, function))
for item in cache.iterparsedselector(function.selector):
if item in items:
yield item
else:
for item in cache.iterparsedselector(function.selector):
if func(cache, function, item):
yield item
def select_lang(cache, function):
' Implement :lang() '
if function.argument_types() not in (['STRING'], ['IDENT']):
raise ExpressionError("Expected a single string or ident for :lang(), got %r" % function.arguments)
lang = function.arguments[0].value
if lang:
lang = ascii_lower(lang)
lp = lang + '-'
for tlang, elem_set in iteritems(cache.lang_map):
if tlang == lang or (tlang is not None and tlang.startswith(lp)):
for elem in elem_set:
yield elem
def select_nth_child(cache, function, elem):
' Implement :nth-child() '
a, b = function.parsed_arguments
try:
num = cache.sibling_count(elem) + 1
except ValueError:
return False
if a == 0:
return num == b
n = (num - b) / a
return n.is_integer() and n > -1
def select_nth_last_child(cache, function, elem):
' Implement :nth-last-child() '
a, b = function.parsed_arguments
try:
num = cache.sibling_count(elem, before=False) + 1
except ValueError:
return False
if a == 0:
return num == b
n = (num - b) / a
return n.is_integer() and n > -1
def select_nth_of_type(cache, function, elem):
' Implement :nth-of-type() '
a, b = function.parsed_arguments
try:
num = cache.sibling_count(elem, same_type=True) + 1
except ValueError:
return False
if a == 0:
return num == b
n = (num - b) / a
return n.is_integer() and n > -1
def select_nth_last_of_type(cache, function, elem):
' Implement :nth-last-of-type() '
a, b = function.parsed_arguments
try:
num = cache.sibling_count(elem, before=False, same_type=True) + 1
except ValueError:
return False
if a == 0:
return num == b
n = (num - b) / a
return n.is_integer() and n > -1
# }}}
# Pseudo elements {{{
def pseudo_func(f):
f.is_pseudo = True
return f
@pseudo_func
def allow_all(cache, item):
return True
def get_func_for_pseudo(cache, ident):
try:
func = cache.dispatch_map[ident.replace('-', '_')]
except KeyError:
if ident in cache.ignore_inappropriate_pseudo_classes:
func = allow_all
else:
raise ExpressionError(
"The pseudo-class :%s is not supported" % ident)
try:
func.is_pseudo
except AttributeError:
raise ExpressionError(
"The pseudo-class :%s is invalid" % ident)
return func
def select_selector(cache, selector):
if selector.pseudo_element is None:
for item in cache.iterparsedselector(selector.parsed_tree):
yield item
return
if isinstance(selector.pseudo_element, FunctionalPseudoElement):
raise ExpressionError(
"The pseudo-element ::%s is not supported" % selector.pseudo_element.name)
func = get_func_for_pseudo(cache, selector.pseudo_element)
for item in cache.iterparsedselector(selector.parsed_tree):
if func(cache, item):
yield item
def select_pseudo(cache, pseudo):
func = get_func_for_pseudo(cache, pseudo.ident)
if func is select_root:
yield cache.root
return
for item in cache.iterparsedselector(pseudo.selector):
if func(cache, item):
yield item
@pseudo_func
def select_root(cache, elem):
return elem is cache.root
@pseudo_func
def select_first_child(cache, elem):
try:
return cache.sibling_count(elem) == 0
except ValueError:
return False
@pseudo_func
def select_last_child(cache, elem):
try:
return cache.sibling_count(elem, before=False) == 0
except ValueError:
return False
@pseudo_func
def select_only_child(cache, elem):
try:
return cache.all_sibling_count(elem) == 0
except ValueError:
return False
@pseudo_func
def select_first_of_type(cache, elem):
try:
return cache.sibling_count(elem, same_type=True) == 0
except ValueError:
return False
@pseudo_func
def select_last_of_type(cache, elem):
try:
return cache.sibling_count(elem, before=False, same_type=True) == 0
except ValueError:
return False
@pseudo_func
def select_only_of_type(cache, elem):
try:
return cache.all_sibling_count(elem, same_type=True) == 0
except ValueError:
return False
@pseudo_func
def select_empty(cache, elem):
return cache.is_empty(elem)
# }}}
default_dispatch_map = {name.partition('_')[2]:obj for name, obj in globals().items() if name.startswith('select_') and callable(obj)}
if __name__ == '__main__':
from pprint import pprint
root = etree.fromstring(
'<body xmlns="xxx" xml:lang="en"><p id="p" class="one two" lang="fr"><a id="a"/><b/><c/><d/></p></body>',
parser=etree.XMLParser(recover=True, no_network=True, resolve_entities=False))
select = Select(root, ignore_inappropriate_pseudo_classes=True, trace=True)
pprint(list(select('p:disabled')))

View File

@@ -0,0 +1,843 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
import unittest, sys, argparse
from lxml import etree, html
from css_selectors.errors import SelectorSyntaxError, ExpressionError
from css_selectors.parser import tokenize, parse
from css_selectors.select import Select
class TestCSSSelectors(unittest.TestCase):
# Test data {{{
HTML_IDS = '''
<html id="html"><head>
<link id="link-href" href="foo" />
<link id="link-nohref" />
</head><body>
<div id="outer-div">
<a id="name-anchor" name="foo"></a>
<a id="tag-anchor" rel="tag" href="http://localhost/foo">link</a>
<a id="nofollow-anchor" rel="nofollow" href="https://example.org">
link</a>
<ol id="first-ol" class="a b c">
<li id="first-li">content</li>
<li id="second-li" lang="En-us">
<div id="li-div">
</div>
</li>
<li id="third-li" class="ab c"></li>
<li id="fourth-li" class="ab
c"></li>
<li id="fifth-li"></li>
<li id="sixth-li"></li>
<li id="seventh-li"> </li>
</ol>
<p id="paragraph">
<b id="p-b">hi</b> <em id="p-em">there</em>
<b id="p-b2">guy</b>
<input type="checkbox" id="checkbox-unchecked" />
<input type="checkbox" id="checkbox-disabled" disabled="" />
<input type="text" id="text-checked" checked="checked" />
<input type="hidden" />
<input type="hidden" disabled="disabled" />
<input type="checkbox" id="checkbox-checked" checked="checked" />
<input type="checkbox" id="checkbox-disabled-checked"
disabled="disabled" checked="checked" />
<fieldset id="fieldset" disabled="disabled">
<input type="checkbox" id="checkbox-fieldset-disabled" />
<input type="hidden" />
</fieldset>
</p>
<ol id="second-ol">
</ol>
<map name="dummymap">
<area shape="circle" coords="200,250,25" href="foo.html" id="area-href" />
<area shape="default" id="area-nohref" />
</map>
</div>
<div id="foobar-div" foobar="ab bc
cde"><span id="foobar-span"></span></div>
</body></html>
'''
HTML_SHAKESPEARE = '''
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" debug="true">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<div id="test">
<div class="dialog">
<h2>As You Like It</h2>
<div id="playwright">
by William Shakespeare
</div>
<div class="dialog scene thirdClass" id="scene1">
<h3>ACT I, SCENE III. A room in the palace.</h3>
<div class="dialog">
<div class="direction">Enter CELIA and ROSALIND</div>
</div>
<div id="speech1" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.1">Why, cousin! why, Rosalind! Cupid have mercy! not a word?</div>
</div>
<div id="speech2" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.2">Not one to throw at a dog.</div>
</div>
<div id="speech3" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.3">No, thy words are too precious to be cast away upon</div>
<div id="scene1.3.4">curs; throw some of them at me; come, lame me with reasons.</div>
</div>
<div id="speech4" class="character">ROSALIND</div>
<div id="speech5" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.8">But is all this for your father?</div>
</div>
<div class="dialog">
<div id="scene1.3.5">Then there were two cousins laid up; when the one</div>
<div id="scene1.3.6">should be lamed with reasons and the other mad</div>
<div id="scene1.3.7">without any.</div>
</div>
<div id="speech6" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.9">No, some of it is for my child's father. O, how</div>
<div id="scene1.3.10">full of briers is this working-day world!</div>
</div>
<div id="speech7" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.11">They are but burs, cousin, thrown upon thee in</div>
<div id="scene1.3.12">holiday foolery: if we walk not in the trodden</div>
<div id="scene1.3.13">paths our very petticoats will catch them.</div>
</div>
<div id="speech8" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.14">I could shake them off my coat: these burs are in my heart.</div>
</div>
<div id="speech9" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.15">Hem them away.</div>
</div>
<div id="speech10" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.16">I would try, if I could cry 'hem' and have him.</div>
</div>
<div id="speech11" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.17">Come, come, wrestle with thy affections.</div>
</div>
<div id="speech12" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.18">O, they take the part of a better wrestler than myself!</div>
</div>
<div id="speech13" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.19">O, a good wish upon you! you will try in time, in</div>
<div id="scene1.3.20">despite of a fall. But, turning these jests out of</div>
<div id="scene1.3.21">service, let us talk in good earnest: is it</div>
<div id="scene1.3.22">possible, on such a sudden, you should fall into so</div>
<div id="scene1.3.23">strong a liking with old Sir Rowland's youngest son?</div>
</div>
<div id="speech14" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.24">The duke my father loved his father dearly.</div>
</div>
<div id="speech15" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.25">Doth it therefore ensue that you should love his son</div>
<div id="scene1.3.26">dearly? By this kind of chase, I should hate him,</div>
<div id="scene1.3.27">for my father hated his father dearly; yet I hate</div>
<div id="scene1.3.28">not Orlando.</div>
</div>
<div id="speech16" class="character">ROSALIND</div>
<div title="wtf" class="dialog">
<div id="scene1.3.29">No, faith, hate him not, for my sake.</div>
</div>
<div id="speech17" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.30">Why should I not? doth he not deserve well?</div>
</div>
<div id="speech18" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.31">Let me love him for that, and do you love him</div>
<div id="scene1.3.32">because I do. Look, here comes the duke.</div>
</div>
<div id="speech19" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.33">With his eyes full of anger.</div>
<div class="direction">Enter DUKE FREDERICK, with Lords</div>
</div>
<div id="speech20" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.34">Mistress, dispatch you with your safest haste</div>
<div id="scene1.3.35">And get you from our court.</div>
</div>
<div id="speech21" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.36">Me, uncle?</div>
</div>
<div id="speech22" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.37">You, cousin</div>
<div id="scene1.3.38">Within these ten days if that thou be'st found</div>
<div id="scene1.3.39">So near our public court as twenty miles,</div>
<div id="scene1.3.40">Thou diest for it.</div>
</div>
<div id="speech23" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.41"> I do beseech your grace,</div>
<div id="scene1.3.42">Let me the knowledge of my fault bear with me:</div>
<div id="scene1.3.43">If with myself I hold intelligence</div>
<div id="scene1.3.44">Or have acquaintance with mine own desires,</div>
<div id="scene1.3.45">If that I do not dream or be not frantic,--</div>
<div id="scene1.3.46">As I do trust I am not--then, dear uncle,</div>
<div id="scene1.3.47">Never so much as in a thought unborn</div>
<div id="scene1.3.48">Did I offend your highness.</div>
</div>
<div id="speech24" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.49">Thus do all traitors:</div>
<div id="scene1.3.50">If their purgation did consist in words,</div>
<div id="scene1.3.51">They are as innocent as grace itself:</div>
<div id="scene1.3.52">Let it suffice thee that I trust thee not.</div>
</div>
<div id="speech25" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.53">Yet your mistrust cannot make me a traitor:</div>
<div id="scene1.3.54">Tell me whereon the likelihood depends.</div>
</div>
<div id="speech26" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.55">Thou art thy father's daughter; there's enough.</div>
</div>
<div id="speech27" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.56">So was I when your highness took his dukedom;</div>
<div id="scene1.3.57">So was I when your highness banish'd him:</div>
<div id="scene1.3.58">Treason is not inherited, my lord;</div>
<div id="scene1.3.59">Or, if we did derive it from our friends,</div>
<div id="scene1.3.60">What's that to me? my father was no traitor:</div>
<div id="scene1.3.61">Then, good my liege, mistake me not so much</div>
<div id="scene1.3.62">To think my poverty is treacherous.</div>
</div>
<div id="speech28" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.63">Dear sovereign, hear me speak.</div>
</div>
<div id="speech29" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.64">Ay, Celia; we stay'd her for your sake,</div>
<div id="scene1.3.65">Else had she with her father ranged along.</div>
</div>
<div id="speech30" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.66">I did not then entreat to have her stay;</div>
<div id="scene1.3.67">It was your pleasure and your own remorse:</div>
<div id="scene1.3.68">I was too young that time to value her;</div>
<div id="scene1.3.69">But now I know her: if she be a traitor,</div>
<div id="scene1.3.70">Why so am I; we still have slept together,</div>
<div id="scene1.3.71">Rose at an instant, learn'd, play'd, eat together,</div>
<div id="scene1.3.72">And wheresoever we went, like Juno's swans,</div>
<div id="scene1.3.73">Still we went coupled and inseparable.</div>
</div>
<div id="speech31" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.74">She is too subtle for thee; and her smoothness,</div>
<div id="scene1.3.75">Her very silence and her patience</div>
<div id="scene1.3.76">Speak to the people, and they pity her.</div>
<div id="scene1.3.77">Thou art a fool: she robs thee of thy name;</div>
<div id="scene1.3.78">And thou wilt show more bright and seem more virtuous</div>
<div id="scene1.3.79">When she is gone. Then open not thy lips:</div>
<div id="scene1.3.80">Firm and irrevocable is my doom</div>
<div id="scene1.3.81">Which I have pass'd upon her; she is banish'd.</div>
</div>
<div id="speech32" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.82">Pronounce that sentence then on me, my liege:</div>
<div id="scene1.3.83">I cannot live out of her company.</div>
</div>
<div id="speech33" class="character">DUKE FREDERICK</div>
<div class="dialog">
<div id="scene1.3.84">You are a fool. You, niece, provide yourself:</div>
<div id="scene1.3.85">If you outstay the time, upon mine honour,</div>
<div id="scene1.3.86">And in the greatness of my word, you die.</div>
<div class="direction">Exeunt DUKE FREDERICK and Lords</div>
</div>
<div id="speech34" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.87">O my poor Rosalind, whither wilt thou go?</div>
<div id="scene1.3.88">Wilt thou change fathers? I will give thee mine.</div>
<div id="scene1.3.89">I charge thee, be not thou more grieved than I am.</div>
</div>
<div id="speech35" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.90">I have more cause.</div>
</div>
<div id="speech36" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.91"> Thou hast not, cousin;</div>
<div id="scene1.3.92">Prithee be cheerful: know'st thou not, the duke</div>
<div id="scene1.3.93">Hath banish'd me, his daughter?</div>
</div>
<div id="speech37" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.94">That he hath not.</div>
</div>
<div id="speech38" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.95">No, hath not? Rosalind lacks then the love</div>
<div id="scene1.3.96">Which teacheth thee that thou and I am one:</div>
<div id="scene1.3.97">Shall we be sunder'd? shall we part, sweet girl?</div>
<div id="scene1.3.98">No: let my father seek another heir.</div>
<div id="scene1.3.99">Therefore devise with me how we may fly,</div>
<div id="scene1.3.100">Whither to go and what to bear with us;</div>
<div id="scene1.3.101">And do not seek to take your change upon you,</div>
<div id="scene1.3.102">To bear your griefs yourself and leave me out;</div>
<div id="scene1.3.103">For, by this heaven, now at our sorrows pale,</div>
<div id="scene1.3.104">Say what thou canst, I'll go along with thee.</div>
</div>
<div id="speech39" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.105">Why, whither shall we go?</div>
</div>
<div id="speech40" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.106">To seek my uncle in the forest of Arden.</div>
</div>
<div id="speech41" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.107">Alas, what danger will it be to us,</div>
<div id="scene1.3.108">Maids as we are, to travel forth so far!</div>
<div id="scene1.3.109">Beauty provoketh thieves sooner than gold.</div>
</div>
<div id="speech42" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.110">I'll put myself in poor and mean attire</div>
<div id="scene1.3.111">And with a kind of umber smirch my face;</div>
<div id="scene1.3.112">The like do you: so shall we pass along</div>
<div id="scene1.3.113">And never stir assailants.</div>
</div>
<div id="speech43" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.114">Were it not better,</div>
<div id="scene1.3.115">Because that I am more than common tall,</div>
<div id="scene1.3.116">That I did suit me all points like a man?</div>
<div id="scene1.3.117">A gallant curtle-axe upon my thigh,</div>
<div id="scene1.3.118">A boar-spear in my hand; and--in my heart</div>
<div id="scene1.3.119">Lie there what hidden woman's fear there will--</div>
<div id="scene1.3.120">We'll have a swashing and a martial outside,</div>
<div id="scene1.3.121">As many other mannish cowards have</div>
<div id="scene1.3.122">That do outface it with their semblances.</div>
</div>
<div id="speech44" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.123">What shall I call thee when thou art a man?</div>
</div>
<div id="speech45" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.124">I'll have no worse a name than Jove's own page;</div>
<div id="scene1.3.125">And therefore look you call me Ganymede.</div>
<div id="scene1.3.126">But what will you be call'd?</div>
</div>
<div id="speech46" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.127">Something that hath a reference to my state</div>
<div id="scene1.3.128">No longer Celia, but Aliena.</div>
</div>
<div id="speech47" class="character">ROSALIND</div>
<div class="dialog">
<div id="scene1.3.129">But, cousin, what if we assay'd to steal</div>
<div id="scene1.3.130">The clownish fool out of your father's court?</div>
<div id="scene1.3.131">Would he not be a comfort to our travel?</div>
</div>
<div id="speech48" class="character">CELIA</div>
<div class="dialog">
<div id="scene1.3.132">He'll go along o'er the wide world with me;</div>
<div id="scene1.3.133">Leave me alone to woo him. Let's away,</div>
<div id="scene1.3.134">And get our jewels and our wealth together,</div>
<div id="scene1.3.135">Devise the fittest time and safest way</div>
<div id="scene1.3.136">To hide us from pursuit that will be made</div>
<div id="scene1.3.137">After my flight. Now go we in content</div>
<div id="scene1.3.138">To liberty and not to banishment.</div>
<div class="direction">Exeunt</div>
</div>
</div>
</div>
</div>
</body>
</html>
'''
# }}}
ae = unittest.TestCase.assertEqual
def test_tokenizer(self): # {{{
tokens = [
type('')(item) for item in tokenize(
r'E\ é > f [a~="y\"x"]:nth(/* fu /]* */-3.7)')]
self.ae(tokens, [
"<IDENT 'E é' at 0>",
"<S ' ' at 4>",
"<DELIM '>' at 5>",
"<S ' ' at 6>",
# the no-break space is not whitespace in CSS
"<IDENT 'f ' at 7>", # f\xa0
"<DELIM '[' at 9>",
"<IDENT 'a' at 10>",
"<DELIM '~' at 11>",
"<DELIM '=' at 12>",
"<STRING 'y\"x' at 13>",
"<DELIM ']' at 19>",
"<DELIM ':' at 20>",
"<IDENT 'nth' at 21>",
"<DELIM '(' at 24>",
"<NUMBER '-3.7' at 37>",
"<DELIM ')' at 41>",
"<EOF at 42>",
])
# }}}
def test_parser(self): # {{{
def repr_parse(css):
selectors = parse(css)
for selector in selectors:
assert selector.pseudo_element is None
return [repr(selector.parsed_tree).replace("(u'", "('")
for selector in selectors]
def parse_many(first, *others):
result = repr_parse(first)
for other in others:
assert repr_parse(other) == result
return result
assert parse_many('*') == ['Element[*]']
assert parse_many('*|*') == ['Element[*]']
assert parse_many('*|foo') == ['Element[foo]']
assert parse_many('foo|*') == ['Element[foo|*]']
assert parse_many('foo|bar') == ['Element[foo|bar]']
# This will never match, but it is valid:
assert parse_many('#foo#bar') == ['Hash[Hash[Element[*]#foo]#bar]']
assert parse_many(
'div>.foo',
'div> .foo',
'div >.foo',
'div > .foo',
'div \n> \t \t .foo', 'div\r>\n\n\n.foo', 'div\f>\f.foo'
) == ['CombinedSelector[Element[div] > Class[Element[*].foo]]']
assert parse_many('td.foo,.bar',
'td.foo, .bar',
'td.foo\t\r\n\f ,\t\r\n\f .bar'
) == [
'Class[Element[td].foo]',
'Class[Element[*].bar]'
]
assert parse_many('div, td.foo, div.bar span') == [
'Element[div]',
'Class[Element[td].foo]',
'CombinedSelector[Class[Element[div].bar] '
'<followed> Element[span]]']
assert parse_many('div > p') == [
'CombinedSelector[Element[div] > Element[p]]']
assert parse_many('td:first') == [
'Pseudo[Element[td]:first]']
assert parse_many('td:first') == [
'Pseudo[Element[td]:first]']
assert parse_many('td :first') == [
'CombinedSelector[Element[td] '
'<followed> Pseudo[Element[*]:first]]']
assert parse_many('td :first') == [
'CombinedSelector[Element[td] '
'<followed> Pseudo[Element[*]:first]]']
assert parse_many('a[name]', 'a[ name\t]') == [
'Attrib[Element[a][name]]']
assert parse_many('a [name]') == [
'CombinedSelector[Element[a] <followed> Attrib[Element[*][name]]]']
self.ae(parse_many('a[rel="include"]', 'a[rel = include]'), [
"Attrib[Element[a][rel = 'include']]"])
assert parse_many("a[hreflang |= 'en']", "a[hreflang|=en]") == [
"Attrib[Element[a][hreflang |= 'en']]"]
self.ae(parse_many('div:nth-child(10)'), [
"Function[Element[div]:nth-child(['10'])]"])
assert parse_many(':nth-child(2n+2)') == [
"Function[Element[*]:nth-child(['2', 'n', '+2'])]"]
assert parse_many('div:nth-of-type(10)') == [
"Function[Element[div]:nth-of-type(['10'])]"]
assert parse_many('div div:nth-of-type(10) .aclass') == [
'CombinedSelector[CombinedSelector[Element[div] <followed> '
"Function[Element[div]:nth-of-type(['10'])]] "
'<followed> Class[Element[*].aclass]]']
assert parse_many('label:only') == [
'Pseudo[Element[label]:only]']
assert parse_many('a:lang(fr)') == [
"Function[Element[a]:lang(['fr'])]"]
assert parse_many('div:contains("foo")') == [
"Function[Element[div]:contains(['foo'])]"]
assert parse_many('div#foobar') == [
'Hash[Element[div]#foobar]']
assert parse_many('div:not(div.foo)') == [
'Negation[Element[div]:not(Class[Element[div].foo])]']
assert parse_many('td ~ th') == [
'CombinedSelector[Element[td] ~ Element[th]]']
# }}}
def test_pseudo_elements(self): # {{{
def parse_pseudo(css):
result = []
for selector in parse(css):
pseudo = selector.pseudo_element
pseudo = type('')(pseudo) if pseudo else pseudo
# No Symbol here
assert pseudo is None or isinstance(pseudo, type(''))
selector = repr(selector.parsed_tree).replace("(u'", "('")
result.append((selector, pseudo))
return result
def parse_one(css):
result = parse_pseudo(css)
assert len(result) == 1
return result[0]
self.ae(parse_one('foo'), ('Element[foo]', None))
self.ae(parse_one('*'), ('Element[*]', None))
self.ae(parse_one(':empty'), ('Pseudo[Element[*]:empty]', None))
# Special cases for CSS 2.1 pseudo-elements
self.ae(parse_one(':BEfore'), ('Element[*]', 'before'))
self.ae(parse_one(':aftER'), ('Element[*]', 'after'))
self.ae(parse_one(':First-Line'), ('Element[*]', 'first-line'))
self.ae(parse_one(':First-Letter'), ('Element[*]', 'first-letter'))
self.ae(parse_one('::befoRE'), ('Element[*]', 'before'))
self.ae(parse_one('::AFter'), ('Element[*]', 'after'))
self.ae(parse_one('::firsT-linE'), ('Element[*]', 'first-line'))
self.ae(parse_one('::firsT-letteR'), ('Element[*]', 'first-letter'))
self.ae(parse_one('::text-content'), ('Element[*]', 'text-content'))
self.ae(parse_one('::attr(name)'), (
"Element[*]", "FunctionalPseudoElement[::attr(['name'])]"))
self.ae(parse_one('::Selection'), ('Element[*]', 'selection'))
self.ae(parse_one('foo:after'), ('Element[foo]', 'after'))
self.ae(parse_one('foo::selection'), ('Element[foo]', 'selection'))
self.ae(parse_one('lorem#ipsum ~ a#b.c[href]:empty::selection'), (
'CombinedSelector[Hash[Element[lorem]#ipsum] ~ '
'Pseudo[Attrib[Class[Hash[Element[a]#b].c][href]]:empty]]',
'selection'))
parse_pseudo('foo:before, bar, baz:after') == [
('Element[foo]', 'before'),
('Element[bar]', None),
('Element[baz]', 'after')]
# }}}
def test_specificity(self): # {{{
def specificity(css):
selectors = parse(css)
assert len(selectors) == 1
return selectors[0].specificity()
assert specificity('*') == (0, 0, 0)
assert specificity(' foo') == (0, 0, 1)
assert specificity(':empty ') == (0, 1, 0)
assert specificity(':before') == (0, 0, 1)
assert specificity('*:before') == (0, 0, 1)
assert specificity(':nth-child(2)') == (0, 1, 0)
assert specificity('.bar') == (0, 1, 0)
assert specificity('[baz]') == (0, 1, 0)
assert specificity('[baz="4"]') == (0, 1, 0)
assert specificity('[baz^="4"]') == (0, 1, 0)
assert specificity('#lipsum') == (1, 0, 0)
assert specificity(':not(*)') == (0, 0, 0)
assert specificity(':not(foo)') == (0, 0, 1)
assert specificity(':not(.foo)') == (0, 1, 0)
assert specificity(':not([foo])') == (0, 1, 0)
assert specificity(':not(:empty)') == (0, 1, 0)
assert specificity(':not(#foo)') == (1, 0, 0)
assert specificity('foo:empty') == (0, 1, 1)
assert specificity('foo:before') == (0, 0, 2)
assert specificity('foo::before') == (0, 0, 2)
assert specificity('foo:empty::before') == (0, 1, 2)
assert specificity('#lorem + foo#ipsum:first-child > bar:first-line'
) == (2, 1, 3)
# }}}
def test_parse_errors(self): # {{{
def get_error(css):
try:
parse(css)
except SelectorSyntaxError:
# Py2, Py3, ...
return str(sys.exc_info()[1]).replace("(u'", "('")
self.ae(get_error('attributes(href)/html/body/a'), (
"Expected selector, got <DELIM '(' at 10>"))
assert get_error('attributes(href)') == (
"Expected selector, got <DELIM '(' at 10>")
assert get_error('html/body/a') == (
"Expected selector, got <DELIM '/' at 4>")
assert get_error(' ') == (
"Expected selector, got <EOF at 1>")
assert get_error('div, ') == (
"Expected selector, got <EOF at 5>")
assert get_error(' , div') == (
"Expected selector, got <DELIM ',' at 1>")
assert get_error('p, , div') == (
"Expected selector, got <DELIM ',' at 3>")
assert get_error('div > ') == (
"Expected selector, got <EOF at 6>")
assert get_error(' > div') == (
"Expected selector, got <DELIM '>' at 2>")
assert get_error('foo|#bar') == (
"Expected ident or '*', got <HASH 'bar' at 4>")
assert get_error('#.foo') == (
"Expected selector, got <DELIM '#' at 0>")
assert get_error('.#foo') == (
"Expected ident, got <HASH 'foo' at 1>")
assert get_error(':#foo') == (
"Expected ident, got <HASH 'foo' at 1>")
assert get_error('[*]') == (
"Expected '|', got <DELIM ']' at 2>")
assert get_error('[foo|]') == (
"Expected ident, got <DELIM ']' at 5>")
assert get_error('[#]') == (
"Expected ident or '*', got <DELIM '#' at 1>")
assert get_error('[foo=#]') == (
"Expected string or ident, got <DELIM '#' at 5>")
assert get_error('[href]a') == (
"Expected selector, got <IDENT 'a' at 6>")
assert get_error('[rel=stylesheet]') is None
assert get_error('[rel:stylesheet]') == (
"Operator expected, got <DELIM ':' at 4>")
assert get_error('[rel=stylesheet') == (
"Expected ']', got <EOF at 15>")
assert get_error(':lang(fr)') is None
assert get_error(':lang(fr') == (
"Expected an argument, got <EOF at 8>")
assert get_error(':contains("foo') == (
"Unclosed string at 10")
assert get_error('foo!') == (
"Expected selector, got <DELIM '!' at 3>")
# Mis-placed pseudo-elements
assert get_error('a:before:empty') == (
"Got pseudo-element ::before not at the end of a selector")
assert get_error('li:before a') == (
"Got pseudo-element ::before not at the end of a selector")
assert get_error(':not(:before)') == (
"Got pseudo-element ::before inside :not() at 12")
assert get_error(':not(:not(a))') == (
"Got nested :not()")
# }}}
def test_select(self): # {{{
document = etree.fromstring(self.HTML_IDS, parser=etree.XMLParser(recover=True, no_network=True, resolve_entities=False))
select = Select(document)
def select_ids(selector):
for elem in select(selector):
yield elem.get('id')
def pcss(main, *selectors, **kwargs):
result = list(select_ids(main))
for selector in selectors:
self.ae(list(select_ids(selector)), result)
return result
all_ids = pcss('*')
self.ae(all_ids[:6], [
'html', None, 'link-href', 'link-nohref', None, 'outer-div'])
self.ae(all_ids[-1:], ['foobar-span'])
self.ae(pcss('div'), ['outer-div', 'li-div', 'foobar-div'])
self.ae(pcss('DIV'), [
'outer-div', 'li-div', 'foobar-div']) # case-insensitive in HTML
self.ae(pcss('div div'), ['li-div'])
self.ae(pcss('div, div div'), ['outer-div', 'li-div', 'foobar-div'])
self.ae(pcss('a[name]'), ['name-anchor'])
self.ae(pcss('a[NAme]'), ['name-anchor']) # case-insensitive in HTML:
self.ae(pcss('a[rel]'), ['tag-anchor', 'nofollow-anchor'])
self.ae(pcss('a[rel="tag"]'), ['tag-anchor'])
self.ae(pcss('a[href*="localhost"]'), ['tag-anchor'])
self.ae(pcss('a[href*=""]'), [])
self.ae(pcss('a[href^="http"]'), ['tag-anchor', 'nofollow-anchor'])
self.ae(pcss('a[href^="http:"]'), ['tag-anchor'])
self.ae(pcss('a[href^=""]'), [])
self.ae(pcss('a[href$="org"]'), ['nofollow-anchor'])
self.ae(pcss('a[href$=""]'), [])
self.ae(pcss('div[foobar~="bc"]', 'div[foobar~="cde"]', skip_webkit=True), ['foobar-div'])
self.ae(pcss('[foobar~="ab bc"]', '[foobar~=""]', '[foobar~=" \t"]'), [])
self.ae(pcss('div[foobar~="cd"]'), [])
self.ae(pcss('*[lang|="En"]', '[lang|="En-us"]'), ['second-li'])
# Attribute values are case sensitive
self.ae(pcss('*[lang|="en"]', '[lang|="en-US"]', skip_webkit=True), [])
self.ae(pcss('*[lang|="e"]'), [])
self.ae(pcss(':lang("EN")', '*:lang(en-US)', skip_webkit=True), ['second-li', 'li-div'])
self.ae(pcss(':lang("e")'), [])
self.ae(pcss('li:nth-child(1)', 'li:first-child'), ['first-li'])
self.ae(pcss('li:nth-child(3)', '#first-li ~ :nth-child(3)'), ['third-li'])
self.ae(pcss('li:nth-child(10)'), [])
self.ae(pcss('li:nth-child(2n)', 'li:nth-child(even)', 'li:nth-child(2n+0)'), ['second-li', 'fourth-li', 'sixth-li'])
self.ae(pcss('li:nth-child(+2n+1)', 'li:nth-child(odd)'), ['first-li', 'third-li', 'fifth-li', 'seventh-li'])
self.ae(pcss('li:nth-child(2n+4)'), ['fourth-li', 'sixth-li'])
self.ae(pcss('li:nth-child(3n+1)'), ['first-li', 'fourth-li', 'seventh-li'])
self.ae(pcss('li:nth-last-child(0)'), [])
self.ae(pcss('li:nth-last-child(1)', 'li:last-child'), ['seventh-li'])
self.ae(pcss('li:nth-last-child(2n)', 'li:nth-last-child(even)'), ['second-li', 'fourth-li', 'sixth-li'])
self.ae(pcss('li:nth-last-child(2n+2)'), ['second-li', 'fourth-li', 'sixth-li'])
self.ae(pcss('ol:first-of-type'), ['first-ol'])
self.ae(pcss('ol:nth-child(1)'), [])
self.ae(pcss('ol:nth-of-type(2)'), ['second-ol'])
self.ae(pcss('ol:nth-last-of-type(1)'), ['second-ol'])
self.ae(pcss('span:only-child'), ['foobar-span'])
self.ae(pcss('li div:only-child'), ['li-div'])
self.ae(pcss('div *:only-child'), ['li-div', 'foobar-span'])
self.ae(pcss('p *:only-of-type', skip_webkit=True), ['p-em', 'fieldset'])
self.ae(pcss('p:only-of-type', skip_webkit=True), ['paragraph'])
self.ae(pcss('a:empty', 'a:EMpty'), ['name-anchor'])
self.ae(pcss('li:empty'), ['third-li', 'fourth-li', 'fifth-li', 'sixth-li'])
self.ae(pcss(':root', 'html:root', 'li:root'), ['html'])
self.ae(pcss('* :root', 'p *:root'), [])
self.ae(pcss('.a', '.b', '*.a', 'ol.a'), ['first-ol'])
self.ae(pcss('.c', '*.c'), ['first-ol', 'third-li', 'fourth-li'])
self.ae(pcss('ol *.c', 'ol li.c', 'li ~ li.c', 'ol > li.c'), [
'third-li', 'fourth-li'])
self.ae(pcss('#first-li', 'li#first-li', '*#first-li'), ['first-li'])
self.ae(pcss('li div', 'li > div', 'div div'), ['li-div'])
self.ae(pcss('div > div'), [])
self.ae(pcss('div>.c', 'div > .c'), ['first-ol'])
self.ae(pcss('div + div'), ['foobar-div'])
self.ae(pcss('a ~ a'), ['tag-anchor', 'nofollow-anchor'])
self.ae(pcss('a[rel="tag"] ~ a'), ['nofollow-anchor'])
self.ae(pcss('ol#first-ol li:last-child'), ['seventh-li'])
self.ae(pcss('ol#first-ol *:last-child'), ['li-div', 'seventh-li'])
self.ae(pcss('#outer-div:first-child'), ['outer-div'])
self.ae(pcss('#outer-div :first-child'), [
'name-anchor', 'first-li', 'li-div', 'p-b',
'checkbox-fieldset-disabled', 'area-href'])
self.ae(pcss('a[href]'), ['tag-anchor', 'nofollow-anchor'])
self.ae(pcss(':not(*)'), [])
self.ae(pcss('a:not([href])'), ['name-anchor'])
self.ae(pcss('ol :Not(li[class])', skip_webkit=True), [
'first-li', 'second-li', 'li-div',
'fifth-li', 'sixth-li', 'seventh-li'])
self.ae(pcss(r'di\a0 v', r'div\['), [])
self.ae(pcss(r'[h\a0 ref]', r'[h\]ref]'), [])
self.assertRaises(ExpressionError, lambda : tuple(select('body:nth-child')))
select = Select(document, ignore_inappropriate_pseudo_classes=True)
self.assertGreater(len(tuple(select('p:hover'))), 0)
def test_select_shakespeare(self):
document = html.document_fromstring(self.HTML_SHAKESPEARE)
select = Select(document)
count = lambda s: sum(1 for r in select(s))
# Data borrowed from http://mootools.net/slickspeed/
# Changed from original; probably because I'm only
self.ae(count('*'), 249)
assert count('div:only-child') == 22 # ?
assert count('div:nth-child(even)') == 106
assert count('div:nth-child(2n)') == 106
assert count('div:nth-child(odd)') == 137
assert count('div:nth-child(2n+1)') == 137
assert count('div:nth-child(n)') == 243
assert count('div:last-child') == 53
assert count('div:first-child') == 51
assert count('div > div') == 242
assert count('div + div') == 190
assert count('div ~ div') == 190
assert count('body') == 1
assert count('body div') == 243
assert count('div') == 243
assert count('div div') == 242
assert count('div div div') == 241
assert count('div, div, div') == 243
assert count('div, a, span') == 243
assert count('.dialog') == 51
assert count('div.dialog') == 51
assert count('div .dialog') == 51
assert count('div.character, div.dialog') == 99
assert count('div.direction.dialog') == 0
assert count('div.dialog.direction') == 0
assert count('div.dialog.scene') == 1
assert count('div.scene.scene') == 1
assert count('div.scene .scene') == 0
assert count('div.direction .dialog ') == 0
assert count('div .dialog .direction') == 4
assert count('div.dialog .dialog .direction') == 4
assert count('#speech5') == 1
assert count('div#speech5') == 1
assert count('div #speech5') == 1
assert count('div.scene div.dialog') == 49
assert count('div#scene1 div.dialog div') == 142
assert count('#scene1 #speech1') == 1
assert count('div[class]') == 103
assert count('div[class=dialog]') == 50
assert count('div[class^=dia]') == 51
assert count('div[class$=log]') == 50
assert count('div[class*=sce]') == 1
assert count('div[class|=dialog]') == 50 # ? Seems right
assert count('div[class~=dialog]') == 51 # ? Seems right
# }}}
# Run tests {{{
def find_tests():
return unittest.defaultTestLoader.loadTestsFromTestCase(TestCSSSelectors)
def run_tests(find_tests=find_tests, for_build=False):
if not for_build:
parser = argparse.ArgumentParser()
parser.add_argument('name', nargs='?', default=None,
help='The name of the test to run')
args = parser.parse_args()
if not for_build and args.name and args.name.startswith('.'):
tests = find_tests()
q = args.name[1:]
if not q.startswith('test_'):
q = 'test_' + q
ans = None
try:
for test in tests:
if test._testMethodName == q:
ans = test
raise StopIteration()
except StopIteration:
pass
if ans is None:
print('No test named %s found' % args.name)
raise SystemExit(1)
tests = ans
else:
tests = unittest.defaultTestLoader.loadTestsFromName(args.name) if not for_build and args.name else find_tests()
r = unittest.TextTestRunner
if for_build:
r = r(verbosity=0, buffer=True, failfast=True)
else:
r = r(verbosity=4)
result = r.run(tests)
if for_build and result.errors or result.failures:
raise SystemExit(1)
if __name__ == '__main__':
run_tests()
# }}}

View File

@@ -0,0 +1,759 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import os, sys, zipfile, importlib
from calibre.constants import numeric_version, iswindows, isosx
from calibre.ptempfile import PersistentTemporaryFile
from polyglot.builtins import unicode_type
platform = 'linux'
if iswindows:
platform = 'windows'
elif isosx:
platform = 'osx'
class PluginNotFound(ValueError):
pass
class InvalidPlugin(ValueError):
pass
class Plugin(object): # {{{
'''
A calibre plugin. Useful members include:
* ``self.plugin_path``: Stores path to the ZIP file that contains
this plugin or None if it is a builtin
plugin
* ``self.site_customization``: Stores a customization string entered
by the user.
Methods that should be overridden in sub classes:
* :meth:`initialize`
* :meth:`customization_help`
Useful methods:
* :meth:`temporary_file`
* :meth:`__enter__`
* :meth:`load_resources`
'''
#: List of platforms this plugin works on.
#: For example: ``['windows', 'osx', 'linux']``
supported_platforms = []
#: The name of this plugin. You must set it something other
#: than Trivial Plugin for it to work.
name = 'Trivial Plugin'
#: The version of this plugin as a 3-tuple (major, minor, revision)
version = (1, 0, 0)
#: A short string describing what this plugin does
description = _('Does absolutely nothing')
#: The author of this plugin
author = _('Unknown')
#: When more than one plugin exists for a filetype,
#: the plugins are run in order of decreasing priority.
#: Plugins with higher priority will be run first.
#: The highest possible priority is ``sys.maxsize``.
#: Default priority is 1.
priority = 1
#: The earliest version of calibre this plugin requires
minimum_calibre_version = (0, 4, 118)
#: If False, the user will not be able to disable this plugin. Use with
#: care.
can_be_disabled = True
#: The type of this plugin. Used for categorizing plugins in the
#: GUI
type = _('Base')
def __init__(self, plugin_path):
self.plugin_path = plugin_path
self.site_customization = None
def initialize(self):
'''
Called once when calibre plugins are initialized. Plugins are
re-initialized every time a new plugin is added. Also note that if the
plugin is run in a worker process, such as for adding books, then the
plugin will be initialized for every new worker process.
Perform any plugin specific initialization here, such as extracting
resources from the plugin ZIP file. The path to the ZIP file is
available as ``self.plugin_path``.
Note that ``self.site_customization`` is **not** available at this point.
'''
pass
def config_widget(self):
'''
Implement this method and :meth:`save_settings` in your plugin to
use a custom configuration dialog, rather then relying on the simple
string based default customization.
This method, if implemented, must return a QWidget. The widget can have
an optional method validate() that takes no arguments and is called
immediately after the user clicks OK. Changes are applied if and only
if the method returns True.
If for some reason you cannot perform the configuration at this time,
return a tuple of two strings (message, details), these will be
displayed as a warning dialog to the user and the process will be
aborted.
'''
raise NotImplementedError()
def save_settings(self, config_widget):
'''
Save the settings specified by the user with config_widget.
:param config_widget: The widget returned by :meth:`config_widget`.
'''
raise NotImplementedError()
def do_user_config(self, parent=None):
'''
This method shows a configuration dialog for this plugin. It returns
True if the user clicks OK, False otherwise. The changes are
automatically applied.
'''
from PyQt5.Qt import QDialog, QDialogButtonBox, QVBoxLayout, \
QLabel, Qt, QLineEdit
from calibre.gui2 import gprefs
prefname = 'plugin config dialog:'+self.type + ':' + self.name
geom = gprefs.get(prefname, None)
config_dialog = QDialog(parent)
button_box = QDialogButtonBox(QDialogButtonBox.Ok | QDialogButtonBox.Cancel)
v = QVBoxLayout(config_dialog)
def size_dialog():
if geom is None:
config_dialog.resize(config_dialog.sizeHint())
else:
from PyQt5.Qt import QApplication
QApplication.instance().safe_restore_geometry(config_dialog, geom)
button_box.accepted.connect(config_dialog.accept)
button_box.rejected.connect(config_dialog.reject)
config_dialog.setWindowTitle(_('Customize') + ' ' + self.name)
try:
config_widget = self.config_widget()
except NotImplementedError:
config_widget = None
if isinstance(config_widget, tuple):
from calibre.gui2 import warning_dialog
warning_dialog(parent, _('Cannot configure'), config_widget[0],
det_msg=config_widget[1], show=True)
return False
if config_widget is not None:
v.addWidget(config_widget)
v.addWidget(button_box)
size_dialog()
config_dialog.exec_()
if config_dialog.result() == QDialog.Accepted:
if hasattr(config_widget, 'validate'):
if config_widget.validate():
self.save_settings(config_widget)
else:
self.save_settings(config_widget)
else:
from calibre.customize.ui import plugin_customization, \
customize_plugin
help_text = self.customization_help(gui=True)
help_text = QLabel(help_text, config_dialog)
help_text.setWordWrap(True)
help_text.setTextInteractionFlags(Qt.LinksAccessibleByMouse | Qt.LinksAccessibleByKeyboard)
help_text.setOpenExternalLinks(True)
v.addWidget(help_text)
sc = plugin_customization(self)
if not sc:
sc = ''
sc = sc.strip()
sc = QLineEdit(sc, config_dialog)
v.addWidget(sc)
v.addWidget(button_box)
size_dialog()
config_dialog.exec_()
if config_dialog.result() == QDialog.Accepted:
sc = unicode_type(sc.text()).strip()
customize_plugin(self, sc)
geom = bytearray(config_dialog.saveGeometry())
gprefs[prefname] = geom
return config_dialog.result()
def load_resources(self, names):
'''
If this plugin comes in a ZIP file (user added plugin), this method
will allow you to load resources from the ZIP file.
For example to load an image::
pixmap = QPixmap()
pixmap.loadFromData(self.load_resources(['images/icon.png'])['images/icon.png'])
icon = QIcon(pixmap)
:param names: List of paths to resources in the ZIP file using / as separator
:return: A dictionary of the form ``{name: file_contents}``. Any names
that were not found in the ZIP file will not be present in the
dictionary.
'''
if self.plugin_path is None:
raise ValueError('This plugin was not loaded from a ZIP file')
ans = {}
with zipfile.ZipFile(self.plugin_path, 'r') as zf:
for candidate in zf.namelist():
if candidate in names:
ans[candidate] = zf.read(candidate)
return ans
def customization_help(self, gui=False):
'''
Return a string giving help on how to customize this plugin.
By default raise a :class:`NotImplementedError`, which indicates that
the plugin does not require customization.
If you re-implement this method in your subclass, the user will
be asked to enter a string as customization for this plugin.
The customization string will be available as
``self.site_customization``.
Site customization could be anything, for example, the path to
a needed binary on the user's computer.
:param gui: If True return HTML help, otherwise return plain text help.
'''
raise NotImplementedError()
def temporary_file(self, suffix):
'''
Return a file-like object that is a temporary file on the file system.
This file will remain available even after being closed and will only
be removed on interpreter shutdown. Use the ``name`` member of the
returned object to access the full path to the created temporary file.
:param suffix: The suffix that the temporary file will have.
'''
return PersistentTemporaryFile(suffix)
def is_customizable(self):
try:
self.customization_help()
return True
except NotImplementedError:
return False
def __enter__(self, *args):
'''
Add this plugin to the python path so that it's contents become directly importable.
Useful when bundling large python libraries into the plugin. Use it like this::
with plugin:
import something
'''
if self.plugin_path is not None:
from calibre.utils.zipfile import ZipFile
zf = ZipFile(self.plugin_path)
extensions = {x.rpartition('.')[-1].lower() for x in
zf.namelist()}
zip_safe = True
for ext in ('pyd', 'so', 'dll', 'dylib'):
if ext in extensions:
zip_safe = False
break
if zip_safe:
sys.path.insert(0, self.plugin_path)
self.sys_insertion_path = self.plugin_path
else:
from calibre.ptempfile import TemporaryDirectory
self._sys_insertion_tdir = TemporaryDirectory('plugin_unzip')
self.sys_insertion_path = self._sys_insertion_tdir.__enter__(*args)
zf.extractall(self.sys_insertion_path)
sys.path.insert(0, self.sys_insertion_path)
zf.close()
def __exit__(self, *args):
ip, it = getattr(self, 'sys_insertion_path', None), getattr(self,
'_sys_insertion_tdir', None)
if ip in sys.path:
sys.path.remove(ip)
if hasattr(it, '__exit__'):
it.__exit__(*args)
def cli_main(self, args):
'''
This method is the main entry point for your plugins command line
interface. It is called when the user does: calibre-debug -r "Plugin
Name". Any arguments passed are present in the args variable.
'''
raise NotImplementedError('The %s plugin has no command line interface'
%self.name)
# }}}
class FileTypePlugin(Plugin): # {{{
'''
A plugin that is associated with a particular set of file types.
'''
#: Set of file types for which this plugin should be run.
#: Use '*' for all file types.
#: For example: ``{'lit', 'mobi', 'prc'}``
file_types = set()
#: If True, this plugin is run when books are added
#: to the database
on_import = False
#: If True, this plugin is run after books are added
#: to the database. In this case the postimport and postadd
#: methods of the plugin are called.
on_postimport = False
#: If True, this plugin is run just before a conversion
on_preprocess = False
#: If True, this plugin is run after conversion
#: on the final file produced by the conversion output plugin.
on_postprocess = False
type = _('File type')
def run(self, path_to_ebook):
'''
Run the plugin. Must be implemented in subclasses.
It should perform whatever modifications are required
on the e-book and return the absolute path to the
modified e-book. If no modifications are needed, it should
return the path to the original e-book. If an error is encountered
it should raise an Exception. The default implementation
simply return the path to the original e-book. Note that the path to
the original file (before any file type plugins are run, is available as
self.original_path_to_file).
The modified e-book file should be created with the
:meth:`temporary_file` method.
:param path_to_ebook: Absolute path to the e-book.
:return: Absolute path to the modified e-book.
'''
# Default implementation does nothing
return path_to_ebook
def postimport(self, book_id, book_format, db):
'''
Called post import, i.e., after the book file has been added to the database. Note that
this is different from :meth:`postadd` which is called when the book record is created for
the first time. This method is called whenever a new file is added to a book record. It is
useful for modifying the book record based on the contents of the newly added file.
:param book_id: Database id of the added book.
:param book_format: The file type of the book that was added.
:param db: Library database.
'''
pass # Default implementation does nothing
def postadd(self, book_id, fmt_map, db):
'''
Called post add, i.e. after a book has been added to the db. Note that
this is different from :meth:`postimport`, which is called after a single book file
has been added to a book. postadd() is called only when an entire book record
with possibly more than one book file has been created for the first time.
This is useful if you wish to modify the book record in the database when the
book is first added to calibre.
:param book_id: Database id of the added book.
:param fmt_map: Map of file format to path from which the file format
was added. Note that this might or might not point to an actual
existing file, as sometimes files are added as streams. In which case
it might be a dummy value or a non-existent path.
:param db: Library database
'''
pass # Default implementation does nothing
# }}}
class MetadataReaderPlugin(Plugin): # {{{
'''
A plugin that implements reading metadata from a set of file types.
'''
#: Set of file types for which this plugin should be run.
#: For example: ``set(['lit', 'mobi', 'prc'])``
file_types = set()
supported_platforms = ['windows', 'osx', 'linux']
version = numeric_version
author = 'Kovid Goyal'
type = _('Metadata reader')
def __init__(self, *args, **kwargs):
Plugin.__init__(self, *args, **kwargs)
self.quick = False
def get_metadata(self, stream, type):
'''
Return metadata for the file represented by stream (a file like object
that supports reading). Raise an exception when there is an error
with the input data.
:param type: The type of file. Guaranteed to be one of the entries
in :attr:`file_types`.
:return: A :class:`calibre.ebooks.metadata.book.Metadata` object
'''
return None
# }}}
class MetadataWriterPlugin(Plugin): # {{{
'''
A plugin that implements reading metadata from a set of file types.
'''
#: Set of file types for which this plugin should be run.
#: For example: ``set(['lit', 'mobi', 'prc'])``
file_types = set()
supported_platforms = ['windows', 'osx', 'linux']
version = numeric_version
author = 'Kovid Goyal'
type = _('Metadata writer')
def __init__(self, *args, **kwargs):
Plugin.__init__(self, *args, **kwargs)
self.apply_null = False
def set_metadata(self, stream, mi, type):
'''
Set metadata for the file represented by stream (a file like object
that supports reading). Raise an exception when there is an error
with the input data.
:param type: The type of file. Guaranteed to be one of the entries
in :attr:`file_types`.
:param mi: A :class:`calibre.ebooks.metadata.book.Metadata` object
'''
pass
# }}}
class CatalogPlugin(Plugin): # {{{
'''
A plugin that implements a catalog generator.
'''
resources_path = None
#: Output file type for which this plugin should be run.
#: For example: 'epub' or 'xml'
file_types = set()
type = _('Catalog generator')
#: CLI parser options specific to this plugin, declared as namedtuple Option:
#:
#: from collections import namedtuple
#: Option = namedtuple('Option', 'option, default, dest, help')
#: cli_options = [Option('--catalog-title', default = 'My Catalog',
#: dest = 'catalog_title', help = (_('Title of generated catalog. \nDefault:') + " '" + '%default' + "'"))]
#: cli_options parsed in calibre.db.cli.cmd_catalog:option_parser()
#:
cli_options = []
def _field_sorter(self, key):
'''
Custom fields sort after standard fields
'''
if key.startswith('#'):
return '~%s' % key[1:]
else:
return key
def search_sort_db(self, db, opts):
db.search(opts.search_text)
if opts.sort_by:
# 2nd arg = ascending
db.sort(opts.sort_by, True)
return db.get_data_as_dict(ids=opts.ids)
def get_output_fields(self, db, opts):
# Return a list of requested fields
all_std_fields = {'author_sort','authors','comments','cover','formats',
'id','isbn','library_name','ondevice','pubdate','publisher',
'rating','series_index','series','size','tags','timestamp',
'title_sort','title','uuid','languages','identifiers'}
all_custom_fields = set(db.custom_field_keys())
for field in list(all_custom_fields):
fm = db.field_metadata[field]
if fm['datatype'] == 'series':
all_custom_fields.add(field+'_index')
all_fields = all_std_fields.union(all_custom_fields)
if opts.fields != 'all':
# Make a list from opts.fields
of = [x.strip() for x in opts.fields.split(',')]
requested_fields = set(of)
# Validate requested_fields
if requested_fields - all_fields:
from calibre.library import current_library_name
invalid_fields = sorted(list(requested_fields - all_fields))
print("invalid --fields specified: %s" % ', '.join(invalid_fields))
print("available fields in '%s': %s" %
(current_library_name(), ', '.join(sorted(list(all_fields)))))
raise ValueError("unable to generate catalog with specified fields")
fields = [x for x in of if x in all_fields]
else:
fields = sorted(all_fields, key=self._field_sorter)
if not opts.connected_device['is_device_connected'] and 'ondevice' in fields:
fields.pop(int(fields.index('ondevice')))
return fields
def initialize(self):
'''
If plugin is not a built-in, copy the plugin's .ui and .py files from
the ZIP file to $TMPDIR.
Tab will be dynamically generated and added to the Catalog Options dialog in
calibre.gui2.dialogs.catalog.py:Catalog
'''
from calibre.customize.builtins import plugins as builtin_plugins
from calibre.customize.ui import config
from calibre.ptempfile import PersistentTemporaryDirectory
if not type(self) in builtin_plugins and self.name not in config['disabled_plugins']:
files_to_copy = ["%s.%s" % (self.name.lower(),ext) for ext in ["ui","py"]]
resources = zipfile.ZipFile(self.plugin_path,'r')
if self.resources_path is None:
self.resources_path = PersistentTemporaryDirectory('_plugin_resources', prefix='')
for file in files_to_copy:
try:
resources.extract(file, self.resources_path)
except:
print(" customize:__init__.initialize(): %s not found in %s" % (file, os.path.basename(self.plugin_path)))
continue
resources.close()
def run(self, path_to_output, opts, db, ids, notification=None):
'''
Run the plugin. Must be implemented in subclasses.
It should generate the catalog in the format specified
in file_types, returning the absolute path to the
generated catalog file. If an error is encountered
it should raise an Exception.
The generated catalog file should be created with the
:meth:`temporary_file` method.
:param path_to_output: Absolute path to the generated catalog file.
:param opts: A dictionary of keyword arguments
:param db: A LibraryDatabase2 object
'''
# Default implementation does nothing
raise NotImplementedError('CatalogPlugin.generate_catalog() default '
'method, should be overridden in subclass')
# }}}
class InterfaceActionBase(Plugin): # {{{
supported_platforms = ['windows', 'osx', 'linux']
author = 'Kovid Goyal'
type = _('User interface action')
can_be_disabled = False
actual_plugin = None
def __init__(self, *args, **kwargs):
Plugin.__init__(self, *args, **kwargs)
self.actual_plugin_ = None
def load_actual_plugin(self, gui):
'''
This method must return the actual interface action plugin object.
'''
ac = self.actual_plugin_
if ac is None:
mod, cls = self.actual_plugin.split(':')
ac = getattr(importlib.import_module(mod), cls)(gui,
self.site_customization)
self.actual_plugin_ = ac
return ac
# }}}
class PreferencesPlugin(Plugin): # {{{
'''
A plugin representing a widget displayed in the Preferences dialog.
This plugin has only one important method :meth:`create_widget`. The
various fields of the plugin control how it is categorized in the UI.
'''
supported_platforms = ['windows', 'osx', 'linux']
author = 'Kovid Goyal'
type = _('Preferences')
can_be_disabled = False
#: Import path to module that contains a class named ConfigWidget
#: which implements the ConfigWidgetInterface. Used by
#: :meth:`create_widget`.
config_widget = None
#: Where in the list of categories the :attr:`category` of this plugin should be.
category_order = 100
#: Where in the list of names in a category, the :attr:`gui_name` of this
#: plugin should be
name_order = 100
#: The category this plugin should be in
category = None
#: The category name displayed to the user for this plugin
gui_category = None
#: The name displayed to the user for this plugin
gui_name = None
#: The icon for this plugin, should be an absolute path
icon = None
#: The description used for tooltips and the like
description = None
def create_widget(self, parent=None):
'''
Create and return the actual Qt widget used for setting this group of
preferences. The widget must implement the
:class:`calibre.gui2.preferences.ConfigWidgetInterface`.
The default implementation uses :attr:`config_widget` to instantiate
the widget.
'''
base, _, wc = self.config_widget.partition(':')
if not wc:
wc = 'ConfigWidget'
base = importlib.import_module(base)
widget = getattr(base, wc)
return widget(parent)
# }}}
class StoreBase(Plugin): # {{{
supported_platforms = ['windows', 'osx', 'linux']
author = 'John Schember'
type = _('Store')
# Information about the store. Should be in the primary language
# of the store. This should not be translatable when set by
# a subclass.
description = _('An e-book store.')
minimum_calibre_version = (0, 8, 0)
version = (1, 0, 1)
actual_plugin = None
# Does the store only distribute e-books without DRM.
drm_free_only = False
# This is the 2 letter country code for the corporate
# headquarters of the store.
headquarters = ''
# All formats the store distributes e-books in.
formats = []
# Is this store on an affiliate program?
affiliate = False
def load_actual_plugin(self, gui):
'''
This method must return the actual interface action plugin object.
'''
mod, cls = self.actual_plugin.split(':')
self.actual_plugin_object = getattr(importlib.import_module(mod), cls)(gui, self.name)
return self.actual_plugin_object
def customization_help(self, gui=False):
if getattr(self, 'actual_plugin_object', None) is not None:
return self.actual_plugin_object.customization_help(gui)
raise NotImplementedError()
def config_widget(self):
if getattr(self, 'actual_plugin_object', None) is not None:
return self.actual_plugin_object.config_widget()
raise NotImplementedError()
def save_settings(self, config_widget):
if getattr(self, 'actual_plugin_object', None) is not None:
return self.actual_plugin_object.save_settings(config_widget)
raise NotImplementedError()
# }}}
class EditBookToolPlugin(Plugin): # {{{
type = _('Edit book tool')
minimum_calibre_version = (1, 46, 0)
# }}}
class LibraryClosedPlugin(Plugin): # {{{
'''
LibraryClosedPlugins are run when a library is closed, either at shutdown,
when the library is changed, or when a library is used in some other way.
At the moment these plugins won't be called by the CLI functions.
'''
type = _('Library closed')
# minimum version 2.54 because that is when support was added
minimum_calibre_version = (2, 54, 0)
def run(self, db):
'''
The db will be a reference to the new_api (db.cache.py).
The plugin must run to completion. It must not use the GUI, threads, or
any signals.
'''
raise NotImplementedError('LibraryClosedPlugin '
'run method must be overridden in subclass')
# }}}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,376 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
'''
Defines the plugin system for conversions.
'''
import re, os, shutil, numbers
from calibre import CurrentDir
from calibre.customize import Plugin
from polyglot.builtins import unicode_type
class ConversionOption(object):
'''
Class representing conversion options
'''
def __init__(self, name=None, help=None, long_switch=None,
short_switch=None, choices=None):
self.name = name
self.help = help
self.long_switch = long_switch
self.short_switch = short_switch
self.choices = choices
if self.long_switch is None:
self.long_switch = self.name.replace('_', '-')
self.validate_parameters()
def validate_parameters(self):
'''
Validate the parameters passed to :meth:`__init__`.
'''
if re.match(r'[a-zA-Z_]([a-zA-Z0-9_])*', self.name) is None:
raise ValueError(self.name + ' is not a valid Python identifier')
if not self.help:
raise ValueError('You must set the help text')
def __hash__(self):
return hash(self.name)
def __eq__(self, other):
return self.name == getattr(other, 'name', other)
def clone(self):
return ConversionOption(name=self.name, help=self.help,
long_switch=self.long_switch, short_switch=self.short_switch,
choices=self.choices)
class OptionRecommendation(object):
LOW = 1
MED = 2
HIGH = 3
def __init__(self, recommended_value=None, level=LOW, **kwargs):
'''
An option recommendation. That is, an option as well as its recommended
value and the level of the recommendation.
'''
self.level = level
self.recommended_value = recommended_value
self.option = kwargs.pop('option', None)
if self.option is None:
self.option = ConversionOption(**kwargs)
self.validate_parameters()
@property
def help(self):
return self.option.help
def clone(self):
return OptionRecommendation(recommended_value=self.recommended_value,
level=self.level, option=self.option.clone())
def validate_parameters(self):
if self.option.choices and self.recommended_value not in \
self.option.choices:
raise ValueError('OpRec: %s: Recommended value not in choices'%
self.option.name)
if not (isinstance(self.recommended_value, (numbers.Number, bytes, unicode_type)) or self.recommended_value is None):
raise ValueError('OpRec: %s:'%self.option.name + repr(
self.recommended_value) + ' is not a string or a number')
class DummyReporter(object):
def __init__(self):
self.cancel_requested = False
def __call__(self, percent, msg=''):
pass
def gui_configuration_widget(name, parent, get_option_by_name,
get_option_help, db, book_id, for_output=True):
import importlib
def widget_factory(cls):
return cls(parent, get_option_by_name,
get_option_help, db, book_id)
if for_output:
try:
output_widget = importlib.import_module(
'calibre.gui2.convert.'+name)
pw = output_widget.PluginWidget
pw.ICON = I('back.png')
pw.HELP = _('Options specific to the output format.')
return widget_factory(pw)
except ImportError:
pass
else:
try:
input_widget = importlib.import_module(
'calibre.gui2.convert.'+name)
pw = input_widget.PluginWidget
pw.ICON = I('forward.png')
pw.HELP = _('Options specific to the input format.')
return widget_factory(pw)
except ImportError:
pass
return None
class InputFormatPlugin(Plugin):
'''
InputFormatPlugins are responsible for converting a document into
HTML+OPF+CSS+etc.
The results of the conversion *must* be encoded in UTF-8.
The main action happens in :meth:`convert`.
'''
type = _('Conversion input')
can_be_disabled = False
supported_platforms = ['windows', 'osx', 'linux']
commit_name = None # unique name under which options for this plugin are saved
ui_data = None
#: Set of file types for which this plugin should be run
#: For example: ``set(['azw', 'mobi', 'prc'])``
file_types = set()
#: If True, this input plugin generates a collection of images,
#: one per HTML file. This can be set dynamically, in the convert method
#: if the input files can be both image collections and non-image collections.
#: If you set this to True, you must implement the get_images() method that returns
#: a list of images.
is_image_collection = False
#: Number of CPU cores used by this plugin.
#: A value of -1 means that it uses all available cores
core_usage = 1
#: If set to True, the input plugin will perform special processing
#: to make its output suitable for viewing
for_viewer = False
#: The encoding that this input plugin creates files in. A value of
#: None means that the encoding is undefined and must be
#: detected individually
output_encoding = 'utf-8'
#: Options shared by all Input format plugins. Do not override
#: in sub-classes. Use :attr:`options` instead. Every option must be an
#: instance of :class:`OptionRecommendation`.
common_options = {
OptionRecommendation(name='input_encoding',
recommended_value=None, level=OptionRecommendation.LOW,
help=_('Specify the character encoding of the input document. If '
'set this option will override any encoding declared by the '
'document itself. Particularly useful for documents that '
'do not declare an encoding or that have erroneous '
'encoding declarations.')
)}
#: Options to customize the behavior of this plugin. Every option must be an
#: instance of :class:`OptionRecommendation`.
options = set()
#: A set of 3-tuples of the form
#: (option_name, recommended_value, recommendation_level)
recommendations = set()
def __init__(self, *args):
Plugin.__init__(self, *args)
self.report_progress = DummyReporter()
def get_images(self):
'''
Return a list of absolute paths to the images, if this input plugin
represents an image collection. The list of images is in the same order
as the spine and the TOC.
'''
raise NotImplementedError()
def convert(self, stream, options, file_ext, log, accelerators):
'''
This method must be implemented in sub-classes. It must return
the path to the created OPF file or an :class:`OEBBook` instance.
All output should be contained in the current directory.
If this plugin creates files outside the current
directory they must be deleted/marked for deletion before this method
returns.
:param stream: A file like object that contains the input file.
:param options: Options to customize the conversion process.
Guaranteed to have attributes corresponding
to all the options declared by this plugin. In
addition, it will have a verbose attribute that
takes integral values from zero upwards. Higher numbers
mean be more verbose. Another useful attribute is
``input_profile`` that is an instance of
:class:`calibre.customize.profiles.InputProfile`.
:param file_ext: The extension (without the .) of the input file. It
is guaranteed to be one of the `file_types` supported
by this plugin.
:param log: A :class:`calibre.utils.logging.Log` object. All output
should use this object.
:param accelarators: A dictionary of various information that the input
plugin can get easily that would speed up the
subsequent stages of the conversion.
'''
raise NotImplementedError()
def __call__(self, stream, options, file_ext, log,
accelerators, output_dir):
try:
log('InputFormatPlugin: %s running'%self.name)
if hasattr(stream, 'name'):
log('on', stream.name)
except:
# In case stdout is broken
pass
with CurrentDir(output_dir):
for x in os.listdir('.'):
shutil.rmtree(x) if os.path.isdir(x) else os.remove(x)
ret = self.convert(stream, options, file_ext,
log, accelerators)
return ret
def postprocess_book(self, oeb, opts, log):
'''
Called to allow the input plugin to perform postprocessing after
the book has been parsed.
'''
pass
def specialize(self, oeb, opts, log, output_fmt):
'''
Called to allow the input plugin to specialize the parsed book
for a particular output format. Called after postprocess_book
and before any transforms are performed on the parsed book.
'''
pass
def gui_configuration_widget(self, parent, get_option_by_name,
get_option_help, db, book_id=None):
'''
Called to create the widget used for configuring this plugin in the
calibre GUI. The widget must be an instance of the PluginWidget class.
See the builtin input plugins for examples.
'''
name = self.name.lower().replace(' ', '_')
return gui_configuration_widget(name, parent, get_option_by_name,
get_option_help, db, book_id, for_output=False)
class OutputFormatPlugin(Plugin):
'''
OutputFormatPlugins are responsible for converting an OEB document
(OPF+HTML) into an output e-book.
The OEB document can be assumed to be encoded in UTF-8.
The main action happens in :meth:`convert`.
'''
type = _('Conversion output')
can_be_disabled = False
supported_platforms = ['windows', 'osx', 'linux']
commit_name = None # unique name under which options for this plugin are saved
ui_data = None
#: The file type (extension without leading period) that this
#: plugin outputs
file_type = None
#: Options shared by all Input format plugins. Do not override
#: in sub-classes. Use :attr:`options` instead. Every option must be an
#: instance of :class:`OptionRecommendation`.
common_options = {
OptionRecommendation(name='pretty_print',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('If specified, the output plugin will try to create output '
'that is as human readable as possible. May not have any effect '
'for some output plugins.')
)}
#: Options to customize the behavior of this plugin. Every option must be an
#: instance of :class:`OptionRecommendation`.
options = set()
#: A set of 3-tuples of the form
#: (option_name, recommended_value, recommendation_level)
recommendations = set()
@property
def description(self):
return _('Convert e-books to the %s format')%self.file_type
def __init__(self, *args):
Plugin.__init__(self, *args)
self.report_progress = DummyReporter()
def convert(self, oeb_book, output, input_plugin, opts, log):
'''
Render the contents of `oeb_book` (which is an instance of
:class:`calibre.ebooks.oeb.OEBBook`) to the file specified by output.
:param output: Either a file like object or a string. If it is a string
it is the path to a directory that may or may not exist. The output
plugin should write its output into that directory. If it is a file like
object, the output plugin should write its output into the file.
:param input_plugin: The input plugin that was used at the beginning of
the conversion pipeline.
:param opts: Conversion options. Guaranteed to have attributes
corresponding to the OptionRecommendations of this plugin.
:param log: The logger. Print debug/info messages etc. using this.
'''
raise NotImplementedError()
@property
def is_periodical(self):
return self.oeb.metadata.publication_type and \
unicode_type(self.oeb.metadata.publication_type[0]).startswith('periodical:')
def specialize_options(self, log, opts, input_fmt):
'''
Can be used to change the values of conversion options, as used by the
conversion pipeline.
'''
pass
def specialize_css_for_output(self, log, opts, item, stylizer):
'''
Can be used to make changes to the css during the CSS flattening
process.
:param item: The item (HTML file) being processed
:param stylizer: A Stylizer object containing the flattened styles for
item. You can get the style for any element by
stylizer.style(element).
'''
pass
def gui_configuration_widget(self, parent, get_option_by_name,
get_option_help, db, book_id=None):
'''
Called to create the widget used for configuring this plugin in the
calibre GUI. The widget must be an instance of the PluginWidget class.
See the builtin output plugins for examples.
'''
name = self.name.lower().replace(' ', '_')
return gui_configuration_widget(name, parent, get_option_by_name,
get_option_help, db, book_id, for_output=True)

View File

@@ -0,0 +1,873 @@
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
from calibre.customize import Plugin as _Plugin
from polyglot.builtins import zip
FONT_SIZES = [('xx-small', 1),
('x-small', None),
('small', 2),
('medium', 3),
('large', 4),
('x-large', 5),
('xx-large', 6),
(None, 7)]
class Plugin(_Plugin):
fbase = 12
fsizes = [5, 7, 9, 12, 13.5, 17, 20, 22, 24]
screen_size = (1600, 1200)
dpi = 100
def __init__(self, *args, **kwargs):
_Plugin.__init__(self, *args, **kwargs)
self.width, self.height = self.screen_size
fsizes = list(self.fsizes)
self.fkey = list(self.fsizes)
self.fsizes = []
for (name, num), size in zip(FONT_SIZES, fsizes):
self.fsizes.append((name, num, float(size)))
self.fnames = dict((name, sz) for name, _, sz in self.fsizes if name)
self.fnums = dict((num, sz) for _, num, sz in self.fsizes if num)
self.width_pts = self.width * 72./self.dpi
self.height_pts = self.height * 72./self.dpi
# Input profiles {{{
class InputProfile(Plugin):
author = 'Kovid Goyal'
supported_platforms = {'windows', 'osx', 'linux'}
can_be_disabled = False
type = _('Input profile')
name = 'Default Input Profile'
short_name = 'default' # Used in the CLI so dont use spaces etc. in it
description = _('This profile tries to provide sane defaults and is useful '
'if you know nothing about the input document.')
class SonyReaderInput(InputProfile):
name = 'Sony Reader'
short_name = 'sony'
description = _('This profile is intended for the SONY PRS line. '
'The 500/505/600/700 etc.')
screen_size = (584, 754)
dpi = 168.451
fbase = 12
fsizes = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
class SonyReader300Input(SonyReaderInput):
name = 'Sony Reader 300'
short_name = 'sony300'
description = _('This profile is intended for the SONY PRS 300.')
dpi = 200
class SonyReader900Input(SonyReaderInput):
author = 'John Schember'
name = 'Sony Reader 900'
short_name = 'sony900'
description = _('This profile is intended for the SONY PRS-900.')
screen_size = (584, 978)
class MSReaderInput(InputProfile):
name = 'Microsoft Reader'
short_name = 'msreader'
description = _('This profile is intended for the Microsoft Reader.')
screen_size = (480, 652)
dpi = 96
fbase = 13
fsizes = [10, 11, 13, 16, 18, 20, 22, 26]
class MobipocketInput(InputProfile):
name = 'Mobipocket Books'
short_name = 'mobipocket'
description = _('This profile is intended for the Mobipocket books.')
# Unfortunately MOBI books are not narrowly targeted, so this information is
# quite likely to be spurious
screen_size = (600, 800)
dpi = 96
fbase = 18
fsizes = [14, 14, 16, 18, 20, 22, 24, 26]
class HanlinV3Input(InputProfile):
name = 'Hanlin V3'
short_name = 'hanlinv3'
description = _('This profile is intended for the Hanlin V3 and its clones.')
# Screen size is a best guess
screen_size = (584, 754)
dpi = 168.451
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class HanlinV5Input(HanlinV3Input):
name = 'Hanlin V5'
short_name = 'hanlinv5'
description = _('This profile is intended for the Hanlin V5 and its clones.')
# Screen size is a best guess
screen_size = (584, 754)
dpi = 200
class CybookG3Input(InputProfile):
name = 'Cybook G3'
short_name = 'cybookg3'
description = _('This profile is intended for the Cybook G3.')
# Screen size is a best guess
screen_size = (600, 800)
dpi = 168.451
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class CybookOpusInput(InputProfile):
author = 'John Schember'
name = 'Cybook Opus'
short_name = 'cybook_opus'
description = _('This profile is intended for the Cybook Opus.')
# Screen size is a best guess
screen_size = (600, 800)
dpi = 200
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class KindleInput(InputProfile):
name = 'Kindle'
short_name = 'kindle'
description = _('This profile is intended for the Amazon Kindle.')
# Screen size is a best guess
screen_size = (525, 640)
dpi = 168.451
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class IlliadInput(InputProfile):
name = 'Illiad'
short_name = 'illiad'
description = _('This profile is intended for the Irex Illiad.')
screen_size = (760, 925)
dpi = 160.0
fbase = 12
fsizes = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
class IRexDR1000Input(InputProfile):
author = 'John Schember'
name = 'IRex Digital Reader 1000'
short_name = 'irexdr1000'
description = _('This profile is intended for the IRex Digital Reader 1000.')
# Screen size is a best guess
screen_size = (1024, 1280)
dpi = 160
fbase = 16
fsizes = [12, 14, 16, 18, 20, 22, 24]
class IRexDR800Input(InputProfile):
author = 'Eric Cronin'
name = 'IRex Digital Reader 800'
short_name = 'irexdr800'
description = _('This profile is intended for the IRex Digital Reader 800.')
screen_size = (768, 1024)
dpi = 160
fbase = 16
fsizes = [12, 14, 16, 18, 20, 22, 24]
class NookInput(InputProfile):
author = 'John Schember'
name = 'Nook'
short_name = 'nook'
description = _('This profile is intended for the B&N Nook.')
# Screen size is a best guess
screen_size = (600, 800)
dpi = 167
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
input_profiles = [InputProfile, SonyReaderInput, SonyReader300Input,
SonyReader900Input, MSReaderInput, MobipocketInput, HanlinV3Input,
HanlinV5Input, CybookG3Input, CybookOpusInput, KindleInput, IlliadInput,
IRexDR1000Input, IRexDR800Input, NookInput]
input_profiles.sort(key=lambda x: x.name.lower())
# }}}
class OutputProfile(Plugin):
author = 'Kovid Goyal'
supported_platforms = {'windows', 'osx', 'linux'}
can_be_disabled = False
type = _('Output profile')
name = 'Default Output Profile'
short_name = 'default' # Used in the CLI so dont use spaces etc. in it
description = _('This profile tries to provide sane defaults and is useful '
'if you want to produce a document intended to be read at a '
'computer or on a range of devices.')
#: The image size for comics
comic_screen_size = (584, 754)
#: If True the MOBI renderer on the device supports MOBI indexing
supports_mobi_indexing = False
#: If True output should be optimized for a touchscreen interface
touchscreen = False
touchscreen_news_css = ''
#: A list of extra (beyond CSS 2.1) modules supported by the device
#: Format is a css_parser profile dictionary (see iPad for example)
extra_css_modules = []
#: If True, the date is appended to the title of downloaded news
periodical_date_in_title = True
#: Characters used in jackets and catalogs
ratings_char = '*'
empty_ratings_char = ' '
#: Unsupported unicode characters to be replaced during preprocessing
unsupported_unicode_chars = []
#: Number of ems that the left margin of a blockquote is rendered as
mobi_ems_per_blockquote = 1.0
#: Special periodical formatting needed in EPUB
epub_periodical_format = None
class iPadOutput(OutputProfile):
name = 'iPad'
short_name = 'ipad'
description = _('Intended for the iPad and similar devices with a '
'resolution of 768x1024')
screen_size = (768, 1024)
comic_screen_size = (768, 1024)
dpi = 132.0
extra_css_modules = [
{
'name':'webkit',
'props': {'-webkit-border-bottom-left-radius':'{length}',
'-webkit-border-bottom-right-radius':'{length}',
'-webkit-border-top-left-radius':'{length}',
'-webkit-border-top-right-radius':'{length}',
'-webkit-border-radius': r'{border-width}(\s+{border-width}){0,3}|inherit',
},
'macros': {'border-width': '{length}|medium|thick|thin'}
}
]
ratings_char = '\u2605' # filled star
empty_ratings_char = '\u2606' # hollow star
touchscreen = True
# touchscreen_news_css {{{
touchscreen_news_css = '''
/* hr used in articles */
.article_articles_list {
width:18%;
}
.article_link {
color: #593f29;
font-style: italic;
}
.article_next {
-webkit-border-top-right-radius:4px;
-webkit-border-bottom-right-radius:4px;
font-style: italic;
width:32%;
}
.article_prev {
-webkit-border-top-left-radius:4px;
-webkit-border-bottom-left-radius:4px;
font-style: italic;
width:32%;
}
.article_sections_list {
width:18%;
}
.articles_link {
font-weight: bold;
}
.sections_link {
font-weight: bold;
}
.caption_divider {
border:#ccc 1px solid;
}
.touchscreen_navbar {
background:#c3bab2;
border:#ccc 0px solid;
border-collapse:separate;
border-spacing:1px;
margin-left: 5%;
margin-right: 5%;
page-break-inside:avoid;
width: 90%;
-webkit-border-radius:4px;
}
.touchscreen_navbar td {
background:#fff;
font-family:Helvetica;
font-size:80%;
/* UI touchboxes use 8px padding */
padding: 6px;
text-align:center;
}
.touchscreen_navbar td a:link {
color: #593f29;
text-decoration: none;
}
/* Index formatting */
.publish_date {
text-align:center;
}
.divider {
border-bottom:1em solid white;
border-top:1px solid gray;
}
hr.caption_divider {
border-color:black;
border-style:solid;
border-width:1px;
}
/* Feed summary formatting */
.article_summary {
display:inline-block;
padding-bottom:0.5em;
}
.feed {
font-family:sans-serif;
font-weight:bold;
font-size:larger;
}
.feed_link {
font-style: italic;
}
.feed_next {
-webkit-border-top-right-radius:4px;
-webkit-border-bottom-right-radius:4px;
font-style: italic;
width:40%;
}
.feed_prev {
-webkit-border-top-left-radius:4px;
-webkit-border-bottom-left-radius:4px;
font-style: italic;
width:40%;
}
.feed_title {
text-align: center;
font-size: 160%;
}
.feed_up {
font-weight: bold;
width:20%;
}
.summary_headline {
font-weight:bold;
text-align:left;
}
.summary_byline {
text-align:left;
font-family:monospace;
}
.summary_text {
text-align:left;
}
'''
# }}}
class iPad3Output(iPadOutput):
screen_size = comic_screen_size = (2048, 1536)
dpi = 264.0
name = 'iPad 3'
short_name = 'ipad3'
description = _('Intended for the iPad 3 and similar devices with a '
'resolution of 1536x2048')
class TabletOutput(iPadOutput):
name = 'Tablet'
short_name = 'tablet'
description = _('Intended for generic tablet devices, does no resizing of images')
screen_size = (10000, 10000)
comic_screen_size = (10000, 10000)
class SamsungGalaxy(TabletOutput):
name = 'Samsung Galaxy'
short_name = 'galaxy'
description = _('Intended for the Samsung Galaxy and similar tablet devices with '
'a resolution of 600x1280')
screen_size = comic_screen_size = (600, 1280)
class NookHD(TabletOutput):
name = 'Nook HD+'
short_name = 'nook_hd_plus'
description = _('Intended for the Nook HD+ and similar tablet devices with '
'a resolution of 1280x1920')
screen_size = comic_screen_size = (1280, 1920)
class SonyReaderOutput(OutputProfile):
name = 'Sony Reader'
short_name = 'sony'
description = _('This profile is intended for the SONY PRS line. '
'The 500/505/600/700 etc.')
screen_size = (590, 775)
dpi = 168.451
fbase = 12
fsizes = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
unsupported_unicode_chars = [u'\u201f', u'\u201b']
epub_periodical_format = 'sony'
# periodical_date_in_title = False
class KoboReaderOutput(OutputProfile):
name = 'Kobo Reader'
short_name = 'kobo'
description = _('This profile is intended for the Kobo Reader.')
screen_size = (536, 710)
comic_screen_size = (536, 710)
dpi = 168.451
fbase = 12
fsizes = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
class SonyReader300Output(SonyReaderOutput):
author = 'John Schember'
name = 'Sony Reader 300'
short_name = 'sony300'
description = _('This profile is intended for the SONY PRS-300.')
dpi = 200
class SonyReader900Output(SonyReaderOutput):
author = 'John Schember'
name = 'Sony Reader 900'
short_name = 'sony900'
description = _('This profile is intended for the SONY PRS-900.')
screen_size = (600, 999)
comic_screen_size = screen_size
class SonyReaderT3Output(SonyReaderOutput):
author = 'Kovid Goyal'
name = 'Sony Reader T3'
short_name = 'sonyt3'
description = _('This profile is intended for the SONY PRS-T3.')
screen_size = (758, 934)
comic_screen_size = screen_size
class GenericEink(SonyReaderOutput):
name = 'Generic e-ink'
short_name = 'generic_eink'
description = _('Suitable for use with any e-ink device')
epub_periodical_format = None
class GenericEinkLarge(GenericEink):
name = 'Generic e-ink large'
short_name = 'generic_eink_large'
description = _('Suitable for use with any large screen e-ink device')
screen_size = (600, 999)
comic_screen_size = screen_size
class GenericEinkHD(GenericEink):
name = 'Generic e-ink HD'
short_name = 'generic_eink_hd'
description = _('Suitable for use with any modern high resolution e-ink device')
screen_size = (10000, 10000)
comic_screen_size = (10000, 10000)
class JetBook5Output(OutputProfile):
name = 'JetBook 5-inch'
short_name = 'jetbook5'
description = _('This profile is intended for the 5-inch JetBook.')
screen_size = (480, 640)
dpi = 168.451
class SonyReaderLandscapeOutput(SonyReaderOutput):
name = 'Sony Reader Landscape'
short_name = 'sony-landscape'
description = _('This profile is intended for the SONY PRS line. '
'The 500/505/700 etc, in landscape mode. Mainly useful '
'for comics.')
screen_size = (784, 1012)
comic_screen_size = (784, 1012)
class MSReaderOutput(OutputProfile):
name = 'Microsoft Reader'
short_name = 'msreader'
description = _('This profile is intended for the Microsoft Reader.')
screen_size = (480, 652)
dpi = 96
fbase = 13
fsizes = [10, 11, 13, 16, 18, 20, 22, 26]
class MobipocketOutput(OutputProfile):
name = 'Mobipocket Books'
short_name = 'mobipocket'
description = _('This profile is intended for the Mobipocket books.')
# Unfortunately MOBI books are not narrowly targeted, so this information is
# quite likely to be spurious
screen_size = (600, 800)
dpi = 96
fbase = 18
fsizes = [14, 14, 16, 18, 20, 22, 24, 26]
class HanlinV3Output(OutputProfile):
name = 'Hanlin V3'
short_name = 'hanlinv3'
description = _('This profile is intended for the Hanlin V3 and its clones.')
# Screen size is a best guess
screen_size = (584, 754)
dpi = 168.451
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class HanlinV5Output(HanlinV3Output):
name = 'Hanlin V5'
short_name = 'hanlinv5'
description = _('This profile is intended for the Hanlin V5 and its clones.')
dpi = 200
class CybookG3Output(OutputProfile):
name = 'Cybook G3'
short_name = 'cybookg3'
description = _('This profile is intended for the Cybook G3.')
# Screen size is a best guess
screen_size = (600, 800)
comic_screen_size = (600, 757)
dpi = 168.451
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class CybookOpusOutput(SonyReaderOutput):
author = 'John Schember'
name = 'Cybook Opus'
short_name = 'cybook_opus'
description = _('This profile is intended for the Cybook Opus.')
# Screen size is a best guess
dpi = 200
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
epub_periodical_format = None
class KindleOutput(OutputProfile):
name = 'Kindle'
short_name = 'kindle'
description = _('This profile is intended for the Amazon Kindle.')
# Screen size is a best guess
screen_size = (525, 640)
dpi = 168.451
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
supports_mobi_indexing = True
periodical_date_in_title = False
empty_ratings_char = '\u2606'
ratings_char = '\u2605'
mobi_ems_per_blockquote = 2.0
class KindleDXOutput(OutputProfile):
name = 'Kindle DX'
short_name = 'kindle_dx'
description = _('This profile is intended for the Amazon Kindle DX.')
# Screen size is a best guess
screen_size = (744, 1022)
dpi = 150.0
comic_screen_size = (771, 1116)
# comic_screen_size = (741, 1022)
supports_mobi_indexing = True
periodical_date_in_title = False
empty_ratings_char = '\u2606'
ratings_char = '\u2605'
mobi_ems_per_blockquote = 2.0
class KindlePaperWhiteOutput(KindleOutput):
name = 'Kindle PaperWhite'
short_name = 'kindle_pw'
description = _('This profile is intended for the Amazon Kindle PaperWhite 1 and 2')
# Screen size is a best guess
screen_size = (658, 940)
dpi = 212.0
comic_screen_size = screen_size
class KindleVoyageOutput(KindleOutput):
name = 'Kindle Voyage'
short_name = 'kindle_voyage'
description = _('This profile is intended for the Amazon Kindle Voyage')
# Screen size is currently just the spec size, actual renderable area will
# depend on someone with the device doing tests.
screen_size = (1080, 1430)
dpi = 300.0
comic_screen_size = screen_size
class KindlePaperWhite3Output(KindleVoyageOutput):
name = 'Kindle PaperWhite 3'
short_name = 'kindle_pw3'
description = _('This profile is intended for the Amazon Kindle PaperWhite 3 and above')
# Screen size is currently just the spec size, actual renderable area will
# depend on someone with the device doing tests.
screen_size = (1072, 1430)
dpi = 300.0
comic_screen_size = screen_size
class KindleOasisOutput(KindlePaperWhite3Output):
name = 'Kindle Oasis'
short_name = 'kindle_oasis'
description = _('This profile is intended for the Amazon Kindle Oasis 2017 and above')
# Screen size is currently just the spec size, actual renderable area will
# depend on someone with the device doing tests.
screen_size = (1264, 1680)
dpi = 300.0
comic_screen_size = screen_size
class KindleFireOutput(KindleDXOutput):
name = 'Kindle Fire'
short_name = 'kindle_fire'
description = _('This profile is intended for the Amazon Kindle Fire.')
screen_size = (570, 1016)
dpi = 169.0
comic_screen_size = (570, 1016)
class IlliadOutput(OutputProfile):
name = 'Illiad'
short_name = 'illiad'
description = _('This profile is intended for the Irex Illiad.')
screen_size = (760, 925)
comic_screen_size = (760, 925)
dpi = 160.0
fbase = 12
fsizes = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
class IRexDR1000Output(OutputProfile):
author = 'John Schember'
name = 'IRex Digital Reader 1000'
short_name = 'irexdr1000'
description = _('This profile is intended for the IRex Digital Reader 1000.')
# Screen size is a best guess
screen_size = (1024, 1280)
comic_screen_size = (996, 1241)
dpi = 160
fbase = 16
fsizes = [12, 14, 16, 18, 20, 22, 24]
class IRexDR800Output(OutputProfile):
author = 'Eric Cronin'
name = 'IRex Digital Reader 800'
short_name = 'irexdr800'
description = _('This profile is intended for the IRex Digital Reader 800.')
# Screen size is a best guess
screen_size = (768, 1024)
comic_screen_size = (768, 1024)
dpi = 160
fbase = 16
fsizes = [12, 14, 16, 18, 20, 22, 24]
class NookOutput(OutputProfile):
author = 'John Schember'
name = 'Nook'
short_name = 'nook'
description = _('This profile is intended for the B&N Nook.')
# Screen size is a best guess
screen_size = (600, 730)
comic_screen_size = (584, 730)
dpi = 167
fbase = 16
fsizes = [12, 12, 14, 16, 18, 20, 22, 24]
class NookColorOutput(NookOutput):
name = 'Nook Color'
short_name = 'nook_color'
description = _('This profile is intended for the B&N Nook Color.')
screen_size = (600, 900)
comic_screen_size = (594, 900)
dpi = 169
class PocketBook900Output(OutputProfile):
author = 'Chris Lockfort'
name = 'PocketBook Pro 900'
short_name = 'pocketbook_900'
description = _('This profile is intended for the PocketBook Pro 900 series of devices.')
screen_size = (810, 1180)
dpi = 150.0
comic_screen_size = screen_size
class PocketBookPro912Output(OutputProfile):
author = 'Daniele Pizzolli'
name = 'PocketBook Pro 912'
short_name = 'pocketbook_pro_912'
description = _('This profile is intended for the PocketBook Pro 912 series of devices.')
# According to http://download.pocketbook-int.com/user-guides/E_Ink/912/User_Guide_PocketBook_912(EN).pdf
screen_size = (825, 1200)
dpi = 155.0
comic_screen_size = screen_size
output_profiles = [
OutputProfile, SonyReaderOutput, SonyReader300Output, SonyReader900Output,
SonyReaderT3Output, MSReaderOutput, MobipocketOutput, HanlinV3Output,
HanlinV5Output, CybookG3Output, CybookOpusOutput, KindleOutput, iPadOutput,
iPad3Output, KoboReaderOutput, TabletOutput, SamsungGalaxy,
SonyReaderLandscapeOutput, KindleDXOutput, IlliadOutput, NookHD,
IRexDR1000Output, IRexDR800Output, JetBook5Output, NookOutput,
NookColorOutput, PocketBook900Output,
PocketBookPro912Output, GenericEink, GenericEinkLarge, GenericEinkHD,
KindleFireOutput, KindlePaperWhiteOutput, KindleVoyageOutput,
KindlePaperWhite3Output, KindleOasisOutput
]
output_profiles.sort(key=lambda x: x.name.lower())

View File

@@ -0,0 +1,835 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import os, shutil, traceback, functools, sys
from collections import defaultdict
from itertools import chain
from calibre.customize import (CatalogPlugin, FileTypePlugin, PluginNotFound,
MetadataReaderPlugin, MetadataWriterPlugin,
InterfaceActionBase as InterfaceAction,
PreferencesPlugin, platform, InvalidPlugin,
StoreBase as Store, EditBookToolPlugin,
LibraryClosedPlugin)
from calibre.customize.conversion import InputFormatPlugin, OutputFormatPlugin
from calibre.customize.zipplugin import loader
from calibre.customize.profiles import InputProfile, OutputProfile
from calibre.customize.builtins import plugins as builtin_plugins
from calibre.devices.interface import DevicePlugin
from calibre.ebooks.metadata import MetaInformation
from calibre.utils.config import (make_config_dir, Config, ConfigProxy,
plugin_dir, OptionParser)
from calibre.ebooks.metadata.sources.base import Source
from calibre.constants import DEBUG, numeric_version
from polyglot.builtins import iteritems, itervalues, unicode_type
builtin_names = frozenset(p.name for p in builtin_plugins)
BLACKLISTED_PLUGINS = frozenset({'Marvin XD', 'iOS reader applications'})
class NameConflict(ValueError):
pass
def _config():
c = Config('customize')
c.add_opt('plugins', default={}, help=_('Installed plugins'))
c.add_opt('filetype_mapping', default={}, help=_('Mapping for filetype plugins'))
c.add_opt('plugin_customization', default={}, help=_('Local plugin customization'))
c.add_opt('disabled_plugins', default=set(), help=_('Disabled plugins'))
c.add_opt('enabled_plugins', default=set(), help=_('Enabled plugins'))
return ConfigProxy(c)
config = _config()
def find_plugin(name):
for plugin in _initialized_plugins:
if plugin.name == name:
return plugin
def load_plugin(path_to_zip_file): # {{{
'''
Load plugin from ZIP file or raise InvalidPlugin error
:return: A :class:`Plugin` instance.
'''
return loader.load(path_to_zip_file)
# }}}
# Enable/disable plugins {{{
def disable_plugin(plugin_or_name):
x = getattr(plugin_or_name, 'name', plugin_or_name)
plugin = find_plugin(x)
if not plugin.can_be_disabled:
raise ValueError('Plugin %s cannot be disabled'%x)
dp = config['disabled_plugins']
dp.add(x)
config['disabled_plugins'] = dp
ep = config['enabled_plugins']
if x in ep:
ep.remove(x)
config['enabled_plugins'] = ep
def enable_plugin(plugin_or_name):
x = getattr(plugin_or_name, 'name', plugin_or_name)
dp = config['disabled_plugins']
if x in dp:
dp.remove(x)
config['disabled_plugins'] = dp
ep = config['enabled_plugins']
ep.add(x)
config['enabled_plugins'] = ep
def restore_plugin_state_to_default(plugin_or_name):
x = getattr(plugin_or_name, 'name', plugin_or_name)
dp = config['disabled_plugins']
if x in dp:
dp.remove(x)
config['disabled_plugins'] = dp
ep = config['enabled_plugins']
if x in ep:
ep.remove(x)
config['enabled_plugins'] = ep
default_disabled_plugins = {
'Overdrive', 'Douban Books', 'OZON.ru', 'Edelweiss', 'Google Images', 'Big Book Search',
}
def is_disabled(plugin):
if plugin.name in config['enabled_plugins']:
return False
return plugin.name in config['disabled_plugins'] or \
plugin.name in default_disabled_plugins
# }}}
# File type plugins {{{
_on_import = {}
_on_postimport = {}
_on_preprocess = {}
_on_postprocess = {}
_on_postadd = []
def reread_filetype_plugins():
global _on_import, _on_postimport, _on_preprocess, _on_postprocess, _on_postadd
_on_import = defaultdict(list)
_on_postimport = defaultdict(list)
_on_preprocess = defaultdict(list)
_on_postprocess = defaultdict(list)
_on_postadd = []
for plugin in _initialized_plugins:
if isinstance(plugin, FileTypePlugin):
for ft in plugin.file_types:
if plugin.on_import:
_on_import[ft].append(plugin)
if plugin.on_postimport:
_on_postimport[ft].append(plugin)
_on_postadd.append(plugin)
if plugin.on_preprocess:
_on_preprocess[ft].append(plugin)
if plugin.on_postprocess:
_on_postprocess[ft].append(plugin)
def plugins_for_ft(ft, occasion):
op = {
'import':_on_import, 'preprocess':_on_preprocess, 'postprocess':_on_postprocess, 'postimport':_on_postimport,
}[occasion]
for p in chain(op.get(ft, ()), op.get('*', ())):
if not is_disabled(p):
yield p
def _run_filetype_plugins(path_to_file, ft=None, occasion='preprocess'):
customization = config['plugin_customization']
if ft is None:
ft = os.path.splitext(path_to_file)[-1].lower().replace('.', '')
nfp = path_to_file
for plugin in plugins_for_ft(ft, occasion):
plugin.site_customization = customization.get(plugin.name, '')
oo, oe = sys.stdout, sys.stderr # Some file type plugins out there override the output streams with buggy implementations
with plugin:
try:
plugin.original_path_to_file = path_to_file
except Exception:
pass
try:
nfp = plugin.run(nfp) or nfp
except:
print('Running file type plugin %s failed with traceback:'%plugin.name, file=oe)
traceback.print_exc(file=oe)
sys.stdout, sys.stderr = oo, oe
x = lambda j: os.path.normpath(os.path.normcase(j))
if occasion == 'postprocess' and x(nfp) != x(path_to_file):
shutil.copyfile(nfp, path_to_file)
nfp = path_to_file
return nfp
run_plugins_on_import = functools.partial(_run_filetype_plugins, occasion='import')
run_plugins_on_preprocess = functools.partial(_run_filetype_plugins, occasion='preprocess')
run_plugins_on_postprocess = functools.partial(_run_filetype_plugins, occasion='postprocess')
def run_plugins_on_postimport(db, book_id, fmt):
customization = config['plugin_customization']
fmt = fmt.lower()
for plugin in plugins_for_ft(fmt, 'postimport'):
plugin.site_customization = customization.get(plugin.name, '')
with plugin:
try:
plugin.postimport(book_id, fmt, db)
except:
print('Running file type plugin %s failed with traceback:'%
plugin.name)
traceback.print_exc()
def run_plugins_on_postadd(db, book_id, fmt_map):
customization = config['plugin_customization']
for plugin in _on_postadd:
if is_disabled(plugin):
continue
plugin.site_customization = customization.get(plugin.name, '')
with plugin:
try:
plugin.postadd(book_id, fmt_map, db)
except Exception:
print('Running file type plugin %s failed with traceback:'%
plugin.name)
traceback.print_exc()
# }}}
# Plugin customization {{{
def customize_plugin(plugin, custom):
d = config['plugin_customization']
d[plugin.name] = custom.strip()
config['plugin_customization'] = d
def plugin_customization(plugin):
return config['plugin_customization'].get(plugin.name, '')
# }}}
# Input/Output profiles {{{
def input_profiles():
for plugin in _initialized_plugins:
if isinstance(plugin, InputProfile):
yield plugin
def output_profiles():
for plugin in _initialized_plugins:
if isinstance(plugin, OutputProfile):
yield plugin
# }}}
# Interface Actions # {{{
def interface_actions():
customization = config['plugin_customization']
for plugin in _initialized_plugins:
if isinstance(plugin, InterfaceAction):
if not is_disabled(plugin):
plugin.site_customization = customization.get(plugin.name, '')
yield plugin
# }}}
# Preferences Plugins # {{{
def preferences_plugins():
customization = config['plugin_customization']
for plugin in _initialized_plugins:
if isinstance(plugin, PreferencesPlugin):
if not is_disabled(plugin):
plugin.site_customization = customization.get(plugin.name, '')
yield plugin
# }}}
# Library Closed Plugins # {{{
def available_library_closed_plugins():
customization = config['plugin_customization']
for plugin in _initialized_plugins:
if isinstance(plugin, LibraryClosedPlugin):
if not is_disabled(plugin):
plugin.site_customization = customization.get(plugin.name, '')
yield plugin
def has_library_closed_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, LibraryClosedPlugin):
if not is_disabled(plugin):
return True
return False
# }}}
# Store Plugins # {{{
def store_plugins():
customization = config['plugin_customization']
for plugin in _initialized_plugins:
if isinstance(plugin, Store):
plugin.site_customization = customization.get(plugin.name, '')
yield plugin
def available_store_plugins():
for plugin in store_plugins():
if not is_disabled(plugin):
yield plugin
def stores():
stores = set()
for plugin in store_plugins():
stores.add(plugin.name)
return stores
def available_stores():
stores = set()
for plugin in available_store_plugins():
stores.add(plugin.name)
return stores
# }}}
# Metadata read/write {{{
_metadata_readers = {}
_metadata_writers = {}
def reread_metadata_plugins():
global _metadata_readers
global _metadata_writers
_metadata_readers = defaultdict(list)
_metadata_writers = defaultdict(list)
for plugin in _initialized_plugins:
if isinstance(plugin, MetadataReaderPlugin):
for ft in plugin.file_types:
_metadata_readers[ft].append(plugin)
elif isinstance(plugin, MetadataWriterPlugin):
for ft in plugin.file_types:
_metadata_writers[ft].append(plugin)
# Ensure custom metadata plugins are used in preference to builtin
# ones for a given filetype
def key(plugin):
return (1 if plugin.plugin_path is None else 0), plugin.name
for group in (_metadata_readers, _metadata_writers):
for plugins in itervalues(group):
if len(plugins) > 1:
plugins.sort(key=key)
def metadata_readers():
ans = set()
for plugins in _metadata_readers.values():
for plugin in plugins:
ans.add(plugin)
return ans
def metadata_writers():
ans = set()
for plugins in _metadata_writers.values():
for plugin in plugins:
ans.add(plugin)
return ans
class QuickMetadata(object):
def __init__(self):
self.quick = False
def __enter__(self):
self.quick = True
def __exit__(self, *args):
self.quick = False
quick_metadata = QuickMetadata()
class ApplyNullMetadata(object):
def __init__(self):
self.apply_null = False
def __enter__(self):
self.apply_null = True
def __exit__(self, *args):
self.apply_null = False
apply_null_metadata = ApplyNullMetadata()
class ForceIdentifiers(object):
def __init__(self):
self.force_identifiers = False
def __enter__(self):
self.force_identifiers = True
def __exit__(self, *args):
self.force_identifiers = False
force_identifiers = ForceIdentifiers()
def get_file_type_metadata(stream, ftype):
mi = MetaInformation(None, None)
ftype = ftype.lower().strip()
if ftype in _metadata_readers:
for plugin in _metadata_readers[ftype]:
if not is_disabled(plugin):
with plugin:
try:
plugin.quick = quick_metadata.quick
if hasattr(stream, 'seek'):
stream.seek(0)
mi = plugin.get_metadata(stream, ftype.lower().strip())
break
except:
traceback.print_exc()
continue
return mi
def set_file_type_metadata(stream, mi, ftype, report_error=None):
ftype = ftype.lower().strip()
if ftype in _metadata_writers:
customization = config['plugin_customization']
for plugin in _metadata_writers[ftype]:
if not is_disabled(plugin):
with plugin:
try:
plugin.apply_null = apply_null_metadata.apply_null
plugin.force_identifiers = force_identifiers.force_identifiers
plugin.site_customization = customization.get(plugin.name, '')
plugin.set_metadata(stream, mi, ftype.lower().strip())
break
except:
if report_error is None:
from calibre import prints
prints('Failed to set metadata for the', ftype.upper(), 'format of:', getattr(mi, 'title', ''), file=sys.stderr)
traceback.print_exc()
else:
report_error(mi, ftype, traceback.format_exc())
def can_set_metadata(ftype):
ftype = ftype.lower().strip()
for plugin in _metadata_writers.get(ftype, ()):
if not is_disabled(plugin):
return True
return False
# }}}
# Add/remove plugins {{{
def add_plugin(path_to_zip_file):
make_config_dir()
plugin = load_plugin(path_to_zip_file)
if plugin.name in builtin_names:
raise NameConflict(
'A builtin plugin with the name %r already exists' % plugin.name)
plugin = initialize_plugin(plugin, path_to_zip_file)
plugins = config['plugins']
zfp = os.path.join(plugin_dir, plugin.name+'.zip')
if os.path.exists(zfp):
os.remove(zfp)
shutil.copyfile(path_to_zip_file, zfp)
plugins[plugin.name] = zfp
config['plugins'] = plugins
initialize_plugins()
return plugin
def remove_plugin(plugin_or_name):
name = getattr(plugin_or_name, 'name', plugin_or_name)
plugins = config['plugins']
removed = False
if name in plugins:
removed = True
try:
zfp = os.path.join(plugin_dir, name+'.zip')
if os.path.exists(zfp):
os.remove(zfp)
zfp = plugins[name]
if os.path.exists(zfp):
os.remove(zfp)
except:
pass
plugins.pop(name)
config['plugins'] = plugins
initialize_plugins()
return removed
# }}}
# Input/Output format plugins {{{
def input_format_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, InputFormatPlugin):
yield plugin
def plugin_for_input_format(fmt):
customization = config['plugin_customization']
for plugin in input_format_plugins():
if fmt.lower() in plugin.file_types:
plugin.site_customization = customization.get(plugin.name, None)
return plugin
def all_input_formats():
formats = set()
for plugin in input_format_plugins():
for format in plugin.file_types:
formats.add(format)
return formats
def available_input_formats():
formats = set()
for plugin in input_format_plugins():
if not is_disabled(plugin):
for format in plugin.file_types:
formats.add(format)
formats.add('zip'), formats.add('rar')
return formats
def output_format_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, OutputFormatPlugin):
yield plugin
def plugin_for_output_format(fmt):
customization = config['plugin_customization']
for plugin in output_format_plugins():
if fmt.lower() == plugin.file_type:
plugin.site_customization = customization.get(plugin.name, None)
return plugin
def available_output_formats():
formats = set()
for plugin in output_format_plugins():
if not is_disabled(plugin):
formats.add(plugin.file_type)
return formats
# }}}
# Catalog plugins {{{
def catalog_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, CatalogPlugin):
yield plugin
def available_catalog_formats():
formats = set()
for plugin in catalog_plugins():
if not is_disabled(plugin):
for format in plugin.file_types:
formats.add(format)
return formats
def plugin_for_catalog_format(fmt):
for plugin in catalog_plugins():
if fmt.lower() in plugin.file_types:
return plugin
# }}}
# Device plugins {{{
def device_plugins(include_disabled=False):
for plugin in _initialized_plugins:
if isinstance(plugin, DevicePlugin):
if include_disabled or not is_disabled(plugin):
if platform in plugin.supported_platforms:
if getattr(plugin, 'plugin_needs_delayed_initialization',
False):
plugin.do_delayed_plugin_initialization()
yield plugin
def disabled_device_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, DevicePlugin):
if is_disabled(plugin):
if platform in plugin.supported_platforms:
yield plugin
# }}}
# Metadata sources2 {{{
def metadata_plugins(capabilities):
capabilities = frozenset(capabilities)
for plugin in all_metadata_plugins():
if plugin.capabilities.intersection(capabilities) and \
not is_disabled(plugin):
yield plugin
def all_metadata_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, Source):
yield plugin
def patch_metadata_plugins(possibly_updated_plugins):
patches = {}
for i, plugin in enumerate(_initialized_plugins):
if isinstance(plugin, Source) and plugin.name in builtin_names:
pup = possibly_updated_plugins.get(plugin.name)
if pup is not None:
if pup.version > plugin.version and pup.minimum_calibre_version <= numeric_version:
patches[i] = pup(None)
# Metadata source plugins dont use initialize() but that
# might change in the future, so be safe.
patches[i].initialize()
for i, pup in iteritems(patches):
_initialized_plugins[i] = pup
# }}}
# Editor plugins {{{
def all_edit_book_tool_plugins():
for plugin in _initialized_plugins:
if isinstance(plugin, EditBookToolPlugin):
yield plugin
# }}}
# Initialize plugins {{{
_initialized_plugins = []
def initialize_plugin(plugin, path_to_zip_file):
try:
p = plugin(path_to_zip_file)
p.initialize()
return p
except Exception:
print('Failed to initialize plugin:', plugin.name, plugin.version)
tb = traceback.format_exc()
raise InvalidPlugin((_('Initialization of plugin %s failed with traceback:')
%tb) + '\n'+tb)
def has_external_plugins():
'True if there are updateable (ZIP file based) plugins'
return bool(config['plugins'])
def initialize_plugins(perf=False):
global _initialized_plugins
_initialized_plugins = []
conflicts = [name for name in config['plugins'] if name in
builtin_names]
for p in conflicts:
remove_plugin(p)
external_plugins = config['plugins'].copy()
for name in BLACKLISTED_PLUGINS:
external_plugins.pop(name, None)
ostdout, ostderr = sys.stdout, sys.stderr
if perf:
from collections import defaultdict
import time
times = defaultdict(lambda:0)
for zfp in list(external_plugins) + builtin_plugins:
try:
if not isinstance(zfp, type):
# We have a plugin name
pname = zfp
zfp = os.path.join(plugin_dir, zfp+'.zip')
if not os.path.exists(zfp):
zfp = external_plugins[pname]
try:
plugin = load_plugin(zfp) if not isinstance(zfp, type) else zfp
except PluginNotFound:
continue
if perf:
st = time.time()
plugin = initialize_plugin(plugin, None if isinstance(zfp, type) else zfp)
if perf:
times[plugin.name] = time.time() - st
_initialized_plugins.append(plugin)
except:
print('Failed to initialize plugin:', repr(zfp))
if DEBUG:
traceback.print_exc()
# Prevent a custom plugin from overriding stdout/stderr as this breaks
# ipython
sys.stdout, sys.stderr = ostdout, ostderr
if perf:
for x in sorted(times, key=lambda x: times[x]):
print('%50s: %.3f'%(x, times[x]))
_initialized_plugins.sort(key=lambda x: x.priority, reverse=True)
reread_filetype_plugins()
reread_metadata_plugins()
initialize_plugins()
def initialized_plugins():
for plugin in _initialized_plugins:
yield plugin
# }}}
# CLI {{{
def build_plugin(path):
from calibre import prints
from calibre.ptempfile import PersistentTemporaryFile
from calibre.utils.zipfile import ZipFile, ZIP_STORED
path = unicode_type(path)
names = frozenset(os.listdir(path))
if '__init__.py' not in names:
prints(path, ' is not a valid plugin')
raise SystemExit(1)
t = PersistentTemporaryFile(u'.zip')
with ZipFile(t, 'w', ZIP_STORED) as zf:
zf.add_dir(path, simple_filter=lambda x:x in {'.git', '.bzr', '.svn', '.hg'})
t.close()
plugin = add_plugin(t.name)
os.remove(t.name)
prints('Plugin updated:', plugin.name, plugin.version)
def option_parser():
parser = OptionParser(usage=_('''\
%prog options
Customize calibre by loading external plugins.
'''))
parser.add_option('-a', '--add-plugin', default=None,
help=_('Add a plugin by specifying the path to the ZIP file containing it.'))
parser.add_option('-b', '--build-plugin', default=None,
help=_('For plugin developers: Path to the directory where you are'
' developing the plugin. This command will automatically zip '
'up the plugin and update it in calibre.'))
parser.add_option('-r', '--remove-plugin', default=None,
help=_('Remove a custom plugin by name. Has no effect on builtin plugins'))
parser.add_option('--customize-plugin', default=None,
help=_('Customize plugin. Specify name of plugin and customization string separated by a comma.'))
parser.add_option('-l', '--list-plugins', default=False, action='store_true',
help=_('List all installed plugins'))
parser.add_option('--enable-plugin', default=None,
help=_('Enable the named plugin'))
parser.add_option('--disable-plugin', default=None,
help=_('Disable the named plugin'))
return parser
def main(args=sys.argv):
parser = option_parser()
if len(args) < 2:
parser.print_help()
return 1
opts, args = parser.parse_args(args)
if opts.add_plugin is not None:
plugin = add_plugin(opts.add_plugin)
print('Plugin added:', plugin.name, plugin.version)
if opts.build_plugin is not None:
build_plugin(opts.build_plugin)
if opts.remove_plugin is not None:
if remove_plugin(opts.remove_plugin):
print('Plugin removed')
else:
print('No custom plugin named', opts.remove_plugin)
if opts.customize_plugin is not None:
name, custom = opts.customize_plugin.split(',')
plugin = find_plugin(name.strip())
if plugin is None:
print('No plugin with the name %s exists'%name)
return 1
customize_plugin(plugin, custom)
if opts.enable_plugin is not None:
enable_plugin(opts.enable_plugin.strip())
if opts.disable_plugin is not None:
disable_plugin(opts.disable_plugin.strip())
if opts.list_plugins:
type_len = name_len = 0
for plugin in initialized_plugins():
type_len, name_len = max(type_len, len(plugin.type)), max(name_len, len(plugin.name))
fmt = '%-{}s%-{}s%-15s%-15s%s'.format(type_len+1, name_len+1)
print(fmt%tuple(('Type|Name|Version|Disabled|Site Customization'.split('|'))))
print()
for plugin in initialized_plugins():
print(fmt%(
plugin.type, plugin.name,
plugin.version, is_disabled(plugin),
plugin_customization(plugin)
))
print('\t', plugin.description)
if plugin.is_customizable():
try:
print('\t', plugin.customization_help())
except NotImplementedError:
pass
print()
return 0
if __name__ == '__main__':
sys.exit(main())
# }}}

View File

@@ -0,0 +1,320 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2011, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os, zipfile, posixpath, importlib, threading, re, imp, sys
from collections import OrderedDict
from functools import partial
from calibre import as_unicode
from calibre.constants import ispy3
from calibre.customize import (Plugin, numeric_version, platform,
InvalidPlugin, PluginNotFound)
from polyglot.builtins import (itervalues, map, string_or_bytes,
unicode_type, reload)
# PEP 302 based plugin loading mechanism, works around the bug in zipimport in
# python 2.x that prevents importing from zip files in locations whose paths
# have non ASCII characters
def get_resources(zfp, name_or_list_of_names):
'''
Load resources from the plugin zip file
:param name_or_list_of_names: List of paths to resources in the zip file using / as
separator, or a single path
:return: A dictionary of the form ``{name : file_contents}``. Any names
that were not found in the zip file will not be present in the
dictionary. If a single path is passed in the return value will
be just the bytes of the resource or None if it wasn't found.
'''
names = name_or_list_of_names
if isinstance(names, string_or_bytes):
names = [names]
ans = {}
with zipfile.ZipFile(zfp) as zf:
for name in names:
try:
ans[name] = zf.read(name)
except:
import traceback
traceback.print_exc()
if len(names) == 1:
ans = ans.pop(names[0], None)
return ans
def get_icons(zfp, name_or_list_of_names):
'''
Load icons from the plugin zip file
:param name_or_list_of_names: List of paths to resources in the zip file using / as
separator, or a single path
:return: A dictionary of the form ``{name : QIcon}``. Any names
that were not found in the zip file will be null QIcons.
If a single path is passed in the return value will
be A QIcon.
'''
from PyQt5.Qt import QIcon, QPixmap
names = name_or_list_of_names
ans = get_resources(zfp, names)
if isinstance(names, string_or_bytes):
names = [names]
if ans is None:
ans = {}
if isinstance(ans, string_or_bytes):
ans = dict([(names[0], ans)])
ians = {}
for name in names:
p = QPixmap()
raw = ans.get(name, None)
if raw:
p.loadFromData(raw)
ians[name] = QIcon(p)
if len(names) == 1:
ians = ians.pop(names[0])
return ians
_translations_cache = {}
def load_translations(namespace, zfp):
null = object()
trans = _translations_cache.get(zfp, null)
if trans is None:
return
if trans is null:
from calibre.utils.localization import get_lang
lang = get_lang()
if not lang or lang == 'en': # performance optimization
_translations_cache[zfp] = None
return
with zipfile.ZipFile(zfp) as zf:
try:
mo = zf.read('translations/%s.mo' % lang)
except KeyError:
mo = None # No translations for this language present
if mo is None:
_translations_cache[zfp] = None
return
from gettext import GNUTranslations
from io import BytesIO
trans = _translations_cache[zfp] = GNUTranslations(BytesIO(mo))
namespace['_'] = getattr(trans, 'gettext' if ispy3 else 'ugettext')
namespace['ngettext'] = getattr(trans, 'ngettext' if ispy3 else 'ungettext')
class PluginLoader(object):
def __init__(self):
self.loaded_plugins = {}
self._lock = threading.RLock()
self._identifier_pat = re.compile(r'[a-zA-Z][_0-9a-zA-Z]*')
def _get_actual_fullname(self, fullname):
parts = fullname.split('.')
if parts[0] == 'calibre_plugins':
if len(parts) == 1:
return parts[0], None
plugin_name = parts[1]
with self._lock:
names = self.loaded_plugins.get(plugin_name, None)
if names is None:
raise ImportError('No plugin named %r loaded'%plugin_name)
names = names[1]
fullname = '.'.join(parts[2:])
if not fullname:
fullname = '__init__'
if fullname in names:
return fullname, plugin_name
if fullname+'.__init__' in names:
return fullname+'.__init__', plugin_name
return None, None
def find_module(self, fullname, path=None):
fullname, plugin_name = self._get_actual_fullname(fullname)
if fullname is None and plugin_name is None:
return None
return self
def load_module(self, fullname):
import_name, plugin_name = self._get_actual_fullname(fullname)
if import_name is None and plugin_name is None:
raise ImportError('No plugin named %r is loaded'%fullname)
mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
mod.__file__ = "<calibre Plugin Loader>"
mod.__loader__ = self
if import_name.endswith('.__init__') or import_name in ('__init__',
'calibre_plugins'):
# We have a package
mod.__path__ = []
if plugin_name is not None:
# We have some actual code to load
with self._lock:
zfp, names = self.loaded_plugins.get(plugin_name, (None, None))
if names is None:
raise ImportError('No plugin named %r loaded'%plugin_name)
zinfo = names.get(import_name, None)
if zinfo is None:
raise ImportError('Plugin %r has no module named %r' %
(plugin_name, import_name))
with zipfile.ZipFile(zfp) as zf:
try:
code = zf.read(zinfo)
except:
# Maybe the zip file changed from under us
code = zf.read(zinfo.filename)
compiled = compile(code, 'calibre_plugins.%s.%s'%(plugin_name,
import_name), 'exec', dont_inherit=True)
mod.__dict__['get_resources'] = partial(get_resources, zfp)
mod.__dict__['get_icons'] = partial(get_icons, zfp)
mod.__dict__['load_translations'] = partial(load_translations, mod.__dict__, zfp)
exec(compiled, mod.__dict__)
return mod
def load(self, path_to_zip_file):
if not os.access(path_to_zip_file, os.R_OK):
raise PluginNotFound('Cannot access %r'%path_to_zip_file)
with zipfile.ZipFile(path_to_zip_file) as zf:
plugin_name = self._locate_code(zf, path_to_zip_file)
try:
ans = None
plugin_module = 'calibre_plugins.%s'%plugin_name
m = sys.modules.get(plugin_module, None)
if m is not None:
reload(m)
else:
m = importlib.import_module(plugin_module)
plugin_classes = []
for obj in itervalues(m.__dict__):
if isinstance(obj, type) and issubclass(obj, Plugin) and \
obj.name != 'Trivial Plugin':
plugin_classes.append(obj)
if not plugin_classes:
raise InvalidPlugin('No plugin class found in %s:%s'%(
as_unicode(path_to_zip_file), plugin_name))
if len(plugin_classes) > 1:
plugin_classes.sort(key=lambda c:(getattr(c, '__module__', None) or '').count('.'))
ans = plugin_classes[0]
if ans.minimum_calibre_version > numeric_version:
raise InvalidPlugin(
'The plugin at %s needs a version of calibre >= %s' %
(as_unicode(path_to_zip_file), '.'.join(map(unicode_type,
ans.minimum_calibre_version))))
if platform not in ans.supported_platforms:
raise InvalidPlugin(
'The plugin at %s cannot be used on %s' %
(as_unicode(path_to_zip_file), platform))
return ans
except:
with self._lock:
del self.loaded_plugins[plugin_name]
raise
def _locate_code(self, zf, path_to_zip_file):
names = [x if isinstance(x, unicode_type) else x.decode('utf-8') for x in
zf.namelist()]
names = [x[1:] if x[0] == '/' else x for x in names]
plugin_name = None
for name in names:
name, ext = posixpath.splitext(name)
if name.startswith('plugin-import-name-') and ext == '.txt':
plugin_name = name.rpartition('-')[-1]
if plugin_name is None:
c = 0
while True:
c += 1
plugin_name = 'dummy%d'%c
if plugin_name not in self.loaded_plugins:
break
else:
if self._identifier_pat.match(plugin_name) is None:
raise InvalidPlugin((
'The plugin at %r uses an invalid import name: %r' %
(path_to_zip_file, plugin_name)))
pynames = [x for x in names if x.endswith('.py')]
candidates = [posixpath.dirname(x) for x in pynames if
x.endswith('/__init__.py')]
candidates.sort(key=lambda x: x.count('/'))
valid_packages = set()
for candidate in candidates:
parts = candidate.split('/')
parent = '.'.join(parts[:-1])
if parent and parent not in valid_packages:
continue
valid_packages.add('.'.join(parts))
names = OrderedDict()
for candidate in pynames:
parts = posixpath.splitext(candidate)[0].split('/')
package = '.'.join(parts[:-1])
if package and package not in valid_packages:
continue
name = '.'.join(parts)
names[name] = zf.getinfo(candidate)
# Legacy plugins
if '__init__' not in names:
for name in tuple(names):
if '.' not in name and name.endswith('plugin'):
names['__init__'] = names[name]
break
if '__init__' not in names:
raise InvalidPlugin(('The plugin in %r is invalid. It does not '
'contain a top-level __init__.py file')
% path_to_zip_file)
with self._lock:
self.loaded_plugins[plugin_name] = (path_to_zip_file, names)
return plugin_name
loader = PluginLoader()
sys.meta_path.insert(0, loader)
if __name__ == '__main__':
from tempfile import NamedTemporaryFile
from calibre.customize.ui import add_plugin
from calibre import CurrentDir
path = sys.argv[-1]
with NamedTemporaryFile(suffix='.zip') as f:
with zipfile.ZipFile(f, 'w') as zf:
with CurrentDir(path):
for x in os.listdir('.'):
if x[0] != '.':
print('Adding', x)
zf.write(x)
if os.path.isdir(x):
for y in os.listdir(x):
zf.write(os.path.join(x, y))
add_plugin(f.name)
print('Added plugin from', sys.argv[-1])

View File

@@ -0,0 +1,216 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
Device drivers.
'''
import sys, time, pprint
from functools import partial
from polyglot.builtins import zip, unicode_type
DAY_MAP = dict(Sun=0, Mon=1, Tue=2, Wed=3, Thu=4, Fri=5, Sat=6)
MONTH_MAP = dict(Jan=1, Feb=2, Mar=3, Apr=4, May=5, Jun=6, Jul=7, Aug=8, Sep=9, Oct=10, Nov=11, Dec=12)
INVERSE_DAY_MAP = dict(zip(DAY_MAP.values(), DAY_MAP.keys()))
INVERSE_MONTH_MAP = dict(zip(MONTH_MAP.values(), MONTH_MAP.keys()))
def strptime(src):
src = src.strip()
src = src.split()
src[0] = unicode_type(DAY_MAP[src[0][:-1]])+','
src[2] = unicode_type(MONTH_MAP[src[2]])
return time.strptime(' '.join(src), '%w, %d %m %Y %H:%M:%S %Z')
def strftime(epoch, zone=time.gmtime):
src = time.strftime("%w, %d %m %Y %H:%M:%S GMT", zone(epoch)).split()
src[0] = INVERSE_DAY_MAP[int(src[0][:-1])]+','
src[2] = INVERSE_MONTH_MAP[int(src[2])]
return ' '.join(src)
def get_connected_device():
from calibre.customize.ui import device_plugins
from calibre.devices.scanner import DeviceScanner
dev = None
scanner = DeviceScanner()
scanner.scan()
connected_devices = []
for d in device_plugins():
ok, det = scanner.is_device_connected(d)
if ok:
dev = d
dev.reset(log_packets=False, detected_device=det)
connected_devices.append((det, dev))
if dev is None:
print('Unable to find a connected ebook reader.', file=sys.stderr)
return
for det, d in connected_devices:
try:
d.open(det, None)
except:
continue
else:
dev = d
break
return dev
def debug(ioreg_to_tmp=False, buf=None, plugins=None,
disabled_plugins=None):
'''
If plugins is None, then this method calls startup and shutdown on the
device plugins. So if you are using it in a context where startup could
already have been called (for example in the main GUI), pass in the list of
device plugins as the plugins parameter.
'''
import textwrap
from calibre.customize.ui import device_plugins, disabled_device_plugins
from calibre.debug import print_basic_debug_info
from calibre.devices.scanner import DeviceScanner
from calibre.constants import iswindows, isosx
from calibre import prints
from polyglot.io import PolyglotBytesIO
oldo, olde = sys.stdout, sys.stderr
if buf is None:
buf = PolyglotBytesIO()
sys.stdout = sys.stderr = buf
out = partial(prints, file=buf)
devplugins = device_plugins() if plugins is None else plugins
devplugins = list(sorted(devplugins, key=lambda x: x.__class__.__name__))
if plugins is None:
for d in devplugins:
try:
d.startup()
except:
out('Startup failed for device plugin: %s'%d)
if disabled_plugins is None:
disabled_plugins = list(disabled_device_plugins())
try:
print_basic_debug_info(out=buf)
s = DeviceScanner()
s.scan()
devices = (s.devices)
if not iswindows:
devices = [list(x) for x in devices]
for d in devices:
for i in range(3):
d[i] = hex(d[i])
out('USB devices on system:')
out(pprint.pformat(devices))
ioreg = None
if isosx:
from calibre.devices.usbms.device import Device
mount = '\n'.join(repr(x) for x in Device.osx_run_mount().splitlines())
drives = pprint.pformat(Device.osx_get_usb_drives())
ioreg = 'Output from mount:\n'+mount+'\n\n'
ioreg += 'Output from osx_get_usb_drives:\n'+drives+'\n\n'
ioreg += Device.run_ioreg()
connected_devices = []
if disabled_plugins:
out('\nDisabled plugins:', textwrap.fill(' '.join([x.__class__.__name__ for x in
disabled_plugins])))
out(' ')
else:
out('\nNo disabled plugins')
found_dev = False
for dev in devplugins:
if not dev.MANAGES_DEVICE_PRESENCE:
continue
out('Looking for devices of type:', dev.__class__.__name__)
if dev.debug_managed_device_detection(s.devices, buf):
found_dev = True
break
out(' ')
if not found_dev:
out('Looking for devices...')
for dev in devplugins:
if dev.MANAGES_DEVICE_PRESENCE:
continue
connected, det = s.is_device_connected(dev, debug=True)
if connected:
out('\t\tDetected possible device', dev.__class__.__name__)
connected_devices.append((dev, det))
out(' ')
errors = {}
success = False
out('Devices possibly connected:', end=' ')
for dev, det in connected_devices:
out(dev.name, end=', ')
if not connected_devices:
out('None', end='')
out(' ')
for dev, det in connected_devices:
out('Trying to open', dev.name, '...', end=' ')
dev.do_device_debug = True
try:
dev.reset(detected_device=det)
dev.open(det, None)
out('OK')
except:
import traceback
errors[dev] = traceback.format_exc()
out('failed')
continue
dev.do_device_debug = False
success = True
if hasattr(dev, '_main_prefix'):
out('Main memory:', repr(dev._main_prefix))
out('Total space:', dev.total_space())
break
if not success and errors:
out('Opening of the following devices failed')
for dev,msg in errors.items():
out(dev)
out(msg)
out(' ')
if ioreg is not None:
ioreg = 'IOREG Output\n'+ioreg
out(' ')
if ioreg_to_tmp:
lopen('/tmp/ioreg.txt', 'wb').write(ioreg)
out('Dont forget to send the contents of /tmp/ioreg.txt')
out('You can open it with the command: open /tmp/ioreg.txt')
else:
out(ioreg)
if hasattr(buf, 'getvalue'):
return buf.getvalue().decode('utf-8', 'replace')
finally:
sys.stdout = oldo
sys.stderr = olde
if plugins is None:
for d in devplugins:
try:
d.shutdown()
except:
pass
def device_info(ioreg_to_tmp=False, buf=None):
from calibre.devices.scanner import DeviceScanner
res = {}
res['device_set'] = device_set = set()
res['device_details'] = device_details = {}
s = DeviceScanner()
s.scan()
devices = s.devices
devices = [tuple(x) for x in devices]
for dev in devices:
device_set.add(dev)
device_details[dev] = dev[0:3]
return res

View File

@@ -0,0 +1,787 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import os
from collections import namedtuple
from calibre import prints
from calibre.constants import iswindows
from calibre.customize import Plugin
class DevicePlugin(Plugin):
"""
Defines the interface that should be implemented by backends that
communicate with an e-book reader.
"""
type = _('Device interface')
#: Ordered list of supported formats
FORMATS = ["lrf", "rtf", "pdf", "txt"]
# If True, the config dialog will not show the formats box
HIDE_FORMATS_CONFIG_BOX = False
#: VENDOR_ID can be either an integer, a list of integers or a dictionary
#: If it is a dictionary, it must be a dictionary of dictionaries,
#: of the form::
#:
#: {
#: integer_vendor_id : { product_id : [list of BCDs], ... },
#: ...
#: }
#:
VENDOR_ID = 0x0000
#: An integer or a list of integers
PRODUCT_ID = 0x0000
#: BCD can be either None to not distinguish between devices based on BCD, or
#: it can be a list of the BCD numbers of all devices supported by this driver.
BCD = None
#: Height for thumbnails on the device
THUMBNAIL_HEIGHT = 68
#: Compression quality for thumbnails. Set this closer to 100 to have better
#: quality thumbnails with fewer compression artifacts. Of course, the
#: thumbnails get larger as well.
THUMBNAIL_COMPRESSION_QUALITY = 75
#: Set this to True if the device supports updating cover thumbnails during
#: sync_booklists. Setting it to true will ask device.py to refresh the
#: cover thumbnails during book matching
WANTS_UPDATED_THUMBNAILS = False
#: Whether the metadata on books can be set via the GUI.
CAN_SET_METADATA = ['title', 'authors', 'collections']
#: Whether the device can handle device_db metadata plugboards
CAN_DO_DEVICE_DB_PLUGBOARD = False
# Set this to None if the books on the device are files that the GUI can
# access in order to add the books from the device to the library
BACKLOADING_ERROR_MESSAGE = _('Cannot get files from this device')
#: Path separator for paths to books on device
path_sep = os.sep
#: Icon for this device
icon = I('reader.png')
# Encapsulates an annotation fetched from the device
UserAnnotation = namedtuple('Annotation','type, value')
#: GUI displays this as a message if not None. Useful if opening can take a
#: long time
OPEN_FEEDBACK_MESSAGE = None
#: Set of extensions that are "virtual books" on the device
#: and therefore cannot be viewed/saved/added to library.
#: For example: ``frozenset(['kobo'])``
VIRTUAL_BOOK_EXTENSIONS = frozenset()
#: Message to display to user for virtual book extensions.
VIRTUAL_BOOK_EXTENSION_MESSAGE = None
#: Whether to nuke comments in the copy of the book sent to the device. If
#: not None this should be short string that the comments will be replaced
#: by.
NUKE_COMMENTS = None
#: If True indicates that this driver completely manages device detection,
#: ejecting and so forth. If you set this to True, you *must* implement the
#: detect_managed_devices and debug_managed_device_detection methods.
#: A driver with this set to true is responsible for detection of devices,
#: managing a blacklist of devices, a list of ejected devices and so forth.
#: calibre will periodically call the detect_managed_devices() method and
#: if it returns a detected device, calibre will call open(). open() will
#: be called every time a device is returned even if previous calls to open()
#: failed, therefore the driver must maintain its own blacklist of failed
#: devices. Similarly, when ejecting, calibre will call eject() and then
#: assuming the next call to detect_managed_devices() returns None, it will
#: call post_yank_cleanup().
MANAGES_DEVICE_PRESENCE = False
#: If set the True, calibre will call the :meth:`get_driveinfo()` method
#: after the books lists have been loaded to get the driveinfo.
SLOW_DRIVEINFO = False
#: If set to True, calibre will ask the user if they want to manage the
#: device with calibre, the first time it is detected. If you set this to
#: True you must implement :meth:`get_device_uid()` and
#: :meth:`ignore_connected_device()` and
#: :meth:`get_user_blacklisted_devices` and
#: :meth:`set_user_blacklisted_devices`
ASK_TO_ALLOW_CONNECT = False
#: Set this to a dictionary of the form {'title':title, 'msg':msg, 'det_msg':detailed_msg} to have calibre popup
#: a message to the user after some callbacks are run (currently only upload_books).
#: Be careful to not spam the user with too many messages. This variable is checked after *every* callback,
#: so only set it when you really need to.
user_feedback_after_callback = None
@classmethod
def get_gui_name(cls):
if hasattr(cls, 'gui_name'):
return cls.gui_name
if hasattr(cls, '__name__'):
return cls.__name__
return cls.name
# Device detection {{{
def test_bcd(self, bcdDevice, bcd):
if bcd is None or len(bcd) == 0:
return True
for c in bcd:
if c == bcdDevice:
return True
return False
def is_usb_connected(self, devices_on_system, debug=False, only_presence=False):
'''
Return True, device_info if a device handled by this plugin is currently connected.
:param devices_on_system: List of devices currently connected
'''
vendors_on_system = {x[0] for x in devices_on_system}
vendors = set(self.VENDOR_ID) if hasattr(self.VENDOR_ID, '__len__') else {self.VENDOR_ID}
if hasattr(self.VENDOR_ID, 'keys'):
products = []
for ven in self.VENDOR_ID:
products.extend(self.VENDOR_ID[ven].keys())
else:
products = self.PRODUCT_ID if hasattr(self.PRODUCT_ID, '__len__') else [self.PRODUCT_ID]
ch = self.can_handle_windows if iswindows else self.can_handle
for vid in vendors_on_system.intersection(vendors):
for dev in devices_on_system:
cvid, pid, bcd = dev[:3]
if cvid == vid:
if pid in products:
if hasattr(self.VENDOR_ID, 'keys'):
try:
cbcd = self.VENDOR_ID[vid][pid]
except KeyError:
# Vendor vid does not have product pid, pid
# exists for some other vendor in this
# device
continue
else:
cbcd = self.BCD
if self.test_bcd(bcd, cbcd):
if debug:
prints(dev)
if ch(dev, debug=debug):
return True, dev
return False, None
def detect_managed_devices(self, devices_on_system, force_refresh=False):
'''
Called only if MANAGES_DEVICE_PRESENCE is True.
Scan for devices that this driver can handle. Should return a device
object if a device is found. This object will be passed to the open()
method as the connected_device. If no device is found, return None. The
returned object can be anything, calibre does not use it, it is only
passed to open().
This method is called periodically by the GUI, so make sure it is not
too resource intensive. Use a cache to avoid repeatedly scanning the
system.
:param devices_on_system: Set of USB devices found on the system.
:param force_refresh: If True and the driver uses a cache to prevent
repeated scanning, the cache must be flushed.
'''
raise NotImplementedError()
def debug_managed_device_detection(self, devices_on_system, output):
'''
Called only if MANAGES_DEVICE_PRESENCE is True.
Should write information about the devices detected on the system to
output, which is a file like object.
Should return True if a device was detected and successfully opened,
otherwise False.
'''
raise NotImplementedError()
# }}}
def reset(self, key='-1', log_packets=False, report_progress=None,
detected_device=None):
"""
:param key: The key to unlock the device
:param log_packets: If true the packet stream to/from the device is logged
:param report_progress: Function that is called with a % progress
(number between 0 and 100) for various tasks
If it is called with -1 that means that the
task does not have any progress information
:param detected_device: Device information from the device scanner
"""
raise NotImplementedError()
def can_handle_windows(self, usbdevice, debug=False):
'''
Optional method to perform further checks on a device to see if this driver
is capable of handling it. If it is not it should return False. This method
is only called after the vendor, product ids and the bcd have matched, so
it can do some relatively time intensive checks. The default implementation
returns True. This method is called only on Windows. See also
:meth:`can_handle`.
Note that for devices based on USBMS this method by default delegates
to :meth:`can_handle`. So you only need to override :meth:`can_handle`
in your subclass of USBMS.
:param usbdevice: A usbdevice as returned by :func:`calibre.devices.winusb.scan_usb_devices`
'''
return True
def can_handle(self, device_info, debug=False):
'''
Unix version of :meth:`can_handle_windows`.
:param device_info: Is a tuple of (vid, pid, bcd, manufacturer, product,
serial number)
'''
return True
can_handle.is_base_class_implementation = True
def open(self, connected_device, library_uuid):
'''
Perform any device specific initialization. Called after the device is
detected but before any other functions that communicate with the device.
For example: For devices that present themselves as USB Mass storage
devices, this method would be responsible for mounting the device or
if the device has been automounted, for finding out where it has been
mounted. The method :meth:`calibre.devices.usbms.device.Device.open` has
an implementation of
this function that should serve as a good example for USB Mass storage
devices.
This method can raise an OpenFeedback exception to display a message to
the user.
:param connected_device: The device that we are trying to open. It is
a tuple of (vendor id, product id, bcd, manufacturer name, product
name, device serial number). However, some devices have no serial
number and on windows only the first three fields are present, the
rest are None.
:param library_uuid: The UUID of the current calibre library. Can be
None if there is no library (for example when used from the command
line).
'''
raise NotImplementedError()
def eject(self):
'''
Un-mount / eject the device from the OS. This does not check if there
are pending GUI jobs that need to communicate with the device.
NOTE: That this method may not be called on the same thread as the rest
of the device methods.
'''
raise NotImplementedError()
def post_yank_cleanup(self):
'''
Called if the user yanks the device without ejecting it first.
'''
raise NotImplementedError()
def set_progress_reporter(self, report_progress):
'''
Set a function to report progress information.
:param report_progress: Function that is called with a % progress
(number between 0 and 100) for various tasks
If it is called with -1 that means that the
task does not have any progress information
'''
raise NotImplementedError()
def get_device_information(self, end_session=True):
"""
Ask device for device information. See L{DeviceInfoQuery}.
:return: (device name, device version, software version on device, mime type)
The tuple can optionally have a fifth element, which is a
drive information dictionary. See usbms.driver for an example.
"""
raise NotImplementedError()
def get_driveinfo(self):
'''
Return the driveinfo dictionary. Usually called from
get_device_information(), but if loading the driveinfo is slow for this
driver, then it should set SLOW_DRIVEINFO. In this case, this method
will be called by calibre after the book lists have been loaded. Note
that it is not called on the device thread, so the driver should cache
the drive info in the books() method and this function should return
the cached data.
'''
return {}
def card_prefix(self, end_session=True):
'''
Return a 2 element list of the prefix to paths on the cards.
If no card is present None is set for the card's prefix.
E.G.
('/place', '/place2')
(None, 'place2')
('place', None)
(None, None)
'''
raise NotImplementedError()
def total_space(self, end_session=True):
"""
Get total space available on the mountpoints:
1. Main memory
2. Memory Card A
3. Memory Card B
:return: A 3 element list with total space in bytes of (1, 2, 3). If a
particular device doesn't have any of these locations it should return 0.
"""
raise NotImplementedError()
def free_space(self, end_session=True):
"""
Get free space available on the mountpoints:
1. Main memory
2. Card A
3. Card B
:return: A 3 element list with free space in bytes of (1, 2, 3). If a
particular device doesn't have any of these locations it should return -1.
"""
raise NotImplementedError()
def books(self, oncard=None, end_session=True):
"""
Return a list of e-books on the device.
:param oncard: If 'carda' or 'cardb' return a list of e-books on the
specific storage card, otherwise return list of e-books
in main memory of device. If a card is specified and no
books are on the card return empty list.
:return: A BookList.
"""
raise NotImplementedError()
def upload_books(self, files, names, on_card=None, end_session=True,
metadata=None):
'''
Upload a list of books to the device. If a file already
exists on the device, it should be replaced.
This method should raise a :class:`FreeSpaceError` if there is not enough
free space on the device. The text of the FreeSpaceError must contain the
word "card" if ``on_card`` is not None otherwise it must contain the word "memory".
:param files: A list of paths
:param names: A list of file names that the books should have
once uploaded to the device. len(names) == len(files)
:param metadata: If not None, it is a list of :class:`Metadata` objects.
The idea is to use the metadata to determine where on the device to
put the book. len(metadata) == len(files). Apart from the regular
cover (path to cover), there may also be a thumbnail attribute, which should
be used in preference. The thumbnail attribute is of the form
(width, height, cover_data as jpeg).
:return: A list of 3-element tuples. The list is meant to be passed
to :meth:`add_books_to_metadata`.
'''
raise NotImplementedError()
@classmethod
def add_books_to_metadata(cls, locations, metadata, booklists):
'''
Add locations to the booklists. This function must not communicate with
the device.
:param locations: Result of a call to L{upload_books}
:param metadata: List of :class:`Metadata` objects, same as for
:meth:`upload_books`.
:param booklists: A tuple containing the result of calls to
(:meth:`books(oncard=None)`,
:meth:`books(oncard='carda')`,
:meth`books(oncard='cardb')`).
'''
raise NotImplementedError()
def delete_books(self, paths, end_session=True):
'''
Delete books at paths on device.
'''
raise NotImplementedError()
@classmethod
def remove_books_from_metadata(cls, paths, booklists):
'''
Remove books from the metadata list. This function must not communicate
with the device.
:param paths: paths to books on the device.
:param booklists: A tuple containing the result of calls to
(:meth:`books(oncard=None)`,
:meth:`books(oncard='carda')`,
:meth`books(oncard='cardb')`).
'''
raise NotImplementedError()
def sync_booklists(self, booklists, end_session=True):
'''
Update metadata on device.
:param booklists: A tuple containing the result of calls to
(:meth:`books(oncard=None)`,
:meth:`books(oncard='carda')`,
:meth`books(oncard='cardb')`).
'''
raise NotImplementedError()
def get_file(self, path, outfile, end_session=True):
'''
Read the file at ``path`` on the device and write it to outfile.
:param outfile: file object like ``sys.stdout`` or the result of an
:func:`open` call.
'''
raise NotImplementedError()
@classmethod
def config_widget(cls):
'''
Should return a QWidget. The QWidget contains the settings for the
device interface
'''
raise NotImplementedError()
@classmethod
def save_settings(cls, settings_widget):
'''
Should save settings to disk. Takes the widget created in
:meth:`config_widget` and saves all settings to disk.
'''
raise NotImplementedError()
@classmethod
def settings(cls):
'''
Should return an opts object. The opts object should have at least one
attribute `format_map` which is an ordered list of formats for the
device.
'''
raise NotImplementedError()
def set_plugboards(self, plugboards, pb_func):
'''
provide the driver the current set of plugboards and a function to
select a specific plugboard. This method is called immediately before
add_books and sync_booklists.
pb_func is a callable with the following signature::
def pb_func(device_name, format, plugboards)
You give it the current device name (either the class name or
DEVICE_PLUGBOARD_NAME), the format you are interested in (a 'real'
format or 'device_db'), and the plugboards (you were given those by
set_plugboards, the same place you got this method).
:return: None or a single plugboard instance.
'''
pass
def set_driveinfo_name(self, location_code, name):
'''
Set the device name in the driveinfo file to 'name'. This setting will
persist until the file is re-created or the name is changed again.
Non-disk devices should implement this method based on the location
codes returned by the get_device_information() method.
'''
pass
def prepare_addable_books(self, paths):
'''
Given a list of paths, returns another list of paths. These paths
point to addable versions of the books.
If there is an error preparing a book, then instead of a path, the
position in the returned list for that book should be a three tuple:
(original_path, the exception instance, traceback)
'''
return paths
def startup(self):
'''
Called when calibre is starting the device. Do any initialization
required. Note that multiple instances of the class can be instantiated,
and thus __init__ can be called multiple times, but only one instance
will have this method called. This method is called on the device
thread, not the GUI thread.
'''
pass
def shutdown(self):
'''
Called when calibre is shutting down, either for good or in preparation
to restart. Do any cleanup required. This method is called on the
device thread, not the GUI thread.
'''
pass
def get_device_uid(self):
'''
Must return a unique id for the currently connected device (this is
called immediately after a successful call to open()). You must
implement this method if you set ASK_TO_ALLOW_CONNECT = True
'''
raise NotImplementedError()
def ignore_connected_device(self, uid):
'''
Should ignore the device identified by uid (the result of a call to
get_device_uid()) in the future. You must implement this method if you
set ASK_TO_ALLOW_CONNECT = True. Note that this function is called
immediately after open(), so if open() caches some state, the driver
should reset that state.
'''
raise NotImplementedError()
def get_user_blacklisted_devices(self):
'''
Return map of device uid to friendly name for all devices that the user
has asked to be ignored.
'''
return {}
def set_user_blacklisted_devices(self, devices):
'''
Set the list of device uids that should be ignored by this driver.
'''
pass
def specialize_global_preferences(self, device_prefs):
'''
Implement this method if your device wants to override a particular
preference. You must ensure that all call sites that want a preference
that can be overridden use device_prefs['something'] instead
of prefs['something']. Your
method should call device_prefs.set_overrides(pref=val, pref=val, ...).
Currently used for:
metadata management (prefs['manage_device_metadata'])
'''
device_prefs.set_overrides()
def set_library_info(self, library_name, library_uuid, field_metadata):
'''
Implement this method if you want information about the current calibre
library. This method is called at startup and when the calibre library
changes while connected.
'''
pass
# Dynamic control interface.
# The following methods are probably called on the GUI thread. Any driver
# that implements these methods must take pains to be thread safe, because
# the device_manager might be using the driver at the same time that one of
# these methods is called.
def is_dynamically_controllable(self):
'''
Called by the device manager when starting plugins. If this method returns
a string, then a) it supports the device manager's dynamic control
interface, and b) that name is to be used when talking to the plugin.
This method can be called on the GUI thread. A driver that implements
this method must be thread safe.
'''
return None
def start_plugin(self):
'''
This method is called to start the plugin. The plugin should begin
to accept device connections however it does that. If the plugin is
already accepting connections, then do nothing.
This method can be called on the GUI thread. A driver that implements
this method must be thread safe.
'''
pass
def stop_plugin(self):
'''
This method is called to stop the plugin. The plugin should no longer
accept connections, and should cleanup behind itself. It is likely that
this method should call shutdown. If the plugin is already not accepting
connections, then do nothing.
This method can be called on the GUI thread. A driver that implements
this method must be thread safe.
'''
pass
def get_option(self, opt_string, default=None):
'''
Return the value of the option indicated by opt_string. This method can
be called when the plugin is not started. Return None if the option does
not exist.
This method can be called on the GUI thread. A driver that implements
this method must be thread safe.
'''
return default
def set_option(self, opt_string, opt_value):
'''
Set the value of the option indicated by opt_string. This method can
be called when the plugin is not started.
This method can be called on the GUI thread. A driver that implements
this method must be thread safe.
'''
pass
def is_running(self):
'''
Return True if the plugin is started, otherwise false
This method can be called on the GUI thread. A driver that implements
this method must be thread safe.
'''
return False
def synchronize_with_db(self, db, book_id, book_metadata, first_call):
'''
Called during book matching when a book on the device is matched with
a book in calibre's db. The method is responsible for syncronizing
data from the device to calibre's db (if needed).
The method must return a two-value tuple. The first value is a set of
calibre book ids changed if calibre's database was changed or None if the
database was not changed. If the first value is an empty set then the
metadata for the book on the device is updated with calibre's metadata
and given back to the device, but no GUI refresh of that book is done.
This is useful when the calibre data is correct but must be sent to the
device.
The second value is itself a 2-value tuple. The first value in the tuple
specifies whether a book format should be sent to the device. The intent
is to permit verifying that the book on the device is the same as the
book in calibre. This value must be None if no book is to be sent,
otherwise return the base file name on the device (a string like
foobar.epub). Be sure to include the extension in the name. The device
subsystem will construct a send_books job for all books with not- None
returned values. Note: other than to later retrieve the extension, the
name is ignored in cases where the device uses a template to generate
the file name, which most do. The second value in the returned tuple
indicated whether the format is future-dated. Return True if it is,
otherwise return False. calibre will display a dialog to the user
listing all future dated books.
Extremely important: this method is called on the GUI thread. It must
be threadsafe with respect to the device manager's thread.
book_id: the calibre id for the book in the database.
book_metadata: the Metadata object for the book coming from the device.
first_call: True if this is the first call during a sync, False otherwise
'''
return (None, (None, False))
class BookList(list):
'''
A list of books. Each Book object must have the fields
#. title
#. authors
#. size (file size of the book)
#. datetime (a UTC time tuple)
#. path (path on the device to the book)
#. thumbnail (can be None) thumbnail is either a str/bytes object with the
image data or it should have an attribute image_path that stores an
absolute (platform native) path to the image
#. tags (a list of strings, can be empty).
'''
__getslice__ = None
__setslice__ = None
def __init__(self, oncard, prefix, settings):
pass
def supports_collections(self):
''' Return True if the device supports collections for this book list. '''
raise NotImplementedError()
def add_book(self, book, replace_metadata):
'''
Add the book to the booklist. Intent is to maintain any device-internal
metadata. Return True if booklists must be sync'ed
'''
raise NotImplementedError()
def remove_book(self, book):
'''
Remove a book from the booklist. Correct any device metadata at the
same time
'''
raise NotImplementedError()
def get_collections(self, collection_attributes):
'''
Return a dictionary of collections created from collection_attributes.
Each entry in the dictionary is of the form collection name:[list of
books]
The list of books is sorted by book title, except for collections
created from series, in which case series_index is used.
:param collection_attributes: A list of attributes of the Book object
'''
raise NotImplementedError()
class CurrentlyConnectedDevice(object):
def __init__(self):
self._device = None
@property
def device(self):
return self._device
# A device driver can check if a device is currently connected to calibre using
# the following code::
# from calibre.device.interface import currently_connected_device
# if currently_connected_device.device is None:
# # no device connected
# The device attribute will be either None or the device driver object
# (DevicePlugin instance) for the currently connected device.
currently_connected_device = CurrentlyConnectedDevice()

View File

@@ -0,0 +1,41 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
# License: GPLv3 Copyright: 2019, Kovid Goyal <kovid at kovidgoyal.net>
from __future__ import absolute_import, division, print_function, unicode_literals
import bs4
from bs4 import ( # noqa
CData, Comment, Declaration, NavigableString, ProcessingInstruction,
SoupStrainer, Tag, __version__
)
from polyglot.builtins import unicode_type
def parse_html(markup):
from calibre.ebooks.chardet import strip_encoding_declarations, xml_to_unicode, substitute_entites
from calibre.utils.cleantext import clean_xml_chars
if isinstance(markup, unicode_type):
markup = strip_encoding_declarations(markup)
markup = substitute_entites(markup)
else:
markup = xml_to_unicode(markup, strip_encoding_pats=True, resolve_entities=True)[0]
markup = clean_xml_chars(markup)
from html5_parser.soup import parse
return parse(markup, return_root=False)
def prettify(soup):
ans = soup.prettify()
if isinstance(ans, bytes):
ans = ans.decode('utf-8')
return ans
def BeautifulSoup(markup='', *a, **kw):
return parse_html(markup)
def BeautifulStoneSoup(markup='', *a, **kw):
return bs4.BeautifulSoup(markup, 'xml')

View File

@@ -0,0 +1,248 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
Code for the conversion of ebook formats and the reading of metadata
from various formats.
'''
import os, re, numbers, sys
from calibre import prints
from calibre.ebooks.chardet import xml_to_unicode
from polyglot.builtins import unicode_type
class ConversionError(Exception):
def __init__(self, msg, only_msg=False):
Exception.__init__(self, msg)
self.only_msg = only_msg
class UnknownFormatError(Exception):
pass
class DRMError(ValueError):
pass
class ParserError(ValueError):
pass
BOOK_EXTENSIONS = ['lrf', 'rar', 'zip', 'rtf', 'lit', 'txt', 'txtz', 'text', 'htm', 'xhtm',
'html', 'htmlz', 'xhtml', 'pdf', 'pdb', 'updb', 'pdr', 'prc', 'mobi', 'azw', 'doc',
'epub', 'fb2', 'fbz', 'djv', 'djvu', 'lrx', 'cbr', 'cbz', 'cbc', 'oebzip',
'rb', 'imp', 'odt', 'chm', 'tpz', 'azw1', 'pml', 'pmlz', 'mbp', 'tan', 'snb',
'xps', 'oxps', 'azw4', 'book', 'zbf', 'pobi', 'docx', 'docm', 'md',
'textile', 'markdown', 'ibook', 'ibooks', 'iba', 'azw3', 'ps', 'kepub', 'kfx', 'kpf']
def return_raster_image(path):
from calibre.utils.imghdr import what
if os.access(path, os.R_OK):
with open(path, 'rb') as f:
raw = f.read()
if what(None, raw) not in (None, 'svg'):
return raw
def extract_cover_from_embedded_svg(html, base, log):
from calibre.ebooks.oeb.base import XPath, SVG, XLINK
from calibre.utils.xml_parse import safe_xml_fromstring
root = safe_xml_fromstring(html)
svg = XPath('//svg:svg')(root)
if len(svg) == 1 and len(svg[0]) == 1 and svg[0][0].tag == SVG('image'):
image = svg[0][0]
href = image.get(XLINK('href'), None)
if href:
path = os.path.join(base, *href.split('/'))
return return_raster_image(path)
def extract_calibre_cover(raw, base, log):
from calibre.ebooks.BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(raw)
matches = soup.find(name=['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'span',
'font', 'br'])
images = soup.findAll('img', src=True)
if matches is None and len(images) == 1 and \
images[0].get('alt', '').lower()=='cover':
img = images[0]
img = os.path.join(base, *img['src'].split('/'))
q = return_raster_image(img)
if q is not None:
return q
# Look for a simple cover, i.e. a body with no text and only one <img> tag
if matches is None:
body = soup.find('body')
if body is not None:
text = u''.join(map(unicode_type, body.findAll(text=True)))
if text.strip():
# Body has text, abort
return
images = body.findAll('img', src=True)
if len(images) == 1:
img = os.path.join(base, *images[0]['src'].split('/'))
return return_raster_image(img)
def render_html_svg_workaround(path_to_html, log, width=590, height=750):
from calibre.ebooks.oeb.base import SVG_NS
with open(path_to_html, 'rb') as f:
raw = f.read()
raw = xml_to_unicode(raw, strip_encoding_pats=True)[0]
data = None
if SVG_NS in raw:
try:
data = extract_cover_from_embedded_svg(raw,
os.path.dirname(path_to_html), log)
except Exception:
pass
if data is None:
try:
data = extract_calibre_cover(raw, os.path.dirname(path_to_html), log)
except Exception:
pass
if data is None:
data = render_html_data(path_to_html, width, height)
return data
def render_html_data(path_to_html, width, height):
from calibre.ptempfile import TemporaryDirectory
from calibre.utils.ipc.simple_worker import fork_job, WorkerError
result = {}
def report_error(text=''):
prints('Failed to render', path_to_html, 'with errors:', file=sys.stderr)
if text:
prints(text, file=sys.stderr)
if result and result['stdout_stderr']:
with open(result['stdout_stderr'], 'rb') as f:
prints(f.read(), file=sys.stderr)
with TemporaryDirectory('-render-html') as tdir:
try:
result = fork_job('calibre.ebooks.render_html', 'main', args=(path_to_html, tdir, 'jpeg'))
except WorkerError as e:
report_error(e.orig_tb)
else:
if result['result']:
with open(os.path.join(tdir, 'rendered.jpeg'), 'rb') as f:
return f.read()
else:
report_error()
def check_ebook_format(stream, current_guess):
ans = current_guess
if current_guess.lower() in ('prc', 'mobi', 'azw', 'azw1', 'azw3'):
stream.seek(0)
if stream.read(3) == b'TPZ':
ans = 'tpz'
stream.seek(0)
return ans
def normalize(x):
if isinstance(x, unicode_type):
import unicodedata
x = unicodedata.normalize('NFC', x)
return x
def calibre_cover(title, author_string, series_string=None,
output_format='jpg', title_size=46, author_size=36, logo_path=None):
title = normalize(title)
author_string = normalize(author_string)
series_string = normalize(series_string)
from calibre.ebooks.covers import calibre_cover2
from calibre.utils.img import image_to_data
ans = calibre_cover2(title, author_string or '', series_string or '', logo_path=logo_path, as_qimage=True)
return image_to_data(ans, fmt=output_format)
UNIT_RE = re.compile(r'^(-*[0-9]*[.]?[0-9]*)\s*(%|em|ex|en|px|mm|cm|in|pt|pc|rem|q)$')
def unit_convert(value, base, font, dpi, body_font_size=12):
' Return value in pts'
if isinstance(value, numbers.Number):
return value
try:
return float(value) * 72.0 / dpi
except:
pass
result = value
m = UNIT_RE.match(value)
if m is not None and m.group(1):
value = float(m.group(1))
unit = m.group(2)
if unit == '%':
result = (value / 100.0) * base
elif unit == 'px':
result = value * 72.0 / dpi
elif unit == 'in':
result = value * 72.0
elif unit == 'pt':
result = value
elif unit == 'em':
result = value * font
elif unit in ('ex', 'en'):
# This is a hack for ex since we have no way to know
# the x-height of the font
font = font
result = value * font * 0.5
elif unit == 'pc':
result = value * 12.0
elif unit == 'mm':
result = value * 2.8346456693
elif unit == 'cm':
result = value * 28.346456693
elif unit == 'rem':
result = value * body_font_size
elif unit == 'q':
result = value * 0.708661417325
return result
def parse_css_length(value):
try:
m = UNIT_RE.match(value)
except TypeError:
return None, None
if m is not None and m.group(1):
value = float(m.group(1))
unit = m.group(2)
return value, unit.lower()
return None, None
def generate_masthead(title, output_path=None, width=600, height=60):
from calibre.ebooks.conversion.config import load_defaults
recs = load_defaults('mobi_output')
masthead_font_family = recs.get('masthead_font', None)
from calibre.ebooks.covers import generate_masthead
return generate_masthead(title, output_path=output_path, width=width, height=height, font_family=masthead_font_family)
def escape_xpath_attr(value):
if '"' in value:
if "'" in value:
parts = re.split('("+)', value)
ans = []
for x in parts:
if x:
q = "'" if '"' in x else '"'
ans.append(q + x + q)
return 'concat(%s)' % ', '.join(ans)
else:
return "'%s'" % value
return '"%s"' % value

View File

@@ -0,0 +1,189 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import re, codecs
from polyglot.builtins import unicode_type
_encoding_pats = (
# XML declaration
r'<\?[^<>]+encoding\s*=\s*[\'"](.*?)[\'"][^<>]*>',
# HTML 5 charset
r'''<meta\s+charset=['"]([-_a-z0-9]+)['"][^<>]*>(?:\s*</meta>){0,1}''',
# HTML 4 Pragma directive
r'''<meta\s+?[^<>]*?content\s*=\s*['"][^'"]*?charset=([-_a-z0-9]+)[^'"]*?['"][^<>]*>(?:\s*</meta>){0,1}''',
)
def compile_pats(binary):
for raw in _encoding_pats:
if binary:
raw = raw.encode('ascii')
yield re.compile(raw, flags=re.IGNORECASE)
class LazyEncodingPats(object):
def __call__(self, binary=False):
attr = 'binary_pats' if binary else 'unicode_pats'
pats = getattr(self, attr, None)
if pats is None:
pats = tuple(compile_pats(binary))
setattr(self, attr, pats)
for pat in pats:
yield pat
lazy_encoding_pats = LazyEncodingPats()
ENTITY_PATTERN = re.compile(r'&(\S+?);')
def strip_encoding_declarations(raw, limit=50*1024, preserve_newlines=False):
prefix = raw[:limit]
suffix = raw[limit:]
is_binary = isinstance(raw, bytes)
if preserve_newlines:
if is_binary:
sub = lambda m: b'\n' * m.group().count(b'\n')
else:
sub = lambda m: '\n' * m.group().count('\n')
else:
sub = b'' if is_binary else u''
for pat in lazy_encoding_pats(is_binary):
prefix = pat.sub(sub, prefix)
raw = prefix + suffix
return raw
def replace_encoding_declarations(raw, enc='utf-8', limit=50*1024):
prefix = raw[:limit]
suffix = raw[limit:]
changed = [False]
is_binary = isinstance(raw, bytes)
if is_binary:
if not isinstance(enc, bytes):
enc = enc.encode('ascii')
else:
if isinstance(enc, bytes):
enc = enc.decode('ascii')
def sub(m):
ans = m.group()
if m.group(1).lower() != enc.lower():
changed[0] = True
start, end = m.start(1) - m.start(0), m.end(1) - m.end(0)
ans = ans[:start] + enc + ans[end:]
return ans
for pat in lazy_encoding_pats(is_binary):
prefix = pat.sub(sub, prefix)
raw = prefix + suffix
return raw, changed[0]
def find_declared_encoding(raw, limit=50*1024):
prefix = raw[:limit]
is_binary = isinstance(raw, bytes)
for pat in lazy_encoding_pats(is_binary):
m = pat.search(prefix)
if m is not None:
ans = m.group(1)
if is_binary:
ans = ans.decode('ascii', 'replace')
return ans
def substitute_entites(raw):
from calibre import xml_entity_to_unicode
return ENTITY_PATTERN.sub(xml_entity_to_unicode, raw)
_CHARSET_ALIASES = {"macintosh" : "mac-roman",
"x-sjis" : "shift-jis"}
def detect(*args, **kwargs):
from chardet import detect
return detect(*args, **kwargs)
def force_encoding(raw, verbose, assume_utf8=False):
from calibre.constants import preferred_encoding
try:
chardet = detect(raw[:1024*50])
except:
chardet = {'encoding':preferred_encoding, 'confidence':0}
encoding = chardet['encoding']
if chardet['confidence'] < 1 and assume_utf8:
encoding = 'utf-8'
if chardet['confidence'] < 1 and verbose:
print('WARNING: Encoding detection confidence for %s is %d%%'%(
chardet['encoding'], chardet['confidence']*100))
if not encoding:
encoding = preferred_encoding
encoding = encoding.lower()
encoding = _CHARSET_ALIASES.get(encoding, encoding)
if encoding == 'ascii':
encoding = 'utf-8'
return encoding
def detect_xml_encoding(raw, verbose=False, assume_utf8=False):
if not raw or isinstance(raw, unicode_type):
return raw, None
for x in ('utf8', 'utf-16-le', 'utf-16-be'):
bom = getattr(codecs, 'BOM_'+x.upper().replace('-16', '16').replace(
'-', '_'))
if raw.startswith(bom):
return raw[len(bom):], x
encoding = None
for pat in lazy_encoding_pats(True):
match = pat.search(raw)
if match:
encoding = match.group(1)
encoding = encoding.decode('ascii', 'replace')
break
if encoding is None:
encoding = force_encoding(raw, verbose, assume_utf8=assume_utf8)
if encoding.lower().strip() == 'macintosh':
encoding = 'mac-roman'
if encoding.lower().replace('_', '-').strip() in (
'gb2312', 'chinese', 'csiso58gb231280', 'euc-cn', 'euccn',
'eucgb2312-cn', 'gb2312-1980', 'gb2312-80', 'iso-ir-58'):
# Microsoft Word exports to HTML with encoding incorrectly set to
# gb2312 instead of gbk. gbk is a superset of gb2312, anyway.
encoding = 'gbk'
try:
codecs.lookup(encoding)
except LookupError:
encoding = 'utf-8'
return raw, encoding
def xml_to_unicode(raw, verbose=False, strip_encoding_pats=False,
resolve_entities=False, assume_utf8=False):
'''
Force conversion of byte string to unicode. Tries to look for XML/HTML
encoding declaration first, if not found uses the chardet library and
prints a warning if detection confidence is < 100%
@return: (unicode, encoding used)
'''
if not raw:
return '', None
raw, encoding = detect_xml_encoding(raw, verbose=verbose,
assume_utf8=assume_utf8)
if not isinstance(raw, unicode_type):
raw = raw.decode(encoding, 'replace')
if strip_encoding_pats:
raw = strip_encoding_declarations(raw)
if resolve_entities:
raw = substitute_entites(raw)
return raw, encoding

View File

@@ -0,0 +1,6 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'

View File

@@ -0,0 +1,238 @@
/*
:mod:`cPalmdoc` -- Palmdoc compression/decompression
=====================================================
.. module:: cPalmdoc
:platform: All
:synopsis: Compression decompression of Palmdoc implemented in C for speed
.. moduleauthor:: Kovid Goyal <kovid@kovidgoyal.net> Copyright 2009
*/
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <stdio.h>
#define BUFFER 6000
#define MIN(x, y) ( ((x) < (y)) ? (x) : (y) )
#define MAX(x, y) ( ((x) > (y)) ? (x) : (y) )
typedef unsigned short int Byte;
typedef struct {
Byte *data;
Py_ssize_t len;
} buffer;
#ifdef bool
#undef bool
#endif
#define bool int
#ifdef false
#undef false
#endif
#define false 0
#ifdef true
#undef true
#endif
#define true 1
#define CHAR(x) (( (x) > 127 ) ? (x)-256 : (x))
#if PY_MAJOR_VERSION >= 3
#define BUFFER_FMT "y#"
#define BYTES_FMT "y#"
#else
#define BUFFER_FMT "t#"
#define BYTES_FMT "s#"
#endif
static PyObject *
cpalmdoc_decompress(PyObject *self, PyObject *args) {
const char *_input = NULL; Py_ssize_t input_len = 0;
Byte *input; char *output; Byte c; PyObject *ans;
Py_ssize_t i = 0, o = 0, j = 0, di, n;
if (!PyArg_ParseTuple(args, BUFFER_FMT, &_input, &input_len))
return NULL;
input = (Byte *) PyMem_Malloc(sizeof(Byte)*input_len);
if (input == NULL) return PyErr_NoMemory();
// Map chars to bytes
for (j = 0; j < input_len; j++)
input[j] = (_input[j] < 0) ? _input[j]+256 : _input[j];
output = (char *)PyMem_Malloc(sizeof(char)*(MAX(BUFFER, 8*input_len)));
if (output == NULL) return PyErr_NoMemory();
while (i < input_len) {
c = input[i++];
if (c >= 1 && c <= 8) // copy 'c' bytes
while (c--) output[o++] = (char)input[i++];
else if (c <= 0x7F) // 0, 09-7F = self
output[o++] = (char)c;
else if (c >= 0xC0) { // space + ASCII char
output[o++] = ' ';
output[o++] = c ^ 0x80;
}
else { // 80-BF repeat sequences
c = (c << 8) + input[i++];
di = (c & 0x3FFF) >> 3;
for ( n = (c & 7) + 3; n--; ++o )
output[o] = output[o - di];
}
}
ans = Py_BuildValue(BYTES_FMT, output, o);
if (output != NULL) PyMem_Free(output);
if (input != NULL) PyMem_Free(input);
return ans;
}
static bool
cpalmdoc_memcmp( Byte *a, Byte *b, Py_ssize_t len) {
Py_ssize_t i;
for (i = 0; i < len; i++) if (a[i] != b[i]) return false;
return true;
}
static Py_ssize_t
cpalmdoc_rfind(Byte *data, Py_ssize_t pos, Py_ssize_t chunk_length) {
Py_ssize_t i;
for (i = pos - chunk_length; i > -1; i--)
if (cpalmdoc_memcmp(data+i, data+pos, chunk_length)) return i;
return pos;
}
static Py_ssize_t
cpalmdoc_do_compress(buffer *b, char *output) {
Py_ssize_t i = 0, j, chunk_len, dist;
unsigned int compound;
Byte c, n;
bool found;
char *head;
buffer temp;
head = output;
temp.data = (Byte *)PyMem_Malloc(sizeof(Byte)*8); temp.len = 0;
if (temp.data == NULL) return 0;
while (i < b->len) {
c = b->data[i];
//do repeats
if ( i > 10 && (b->len - i) > 10) {
found = false;
for (chunk_len = 10; chunk_len > 2; chunk_len--) {
j = cpalmdoc_rfind(b->data, i, chunk_len);
dist = i - j;
if (j < i && dist <= 2047) {
found = true;
compound = (unsigned int)((dist << 3) + chunk_len-3);
*(output++) = CHAR(0x80 + (compound >> 8 ));
*(output++) = CHAR(compound & 0xFF);
i += chunk_len;
break;
}
}
if (found) continue;
}
//write single character
i++;
if (c == 32 && i < b->len) {
n = b->data[i];
if ( n >= 0x40 && n <= 0x7F) {
*(output++) = CHAR(n^0x80); i++; continue;
}
}
if (c == 0 || (c > 8 && c < 0x80))
*(output++) = CHAR(c);
else { // Write binary data
j = i;
temp.data[0] = c; temp.len = 1;
while (j < b->len && temp.len < 8) {
c = b->data[j];
if (c == 0 || (c > 8 && c < 0x80)) break;
temp.data[temp.len++] = c; j++;
}
i += temp.len - 1;
*(output++) = (char)temp.len;
for (j=0; j < temp.len; j++) *(output++) = (char)temp.data[j];
}
}
PyMem_Free(temp.data);
return output - head;
}
static PyObject *
cpalmdoc_compress(PyObject *self, PyObject *args) {
const char *_input = NULL; Py_ssize_t input_len = 0;
char *output; PyObject *ans;
Py_ssize_t j = 0;
buffer b;
if (!PyArg_ParseTuple(args, BUFFER_FMT, &_input, &input_len))
return NULL;
b.data = (Byte *)PyMem_Malloc(sizeof(Byte)*input_len);
if (b.data == NULL) return PyErr_NoMemory();
// Map chars to bytes
for (j = 0; j < input_len; j++)
b.data[j] = (_input[j] < 0) ? _input[j]+256 : _input[j];
b.len = input_len;
// Make the output buffer larger than the input as sometimes
// compression results in a larger block
output = (char *)PyMem_Malloc(sizeof(char) * (int)(1.25*b.len));
if (output == NULL) return PyErr_NoMemory();
j = cpalmdoc_do_compress(&b, output);
if ( j == 0) return PyErr_NoMemory();
ans = Py_BuildValue(BYTES_FMT, output, j);
PyMem_Free(output);
PyMem_Free(b.data);
return ans;
}
static char cPalmdoc_doc[] = "Compress and decompress palmdoc strings.";
static PyMethodDef cPalmdoc_methods[] = {
{"decompress", cpalmdoc_decompress, METH_VARARGS,
"decompress(bytestring) -> decompressed bytestring\n\n"
"Decompress a palmdoc compressed byte string. "
},
{"compress", cpalmdoc_compress, METH_VARARGS,
"compress(bytestring) -> compressed bytestring\n\n"
"Palmdoc compress a byte string. "
},
{NULL, NULL, 0, NULL}
};
#if PY_MAJOR_VERSION >= 3
#define INITERROR return NULL
#define INITMODULE PyModule_Create(&cPalmdoc_module)
static struct PyModuleDef cPalmdoc_module = {
/* m_base */ PyModuleDef_HEAD_INIT,
/* m_name */ "cPalmdoc",
/* m_doc */ cPalmdoc_doc,
/* m_size */ -1,
/* m_methods */ cPalmdoc_methods,
/* m_slots */ 0,
/* m_traverse */ 0,
/* m_clear */ 0,
/* m_free */ 0,
};
CALIBRE_MODINIT_FUNC PyInit_cPalmdoc(void) {
#else
#define INITERROR return
#define INITMODULE Py_InitModule3("cPalmdoc", cPalmdoc_methods, cPalmdoc_doc)
CALIBRE_MODINIT_FUNC initcPalmdoc(void) {
#endif
PyObject *m;
m = INITMODULE;
if (m == NULL) {
INITERROR;
}
#if PY_MAJOR_VERSION >= 3
return m;
#endif
}

View File

@@ -0,0 +1,96 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import io
from struct import pack
from calibre.constants import plugins
from polyglot.builtins import range
cPalmdoc = plugins['cPalmdoc'][0]
if not cPalmdoc:
raise RuntimeError(('Failed to load required cPalmdoc module: '
'%s')%plugins['cPalmdoc'][1])
def decompress_doc(data):
return cPalmdoc.decompress(data)
def compress_doc(data):
return cPalmdoc.compress(data) if data else b''
def py_compress_doc(data):
out = io.BytesIO()
i = 0
ldata = len(data)
while i < ldata:
if i > 10 and (ldata - i) > 10:
chunk = b''
match = -1
for j in range(10, 2, -1):
chunk = data[i:i+j]
try:
match = data.rindex(chunk, 0, i)
except ValueError:
continue
if (i - match) <= 2047:
break
match = -1
if match >= 0:
n = len(chunk)
m = i - match
code = 0x8000 + ((m << 3) & 0x3ff8) + (n - 3)
out.write(pack('>H', code))
i += n
continue
ch = data[i:i+1]
och = ord(ch)
i += 1
if ch == b' ' and (i + 1) < ldata:
onch = ord(data[i:i+1])
if onch >= 0x40 and onch < 0x80:
out.write(pack('>B', onch ^ 0x80))
i += 1
continue
if och == 0 or (och > 8 and och < 0x80):
out.write(ch)
else:
j = i
binseq = [ch]
while j < ldata and len(binseq) < 8:
ch = data[j:j+1]
och = ord(ch)
if och == 0 or (och > 8 and och < 0x80):
break
binseq.append(ch)
j += 1
out.write(pack('>B', len(binseq)))
out.write(b''.join(binseq))
i += len(binseq) - 1
return out.getvalue()
def find_tests():
import unittest
class Test(unittest.TestCase):
def test_palmdoc_compression(self):
for test in [
b'abc\x03\x04\x05\x06ms', # Test binary writing
b'a b c \xfed ', # Test encoding of spaces
b'0123456789axyz2bxyz2cdfgfo9iuyerh',
b'0123456789asd0123456789asd|yyzzxxffhhjjkk',
(b'ciewacnaq eiu743 r787q 0w% ; sa fd\xef\ffdxosac wocjp acoiecowei '
b'owaic jociowapjcivcjpoivjporeivjpoavca; p9aw8743y6r74%$^$^%8 ')
]:
x = compress_doc(test)
self.assertEqual(py_compress_doc(test), x)
self.assertEqual(decompress_doc(x), test)
return unittest.defaultTestLoader.loadTestsFromTestCase(Test)

View File

@@ -0,0 +1,30 @@
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2011, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
from polyglot.builtins import native_string_type
class ConversionUserFeedBack(Exception):
def __init__(self, title, msg, level='info', det_msg=''):
''' Show a simple message to the user
:param title: The title (very short description)
:param msg: The message to show the user
:param level: Must be one of 'info', 'warn' or 'error'
:param det_msg: Optional detailed message to show the user
'''
import json
Exception.__init__(self, json.dumps({'msg':msg, 'level':level,
'det_msg':det_msg, 'title':title}))
self.title, self.msg, self.det_msg = title, msg, det_msg
self.level = level
# Ensure exception uses fully qualified name as this is used to detect it in
# the GUI.
ConversionUserFeedBack.__name__ = native_string_type('calibre.ebooks.conversion.ConversionUserFeedBack')

View File

@@ -0,0 +1,428 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
'''
Command line interface to conversion sub-system
'''
import sys, os, numbers
from optparse import OptionGroup, Option
from collections import OrderedDict
from calibre.utils.config import OptionParser
from calibre.utils.logging import Log
from calibre.customize.conversion import OptionRecommendation
from calibre import patheq
from calibre.ebooks.conversion import ConversionUserFeedBack
from calibre.utils.localization import localize_user_manual_link
from polyglot.builtins import iteritems
USAGE = '%prog ' + _('''\
input_file output_file [options]
Convert an e-book from one format to another.
input_file is the input and output_file is the output. Both must be \
specified as the first two arguments to the command.
The output e-book format is guessed from the file extension of \
output_file. output_file can also be of the special format .EXT where \
EXT is the output file extension. In this case, the name of the output \
file is derived from the name of the input file. Note that the filenames must \
not start with a hyphen. Finally, if output_file has no extension, then \
it is treated as a directory and an "open e-book" (OEB) consisting of HTML \
files is written to that directory. These files are the files that would \
normally have been passed to the output plugin.
After specifying the input \
and output file you can customize the conversion by specifying various \
options. The available options depend on the input and output file types. \
To get help on them specify the input and output file and then use the -h \
option.
For full documentation of the conversion system see
''') + localize_user_manual_link('https://manual.calibre-ebook.com/conversion.html')
HEURISTIC_OPTIONS = ['markup_chapter_headings',
'italicize_common_cases', 'fix_indents',
'html_unwrap_factor', 'unwrap_lines',
'delete_blank_paragraphs', 'format_scene_breaks',
'dehyphenate', 'renumber_headings',
'replace_scene_breaks']
DEFAULT_TRUE_OPTIONS = HEURISTIC_OPTIONS + ['remove_fake_margins']
def print_help(parser, log):
parser.print_help()
def check_command_line_options(parser, args, log):
if len(args) < 3 or args[1].startswith('-') or args[2].startswith('-'):
print_help(parser, log)
log.error('\n\nYou must specify the input AND output files')
raise SystemExit(1)
input = os.path.abspath(args[1])
if not input.endswith('.recipe') and not os.access(input, os.R_OK) and not \
('-h' in args or '--help' in args):
log.error('Cannot read from', input)
raise SystemExit(1)
if input.endswith('.recipe') and not os.access(input, os.R_OK):
input = args[1]
output = args[2]
if (output.startswith('.') and output[:2] not in {'..', '.'} and '/' not in
output and '\\' not in output):
output = os.path.splitext(os.path.basename(input))[0]+output
output = os.path.abspath(output)
return input, output
def option_recommendation_to_cli_option(add_option, rec):
opt = rec.option
switches = ['-'+opt.short_switch] if opt.short_switch else []
switches.append('--'+opt.long_switch)
attrs = dict(dest=opt.name, help=opt.help,
choices=opt.choices, default=rec.recommended_value)
if isinstance(rec.recommended_value, type(True)):
attrs['action'] = 'store_false' if rec.recommended_value else \
'store_true'
else:
if isinstance(rec.recommended_value, numbers.Integral):
attrs['type'] = 'int'
if isinstance(rec.recommended_value, numbers.Real):
attrs['type'] = 'float'
if opt.long_switch == 'verbose':
attrs['action'] = 'count'
attrs.pop('type', '')
if opt.name == 'read_metadata_from_opf':
switches.append('--from-opf')
if opt.name == 'transform_css_rules':
attrs['help'] = _(
'Path to a file containing rules to transform the CSS styles'
' in this book. The easiest way to create such a file is to'
' use the wizard for creating rules in the calibre GUI. Access'
' it in the "Look & feel->Transform styles" section of the conversion'
' dialog. Once you create the rules, you can use the "Export" button'
' to save them to a file.'
)
if opt.name in DEFAULT_TRUE_OPTIONS and rec.recommended_value is True:
switches = ['--disable-'+opt.long_switch]
add_option(Option(*switches, **attrs))
def group_titles():
return _('INPUT OPTIONS'), _('OUTPUT OPTIONS')
def recipe_test(option, opt_str, value, parser):
assert value is None
value = []
def floatable(s):
try:
float(s)
return True
except ValueError:
return False
for arg in parser.rargs:
# stop on --foo like options
if arg[:2] == "--":
break
# stop on -a, but not on -3 or -3.0
if arg[:1] == "-" and len(arg) > 1 and not floatable(arg):
break
try:
value.append(int(arg))
except (TypeError, ValueError, AttributeError):
break
if len(value) == 2:
break
del parser.rargs[:len(value)]
while len(value) < 2:
value.append(2)
setattr(parser.values, option.dest, tuple(value))
def add_input_output_options(parser, plumber):
input_options, output_options = \
plumber.input_options, plumber.output_options
def add_options(group, options):
for opt in options:
if plumber.input_fmt == 'recipe' and opt.option.long_switch == 'test':
group(Option('--test', dest='test', action='callback', callback=recipe_test))
else:
option_recommendation_to_cli_option(group, opt)
if input_options:
title = group_titles()[0]
io = OptionGroup(parser, title, _('Options to control the processing'
' of the input %s file')%plumber.input_fmt)
add_options(io.add_option, input_options)
parser.add_option_group(io)
if output_options:
title = group_titles()[1]
oo = OptionGroup(parser, title, _('Options to control the processing'
' of the output %s')%plumber.output_fmt)
add_options(oo.add_option, output_options)
parser.add_option_group(oo)
def add_pipeline_options(parser, plumber):
groups = OrderedDict((
('' , ('',
[
'input_profile',
'output_profile',
]
)),
(_('LOOK AND FEEL') , (
_('Options to control the look and feel of the output'),
[
'base_font_size', 'disable_font_rescaling',
'font_size_mapping', 'embed_font_family',
'subset_embedded_fonts', 'embed_all_fonts',
'line_height', 'minimum_line_height',
'linearize_tables',
'extra_css', 'filter_css', 'transform_css_rules', 'expand_css',
'smarten_punctuation', 'unsmarten_punctuation',
'margin_top', 'margin_left', 'margin_right',
'margin_bottom', 'change_justification',
'insert_blank_line', 'insert_blank_line_size',
'remove_paragraph_spacing',
'remove_paragraph_spacing_indent_size',
'asciiize', 'keep_ligatures',
]
)),
(_('HEURISTIC PROCESSING') , (
_('Modify the document text and structure using common'
' patterns. Disabled by default. Use %(en)s to enable. '
' Individual actions can be disabled with the %(dis)s options.')
% dict(en='--enable-heuristics', dis='--disable-*'),
['enable_heuristics'] + HEURISTIC_OPTIONS
)),
(_('SEARCH AND REPLACE') , (
_('Modify the document text and structure using user defined patterns.'),
[
'sr1_search', 'sr1_replace',
'sr2_search', 'sr2_replace',
'sr3_search', 'sr3_replace',
'search_replace',
]
)),
(_('STRUCTURE DETECTION') , (
_('Control auto-detection of document structure.'),
[
'chapter', 'chapter_mark',
'prefer_metadata_cover', 'remove_first_image',
'insert_metadata', 'page_breaks_before',
'remove_fake_margins', 'start_reading_at',
]
)),
(_('TABLE OF CONTENTS') , (
_('Control the automatic generation of a Table of Contents. By '
'default, if the source file has a Table of Contents, it will '
'be used in preference to the automatically generated one.'),
[
'level1_toc', 'level2_toc', 'level3_toc',
'toc_threshold', 'max_toc_links', 'no_chapters_in_toc',
'use_auto_toc', 'toc_filter', 'duplicate_links_in_toc',
]
)),
(_('METADATA') , (_('Options to set metadata in the output'),
plumber.metadata_option_names + ['read_metadata_from_opf'],
)),
(_('DEBUG'), (_('Options to help with debugging the conversion'),
[
'verbose',
'debug_pipeline',
])),
))
for group, (desc, options) in iteritems(groups):
if group:
group = OptionGroup(parser, group, desc)
parser.add_option_group(group)
add_option = group.add_option if group != '' else parser.add_option
for name in options:
rec = plumber.get_option_by_name(name)
if rec.level < rec.HIGH:
option_recommendation_to_cli_option(add_option, rec)
def option_parser():
parser = OptionParser(usage=USAGE)
parser.add_option('--list-recipes', default=False, action='store_true',
help=_('List builtin recipe names. You can create an e-book from '
'a builtin recipe like this: ebook-convert "Recipe Name.recipe" '
'output.epub'))
return parser
class ProgressBar(object):
def __init__(self, log):
self.log = log
def __call__(self, frac, msg=''):
if msg:
percent = int(frac*100)
self.log('%d%% %s'%(percent, msg))
def create_option_parser(args, log):
if '--version' in args:
from calibre.constants import __appname__, __version__, __author__
log(os.path.basename(args[0]), '('+__appname__, __version__+')')
log('Created by:', __author__)
raise SystemExit(0)
if '--list-recipes' in args:
from calibre.web.feeds.recipes.collection import get_builtin_recipe_titles
log('Available recipes:')
titles = sorted(get_builtin_recipe_titles())
for title in titles:
try:
log('\t'+title)
except:
log('\t'+repr(title))
log('%d recipes available'%len(titles))
raise SystemExit(0)
parser = option_parser()
if len(args) < 3:
print_help(parser, log)
if any(x in args for x in ('-h', '--help')):
raise SystemExit(0)
else:
raise SystemExit(1)
input, output = check_command_line_options(parser, args, log)
from calibre.ebooks.conversion.plumber import Plumber
reporter = ProgressBar(log)
if patheq(input, output):
raise ValueError('Input file is the same as the output file')
plumber = Plumber(input, output, log, reporter)
add_input_output_options(parser, plumber)
add_pipeline_options(parser, plumber)
return parser, plumber
def abspath(x):
if x.startswith('http:') or x.startswith('https:'):
return x
return os.path.abspath(os.path.expanduser(x))
def escape_sr_pattern(exp):
return exp.replace('\n', '\ue123')
def read_sr_patterns(path, log=None):
import json, re
pats = []
with open(path, 'rb') as f:
lines = f.read().decode('utf-8').splitlines()
pat = None
for line in lines:
if pat is None:
if not line.strip():
continue
line = line.replace('\ue123', '\n')
try:
re.compile(line)
except:
msg = 'Invalid regular expression: %r from file: %r'%(
line, path)
if log is not None:
log.error(msg)
raise SystemExit(1)
else:
raise ValueError(msg)
pat = line
else:
pats.append((pat, line))
pat = None
return json.dumps(pats)
def main(args=sys.argv):
log = Log()
parser, plumber = create_option_parser(args, log)
opts, leftover_args = parser.parse_args(args)
if len(leftover_args) > 3:
log.error('Extra arguments not understood:', u', '.join(leftover_args[3:]))
return 1
for x in ('read_metadata_from_opf', 'cover'):
if getattr(opts, x, None) is not None:
setattr(opts, x, abspath(getattr(opts, x)))
if opts.search_replace:
opts.search_replace = read_sr_patterns(opts.search_replace, log)
if opts.transform_css_rules:
from calibre.ebooks.css_transform_rules import import_rules, validate_rule
with open(opts.transform_css_rules, 'rb') as tcr:
opts.transform_css_rules = rules = list(import_rules(tcr.read()))
for rule in rules:
title, msg = validate_rule(rule)
if title and msg:
log.error('Failed to parse CSS transform rules')
log.error(title)
log.error(msg)
return 1
recommendations = [(n.dest, getattr(opts, n.dest),
OptionRecommendation.HIGH)
for n in parser.options_iter()
if n.dest]
plumber.merge_ui_recommendations(recommendations)
try:
plumber.run()
except ConversionUserFeedBack as e:
ll = {'info': log.info, 'warn': log.warn,
'error':log.error}.get(e.level, log.info)
ll(e.title)
if e.det_msg:
log.debug(e.detmsg)
ll(e.msg)
raise SystemExit(1)
log(_('Output saved to'), ' ', plumber.output)
return 0
def manual_index_strings():
return _('''\
The options and default values for the options change depending on both the
input and output formats, so you should always check with::
%s
Below are the options that are common to all conversion, followed by the
options specific to every input and output format.''')
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,10 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

View File

@@ -0,0 +1,29 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
from calibre.customize.conversion import InputFormatPlugin
from polyglot.builtins import getcwd
class AZW4Input(InputFormatPlugin):
name = 'AZW4 Input'
author = 'John Schember'
description = 'Convert AZW4 to HTML'
file_types = {'azw4'}
commit_name = 'azw4_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.pdb.header import PdbHeaderReader
from calibre.ebooks.azw4.reader import Reader
header = PdbHeaderReader(stream)
reader = Reader(header, stream, log, options)
opf = reader.extract_content(getcwd())
return opf

View File

@@ -0,0 +1,202 @@
from __future__ import absolute_import, division, print_function, unicode_literals
''' CHM File decoding support '''
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>,' \
' and Alex Bramley <a.bramley at gmail.com>.'
import os
from calibre.customize.conversion import InputFormatPlugin
from calibre.ptempfile import TemporaryDirectory
from calibre.constants import filesystem_encoding
from polyglot.builtins import unicode_type, as_bytes
class CHMInput(InputFormatPlugin):
name = 'CHM Input'
author = 'Kovid Goyal and Alex Bramley'
description = 'Convert CHM files to OEB'
file_types = {'chm'}
commit_name = 'chm_input'
def _chmtohtml(self, output_dir, chm_path, no_images, log, debug_dump=False):
from calibre.ebooks.chm.reader import CHMReader
log.debug('Opening CHM file')
rdr = CHMReader(chm_path, log, input_encoding=self.opts.input_encoding)
log.debug('Extracting CHM to %s' % output_dir)
rdr.extract_content(output_dir, debug_dump=debug_dump)
self._chm_reader = rdr
return rdr.hhc_path
def convert(self, stream, options, file_ext, log, accelerators):
from calibre.ebooks.chm.metadata import get_metadata_from_reader
from calibre.customize.ui import plugin_for_input_format
self.opts = options
log.debug('Processing CHM...')
with TemporaryDirectory('_chm2oeb') as tdir:
if not isinstance(tdir, unicode_type):
tdir = tdir.decode(filesystem_encoding)
html_input = plugin_for_input_format('html')
for opt in html_input.options:
setattr(options, opt.option.name, opt.recommended_value)
no_images = False # options.no_images
chm_name = stream.name
# chm_data = stream.read()
# closing stream so CHM can be opened by external library
stream.close()
log.debug('tdir=%s' % tdir)
log.debug('stream.name=%s' % stream.name)
debug_dump = False
odi = options.debug_pipeline
if odi:
debug_dump = os.path.join(odi, 'input')
mainname = self._chmtohtml(tdir, chm_name, no_images, log,
debug_dump=debug_dump)
mainpath = os.path.join(tdir, mainname)
try:
metadata = get_metadata_from_reader(self._chm_reader)
except Exception:
log.exception('Failed to read metadata, using filename')
from calibre.ebooks.metadata.book.base import Metadata
metadata = Metadata(os.path.basename(chm_name))
encoding = self._chm_reader.get_encoding() or options.input_encoding or 'cp1252'
self._chm_reader.CloseCHM()
# print((tdir, mainpath))
# from calibre import ipython
# ipython()
options.debug_pipeline = None
options.input_encoding = 'utf-8'
uenc = encoding
if os.path.abspath(mainpath) in self._chm_reader.re_encoded_files:
uenc = 'utf-8'
htmlpath, toc = self._create_html_root(mainpath, log, uenc)
oeb = self._create_oebbook_html(htmlpath, tdir, options, log, metadata)
options.debug_pipeline = odi
if toc.count() > 1:
oeb.toc = self.parse_html_toc(oeb.spine[0])
oeb.manifest.remove(oeb.spine[0])
oeb.auto_generated_toc = False
return oeb
def parse_html_toc(self, item):
from calibre.ebooks.oeb.base import TOC, XPath
dx = XPath('./h:div')
ax = XPath('./h:a[1]')
def do_node(parent, div):
for child in dx(div):
a = ax(child)[0]
c = parent.add(a.text, a.attrib['href'])
do_node(c, child)
toc = TOC()
root = XPath('//h:div[1]')(item.data)[0]
do_node(toc, root)
return toc
def _create_oebbook_html(self, htmlpath, basedir, opts, log, mi):
# use HTMLInput plugin to generate book
from calibre.customize.builtins import HTMLInput
opts.breadth_first = True
htmlinput = HTMLInput(None)
oeb = htmlinput.create_oebbook(htmlpath, basedir, opts, log, mi)
return oeb
def _create_html_root(self, hhcpath, log, encoding):
from lxml import html
from polyglot.urllib import unquote as _unquote
from calibre.ebooks.oeb.base import urlquote
from calibre.ebooks.chardet import xml_to_unicode
hhcdata = self._read_file(hhcpath)
hhcdata = hhcdata.decode(encoding)
hhcdata = xml_to_unicode(hhcdata, verbose=True,
strip_encoding_pats=True, resolve_entities=True)[0]
hhcroot = html.fromstring(hhcdata)
toc = self._process_nodes(hhcroot)
# print("=============================")
# print("Printing hhcroot")
# print(etree.tostring(hhcroot, pretty_print=True))
# print("=============================")
log.debug('Found %d section nodes' % toc.count())
htmlpath = os.path.splitext(hhcpath)[0] + ".html"
base = os.path.dirname(os.path.abspath(htmlpath))
def unquote(x):
if isinstance(x, unicode_type):
x = x.encode('utf-8')
return _unquote(x).decode('utf-8')
def unquote_path(x):
y = unquote(x)
if (not os.path.exists(os.path.join(base, x)) and os.path.exists(os.path.join(base, y))):
x = y
return x
def donode(item, parent, base, subpath):
for child in item:
title = child.title
if not title:
continue
raw = unquote_path(child.href or '')
rsrcname = os.path.basename(raw)
rsrcpath = os.path.join(subpath, rsrcname)
if (not os.path.exists(os.path.join(base, rsrcpath)) and os.path.exists(os.path.join(base, raw))):
rsrcpath = raw
if '%' not in rsrcpath:
rsrcpath = urlquote(rsrcpath)
if not raw:
rsrcpath = ''
c = DIV(A(title, href=rsrcpath))
donode(child, c, base, subpath)
parent.append(c)
with open(htmlpath, 'wb') as f:
if toc.count() > 1:
from lxml.html.builder import HTML, BODY, DIV, A
path0 = toc[0].href
path0 = unquote_path(path0)
subpath = os.path.dirname(path0)
base = os.path.dirname(f.name)
root = DIV()
donode(toc, root, base, subpath)
raw = html.tostring(HTML(BODY(root)), encoding='utf-8',
pretty_print=True)
f.write(raw)
else:
f.write(as_bytes(hhcdata))
return htmlpath, toc
def _read_file(self, name):
with lopen(name, 'rb') as f:
data = f.read()
return data
def add_node(self, node, toc, ancestor_map):
from calibre.ebooks.chm.reader import match_string
if match_string(node.attrib.get('type', ''), 'text/sitemap'):
p = node.xpath('ancestor::ul[1]/ancestor::li[1]/object[1]')
parent = p[0] if p else None
toc = ancestor_map.get(parent, toc)
title = href = ''
for param in node.xpath('./param'):
if match_string(param.attrib['name'], 'name'):
title = param.attrib['value']
elif match_string(param.attrib['name'], 'local'):
href = param.attrib['value']
child = toc.add(title or _('Unknown'), href)
ancestor_map[node] = child
def _process_nodes(self, root):
from calibre.ebooks.oeb.base import TOC
toc = TOC()
ancestor_map = {}
for node in root.xpath('//object'):
self.add_node(node, toc, ancestor_map)
return toc

View File

@@ -0,0 +1,310 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'
'''
Based on ideas from comiclrf created by FangornUK.
'''
import shutil, textwrap, codecs, os
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from calibre import CurrentDir
from calibre.ptempfile import PersistentTemporaryDirectory
from polyglot.builtins import getcwd, map
class ComicInput(InputFormatPlugin):
name = 'Comic Input'
author = 'Kovid Goyal'
description = 'Optimize comic files (.cbz, .cbr, .cbc) for viewing on portable devices'
file_types = {'cbz', 'cbr', 'cbc'}
is_image_collection = True
commit_name = 'comic_input'
core_usage = -1
options = {
OptionRecommendation(name='colors', recommended_value=0,
help=_('Reduce the number of colors used in the image. This works only'
' if you choose the PNG output format. It is useful to reduce file sizes.'
' Set to zero to turn off. Maximum value is 256. It is off by default.')),
OptionRecommendation(name='dont_normalize', recommended_value=False,
help=_('Disable normalize (improve contrast) color range '
'for pictures. Default: False')),
OptionRecommendation(name='keep_aspect_ratio', recommended_value=False,
help=_('Maintain picture aspect ratio. Default is to fill the screen.')),
OptionRecommendation(name='dont_sharpen', recommended_value=False,
help=_('Disable sharpening.')),
OptionRecommendation(name='disable_trim', recommended_value=False,
help=_('Disable trimming of comic pages. For some comics, '
'trimming might remove content as well as borders.')),
OptionRecommendation(name='landscape', recommended_value=False,
help=_("Don't split landscape images into two portrait images")),
OptionRecommendation(name='wide', recommended_value=False,
help=_("Keep aspect ratio and scale image using screen height as "
"image width for viewing in landscape mode.")),
OptionRecommendation(name='right2left', recommended_value=False,
help=_('Used for right-to-left publications like manga. '
'Causes landscape pages to be split into portrait pages '
'from right to left.')),
OptionRecommendation(name='despeckle', recommended_value=False,
help=_('Enable Despeckle. Reduces speckle noise. '
'May greatly increase processing time.')),
OptionRecommendation(name='no_sort', recommended_value=False,
help=_("Don't sort the files found in the comic "
"alphabetically by name. Instead use the order they were "
"added to the comic.")),
OptionRecommendation(name='output_format', choices=['png', 'jpg'],
recommended_value='png', help=_('The format that images in the created e-book '
'are converted to. You can experiment to see which format gives '
'you optimal size and look on your device.')),
OptionRecommendation(name='no_process', recommended_value=False,
help=_("Apply no processing to the image")),
OptionRecommendation(name='dont_grayscale', recommended_value=False,
help=_('Do not convert the image to grayscale (black and white)')),
OptionRecommendation(name='comic_image_size', recommended_value=None,
help=_('Specify the image size as widthxheight pixels. Normally,'
' an image size is automatically calculated from the output '
'profile, this option overrides it.')),
OptionRecommendation(name='dont_add_comic_pages_to_toc', recommended_value=False,
help=_('When converting a CBC do not add links to each page to'
' the TOC. Note this only applies if the TOC has more than one'
' section')),
}
recommendations = {
('margin_left', 0, OptionRecommendation.HIGH),
('margin_top', 0, OptionRecommendation.HIGH),
('margin_right', 0, OptionRecommendation.HIGH),
('margin_bottom', 0, OptionRecommendation.HIGH),
('insert_blank_line', False, OptionRecommendation.HIGH),
('remove_paragraph_spacing', False, OptionRecommendation.HIGH),
('change_justification', 'left', OptionRecommendation.HIGH),
('dont_split_on_pagebreaks', True, OptionRecommendation.HIGH),
('chapter', None, OptionRecommendation.HIGH),
('page_breaks_brefore', None, OptionRecommendation.HIGH),
('use_auto_toc', False, OptionRecommendation.HIGH),
('page_breaks_before', None, OptionRecommendation.HIGH),
('disable_font_rescaling', True, OptionRecommendation.HIGH),
('linearize_tables', False, OptionRecommendation.HIGH),
}
def get_comics_from_collection(self, stream):
from calibre.libunzip import extract as zipextract
tdir = PersistentTemporaryDirectory('_comic_collection')
zipextract(stream, tdir)
comics = []
with CurrentDir(tdir):
if not os.path.exists('comics.txt'):
raise ValueError((
'%s is not a valid comic collection'
' no comics.txt was found in the file')
%stream.name)
with open('comics.txt', 'rb') as f:
raw = f.read()
if raw.startswith(codecs.BOM_UTF16_BE):
raw = raw.decode('utf-16-be')[1:]
elif raw.startswith(codecs.BOM_UTF16_LE):
raw = raw.decode('utf-16-le')[1:]
elif raw.startswith(codecs.BOM_UTF8):
raw = raw.decode('utf-8')[1:]
else:
raw = raw.decode('utf-8')
for line in raw.splitlines():
line = line.strip()
if not line:
continue
fname, title = line.partition(':')[0], line.partition(':')[-1]
fname = fname.replace('#', '_')
fname = os.path.join(tdir, *fname.split('/'))
if not title:
title = os.path.basename(fname).rpartition('.')[0]
if os.access(fname, os.R_OK):
comics.append([title, fname])
if not comics:
raise ValueError('%s has no comics'%stream.name)
return comics
def get_pages(self, comic, tdir2):
from calibre.ebooks.comic.input import (extract_comic, process_pages,
find_pages)
tdir = extract_comic(comic)
new_pages = find_pages(tdir, sort_on_mtime=self.opts.no_sort,
verbose=self.opts.verbose)
thumbnail = None
if not new_pages:
raise ValueError('Could not find any pages in the comic: %s'
%comic)
if self.opts.no_process:
n2 = []
for i, page in enumerate(new_pages):
n2.append(os.path.join(tdir2, '{} - {}' .format(i, os.path.basename(page))))
shutil.copyfile(page, n2[-1])
new_pages = n2
else:
new_pages, failures = process_pages(new_pages, self.opts,
self.report_progress, tdir2)
if failures:
self.log.warning('Could not process the following pages '
'(run with --verbose to see why):')
for f in failures:
self.log.warning('\t', f)
if not new_pages:
raise ValueError('Could not find any valid pages in comic: %s'
% comic)
thumbnail = os.path.join(tdir2,
'thumbnail.'+self.opts.output_format.lower())
if not os.access(thumbnail, os.R_OK):
thumbnail = None
return new_pages
def get_images(self):
return self._images
def convert(self, stream, opts, file_ext, log, accelerators):
from calibre.ebooks.metadata import MetaInformation
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.ebooks.metadata.toc import TOC
self.opts, self.log= opts, log
if file_ext == 'cbc':
comics_ = self.get_comics_from_collection(stream)
else:
comics_ = [['Comic', os.path.abspath(stream.name)]]
stream.close()
comics = []
for i, x in enumerate(comics_):
title, fname = x
cdir = 'comic_%d'%(i+1) if len(comics_) > 1 else '.'
cdir = os.path.abspath(cdir)
if not os.path.exists(cdir):
os.makedirs(cdir)
pages = self.get_pages(fname, cdir)
if not pages:
continue
if self.for_viewer:
comics.append((title, pages, [self.create_viewer_wrapper(pages)]))
else:
wrappers = self.create_wrappers(pages)
comics.append((title, pages, wrappers))
if not comics:
raise ValueError('No comic pages found in %s'%stream.name)
mi = MetaInformation(os.path.basename(stream.name).rpartition('.')[0],
[_('Unknown')])
opf = OPFCreator(getcwd(), mi)
entries = []
def href(x):
if len(comics) == 1:
return os.path.basename(x)
return '/'.join(x.split(os.sep)[-2:])
cover_href = None
for comic in comics:
pages, wrappers = comic[1:]
page_entries = [(x, None) for x in map(href, pages)]
entries += [(w, None) for w in map(href, wrappers)] + page_entries
if cover_href is None and page_entries:
cover_href = page_entries[0][0]
opf.create_manifest(entries)
spine = []
for comic in comics:
spine.extend(map(href, comic[2]))
self._images = []
for comic in comics:
self._images.extend(comic[1])
opf.create_spine(spine)
if self.for_viewer and cover_href:
opf.guide.set_cover(cover_href)
toc = TOC()
if len(comics) == 1:
wrappers = comics[0][2]
for i, x in enumerate(wrappers):
toc.add_item(href(x), None, _('Page')+' %d'%(i+1),
play_order=i)
else:
po = 0
for comic in comics:
po += 1
wrappers = comic[2]
stoc = toc.add_item(href(wrappers[0]),
None, comic[0], play_order=po)
if not opts.dont_add_comic_pages_to_toc:
for i, x in enumerate(wrappers):
stoc.add_item(href(x), None,
_('Page')+' %d'%(i+1), play_order=po)
po += 1
opf.set_toc(toc)
with open('metadata.opf', 'wb') as m, open('toc.ncx', 'wb') as n:
opf.render(m, n, 'toc.ncx')
return os.path.abspath('metadata.opf')
def create_wrappers(self, pages):
from calibre.ebooks.oeb.base import XHTML_NS
wrappers = []
WRAPPER = textwrap.dedent('''\
<html xmlns="%s">
<head>
<meta charset="utf-8"/>
<title>Page #%d</title>
<style type="text/css">
@page { margin:0pt; padding: 0pt}
body { margin: 0pt; padding: 0pt}
div { text-align: center }
</style>
</head>
<body>
<div>
<img src="%s" alt="comic page #%d" />
</div>
</body>
</html>
''')
dir = os.path.dirname(pages[0])
for i, page in enumerate(pages):
wrapper = WRAPPER%(XHTML_NS, i+1, os.path.basename(page), i+1)
page = os.path.join(dir, 'page_%d.xhtml'%(i+1))
with open(page, 'wb') as f:
f.write(wrapper.encode('utf-8'))
wrappers.append(page)
return wrappers
def create_viewer_wrapper(self, pages):
from calibre.ebooks.oeb.base import XHTML_NS
def page(src):
return '<img src="{}"></img>'.format(os.path.basename(src))
pages = '\n'.join(map(page, pages))
base = os.path.dirname(pages[0])
wrapper = '''
<html xmlns="%s">
<head>
<meta charset="utf-8"/>
<style type="text/css">
html, body, img { height: 100vh; display: block; margin: 0; padding: 0; border-width: 0; }
img {
width: 100%%; height: 100%%;
object-fit: contain;
margin-left: auto; margin-right: auto;
max-width: 100vw; max-height: 100vh;
top: 50vh; transform: translateY(-50%%);
position: relative;
page-break-after: always;
}
</style>
</head>
<body>
%s
</body>
</html>
''' % (XHTML_NS, pages)
path = os.path.join(base, 'wrapper.xhtml')
with open(path, 'wb') as f:
f.write(wrapper.encode('utf-8'))
return path

View File

@@ -0,0 +1,67 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2011, Anthon van der Neut <anthon@mnt.org>'
__docformat__ = 'restructuredtext en'
import os
from io import BytesIO
from calibre.customize.conversion import InputFormatPlugin
from polyglot.builtins import getcwd
class DJVUInput(InputFormatPlugin):
name = 'DJVU Input'
author = 'Anthon van der Neut'
description = 'Convert OCR-ed DJVU files (.djvu) to HTML'
file_types = {'djvu', 'djv'}
commit_name = 'djvu_input'
def convert(self, stream, options, file_ext, log, accelerators):
from calibre.ebooks.txt.processor import convert_basic
stdout = BytesIO()
from calibre.ebooks.djvu.djvu import DJVUFile
x = DJVUFile(stream)
x.get_text(stdout)
raw_text = stdout.getvalue()
if not raw_text:
raise ValueError('The DJVU file contains no text, only images, probably page scans.'
' calibre only supports conversion of DJVU files with actual text in them.')
html = convert_basic(raw_text.replace(b"\n", b' ').replace(
b'\037', b'\n\n'))
# Run the HTMLized text through the html processing plugin.
from calibre.customize.ui import plugin_for_input_format
html_input = plugin_for_input_format('html')
for opt in html_input.options:
setattr(options, opt.option.name, opt.recommended_value)
options.input_encoding = 'utf-8'
base = getcwd()
htmlfile = os.path.join(base, 'index.html')
c = 0
while os.path.exists(htmlfile):
c += 1
htmlfile = os.path.join(base, 'index%d.html'%c)
with open(htmlfile, 'wb') as f:
f.write(html.encode('utf-8'))
odi = options.debug_pipeline
options.debug_pipeline = None
# Generate oeb from html conversion.
with open(htmlfile, 'rb') as f:
oeb = html_input.convert(f, options, 'html', log,
{})
options.debug_pipeline = odi
os.remove(htmlfile)
# Set metadata from file.
from calibre.customize.ui import get_file_type_metadata
from calibre.ebooks.oeb.transforms.metadata import meta_info_to_oeb_metadata
mi = get_file_type_metadata(stream, file_ext)
meta_info_to_oeb_metadata(mi, oeb.metadata, log)
return oeb

View File

@@ -0,0 +1,34 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
class DOCXInput(InputFormatPlugin):
name = 'DOCX Input'
author = 'Kovid Goyal'
description = _('Convert DOCX files (.docx and .docm) to HTML')
file_types = {'docx', 'docm'}
commit_name = 'docx_input'
options = {
OptionRecommendation(name='docx_no_cover', recommended_value=False,
help=_('Normally, if a large image is present at the start of the document that looks like a cover, '
'it will be removed from the document and used as the cover for created e-book. This option '
'turns off that behavior.')),
OptionRecommendation(name='docx_no_pagebreaks_between_notes', recommended_value=False,
help=_('Do not insert a page break after every endnote.')),
OptionRecommendation(name='docx_inline_subsup', recommended_value=False,
help=_('Render superscripts and subscripts so that they do not affect the line height.')),
}
recommendations = {('page_breaks_before', '/', OptionRecommendation.MED)}
def convert(self, stream, options, file_ext, log, accelerators):
from calibre.ebooks.docx.to_html import Convert
return Convert(stream, detect_cover=not options.docx_no_cover, log=log, notes_nopb=options.docx_no_pagebreaks_between_notes,
nosupsub=options.docx_inline_subsup)()

View File

@@ -0,0 +1,93 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
PAGE_SIZES = ['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'b0', 'b1',
'b2', 'b3', 'b4', 'b5', 'b6', 'legal', 'letter']
class DOCXOutput(OutputFormatPlugin):
name = 'DOCX Output'
author = 'Kovid Goyal'
file_type = 'docx'
commit_name = 'docx_output'
ui_data = {'page_sizes': PAGE_SIZES}
options = {
OptionRecommendation(name='docx_page_size', recommended_value='letter',
level=OptionRecommendation.LOW, choices=PAGE_SIZES,
help=_('The size of the page. Default is letter. Choices '
'are %s') % PAGE_SIZES),
OptionRecommendation(name='docx_custom_page_size', recommended_value=None,
help=_('Custom size of the document. Use the form widthxheight '
'EG. `123x321` to specify the width and height (in pts). '
'This overrides any specified page-size.')),
OptionRecommendation(name='docx_no_cover', recommended_value=False,
help=_('Do not insert the book cover as an image at the start of the document.'
' If you use this option, the book cover will be discarded.')),
OptionRecommendation(name='preserve_cover_aspect_ratio', recommended_value=False,
help=_('Preserve the aspect ratio of the cover image instead of stretching'
' it out to cover the entire page.')),
OptionRecommendation(name='docx_no_toc', recommended_value=False,
help=_('Do not insert the table of contents as a page at the start of the document.')),
OptionRecommendation(name='extract_to',
help=_('Extract the contents of the generated %s file to the '
'specified directory. The contents of the directory are first '
'deleted, so be careful.') % 'DOCX'),
OptionRecommendation(name='docx_page_margin_left', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the left page margin, in pts. Default is 72pt.'
' Overrides the common left page margin setting.')
),
OptionRecommendation(name='docx_page_margin_top', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the top page margin, in pts. Default is 72pt.'
' Overrides the common top page margin setting, unless set to zero.')
),
OptionRecommendation(name='docx_page_margin_right', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the right page margin, in pts. Default is 72pt.'
' Overrides the common right page margin setting, unless set to zero.')
),
OptionRecommendation(name='docx_page_margin_bottom', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the bottom page margin, in pts. Default is 72pt.'
' Overrides the common bottom page margin setting, unless set to zero.')
),
}
def convert_metadata(self, oeb):
from lxml import etree
from calibre.ebooks.oeb.base import OPF, OPF2_NS
from calibre.ebooks.metadata.opf2 import OPF as ReadOPF
from io import BytesIO
package = etree.Element(OPF('package'), attrib={'version': '2.0'}, nsmap={None: OPF2_NS})
oeb.metadata.to_opf2(package)
self.mi = ReadOPF(BytesIO(etree.tostring(package, encoding='utf-8')), populate_spine=False, try_to_guess_cover=False).to_book_metadata()
def convert(self, oeb, output_path, input_plugin, opts, log):
from calibre.ebooks.docx.writer.container import DOCX
from calibre.ebooks.docx.writer.from_html import Convert
docx = DOCX(opts, log)
self.convert_metadata(oeb)
Convert(oeb, docx, self.mi, not opts.docx_no_cover, not opts.docx_no_toc)()
docx.write(output_path, self.mi)
if opts.extract_to:
from calibre.ebooks.docx.dump import do_dump
do_dump(output_path, opts.extract_to)

View File

@@ -0,0 +1,438 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os, re, posixpath
from itertools import cycle
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from polyglot.builtins import getcwd
ADOBE_OBFUSCATION = 'http://ns.adobe.com/pdf/enc#RC'
IDPF_OBFUSCATION = 'http://www.idpf.org/2008/embedding'
def decrypt_font_data(key, data, algorithm):
is_adobe = algorithm == ADOBE_OBFUSCATION
crypt_len = 1024 if is_adobe else 1040
crypt = bytearray(data[:crypt_len])
key = cycle(iter(bytearray(key)))
decrypt = bytes(bytearray(x^next(key) for x in crypt))
return decrypt + data[crypt_len:]
def decrypt_font(key, path, algorithm):
with lopen(path, 'r+b') as f:
data = decrypt_font_data(key, f.read(), algorithm)
f.seek(0), f.truncate(), f.write(data)
class EPUBInput(InputFormatPlugin):
name = 'EPUB Input'
author = 'Kovid Goyal'
description = 'Convert EPUB files (.epub) to HTML'
file_types = {'epub'}
output_encoding = None
commit_name = 'epub_input'
recommendations = {('page_breaks_before', '/', OptionRecommendation.MED)}
def process_encryption(self, encfile, opf, log):
from lxml import etree
import uuid, hashlib
idpf_key = opf.raw_unique_identifier
if idpf_key:
idpf_key = re.sub('[\u0020\u0009\u000d\u000a]', '', idpf_key)
idpf_key = hashlib.sha1(idpf_key.encode('utf-8')).digest()
key = None
for item in opf.identifier_iter():
scheme = None
for xkey in item.attrib.keys():
if xkey.endswith('scheme'):
scheme = item.get(xkey)
if (scheme and scheme.lower() == 'uuid') or \
(item.text and item.text.startswith('urn:uuid:')):
try:
key = item.text.rpartition(':')[-1]
key = uuid.UUID(key).bytes
except:
import traceback
traceback.print_exc()
key = None
try:
root = etree.parse(encfile)
for em in root.xpath('descendant::*[contains(name(), "EncryptionMethod")]'):
algorithm = em.get('Algorithm', '')
if algorithm not in {ADOBE_OBFUSCATION, IDPF_OBFUSCATION}:
return False
cr = em.getparent().xpath('descendant::*[contains(name(), "CipherReference")]')[0]
uri = cr.get('URI')
path = os.path.abspath(os.path.join(os.path.dirname(encfile), '..', *uri.split('/')))
tkey = (key if algorithm == ADOBE_OBFUSCATION else idpf_key)
if (tkey and os.path.exists(path)):
self._encrypted_font_uris.append(uri)
decrypt_font(tkey, path, algorithm)
return True
except:
import traceback
traceback.print_exc()
return False
def set_guide_type(self, opf, gtype, href=None, title=''):
# Set the specified guide entry
for elem in list(opf.iterguide()):
if elem.get('type', '').lower() == gtype:
elem.getparent().remove(elem)
if href is not None:
t = opf.create_guide_item(gtype, title, href)
for guide in opf.root.xpath('./*[local-name()="guide"]'):
guide.append(t)
return
guide = opf.create_guide_element()
opf.root.append(guide)
guide.append(t)
return t
def rationalize_cover3(self, opf, log):
''' If there is a reference to the cover/titlepage via manifest properties, convert to
entries in the <guide> so that the rest of the pipeline picks it up. '''
from calibre.ebooks.metadata.opf3 import items_with_property
removed = guide_titlepage_href = guide_titlepage_id = None
# Look for titlepages incorrectly marked in the <guide> as covers
guide_cover, guide_elem = None, None
for guide_elem in opf.iterguide():
if guide_elem.get('type', '').lower() == 'cover':
guide_cover = guide_elem.get('href', '').partition('#')[0]
break
if guide_cover:
spine = list(opf.iterspine())
if spine:
idref = spine[0].get('idref', '')
for x in opf.itermanifest():
if x.get('id') == idref and x.get('href') == guide_cover:
guide_titlepage_href = guide_cover
guide_titlepage_id = idref
break
raster_cover_href = opf.epub3_raster_cover or opf.raster_cover
if raster_cover_href:
self.set_guide_type(opf, 'cover', raster_cover_href, 'Cover Image')
titlepage_id = titlepage_href = None
for item in items_with_property(opf.root, 'calibre:title-page'):
tid, href = item.get('id'), item.get('href')
if href and tid:
titlepage_id, titlepage_href = tid, href.partition('#')[0]
break
if titlepage_href is None:
titlepage_href, titlepage_id = guide_titlepage_href, guide_titlepage_id
if titlepage_href is not None:
self.set_guide_type(opf, 'titlepage', titlepage_href, 'Title Page')
spine = list(opf.iterspine())
if len(spine) > 1:
for item in spine:
if item.get('idref') == titlepage_id:
log('Found HTML cover', titlepage_href)
if self.for_viewer:
item.attrib.pop('linear', None)
else:
item.getparent().remove(item)
removed = titlepage_href
return removed
def rationalize_cover2(self, opf, log):
''' Ensure that the cover information in the guide is correct. That
means, at most one entry with type="cover" that points to a raster
cover and at most one entry with type="titlepage" that points to an
HTML titlepage. '''
from calibre.ebooks.oeb.base import OPF
removed = None
from lxml import etree
guide_cover, guide_elem = None, None
for guide_elem in opf.iterguide():
if guide_elem.get('type', '').lower() == 'cover':
guide_cover = guide_elem.get('href', '').partition('#')[0]
break
if not guide_cover:
raster_cover = opf.raster_cover
if raster_cover:
if guide_elem is None:
g = opf.root.makeelement(OPF('guide'))
opf.root.append(g)
else:
g = guide_elem.getparent()
guide_cover = raster_cover
guide_elem = g.makeelement(OPF('reference'), attrib={'href':raster_cover, 'type':'cover'})
g.append(guide_elem)
return
spine = list(opf.iterspine())
if not spine:
return
# Check if the cover specified in the guide is also
# the first element in spine
idref = spine[0].get('idref', '')
manifest = list(opf.itermanifest())
if not manifest:
return
elem = [x for x in manifest if x.get('id', '') == idref]
if not elem or elem[0].get('href', None) != guide_cover:
return
log('Found HTML cover', guide_cover)
# Remove from spine as covers must be treated
# specially
if not self.for_viewer:
if len(spine) == 1:
log.warn('There is only a single spine item and it is marked as the cover. Removing cover marking.')
for guide_elem in tuple(opf.iterguide()):
if guide_elem.get('type', '').lower() == 'cover':
guide_elem.getparent().remove(guide_elem)
return
else:
spine[0].getparent().remove(spine[0])
removed = guide_cover
else:
# Ensure the cover is displayed as the first item in the book, some
# epub files have it set with linear='no' which causes the cover to
# display in the end
spine[0].attrib.pop('linear', None)
opf.spine[0].is_linear = True
# Ensure that the guide has a cover entry pointing to a raster cover
# and a titlepage entry pointing to the html titlepage. The titlepage
# entry will be used by the epub output plugin, the raster cover entry
# by other output plugins.
# Search for a raster cover identified in the OPF
raster_cover = opf.raster_cover
# Set the cover guide entry
if raster_cover is not None:
guide_elem.set('href', raster_cover)
else:
# Render the titlepage to create a raster cover
from calibre.ebooks import render_html_svg_workaround
guide_elem.set('href', 'calibre_raster_cover.jpg')
t = etree.SubElement(
elem[0].getparent(), OPF('item'), href=guide_elem.get('href'), id='calibre_raster_cover')
t.set('media-type', 'image/jpeg')
if os.path.exists(guide_cover):
renderer = render_html_svg_workaround(guide_cover, log)
if renderer is not None:
with lopen('calibre_raster_cover.jpg', 'wb') as f:
f.write(renderer)
# Set the titlepage guide entry
self.set_guide_type(opf, 'titlepage', guide_cover, 'Title Page')
return removed
def find_opf(self):
from calibre.utils.xml_parse import safe_xml_fromstring
def attr(n, attr):
for k, v in n.attrib.items():
if k.endswith(attr):
return v
try:
with lopen('META-INF/container.xml', 'rb') as f:
root = safe_xml_fromstring(f.read())
for r in root.xpath('//*[local-name()="rootfile"]'):
if attr(r, 'media-type') != "application/oebps-package+xml":
continue
path = attr(r, 'full-path')
if not path:
continue
path = os.path.join(getcwd(), *path.split('/'))
if os.path.exists(path):
return path
except Exception:
import traceback
traceback.print_exc()
def convert(self, stream, options, file_ext, log, accelerators):
from calibre.utils.zipfile import ZipFile
from calibre import walk
from calibre.ebooks import DRMError
from calibre.ebooks.metadata.opf2 import OPF
try:
zf = ZipFile(stream)
zf.extractall(getcwd())
except:
log.exception('EPUB appears to be invalid ZIP file, trying a'
' more forgiving ZIP parser')
from calibre.utils.localunzip import extractall
stream.seek(0)
extractall(stream)
encfile = os.path.abspath(os.path.join('META-INF', 'encryption.xml'))
opf = self.find_opf()
if opf is None:
for f in walk('.'):
if f.lower().endswith('.opf') and '__MACOSX' not in f and \
not os.path.basename(f).startswith('.'):
opf = os.path.abspath(f)
break
path = getattr(stream, 'name', 'stream')
if opf is None:
raise ValueError('%s is not a valid EPUB file (could not find opf)'%path)
opf = os.path.relpath(opf, getcwd())
parts = os.path.split(opf)
opf = OPF(opf, os.path.dirname(os.path.abspath(opf)))
self._encrypted_font_uris = []
if os.path.exists(encfile):
if not self.process_encryption(encfile, opf, log):
raise DRMError(os.path.basename(path))
self.encrypted_fonts = self._encrypted_font_uris
if len(parts) > 1 and parts[0]:
delta = '/'.join(parts[:-1])+'/'
def normpath(x):
return posixpath.normpath(delta + elem.get('href'))
for elem in opf.itermanifest():
elem.set('href', normpath(elem.get('href')))
for elem in opf.iterguide():
elem.set('href', normpath(elem.get('href')))
f = self.rationalize_cover3 if opf.package_version >= 3.0 else self.rationalize_cover2
self.removed_cover = f(opf, log)
if self.removed_cover:
self.removed_items_to_ignore = (self.removed_cover,)
epub3_nav = opf.epub3_nav
if epub3_nav is not None:
self.convert_epub3_nav(epub3_nav, opf, log, options)
for x in opf.itermanifest():
if x.get('media-type', '') == 'application/x-dtbook+xml':
raise ValueError(
'EPUB files with DTBook markup are not supported')
not_for_spine = set()
for y in opf.itermanifest():
id_ = y.get('id', None)
if id_:
mt = y.get('media-type', None)
if mt in {
'application/vnd.adobe-page-template+xml',
'application/vnd.adobe.page-template+xml',
'application/adobe-page-template+xml',
'application/adobe.page-template+xml',
'application/text'
}:
not_for_spine.add(id_)
ext = y.get('href', '').rpartition('.')[-1].lower()
if mt == 'text/plain' and ext in {'otf', 'ttf'}:
# some epub authoring software sets font mime types to
# text/plain
not_for_spine.add(id_)
y.set('media-type', 'application/font')
seen = set()
for x in list(opf.iterspine()):
ref = x.get('idref', None)
if not ref or ref in not_for_spine or ref in seen:
x.getparent().remove(x)
continue
seen.add(ref)
if len(list(opf.iterspine())) == 0:
raise ValueError('No valid entries in the spine of this EPUB')
with lopen('content.opf', 'wb') as nopf:
nopf.write(opf.render())
return os.path.abspath('content.opf')
def convert_epub3_nav(self, nav_path, opf, log, opts):
from lxml import etree
from calibre.ebooks.chardet import xml_to_unicode
from calibre.ebooks.oeb.polish.parsing import parse
from calibre.ebooks.oeb.base import EPUB_NS, XHTML, NCX_MIME, NCX, urlnormalize, urlunquote, serialize
from calibre.ebooks.oeb.polish.toc import first_child
from calibre.utils.xml_parse import safe_xml_fromstring
from tempfile import NamedTemporaryFile
with lopen(nav_path, 'rb') as f:
raw = f.read()
raw = xml_to_unicode(raw, strip_encoding_pats=True, assume_utf8=True)[0]
root = parse(raw, log=log)
ncx = safe_xml_fromstring('<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="eng"><navMap/></ncx>')
navmap = ncx[0]
et = '{%s}type' % EPUB_NS
bn = os.path.basename(nav_path)
def add_from_li(li, parent):
href = text = None
for x in li.iterchildren(XHTML('a'), XHTML('span')):
text = etree.tostring(
x, method='text', encoding='unicode', with_tail=False).strip() or ' '.join(
x.xpath('descendant-or-self::*/@title')).strip()
href = x.get('href')
if href:
if href.startswith('#'):
href = bn + href
break
np = parent.makeelement(NCX('navPoint'))
parent.append(np)
np.append(np.makeelement(NCX('navLabel')))
np[0].append(np.makeelement(NCX('text')))
np[0][0].text = text
if href:
np.append(np.makeelement(NCX('content'), attrib={'src':href}))
return np
def process_nav_node(node, toc_parent):
for li in node.iterchildren(XHTML('li')):
child = add_from_li(li, toc_parent)
ol = first_child(li, XHTML('ol'))
if child is not None and ol is not None:
process_nav_node(ol, child)
for nav in root.iterdescendants(XHTML('nav')):
if nav.get(et) == 'toc':
ol = first_child(nav, XHTML('ol'))
if ol is not None:
process_nav_node(ol, navmap)
break
else:
return
with NamedTemporaryFile(suffix='.ncx', dir=os.path.dirname(nav_path), delete=False) as f:
f.write(etree.tostring(ncx, encoding='utf-8'))
ncx_href = os.path.relpath(f.name, getcwd()).replace(os.sep, '/')
ncx_id = opf.create_manifest_item(ncx_href, NCX_MIME, append=True).get('id')
for spine in opf.root.xpath('//*[local-name()="spine"]'):
spine.set('toc', ncx_id)
opts.epub3_nav_href = urlnormalize(os.path.relpath(nav_path).replace(os.sep, '/'))
opts.epub3_nav_parsed = root
if getattr(self, 'removed_cover', None):
changed = False
base_path = os.path.dirname(nav_path)
for elem in root.xpath('//*[@href]'):
href, frag = elem.get('href').partition('#')[::2]
link_path = os.path.relpath(os.path.join(base_path, urlunquote(href)), base_path)
abs_href = urlnormalize(link_path)
if abs_href == self.removed_cover:
changed = True
elem.set('data-calibre-removed-titlepage', '1')
if changed:
with lopen(nav_path, 'wb') as f:
f.write(serialize(root, 'application/xhtml+xml'))
def postprocess_book(self, oeb, opts, log):
rc = getattr(self, 'removed_cover', None)
if rc:
cover_toc_item = None
for item in oeb.toc.iterdescendants():
if item.href and item.href.partition('#')[0] == rc:
cover_toc_item = item
break
spine = {x.href for x in oeb.spine}
if (cover_toc_item is not None and cover_toc_item not in spine):
oeb.toc.item_that_refers_to_cover = cover_toc_item

View File

@@ -0,0 +1,548 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os, shutil, re
from calibre.customize.conversion import (OutputFormatPlugin,
OptionRecommendation)
from calibre.ptempfile import TemporaryDirectory
from calibre import CurrentDir
from polyglot.builtins import unicode_type, filter, map, zip, range, as_bytes
block_level_tags = (
'address',
'body',
'blockquote',
'center',
'dir',
'div',
'dl',
'fieldset',
'form',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'hr',
'isindex',
'menu',
'noframes',
'noscript',
'ol',
'p',
'pre',
'table',
'ul',
)
class EPUBOutput(OutputFormatPlugin):
name = 'EPUB Output'
author = 'Kovid Goyal'
file_type = 'epub'
commit_name = 'epub_output'
ui_data = {'versions': ('2', '3')}
options = {
OptionRecommendation(name='extract_to',
help=_('Extract the contents of the generated %s file to the '
'specified directory. The contents of the directory are first '
'deleted, so be careful.') % 'EPUB'),
OptionRecommendation(name='dont_split_on_page_breaks',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Turn off splitting at page breaks. Normally, input '
'files are automatically split at every page break into '
'two files. This gives an output e-book that can be '
'parsed faster and with less resources. However, '
'splitting is slow and if your source file contains a '
'very large number of page breaks, you should turn off '
'splitting on page breaks.'
)
),
OptionRecommendation(name='flow_size', recommended_value=260,
help=_('Split all HTML files larger than this size (in KB). '
'This is necessary as most EPUB readers cannot handle large '
'file sizes. The default of %defaultKB is the size required '
'for Adobe Digital Editions. Set to 0 to disable size based splitting.')
),
OptionRecommendation(name='no_default_epub_cover', recommended_value=False,
help=_('Normally, if the input file has no cover and you don\'t'
' specify one, a default cover is generated with the title, '
'authors, etc. This option disables the generation of this cover.')
),
OptionRecommendation(name='no_svg_cover', recommended_value=False,
help=_('Do not use SVG for the book cover. Use this option if '
'your EPUB is going to be used on a device that does not '
'support SVG, like the iPhone or the JetBook Lite. '
'Without this option, such devices will display the cover '
'as a blank page.')
),
OptionRecommendation(name='preserve_cover_aspect_ratio',
recommended_value=False, help=_(
'When using an SVG cover, this option will cause the cover to scale '
'to cover the available screen area, but still preserve its aspect ratio '
'(ratio of width to height). That means there may be white borders '
'at the sides or top and bottom of the image, but the image will '
'never be distorted. Without this option the image may be slightly '
'distorted, but there will be no borders.'
)
),
OptionRecommendation(name='epub_flatten', recommended_value=False,
help=_('This option is needed only if you intend to use the EPUB'
' with FBReaderJ. It will flatten the file system inside the'
' EPUB, putting all files into the top level.')
),
OptionRecommendation(name='epub_inline_toc', recommended_value=False,
help=_('Insert an inline Table of Contents that will appear as part of the main book content.')
),
OptionRecommendation(name='epub_toc_at_end', recommended_value=False,
help=_('Put the inserted inline Table of Contents at the end of the book instead of the start.')
),
OptionRecommendation(name='toc_title', recommended_value=None,
help=_('Title for any generated in-line table of contents.')
),
OptionRecommendation(name='epub_version', recommended_value='2', choices=ui_data['versions'],
help=_('The version of the EPUB file to generate. EPUB 2 is the'
' most widely compatible, only use EPUB 3 if you know you'
' actually need it.')
),
}
recommendations = {('pretty_print', True, OptionRecommendation.HIGH)}
def workaround_webkit_quirks(self): # {{{
from calibre.ebooks.oeb.base import XPath
for x in self.oeb.spine:
root = x.data
body = XPath('//h:body')(root)
if body:
body = body[0]
if not hasattr(body, 'xpath'):
continue
for pre in XPath('//h:pre')(body):
if not pre.text and len(pre) == 0:
pre.tag = 'div'
# }}}
def upshift_markup(self): # {{{
'Upgrade markup to comply with XHTML 1.1 where possible'
from calibre.ebooks.oeb.base import XPath, XML
for x in self.oeb.spine:
root = x.data
if (not root.get(XML('lang'))) and (root.get('lang')):
root.set(XML('lang'), root.get('lang'))
body = XPath('//h:body')(root)
if body:
body = body[0]
if not hasattr(body, 'xpath'):
continue
for u in XPath('//h:u')(root):
u.tag = 'span'
seen_ids, seen_names = set(), set()
for x in XPath('//*[@id or @name]')(root):
eid, name = x.get('id', None), x.get('name', None)
if eid:
if eid in seen_ids:
del x.attrib['id']
else:
seen_ids.add(eid)
if name:
if name in seen_names:
del x.attrib['name']
else:
seen_names.add(name)
# }}}
def convert(self, oeb, output_path, input_plugin, opts, log):
self.log, self.opts, self.oeb = log, opts, oeb
if self.opts.epub_inline_toc:
from calibre.ebooks.mobi.writer8.toc import TOCAdder
opts.mobi_toc_at_start = not opts.epub_toc_at_end
opts.mobi_passthrough = False
opts.no_inline_toc = False
TOCAdder(oeb, opts, replace_previous_inline_toc=True, ignore_existing_toc=True)
if self.opts.epub_flatten:
from calibre.ebooks.oeb.transforms.filenames import FlatFilenames
FlatFilenames()(oeb, opts)
else:
from calibre.ebooks.oeb.transforms.filenames import UniqueFilenames
UniqueFilenames()(oeb, opts)
self.workaround_ade_quirks()
self.workaround_webkit_quirks()
self.upshift_markup()
from calibre.ebooks.oeb.transforms.rescale import RescaleImages
RescaleImages(check_colorspaces=True)(oeb, opts)
from calibre.ebooks.oeb.transforms.split import Split
split = Split(not self.opts.dont_split_on_page_breaks,
max_flow_size=self.opts.flow_size*1024
)
split(self.oeb, self.opts)
from calibre.ebooks.oeb.transforms.cover import CoverManager
cm = CoverManager(
no_default_cover=self.opts.no_default_epub_cover,
no_svg_cover=self.opts.no_svg_cover,
preserve_aspect_ratio=self.opts.preserve_cover_aspect_ratio)
cm(self.oeb, self.opts, self.log)
self.workaround_sony_quirks()
if self.oeb.toc.count() == 0:
self.log.warn('This EPUB file has no Table of Contents. '
'Creating a default TOC')
first = next(iter(self.oeb.spine))
self.oeb.toc.add(_('Start'), first.href)
from calibre.ebooks.oeb.base import OPF
identifiers = oeb.metadata['identifier']
uuid = None
for x in identifiers:
if x.get(OPF('scheme'), None).lower() == 'uuid' or unicode_type(x).startswith('urn:uuid:'):
uuid = unicode_type(x).split(':')[-1]
break
encrypted_fonts = getattr(input_plugin, 'encrypted_fonts', [])
if uuid is None:
self.log.warn('No UUID identifier found')
from uuid import uuid4
uuid = unicode_type(uuid4())
oeb.metadata.add('identifier', uuid, scheme='uuid', id=uuid)
if encrypted_fonts and not uuid.startswith('urn:uuid:'):
# Apparently ADE requires this value to start with urn:uuid:
# for some absurd reason, or it will throw a hissy fit and refuse
# to use the obfuscated fonts.
for x in identifiers:
if unicode_type(x) == uuid:
x.content = 'urn:uuid:'+uuid
with TemporaryDirectory('_epub_output') as tdir:
from calibre.customize.ui import plugin_for_output_format
metadata_xml = None
extra_entries = []
if self.is_periodical:
if self.opts.output_profile.epub_periodical_format == 'sony':
from calibre.ebooks.epub.periodical import sony_metadata
metadata_xml, atom_xml = sony_metadata(oeb)
extra_entries = [('atom.xml', 'application/atom+xml', atom_xml)]
oeb_output = plugin_for_output_format('oeb')
oeb_output.convert(oeb, tdir, input_plugin, opts, log)
opf = [x for x in os.listdir(tdir) if x.endswith('.opf')][0]
self.condense_ncx([os.path.join(tdir, x) for x in os.listdir(tdir)
if x.endswith('.ncx')][0])
if self.opts.epub_version == '3':
self.upgrade_to_epub3(tdir, opf)
encryption = None
if encrypted_fonts:
encryption = self.encrypt_fonts(encrypted_fonts, tdir, uuid)
from calibre.ebooks.epub import initialize_container
with initialize_container(output_path, os.path.basename(opf),
extra_entries=extra_entries) as epub:
epub.add_dir(tdir)
if encryption is not None:
epub.writestr('META-INF/encryption.xml', as_bytes(encryption))
if metadata_xml is not None:
epub.writestr('META-INF/metadata.xml',
metadata_xml.encode('utf-8'))
if opts.extract_to is not None:
from calibre.utils.zipfile import ZipFile
if os.path.exists(opts.extract_to):
if os.path.isdir(opts.extract_to):
shutil.rmtree(opts.extract_to)
else:
os.remove(opts.extract_to)
os.mkdir(opts.extract_to)
with ZipFile(output_path) as zf:
zf.extractall(path=opts.extract_to)
self.log.info('EPUB extracted to', opts.extract_to)
def upgrade_to_epub3(self, tdir, opf):
self.log.info('Upgrading to EPUB 3...')
from calibre.ebooks.epub import simple_container_xml
from calibre.ebooks.oeb.polish.cover import fix_conversion_titlepage_links_in_nav
try:
os.mkdir(os.path.join(tdir, 'META-INF'))
except EnvironmentError:
pass
with open(os.path.join(tdir, 'META-INF', 'container.xml'), 'wb') as f:
f.write(simple_container_xml(os.path.basename(opf)).encode('utf-8'))
from calibre.ebooks.oeb.polish.container import EpubContainer
container = EpubContainer(tdir, self.log)
from calibre.ebooks.oeb.polish.upgrade import epub_2_to_3
existing_nav = getattr(self.opts, 'epub3_nav_parsed', None)
nav_href = getattr(self.opts, 'epub3_nav_href', None)
previous_nav = (nav_href, existing_nav) if existing_nav and nav_href else None
epub_2_to_3(container, self.log.info, previous_nav=previous_nav)
fix_conversion_titlepage_links_in_nav(container)
container.commit()
os.remove(f.name)
try:
os.rmdir(os.path.join(tdir, 'META-INF'))
except EnvironmentError:
pass
def encrypt_fonts(self, uris, tdir, uuid): # {{{
from polyglot.binary import from_hex_bytes
key = re.sub(r'[^a-fA-F0-9]', '', uuid)
if len(key) < 16:
raise ValueError('UUID identifier %r is invalid'%uuid)
key = bytearray(from_hex_bytes((key + key)[:32]))
paths = []
with CurrentDir(tdir):
paths = [os.path.join(*x.split('/')) for x in uris]
uris = dict(zip(uris, paths))
fonts = []
for uri in list(uris.keys()):
path = uris[uri]
if not os.path.exists(path):
uris.pop(uri)
continue
self.log.debug('Encrypting font:', uri)
with lopen(path, 'r+b') as f:
data = f.read(1024)
if len(data) >= 1024:
data = bytearray(data)
f.seek(0)
f.write(bytes(bytearray(data[i] ^ key[i%16] for i in range(1024))))
else:
self.log.warn('Font', path, 'is invalid, ignoring')
if not isinstance(uri, unicode_type):
uri = uri.decode('utf-8')
fonts.append('''
<enc:EncryptedData>
<enc:EncryptionMethod Algorithm="http://ns.adobe.com/pdf/enc#RC"/>
<enc:CipherData>
<enc:CipherReference URI="%s"/>
</enc:CipherData>
</enc:EncryptedData>
'''%(uri.replace('"', '\\"')))
if fonts:
ans = '''<encryption
xmlns="urn:oasis:names:tc:opendocument:xmlns:container"
xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
xmlns:deenc="http://ns.adobe.com/digitaleditions/enc">
'''
ans += '\n'.join(fonts)
ans += '\n</encryption>'
return ans
# }}}
def condense_ncx(self, ncx_path): # {{{
from lxml import etree
if not self.opts.pretty_print:
tree = etree.parse(ncx_path)
for tag in tree.getroot().iter(tag=etree.Element):
if tag.text:
tag.text = tag.text.strip()
if tag.tail:
tag.tail = tag.tail.strip()
compressed = etree.tostring(tree.getroot(), encoding='utf-8')
with open(ncx_path, 'wb') as f:
f.write(compressed)
# }}}
def workaround_ade_quirks(self): # {{{
'''
Perform various markup transforms to get the output to render correctly
in the quirky ADE.
'''
from calibre.ebooks.oeb.base import XPath, XHTML, barename, urlunquote
stylesheet = self.oeb.manifest.main_stylesheet
# ADE cries big wet tears when it encounters an invalid fragment
# identifier in the NCX toc.
frag_pat = re.compile(r'[-A-Za-z0-9_:.]+$')
for node in self.oeb.toc.iter():
href = getattr(node, 'href', None)
if hasattr(href, 'partition'):
base, _, frag = href.partition('#')
frag = urlunquote(frag)
if frag and frag_pat.match(frag) is None:
self.log.warn(
'Removing fragment identifier %r from TOC as Adobe Digital Editions cannot handle it'%frag)
node.href = base
for x in self.oeb.spine:
root = x.data
body = XPath('//h:body')(root)
if body:
body = body[0]
if hasattr(body, 'xpath'):
# remove <img> tags with empty src elements
bad = []
for x in XPath('//h:img')(body):
src = x.get('src', '').strip()
if src in ('', '#') or src.startswith('http:'):
bad.append(x)
for img in bad:
img.getparent().remove(img)
# Add id attribute to <a> tags that have name
for x in XPath('//h:a[@name]')(body):
if not x.get('id', False):
x.set('id', x.get('name'))
# The delightful epubcheck has started complaining about <a> tags that
# have name attributes.
x.attrib.pop('name')
# Replace <br> that are children of <body> as ADE doesn't handle them
for br in XPath('./h:br')(body):
if br.getparent() is None:
continue
try:
prior = next(br.itersiblings(preceding=True))
priortag = barename(prior.tag)
priortext = prior.tail
except:
priortag = 'body'
priortext = body.text
if priortext:
priortext = priortext.strip()
br.tag = XHTML('p')
br.text = '\u00a0'
style = br.get('style', '').split(';')
style = list(filter(None, map(lambda x: x.strip(), style)))
style.append('margin:0pt; border:0pt')
# If the prior tag is a block (including a <br> we replaced)
# then this <br> replacement should have a 1-line height.
# Otherwise it should have no height.
if not priortext and priortag in block_level_tags:
style.append('height:1em')
else:
style.append('height:0pt')
br.set('style', '; '.join(style))
for tag in XPath('//h:embed')(root):
tag.getparent().remove(tag)
for tag in XPath('//h:object')(root):
if tag.get('type', '').lower().strip() in {'image/svg+xml', 'application/svg+xml'}:
continue
tag.getparent().remove(tag)
for tag in XPath('//h:title|//h:style')(root):
if not tag.text:
tag.getparent().remove(tag)
for tag in XPath('//h:script')(root):
if (not tag.text and not tag.get('src', False) and tag.get('type', None) != 'text/x-mathjax-config'):
tag.getparent().remove(tag)
for tag in XPath('//h:body/descendant::h:script')(root):
tag.getparent().remove(tag)
formchildren = XPath('./h:input|./h:button|./h:textarea|'
'./h:label|./h:fieldset|./h:legend')
for tag in XPath('//h:form')(root):
if formchildren(tag):
tag.getparent().remove(tag)
else:
# Not a real form
tag.tag = XHTML('div')
for tag in XPath('//h:center')(root):
tag.tag = XHTML('div')
tag.set('style', 'text-align:center')
# ADE can't handle &amp; in an img url
for tag in XPath('//h:img[@src]')(root):
tag.set('src', tag.get('src', '').replace('&', ''))
# ADE whimpers in fright when it encounters a <td> outside a
# <table>
in_table = XPath('ancestor::h:table')
for tag in XPath('//h:td|//h:tr|//h:th')(root):
if not in_table(tag):
tag.tag = XHTML('div')
# ADE fails to render non breaking hyphens/soft hyphens/zero width spaces
special_chars = re.compile('[\u200b\u00ad]')
for elem in root.iterdescendants('*'):
if elem.text:
elem.text = special_chars.sub('', elem.text)
elem.text = elem.text.replace('\u2011', '-')
if elem.tail:
elem.tail = special_chars.sub('', elem.tail)
elem.tail = elem.tail.replace('\u2011', '-')
if stylesheet is not None:
# ADE doesn't render lists correctly if they have left margins
from css_parser.css import CSSRule
for lb in XPath('//h:ul[@class]|//h:ol[@class]')(root):
sel = '.'+lb.get('class')
for rule in stylesheet.data.cssRules.rulesOfType(CSSRule.STYLE_RULE):
if sel == rule.selectorList.selectorText:
rule.style.removeProperty('margin-left')
# padding-left breaks rendering in webkit and gecko
rule.style.removeProperty('padding-left')
# Change whitespace:pre to pre-wrap to accommodate readers that
# cannot scroll horizontally
for rule in stylesheet.data.cssRules.rulesOfType(CSSRule.STYLE_RULE):
style = rule.style
ws = style.getPropertyValue('white-space')
if ws == 'pre':
style.setProperty('white-space', 'pre-wrap')
# }}}
def workaround_sony_quirks(self): # {{{
'''
Perform toc link transforms to alleviate slow loading.
'''
from calibre.ebooks.oeb.base import urldefrag, XPath
from calibre.ebooks.oeb.polish.toc import item_at_top
def frag_is_at_top(root, frag):
elem = XPath('//*[@id="%s" or @name="%s"]'%(frag, frag))(root)
if elem:
elem = elem[0]
else:
return False
return item_at_top(elem)
def simplify_toc_entry(toc):
if toc.href:
href, frag = urldefrag(toc.href)
if frag:
for x in self.oeb.spine:
if x.href == href:
if frag_is_at_top(x.data, frag):
self.log.debug('Removing anchor from TOC href:',
href+'#'+frag)
toc.href = href
break
for x in toc:
simplify_toc_entry(x)
if self.oeb.toc:
simplify_toc_entry(self.oeb.toc)
# }}}

View File

@@ -0,0 +1,179 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Anatoly Shipitsin <norguhtar at gmail.com>'
"""
Convert .fb2 files to .lrf
"""
import os, re
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from calibre import guess_type
from polyglot.builtins import iteritems, getcwd
FB2NS = 'http://www.gribuser.ru/xml/fictionbook/2.0'
FB21NS = 'http://www.gribuser.ru/xml/fictionbook/2.1'
class FB2Input(InputFormatPlugin):
name = 'FB2 Input'
author = 'Anatoly Shipitsin'
description = 'Convert FB2 and FBZ files to HTML'
file_types = {'fb2', 'fbz'}
commit_name = 'fb2_input'
recommendations = {
('level1_toc', '//h:h1', OptionRecommendation.MED),
('level2_toc', '//h:h2', OptionRecommendation.MED),
('level3_toc', '//h:h3', OptionRecommendation.MED),
}
options = {
OptionRecommendation(name='no_inline_fb2_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Do not insert a Table of Contents at the beginning of the book.'
)
)}
def convert(self, stream, options, file_ext, log,
accelerators):
from lxml import etree
from calibre.utils.xml_parse import safe_xml_fromstring
from calibre.ebooks.metadata.fb2 import ensure_namespace, get_fb2_data
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.ebooks.metadata.meta import get_metadata
from calibre.ebooks.oeb.base import XLINK_NS, XHTML_NS
from calibre.ebooks.chardet import xml_to_unicode
self.log = log
log.debug('Parsing XML...')
raw = get_fb2_data(stream)[0]
raw = raw.replace(b'\0', b'')
raw = xml_to_unicode(raw, strip_encoding_pats=True,
assume_utf8=True, resolve_entities=True)[0]
try:
doc = safe_xml_fromstring(raw)
except etree.XMLSyntaxError:
doc = safe_xml_fromstring(raw.replace('& ', '&amp;'))
if doc is None:
raise ValueError('The FB2 file is not valid XML')
doc = ensure_namespace(doc)
try:
fb_ns = doc.nsmap[doc.prefix]
except Exception:
fb_ns = FB2NS
NAMESPACES = {'f':fb_ns, 'l':XLINK_NS}
stylesheets = doc.xpath('//*[local-name() = "stylesheet" and @type="text/css"]')
css = ''
for s in stylesheets:
css += etree.tostring(s, encoding='unicode', method='text',
with_tail=False) + '\n\n'
if css:
import css_parser, logging
parser = css_parser.CSSParser(fetcher=None,
log=logging.getLogger('calibre.css'))
XHTML_CSS_NAMESPACE = '@namespace "%s";\n' % XHTML_NS
text = XHTML_CSS_NAMESPACE + css
log.debug('Parsing stylesheet...')
stylesheet = parser.parseString(text)
stylesheet.namespaces['h'] = XHTML_NS
css = stylesheet.cssText
if isinstance(css, bytes):
css = css.decode('utf-8', 'replace')
css = css.replace('h|style', 'h|span')
css = re.sub(r'name\s*=\s*', 'class=', css)
self.extract_embedded_content(doc)
log.debug('Converting XML to HTML...')
with open(P('templates/fb2.xsl'), 'rb') as f:
ss = f.read().decode('utf-8')
ss = ss.replace("__FB_NS__", fb_ns)
if options.no_inline_fb2_toc:
log('Disabling generation of inline FB2 TOC')
ss = re.compile(r'<!-- BUILD TOC -->.*<!-- END BUILD TOC -->',
re.DOTALL).sub('', ss)
styledoc = safe_xml_fromstring(ss)
transform = etree.XSLT(styledoc)
result = transform(doc)
# Handle links of type note and cite
notes = {a.get('href')[1:]: a for a in result.xpath('//a[@link_note and @href]') if a.get('href').startswith('#')}
cites = {a.get('link_cite'): a for a in result.xpath('//a[@link_cite]') if not a.get('href', '')}
all_ids = {x for x in result.xpath('//*/@id')}
for cite, a in iteritems(cites):
note = notes.get(cite, None)
if note:
c = 1
while 'cite%d' % c in all_ids:
c += 1
if not note.get('id', None):
note.set('id', 'cite%d' % c)
all_ids.add(note.get('id'))
a.set('href', '#%s' % note.get('id'))
for x in result.xpath('//*[@link_note or @link_cite]'):
x.attrib.pop('link_note', None)
x.attrib.pop('link_cite', None)
for img in result.xpath('//img[@src]'):
src = img.get('src')
img.set('src', self.binary_map.get(src, src))
index = transform.tostring(result)
with open('index.xhtml', 'wb') as f:
f.write(index.encode('utf-8'))
with open('inline-styles.css', 'wb') as f:
f.write(css.encode('utf-8'))
stream.seek(0)
mi = get_metadata(stream, 'fb2')
if not mi.title:
mi.title = _('Unknown')
if not mi.authors:
mi.authors = [_('Unknown')]
cpath = None
if mi.cover_data and mi.cover_data[1]:
with open('fb2_cover_calibre_mi.jpg', 'wb') as f:
f.write(mi.cover_data[1])
cpath = os.path.abspath('fb2_cover_calibre_mi.jpg')
else:
for img in doc.xpath('//f:coverpage/f:image', namespaces=NAMESPACES):
href = img.get('{%s}href'%XLINK_NS, img.get('href', None))
if href is not None:
if href.startswith('#'):
href = href[1:]
cpath = os.path.abspath(href)
break
opf = OPFCreator(getcwd(), mi)
entries = [(f2, guess_type(f2)[0]) for f2 in os.listdir(u'.')]
opf.create_manifest(entries)
opf.create_spine(['index.xhtml'])
if cpath:
opf.guide.set_cover(cpath)
with open('metadata.opf', 'wb') as f:
opf.render(f)
return os.path.join(getcwd(), 'metadata.opf')
def extract_embedded_content(self, doc):
from calibre.ebooks.fb2 import base64_decode
self.binary_map = {}
for elem in doc.xpath('./*'):
if elem.text and 'binary' in elem.tag and 'id' in elem.attrib:
ct = elem.get('content-type', '')
fname = elem.attrib['id']
ext = ct.rpartition('/')[-1].lower()
if ext in ('png', 'jpeg', 'jpg'):
if fname.lower().rpartition('.')[-1] not in {'jpg', 'jpeg',
'png'}:
fname += '.' + ext
self.binary_map[elem.get('id')] = fname
raw = elem.text.strip()
try:
data = base64_decode(raw)
except TypeError:
self.log.exception('Binary data with id=%s is corrupted, ignoring'%(
elem.get('id')))
else:
with open(fname, 'wb') as f:
f.write(data)

View File

@@ -0,0 +1,203 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
class FB2Output(OutputFormatPlugin):
name = 'FB2 Output'
author = 'John Schember'
file_type = 'fb2'
commit_name = 'fb2_output'
FB2_GENRES = [
# Science Fiction & Fantasy
'sf_history', # Alternative history
'sf_action', # Action
'sf_epic', # Epic
'sf_heroic', # Heroic
'sf_detective', # Detective
'sf_cyberpunk', # Cyberpunk
'sf_space', # Space
'sf_social', # Social#philosophical
'sf_horror', # Horror & mystic
'sf_humor', # Humor
'sf_fantasy', # Fantasy
'sf', # Science Fiction
# Detectives & Thrillers
'det_classic', # Classical detectives
'det_police', # Police Stories
'det_action', # Action
'det_irony', # Ironical detectives
'det_history', # Historical detectives
'det_espionage', # Espionage detectives
'det_crime', # Crime detectives
'det_political', # Political detectives
'det_maniac', # Maniacs
'det_hard', # Hard#boiled
'thriller', # Thrillers
'detective', # Detectives
# Prose
'prose_classic', # Classics prose
'prose_history', # Historical prose
'prose_contemporary', # Contemporary prose
'prose_counter', # Counterculture
'prose_rus_classic', # Russial classics prose
'prose_su_classics', # Soviet classics prose
# Romance
'love_contemporary', # Contemporary Romance
'love_history', # Historical Romance
'love_detective', # Detective Romance
'love_short', # Short Romance
'love_erotica', # Erotica
# Adventure
'adv_western', # Western
'adv_history', # History
'adv_indian', # Indians
'adv_maritime', # Maritime Fiction
'adv_geo', # Travel & geography
'adv_animal', # Nature & animals
'adventure', # Other
# Children's
'child_tale', # Fairy Tales
'child_verse', # Verses
'child_prose', # Prose
'child_sf', # Science Fiction
'child_det', # Detectives & Thrillers
'child_adv', # Adventures
'child_education', # Educational
'children', # Other
# Poetry & Dramaturgy
'poetry', # Poetry
'dramaturgy', # Dramaturgy
# Antique literature
'antique_ant', # Antique
'antique_european', # European
'antique_russian', # Old russian
'antique_east', # Old east
'antique_myths', # Myths. Legends. Epos
'antique', # Other
# Scientific#educational
'sci_history', # History
'sci_psychology', # Psychology
'sci_culture', # Cultural science
'sci_religion', # Religious studies
'sci_philosophy', # Philosophy
'sci_politics', # Politics
'sci_business', # Business literature
'sci_juris', # Jurisprudence
'sci_linguistic', # Linguistics
'sci_medicine', # Medicine
'sci_phys', # Physics
'sci_math', # Mathematics
'sci_chem', # Chemistry
'sci_biology', # Biology
'sci_tech', # Technical
'science', # Other
# Computers & Internet
'comp_www', # Internet
'comp_programming', # Programming
'comp_hard', # Hardware
'comp_soft', # Software
'comp_db', # Databases
'comp_osnet', # OS & Networking
'computers', # Other
# Reference
'ref_encyc', # Encyclopedias
'ref_dict', # Dictionaries
'ref_ref', # Reference
'ref_guide', # Guidebooks
'reference', # Other
# Nonfiction
'nonf_biography', # Biography & Memoirs
'nonf_publicism', # Publicism
'nonf_criticism', # Criticism
'design', # Art & design
'nonfiction', # Other
# Religion & Inspiration
'religion_rel', # Religion
'religion_esoterics', # Esoterics
'religion_self', # Self#improvement
'religion', # Other
# Humor
'humor_anecdote', # Anecdote (funny stories)
'humor_prose', # Prose
'humor_verse', # Verses
'humor', # Other
# Home & Family
'home_cooking', # Cooking
'home_pets', # Pets
'home_crafts', # Hobbies & Crafts
'home_entertain', # Entertaining
'home_health', # Health
'home_garden', # Garden
'home_diy', # Do it yourself
'home_sport', # Sports
'home_sex', # Erotica & sex
'home', # Other
]
ui_data = {
'sectionize': {
'toc': _('Section per entry in the ToC'),
'files': _('Section per file'),
'nothing': _('A single section')
},
'genres': FB2_GENRES,
}
options = {
OptionRecommendation(name='sectionize',
recommended_value='files', level=OptionRecommendation.LOW,
choices=list(ui_data['sectionize']),
help=_('Specify how sections are created:\n'
' * nothing: {nothing}\n'
' * files: {files}\n'
' * toc: {toc}\n'
'If ToC based generation fails, adjust the "Structure detection" and/or "Table of Contents" settings '
'(turn on "Force use of auto-generated Table of Contents").').format(**ui_data['sectionize'])
),
OptionRecommendation(name='fb2_genre',
recommended_value='antique', level=OptionRecommendation.LOW,
choices=FB2_GENRES,
help=(_('Genre for the book. Choices: %s\n\n See: ') % ', '.join(FB2_GENRES)
) + 'http://www.fictionbook.org/index.php/Eng:FictionBook_2.1_genres ' + _('for a complete list with descriptions.')),
}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.oeb.transforms.jacket import linearize_jacket
from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer, Unavailable
from calibre.ebooks.fb2.fb2ml import FB2MLizer
try:
rasterizer = SVGRasterizer()
rasterizer(oeb_book, opts)
except Unavailable:
log.warn('SVG rasterizer unavailable, SVG will not be converted')
linearize_jacket(oeb_book)
fb2mlizer = FB2MLizer(log)
fb2_content = fb2mlizer.extract_content(oeb_book, opts)
close = False
if not hasattr(output_path, 'write'):
close = True
if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path) != '':
os.makedirs(os.path.dirname(output_path))
out_stream = lopen(output_path, 'wb')
else:
out_stream = output_path
out_stream.seek(0)
out_stream.truncate()
out_stream.write(fb2_content.encode('utf-8', 'replace'))
if close:
out_stream.close()

View File

@@ -0,0 +1,316 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import re, tempfile, os
from functools import partial
from calibre.constants import islinux, isbsd
from calibre.customize.conversion import (InputFormatPlugin,
OptionRecommendation)
from calibre.utils.localization import get_lang
from calibre.utils.filenames import ascii_filename
from calibre.utils.imghdr import what
from polyglot.builtins import unicode_type, zip, getcwd, as_unicode
def sanitize_file_name(x):
ans = re.sub(r'\s+', ' ', re.sub(r'[?&=;#]', '_', ascii_filename(x))).strip().rstrip('.')
ans, ext = ans.rpartition('.')[::2]
return (ans.strip() + '.' + ext.strip()).rstrip('.')
class HTMLInput(InputFormatPlugin):
name = 'HTML Input'
author = 'Kovid Goyal'
description = 'Convert HTML and OPF files to an OEB'
file_types = {'opf', 'html', 'htm', 'xhtml', 'xhtm', 'shtm', 'shtml'}
commit_name = 'html_input'
options = {
OptionRecommendation(name='breadth_first',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Traverse links in HTML files breadth first. Normally, '
'they are traversed depth first.'
)
),
OptionRecommendation(name='max_levels',
recommended_value=5, level=OptionRecommendation.LOW,
help=_('Maximum levels of recursion when following links in '
'HTML files. Must be non-negative. 0 implies that no '
'links in the root HTML file are followed. Default is '
'%default.'
)
),
OptionRecommendation(name='dont_package',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Normally this input plugin re-arranges all the input '
'files into a standard folder hierarchy. Only use this option '
'if you know what you are doing as it can result in various '
'nasty side effects in the rest of the conversion pipeline.'
)
),
}
def convert(self, stream, opts, file_ext, log,
accelerators):
self._is_case_sensitive = None
basedir = getcwd()
self.opts = opts
fname = None
if hasattr(stream, 'name'):
basedir = os.path.dirname(stream.name)
fname = os.path.basename(stream.name)
if file_ext != 'opf':
if opts.dont_package:
raise ValueError('The --dont-package option is not supported for an HTML input file')
from calibre.ebooks.metadata.html import get_metadata
mi = get_metadata(stream)
if fname:
from calibre.ebooks.metadata.meta import metadata_from_filename
fmi = metadata_from_filename(fname)
fmi.smart_update(mi)
mi = fmi
oeb = self.create_oebbook(stream.name, basedir, opts, log, mi)
return oeb
from calibre.ebooks.conversion.plumber import create_oebbook
return create_oebbook(log, stream.name, opts,
encoding=opts.input_encoding)
def is_case_sensitive(self, path):
if getattr(self, '_is_case_sensitive', None) is not None:
return self._is_case_sensitive
if not path or not os.path.exists(path):
return islinux or isbsd
self._is_case_sensitive = not (os.path.exists(path.lower()) and os.path.exists(path.upper()))
return self._is_case_sensitive
def create_oebbook(self, htmlpath, basedir, opts, log, mi):
import uuid
from calibre.ebooks.conversion.plumber import create_oebbook
from calibre.ebooks.oeb.base import (DirContainer,
rewrite_links, urlnormalize, urldefrag, BINARY_MIME, OEB_STYLES,
xpath, urlquote)
from calibre import guess_type
from calibre.ebooks.oeb.transforms.metadata import \
meta_info_to_oeb_metadata
from calibre.ebooks.html.input import get_filelist
from calibre.ebooks.metadata import string_to_authors
from calibre.utils.localization import canonicalize_lang
import css_parser, logging
css_parser.log.setLevel(logging.WARN)
self.OEB_STYLES = OEB_STYLES
oeb = create_oebbook(log, None, opts, self,
encoding=opts.input_encoding, populate=False)
self.oeb = oeb
metadata = oeb.metadata
meta_info_to_oeb_metadata(mi, metadata, log)
if not metadata.language:
l = canonicalize_lang(getattr(opts, 'language', None))
if not l:
oeb.logger.warn('Language not specified')
l = get_lang().replace('_', '-')
metadata.add('language', l)
if not metadata.creator:
a = getattr(opts, 'authors', None)
if a:
a = string_to_authors(a)
if not a:
oeb.logger.warn('Creator not specified')
a = [self.oeb.translate(__('Unknown'))]
for aut in a:
metadata.add('creator', aut)
if not metadata.title:
oeb.logger.warn('Title not specified')
metadata.add('title', self.oeb.translate(__('Unknown')))
bookid = unicode_type(uuid.uuid4())
metadata.add('identifier', bookid, id='uuid_id', scheme='uuid')
for ident in metadata.identifier:
if 'id' in ident.attrib:
self.oeb.uid = metadata.identifier[0]
break
filelist = get_filelist(htmlpath, basedir, opts, log)
filelist = [f for f in filelist if not f.is_binary]
htmlfile_map = {}
for f in filelist:
path = f.path
oeb.container = DirContainer(os.path.dirname(path), log,
ignore_opf=True)
bname = os.path.basename(path)
id, href = oeb.manifest.generate(id='html', href=sanitize_file_name(bname))
htmlfile_map[path] = href
item = oeb.manifest.add(id, href, 'text/html')
if path == htmlpath and '%' in path:
bname = urlquote(bname)
item.html_input_href = bname
oeb.spine.add(item, True)
self.added_resources = {}
self.log = log
self.log('Normalizing filename cases')
for path, href in htmlfile_map.items():
if not self.is_case_sensitive(path):
path = path.lower()
self.added_resources[path] = href
self.urlnormalize, self.DirContainer = urlnormalize, DirContainer
self.urldefrag = urldefrag
self.guess_type, self.BINARY_MIME = guess_type, BINARY_MIME
self.log('Rewriting HTML links')
for f in filelist:
path = f.path
dpath = os.path.dirname(path)
oeb.container = DirContainer(dpath, log, ignore_opf=True)
href = htmlfile_map[path]
try:
item = oeb.manifest.hrefs[href]
except KeyError:
item = oeb.manifest.hrefs[urlnormalize(href)]
rewrite_links(item.data, partial(self.resource_adder, base=dpath))
for item in oeb.manifest.values():
if item.media_type in self.OEB_STYLES:
dpath = None
for path, href in self.added_resources.items():
if href == item.href:
dpath = os.path.dirname(path)
break
css_parser.replaceUrls(item.data,
partial(self.resource_adder, base=dpath))
toc = self.oeb.toc
self.oeb.auto_generated_toc = True
titles = []
headers = []
for item in self.oeb.spine:
if not item.linear:
continue
html = item.data
title = ''.join(xpath(html, '/h:html/h:head/h:title/text()'))
title = re.sub(r'\s+', ' ', title.strip())
if title:
titles.append(title)
headers.append('(unlabled)')
for tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'strong'):
expr = '/h:html/h:body//h:%s[position()=1]/text()'
header = ''.join(xpath(html, expr % tag))
header = re.sub(r'\s+', ' ', header.strip())
if header:
headers[-1] = header
break
use = titles
if len(titles) > len(set(titles)):
use = headers
for title, item in zip(use, self.oeb.spine):
if not item.linear:
continue
toc.add(title, item.href)
oeb.container = DirContainer(getcwd(), oeb.log, ignore_opf=True)
return oeb
def link_to_local_path(self, link_, base=None):
from calibre.ebooks.html.input import Link
if not isinstance(link_, unicode_type):
try:
link_ = link_.decode('utf-8', 'error')
except:
self.log.warn('Failed to decode link %r. Ignoring'%link_)
return None, None
try:
l = Link(link_, base if base else getcwd())
except:
self.log.exception('Failed to process link: %r'%link_)
return None, None
if l.path is None:
# Not a local resource
return None, None
link = l.path.replace('/', os.sep).strip()
frag = l.fragment
if not link:
return None, None
return link, frag
def resource_adder(self, link_, base=None):
from polyglot.urllib import quote
link, frag = self.link_to_local_path(link_, base=base)
if link is None:
return link_
try:
if base and not os.path.isabs(link):
link = os.path.join(base, link)
link = os.path.abspath(link)
except:
return link_
if not os.access(link, os.R_OK):
return link_
if os.path.isdir(link):
self.log.warn(link_, 'is a link to a directory. Ignoring.')
return link_
if not self.is_case_sensitive(tempfile.gettempdir()):
link = link.lower()
if link not in self.added_resources:
bhref = os.path.basename(link)
id, href = self.oeb.manifest.generate(id='added', href=sanitize_file_name(bhref))
guessed = self.guess_type(href)[0]
media_type = guessed or self.BINARY_MIME
if media_type == 'text/plain':
self.log.warn('Ignoring link to text file %r'%link_)
return None
if media_type == self.BINARY_MIME:
# Check for the common case, images
try:
img = what(link)
except EnvironmentError:
pass
else:
if img:
media_type = self.guess_type('dummy.'+img)[0] or self.BINARY_MIME
self.oeb.log.debug('Added', link)
self.oeb.container = self.DirContainer(os.path.dirname(link),
self.oeb.log, ignore_opf=True)
# Load into memory
item = self.oeb.manifest.add(id, href, media_type)
# bhref refers to an already existing file. The read() method of
# DirContainer will call unquote on it before trying to read the
# file, therefore we quote it here.
if isinstance(bhref, unicode_type):
bhref = bhref.encode('utf-8')
item.html_input_href = as_unicode(quote(bhref))
if guessed in self.OEB_STYLES:
item.override_css_fetch = partial(
self.css_import_handler, os.path.dirname(link))
item.data
self.added_resources[link] = href
nlink = self.added_resources[link]
if frag:
nlink = '#'.join((nlink, frag))
return nlink
def css_import_handler(self, base, href):
link, frag = self.link_to_local_path(href, base=base)
if link is None or not os.access(link, os.R_OK) or os.path.isdir(link):
return (None, None)
try:
with open(link, 'rb') as f:
raw = f.read().decode('utf-8', 'replace')
raw = self.oeb.css_preprocessor(raw, add_namespace=False)
except:
self.log.exception('Failed to read CSS file: %r'%link)
return (None, None)
return (None, raw)

View File

@@ -0,0 +1,226 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2010, Fabian Grassl <fg@jusmeum.de>'
__docformat__ = 'restructuredtext en'
import os, re, shutil
from os.path import dirname, abspath, relpath as _relpath, exists, basename
from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
from calibre import CurrentDir
from calibre.ptempfile import PersistentTemporaryDirectory
from polyglot.builtins import unicode_type
def relpath(*args):
return _relpath(*args).replace(os.sep, '/')
class HTMLOutput(OutputFormatPlugin):
name = 'HTML Output'
author = 'Fabian Grassl'
file_type = 'zip'
commit_name = 'html_output'
options = {
OptionRecommendation(name='template_css',
help=_('CSS file used for the output instead of the default file')),
OptionRecommendation(name='template_html_index',
help=_('Template used for generation of the HTML index file instead of the default file')),
OptionRecommendation(name='template_html',
help=_('Template used for the generation of the HTML contents of the book instead of the default file')),
OptionRecommendation(name='extract_to',
help=_('Extract the contents of the generated ZIP file to the '
'specified directory. WARNING: The contents of the directory '
'will be deleted.')
),
}
recommendations = {('pretty_print', True, OptionRecommendation.HIGH)}
def generate_toc(self, oeb_book, ref_url, output_dir):
'''
Generate table of contents
'''
from lxml import etree
from polyglot.urllib import unquote
from calibre.ebooks.oeb.base import element
from calibre.utils.cleantext import clean_xml_chars
with CurrentDir(output_dir):
def build_node(current_node, parent=None):
if parent is None:
parent = etree.Element('ul')
elif len(current_node.nodes):
parent = element(parent, ('ul'))
for node in current_node.nodes:
point = element(parent, 'li')
href = relpath(abspath(unquote(node.href)), dirname(ref_url))
if isinstance(href, bytes):
href = href.decode('utf-8')
link = element(point, 'a', href=clean_xml_chars(href))
title = node.title
if isinstance(title, bytes):
title = title.decode('utf-8')
if title:
title = re.sub(r'\s+', ' ', title)
link.text = clean_xml_chars(title)
build_node(node, point)
return parent
wrap = etree.Element('div')
wrap.append(build_node(oeb_book.toc))
return wrap
def generate_html_toc(self, oeb_book, ref_url, output_dir):
from lxml import etree
root = self.generate_toc(oeb_book, ref_url, output_dir)
return etree.tostring(root, pretty_print=True, encoding='unicode',
xml_declaration=False)
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from lxml import etree
from calibre.utils import zipfile
from templite import Templite
from polyglot.urllib import unquote
from calibre.ebooks.html.meta import EasyMeta
# read template files
if opts.template_html_index is not None:
with open(opts.template_html_index, 'rb') as f:
template_html_index_data = f.read()
else:
template_html_index_data = P('templates/html_export_default_index.tmpl', data=True)
if opts.template_html is not None:
with open(opts.template_html, 'rb') as f:
template_html_data = f.read()
else:
template_html_data = P('templates/html_export_default.tmpl', data=True)
if opts.template_css is not None:
with open(opts.template_css, 'rb') as f:
template_css_data = f.read()
else:
template_css_data = P('templates/html_export_default.css', data=True)
template_html_index_data = template_html_index_data.decode('utf-8')
template_html_data = template_html_data.decode('utf-8')
template_css_data = template_css_data.decode('utf-8')
self.log = log
self.opts = opts
meta = EasyMeta(oeb_book.metadata)
tempdir = os.path.realpath(PersistentTemporaryDirectory())
output_file = os.path.join(tempdir,
basename(re.sub(r'\.zip', '', output_path)+'.html'))
output_dir = re.sub(r'\.html', '', output_file)+'_files'
if not exists(output_dir):
os.makedirs(output_dir)
css_path = output_dir+os.sep+'calibreHtmlOutBasicCss.css'
with open(css_path, 'wb') as f:
f.write(template_css_data.encode('utf-8'))
with open(output_file, 'wb') as f:
html_toc = self.generate_html_toc(oeb_book, output_file, output_dir)
templite = Templite(template_html_index_data)
nextLink = oeb_book.spine[0].href
nextLink = relpath(output_dir+os.sep+nextLink, dirname(output_file))
cssLink = relpath(abspath(css_path), dirname(output_file))
tocUrl = relpath(output_file, dirname(output_file))
t = templite.render(has_toc=bool(oeb_book.toc.count()),
toc=html_toc, meta=meta, nextLink=nextLink,
tocUrl=tocUrl, cssLink=cssLink,
firstContentPageLink=nextLink)
if isinstance(t, unicode_type):
t = t.encode('utf-8')
f.write(t)
with CurrentDir(output_dir):
for item in oeb_book.manifest:
path = abspath(unquote(item.href))
dir = dirname(path)
if not exists(dir):
os.makedirs(dir)
if item.spine_position is not None:
with open(path, 'wb') as f:
pass
else:
with open(path, 'wb') as f:
f.write(item.bytes_representation)
item.unload_data_from_memory(memory=path)
for item in oeb_book.spine:
path = abspath(unquote(item.href))
dir = dirname(path)
root = item.data.getroottree()
# get & clean HTML <HEAD>-data
head = root.xpath('//h:head', namespaces={'h': 'http://www.w3.org/1999/xhtml'})[0]
head_content = etree.tostring(head, pretty_print=True, encoding='unicode')
head_content = re.sub(r'\<\/?head.*\>', '', head_content)
head_content = re.sub(re.compile(r'\<style.*\/style\>', re.M|re.S), '', head_content)
head_content = re.sub(r'<(title)([^>]*)/>', r'<\1\2></\1>', head_content)
# get & clean HTML <BODY>-data
body = root.xpath('//h:body', namespaces={'h': 'http://www.w3.org/1999/xhtml'})[0]
ebook_content = etree.tostring(body, pretty_print=True, encoding='unicode')
ebook_content = re.sub(r'\<\/?body.*\>', '', ebook_content)
ebook_content = re.sub(r'<(div|a|span)([^>]*)/>', r'<\1\2></\1>', ebook_content)
# generate link to next page
if item.spine_position+1 < len(oeb_book.spine):
nextLink = oeb_book.spine[item.spine_position+1].href
nextLink = relpath(abspath(nextLink), dir)
else:
nextLink = None
# generate link to previous page
if item.spine_position > 0:
prevLink = oeb_book.spine[item.spine_position-1].href
prevLink = relpath(abspath(prevLink), dir)
else:
prevLink = None
cssLink = relpath(abspath(css_path), dir)
tocUrl = relpath(output_file, dir)
firstContentPageLink = oeb_book.spine[0].href
# render template
templite = Templite(template_html_data)
toc = lambda: self.generate_html_toc(oeb_book, path, output_dir)
t = templite.render(ebookContent=ebook_content,
prevLink=prevLink, nextLink=nextLink,
has_toc=bool(oeb_book.toc.count()), toc=toc,
tocUrl=tocUrl, head_content=head_content,
meta=meta, cssLink=cssLink,
firstContentPageLink=firstContentPageLink)
# write html to file
with open(path, 'wb') as f:
f.write(t.encode('utf-8'))
item.unload_data_from_memory(memory=path)
zfile = zipfile.ZipFile(output_path, "w")
zfile.add_dir(output_dir, basename(output_dir))
zfile.write(output_file, basename(output_file), zipfile.ZIP_DEFLATED)
if opts.extract_to:
if os.path.exists(opts.extract_to):
shutil.rmtree(opts.extract_to)
os.makedirs(opts.extract_to)
zfile.extractall(opts.extract_to)
self.log('Zip file extracted to', opts.extract_to)
zfile.close()
# cleanup temp dir
shutil.rmtree(tempdir)

View File

@@ -0,0 +1,133 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre import guess_type
from calibre.customize.conversion import InputFormatPlugin
from polyglot.builtins import getcwd
class HTMLZInput(InputFormatPlugin):
name = 'HTLZ Input'
author = 'John Schember'
description = 'Convert HTML files to HTML'
file_types = {'htmlz'}
commit_name = 'htmlz_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.chardet import xml_to_unicode
from calibre.ebooks.metadata.opf2 import OPF
from calibre.utils.zipfile import ZipFile
self.log = log
html = u''
top_levels = []
# Extract content from zip archive.
zf = ZipFile(stream)
zf.extractall()
# Find the HTML file in the archive. It needs to be
# top level.
index = u''
multiple_html = False
# Get a list of all top level files in the archive.
for x in os.listdir(u'.'):
if os.path.isfile(x):
top_levels.append(x)
# Try to find an index. file.
for x in top_levels:
if x.lower() in (u'index.html', u'index.xhtml', u'index.htm'):
index = x
break
# Look for multiple HTML files in the archive. We look at the
# top level files only as only they matter in HTMLZ.
for x in top_levels:
if os.path.splitext(x)[1].lower() in (u'.html', u'.xhtml', u'.htm'):
# Set index to the first HTML file found if it's not
# called index.
if not index:
index = x
else:
multiple_html = True
# Warn the user if there multiple HTML file in the archive. HTMLZ
# supports a single HTML file. A conversion with a multiple HTML file
# HTMLZ archive probably won't turn out as the user expects. With
# Multiple HTML files ZIP input should be used in place of HTMLZ.
if multiple_html:
log.warn(_('Multiple HTML files found in the archive. Only %s will be used.') % index)
if index:
with open(index, 'rb') as tf:
html = tf.read()
else:
raise Exception(_('No top level HTML file found.'))
if not html:
raise Exception(_('Top level HTML file %s is empty') % index)
# Encoding
if options.input_encoding:
ienc = options.input_encoding
else:
ienc = xml_to_unicode(html[:4096])[-1]
html = html.decode(ienc, 'replace')
# Run the HTML through the html processing plugin.
from calibre.customize.ui import plugin_for_input_format
html_input = plugin_for_input_format('html')
for opt in html_input.options:
setattr(options, opt.option.name, opt.recommended_value)
options.input_encoding = 'utf-8'
base = getcwd()
htmlfile = os.path.join(base, u'index.html')
c = 0
while os.path.exists(htmlfile):
c += 1
htmlfile = u'index%d.html'%c
with open(htmlfile, 'wb') as f:
f.write(html.encode('utf-8'))
odi = options.debug_pipeline
options.debug_pipeline = None
# Generate oeb from html conversion.
with open(htmlfile, 'rb') as f:
oeb = html_input.convert(f, options, 'html', log,
{})
options.debug_pipeline = odi
os.remove(htmlfile)
# Set metadata from file.
from calibre.customize.ui import get_file_type_metadata
from calibre.ebooks.oeb.transforms.metadata import meta_info_to_oeb_metadata
mi = get_file_type_metadata(stream, file_ext)
meta_info_to_oeb_metadata(mi, oeb.metadata, log)
# Get the cover path from the OPF.
cover_path = None
opf = None
for x in top_levels:
if os.path.splitext(x)[1].lower() == u'.opf':
opf = x
break
if opf:
opf = OPF(opf, basedir=getcwd())
cover_path = opf.raster_cover or opf.cover
# Set the cover.
if cover_path:
cdata = None
with open(os.path.join(getcwd(), cover_path), 'rb') as cf:
cdata = cf.read()
cover_name = os.path.basename(cover_path)
id, href = oeb.manifest.generate('cover', cover_name)
oeb.manifest.add(id, href, guess_type(cover_name)[0], data=cdata)
oeb.guide.add('cover', 'Cover', href)
return oeb

View File

@@ -0,0 +1,136 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2011, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import io
import os
from calibre.customize.conversion import OutputFormatPlugin, \
OptionRecommendation
from calibre.ptempfile import TemporaryDirectory
from polyglot.builtins import unicode_type
class HTMLZOutput(OutputFormatPlugin):
name = 'HTMLZ Output'
author = 'John Schember'
file_type = 'htmlz'
commit_name = 'htmlz_output'
ui_data = {
'css_choices': {
'class': _('Use CSS classes'),
'inline': _('Use the style attribute'),
'tag': _('Use HTML tags wherever possible')
},
'sheet_choices': {
'external': _('Use an external CSS file'),
'inline': _('Use a <style> tag in the HTML file')
}
}
options = {
OptionRecommendation(name='htmlz_css_type', recommended_value='class',
level=OptionRecommendation.LOW,
choices=list(ui_data['css_choices']),
help=_('Specify the handling of CSS. Default is class.\n'
'class: {class}\n'
'inline: {inline}\n'
'tag: {tag}'
).format(**ui_data['css_choices'])),
OptionRecommendation(name='htmlz_class_style', recommended_value='external',
level=OptionRecommendation.LOW,
choices=list(ui_data['sheet_choices']),
help=_('How to handle the CSS when using css-type = \'class\'.\n'
'Default is external.\n'
'external: {external}\n'
'inline: {inline}'
).format(**ui_data['sheet_choices'])),
OptionRecommendation(name='htmlz_title_filename',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('If set this option causes the file name of the HTML file'
' inside the HTMLZ archive to be based on the book title.')
),
}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from lxml import etree
from calibre.ebooks.oeb.base import OEB_IMAGES, SVG_MIME
from calibre.ebooks.metadata.opf2 import OPF, metadata_to_opf
from calibre.utils.zipfile import ZipFile
from calibre.utils.filenames import ascii_filename
# HTML
if opts.htmlz_css_type == 'inline':
from calibre.ebooks.htmlz.oeb2html import OEB2HTMLInlineCSSizer
OEB2HTMLizer = OEB2HTMLInlineCSSizer
elif opts.htmlz_css_type == 'tag':
from calibre.ebooks.htmlz.oeb2html import OEB2HTMLNoCSSizer
OEB2HTMLizer = OEB2HTMLNoCSSizer
else:
from calibre.ebooks.htmlz.oeb2html import OEB2HTMLClassCSSizer as OEB2HTMLizer
with TemporaryDirectory(u'_htmlz_output') as tdir:
htmlizer = OEB2HTMLizer(log)
html = htmlizer.oeb2html(oeb_book, opts)
fname = u'index'
if opts.htmlz_title_filename:
from calibre.utils.filenames import shorten_components_to
fname = shorten_components_to(100, (ascii_filename(unicode_type(oeb_book.metadata.title[0])),))[0]
with open(os.path.join(tdir, fname+u'.html'), 'wb') as tf:
if isinstance(html, unicode_type):
html = html.encode('utf-8')
tf.write(html)
# CSS
if opts.htmlz_css_type == 'class' and opts.htmlz_class_style == 'external':
with open(os.path.join(tdir, u'style.css'), 'wb') as tf:
tf.write(htmlizer.get_css(oeb_book))
# Images
images = htmlizer.images
if images:
if not os.path.exists(os.path.join(tdir, u'images')):
os.makedirs(os.path.join(tdir, u'images'))
for item in oeb_book.manifest:
if item.media_type in OEB_IMAGES and item.href in images:
if item.media_type == SVG_MIME:
data = etree.tostring(item.data, encoding='unicode')
else:
data = item.data
fname = os.path.join(tdir, u'images', images[item.href])
with open(fname, 'wb') as img:
img.write(data)
# Cover
cover_path = None
try:
cover_data = None
if oeb_book.metadata.cover:
term = oeb_book.metadata.cover[0].term
cover_data = oeb_book.guide[term].item.data
if cover_data:
from calibre.utils.img import save_cover_data_to
cover_path = os.path.join(tdir, u'cover.jpg')
with lopen(cover_path, 'w') as cf:
cf.write('')
save_cover_data_to(cover_data, cover_path)
except:
import traceback
traceback.print_exc()
# Metadata
with open(os.path.join(tdir, u'metadata.opf'), 'wb') as mdataf:
opf = OPF(io.BytesIO(etree.tostring(oeb_book.metadata.to_opf1(), encoding='UTF-8')))
mi = opf.to_book_metadata()
if cover_path:
mi.cover = u'cover.jpg'
mdataf.write(metadata_to_opf(mi))
htmlz = ZipFile(output_path, 'w')
htmlz.add_dir(tdir)

View File

@@ -0,0 +1,64 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
from calibre.customize.conversion import InputFormatPlugin
class LITInput(InputFormatPlugin):
name = 'LIT Input'
author = 'Marshall T. Vandegrift'
description = 'Convert LIT files to HTML'
file_types = {'lit'}
commit_name = 'lit_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.lit.reader import LitReader
from calibre.ebooks.conversion.plumber import create_oebbook
self.log = log
return create_oebbook(log, stream, options, reader=LitReader)
def postprocess_book(self, oeb, opts, log):
from calibre.ebooks.oeb.base import XHTML_NS, XPath, XHTML
for item in oeb.spine:
root = item.data
if not hasattr(root, 'xpath'):
continue
for bad in ('metadata', 'guide'):
metadata = XPath('//h:'+bad)(root)
if metadata:
for x in metadata:
x.getparent().remove(x)
body = XPath('//h:body')(root)
if body:
body = body[0]
if len(body) == 1 and body[0].tag == XHTML('pre'):
pre = body[0]
from calibre.ebooks.txt.processor import convert_basic, \
separate_paragraphs_single_line
from calibre.ebooks.chardet import xml_to_unicode
from calibre.utils.xml_parse import safe_xml_fromstring
import copy
self.log('LIT file with all text in singe <pre> tag detected')
html = separate_paragraphs_single_line(pre.text)
html = convert_basic(html).replace('<html>',
'<html xmlns="%s">'%XHTML_NS)
html = xml_to_unicode(html, strip_encoding_pats=True,
resolve_entities=True)[0]
if opts.smarten_punctuation:
# SmartyPants skips text inside <pre> tags
from calibre.ebooks.conversion.preprocess import smarten_punctuation
html = smarten_punctuation(html, self.log)
root = safe_xml_fromstring(html)
body = XPath('//h:body')(root)
pre.tag = XHTML('div')
pre.text = ''
for elem in body:
ne = copy.deepcopy(elem)
pre.append(ne)

View File

@@ -0,0 +1,38 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
from calibre.customize.conversion import OutputFormatPlugin
class LITOutput(OutputFormatPlugin):
name = 'LIT Output'
author = 'Marshall T. Vandegrift'
file_type = 'lit'
commit_name = 'lit_output'
def convert(self, oeb, output_path, input_plugin, opts, log):
self.log, self.opts, self.oeb = log, opts, oeb
from calibre.ebooks.oeb.transforms.manglecase import CaseMangler
from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer
from calibre.ebooks.oeb.transforms.htmltoc import HTMLTOCAdder
from calibre.ebooks.lit.writer import LitWriter
from calibre.ebooks.oeb.transforms.split import Split
split = Split(split_on_page_breaks=True, max_flow_size=0,
remove_css_pagebreaks=False)
split(self.oeb, self.opts)
tocadder = HTMLTOCAdder()
tocadder(oeb, opts)
mangler = CaseMangler()
mangler(oeb, opts)
rasterizer = SVGRasterizer()
rasterizer(oeb, opts)
lit = LitWriter(self.opts)
lit(oeb, output_path)

View File

@@ -0,0 +1,82 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os, sys
from calibre.customize.conversion import InputFormatPlugin
class LRFInput(InputFormatPlugin):
name = 'LRF Input'
author = 'Kovid Goyal'
description = 'Convert LRF files to HTML'
file_types = {'lrf'}
commit_name = 'lrf_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.lrf.input import (MediaType, Styles, TextBlock,
Canvas, ImageBlock, RuledLine)
self.log = log
self.log('Generating XML')
from calibre.ebooks.lrf.lrfparser import LRFDocument
from calibre.utils.xml_parse import safe_xml_fromstring
from lxml import etree
d = LRFDocument(stream)
d.parse()
xml = d.to_xml(write_files=True)
if options.verbose > 2:
open(u'lrs.xml', 'wb').write(xml.encode('utf-8'))
doc = safe_xml_fromstring(xml)
char_button_map = {}
for x in doc.xpath('//CharButton[@refobj]'):
ro = x.get('refobj')
jump_button = doc.xpath('//*[@objid="%s"]'%ro)
if jump_button:
jump_to = jump_button[0].xpath('descendant::JumpTo[@refpage and @refobj]')
if jump_to:
char_button_map[ro] = '%s.xhtml#%s'%(jump_to[0].get('refpage'),
jump_to[0].get('refobj'))
plot_map = {}
for x in doc.xpath('//Plot[@refobj]'):
ro = x.get('refobj')
image = doc.xpath('//Image[@objid="%s" and @refstream]'%ro)
if image:
imgstr = doc.xpath('//ImageStream[@objid="%s" and @file]'%
image[0].get('refstream'))
if imgstr:
plot_map[ro] = imgstr[0].get('file')
self.log('Converting XML to HTML...')
styledoc = safe_xml_fromstring(P('templates/lrf.xsl', data=True))
media_type = MediaType()
styles = Styles()
text_block = TextBlock(styles, char_button_map, plot_map, log)
canvas = Canvas(doc, styles, text_block, log)
image_block = ImageBlock(canvas)
ruled_line = RuledLine()
extensions = {
('calibre', 'media-type') : media_type,
('calibre', 'text-block') : text_block,
('calibre', 'ruled-line') : ruled_line,
('calibre', 'styles') : styles,
('calibre', 'canvas') : canvas,
('calibre', 'image-block'): image_block,
}
transform = etree.XSLT(styledoc, extensions=extensions)
try:
result = transform(doc)
except RuntimeError:
sys.setrecursionlimit(5000)
result = transform(doc)
with open('content.opf', 'wb') as f:
f.write(result)
styles.write()
return os.path.abspath('content.opf')

View File

@@ -0,0 +1,196 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import sys, os
from calibre.customize.conversion import OutputFormatPlugin
from calibre.customize.conversion import OptionRecommendation
from polyglot.builtins import unicode_type
class LRFOptions(object):
def __init__(self, output, opts, oeb):
def f2s(f):
try:
return unicode_type(f[0])
except:
return ''
m = oeb.metadata
for x in ('left', 'top', 'right', 'bottom'):
attr = 'margin_'+x
val = getattr(opts, attr)
if val < 0:
setattr(opts, attr, 0)
self.title = None
self.author = self.publisher = _('Unknown')
self.title_sort = self.author_sort = ''
for x in m.creator:
if x.role == 'aut':
self.author = unicode_type(x)
fa = unicode_type(getattr(x, 'file_as', ''))
if fa:
self.author_sort = fa
for x in m.title:
if unicode_type(x.file_as):
self.title_sort = unicode_type(x.file_as)
self.freetext = f2s(m.description)
self.category = f2s(m.subject)
self.cover = None
self.use_metadata_cover = True
self.output = output
self.ignore_tables = opts.linearize_tables
if opts.disable_font_rescaling:
self.base_font_size = 0
else:
self.base_font_size = opts.base_font_size
self.blank_after_para = opts.insert_blank_line
self.use_spine = True
self.font_delta = 0
self.ignore_colors = False
from calibre.ebooks.lrf import PRS500_PROFILE
self.profile = PRS500_PROFILE
self.link_levels = sys.maxsize
self.link_exclude = '@'
self.no_links_in_toc = True
self.disable_chapter_detection = True
self.chapter_regex = 'dsadcdswcdec'
self.chapter_attr = '$,,$'
self.override_css = self._override_css = ''
self.page_break = 'h[12]'
self.force_page_break = '$'
self.force_page_break_attr = '$'
self.add_chapters_to_toc = False
self.baen = self.pdftohtml = self.book_designer = False
self.verbose = opts.verbose
self.encoding = 'utf-8'
self.lrs = False
self.minimize_memory_usage = False
self.autorotation = opts.enable_autorotation
self.header_separation = (self.profile.dpi/72.) * opts.header_separation
self.headerformat = opts.header_format
for x in ('top', 'bottom', 'left', 'right'):
setattr(self, x+'_margin',
(self.profile.dpi/72.) * float(getattr(opts, 'margin_'+x)))
for x in ('wordspace', 'header', 'header_format',
'minimum_indent', 'serif_family',
'render_tables_as_images', 'sans_family', 'mono_family',
'text_size_multiplier_for_rendered_tables'):
setattr(self, x, getattr(opts, x))
class LRFOutput(OutputFormatPlugin):
name = 'LRF Output'
author = 'Kovid Goyal'
file_type = 'lrf'
commit_name = 'lrf_output'
options = {
OptionRecommendation(name='enable_autorotation', recommended_value=False,
help=_('Enable auto-rotation of images that are wider than the screen width.')
),
OptionRecommendation(name='wordspace',
recommended_value=2.5, level=OptionRecommendation.LOW,
help=_('Set the space between words in pts. Default is %default')
),
OptionRecommendation(name='header', recommended_value=False,
help=_('Add a header to all the pages with title and author.')
),
OptionRecommendation(name='header_format', recommended_value="%t by %a",
help=_('Set the format of the header. %a is replaced by the author '
'and %t by the title. Default is %default')
),
OptionRecommendation(name='header_separation', recommended_value=0,
help=_('Add extra spacing below the header. Default is %default pt.')
),
OptionRecommendation(name='minimum_indent', recommended_value=0,
help=_('Minimum paragraph indent (the indent of the first line '
'of a paragraph) in pts. Default: %default')
),
OptionRecommendation(name='render_tables_as_images',
recommended_value=False,
help=_('This option has no effect')
),
OptionRecommendation(name='text_size_multiplier_for_rendered_tables',
recommended_value=1.0,
help=_('Multiply the size of text in rendered tables by this '
'factor. Default is %default')
),
OptionRecommendation(name='serif_family', recommended_value=None,
help=_('The serif family of fonts to embed')
),
OptionRecommendation(name='sans_family', recommended_value=None,
help=_('The sans-serif family of fonts to embed')
),
OptionRecommendation(name='mono_family', recommended_value=None,
help=_('The monospace family of fonts to embed')
),
}
recommendations = {
('change_justification', 'original', OptionRecommendation.HIGH)}
def convert_images(self, pages, opts, wide):
from calibre.ebooks.lrf.pylrs.pylrs import Book, BookSetting, ImageStream, ImageBlock
from uuid import uuid4
from calibre.constants import __appname__, __version__
width, height = (784, 1012) if wide else (584, 754)
ps = {}
ps['topmargin'] = 0
ps['evensidemargin'] = 0
ps['oddsidemargin'] = 0
ps['textwidth'] = width
ps['textheight'] = height
book = Book(title=opts.title, author=opts.author,
bookid=uuid4().hex,
publisher='%s %s'%(__appname__, __version__),
category=_('Comic'), pagestyledefault=ps,
booksetting=BookSetting(screenwidth=width, screenheight=height))
for page in pages:
imageStream = ImageStream(page)
_page = book.create_page()
_page.append(ImageBlock(refstream=imageStream,
blockwidth=width, blockheight=height, xsize=width,
ysize=height, x1=width, y1=height))
book.append(_page)
book.renderLrf(open(opts.output, 'wb'))
def flatten_toc(self):
from calibre.ebooks.oeb.base import TOC
nroot = TOC()
for x in self.oeb.toc.iterdescendants():
nroot.add(x.title, x.href)
self.oeb.toc = nroot
def convert(self, oeb, output_path, input_plugin, opts, log):
self.log, self.opts, self.oeb = log, opts, oeb
lrf_opts = LRFOptions(output_path, opts, oeb)
if input_plugin.is_image_collection:
self.convert_images(input_plugin.get_images(), lrf_opts,
getattr(opts, 'wide', False))
return
self.flatten_toc()
from calibre.ptempfile import TemporaryDirectory
with TemporaryDirectory('_lrf_output') as tdir:
from calibre.customize.ui import plugin_for_output_format
oeb_output = plugin_for_output_format('oeb')
oeb_output.convert(oeb, tdir, input_plugin, opts, log)
opf = [x for x in os.listdir(tdir) if x.endswith('.opf')][0]
from calibre.ebooks.lrf.html.convert_from import process_file
process_file(os.path.join(tdir, opf), lrf_opts, self.log)

View File

@@ -0,0 +1,66 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import InputFormatPlugin
from polyglot.builtins import unicode_type
class MOBIInput(InputFormatPlugin):
name = 'MOBI Input'
author = 'Kovid Goyal'
description = 'Convert MOBI files (.mobi, .prc, .azw) to HTML'
file_types = {'mobi', 'prc', 'azw', 'azw3', 'pobi'}
commit_name = 'mobi_input'
def convert(self, stream, options, file_ext, log,
accelerators):
self.is_kf8 = False
self.mobi_is_joint = False
from calibre.ebooks.mobi.reader.mobi6 import MobiReader
from lxml import html
parse_cache = {}
try:
mr = MobiReader(stream, log, options.input_encoding,
options.debug_pipeline)
if mr.kf8_type is None:
mr.extract_content('.', parse_cache)
except:
mr = MobiReader(stream, log, options.input_encoding,
options.debug_pipeline, try_extra_data_fix=True)
if mr.kf8_type is None:
mr.extract_content('.', parse_cache)
if mr.kf8_type is not None:
log('Found KF8 MOBI of type %r'%mr.kf8_type)
if mr.kf8_type == 'joint':
self.mobi_is_joint = True
from calibre.ebooks.mobi.reader.mobi8 import Mobi8Reader
mr = Mobi8Reader(mr, log)
opf = os.path.abspath(mr())
self.encrypted_fonts = mr.encrypted_fonts
self.is_kf8 = True
return opf
raw = parse_cache.pop('calibre_raw_mobi_markup', False)
if raw:
if isinstance(raw, unicode_type):
raw = raw.encode('utf-8')
with lopen('debug-raw.html', 'wb') as f:
f.write(raw)
from calibre.ebooks.oeb.base import close_self_closing_tags
for f, root in parse_cache.items():
raw = html.tostring(root, encoding='utf-8', method='xml',
include_meta_content_type=False)
raw = close_self_closing_tags(raw)
with lopen(f, 'wb') as q:
q.write(raw)
accelerators['pagebreaks'] = '//h:div[@class="mbp_pagebreak"]'
return mr.created_opf_path

View File

@@ -0,0 +1,337 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
from calibre.customize.conversion import (OutputFormatPlugin,
OptionRecommendation)
from polyglot.builtins import unicode_type
def remove_html_cover(oeb, log):
from calibre.ebooks.oeb.base import OEB_DOCS
if not oeb.metadata.cover \
or 'cover' not in oeb.guide:
return
href = oeb.guide['cover'].href
del oeb.guide['cover']
item = oeb.manifest.hrefs[href]
if item.spine_position is not None:
log.warn('Found an HTML cover: ', item.href, 'removing it.',
'If you find some content missing from the output MOBI, it '
'is because you misidentified the HTML cover in the input '
'document')
oeb.spine.remove(item)
if item.media_type in OEB_DOCS:
oeb.manifest.remove(item)
def extract_mobi(output_path, opts):
if opts.extract_to is not None:
from calibre.ebooks.mobi.debug.main import inspect_mobi
ddir = opts.extract_to
inspect_mobi(output_path, ddir=ddir)
class MOBIOutput(OutputFormatPlugin):
name = 'MOBI Output'
author = 'Kovid Goyal'
file_type = 'mobi'
commit_name = 'mobi_output'
ui_data = {'file_types': ['old', 'both', 'new']}
options = {
OptionRecommendation(name='prefer_author_sort',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('When present, use author sort field as author.')
),
OptionRecommendation(name='no_inline_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Don\'t add Table of Contents to the book. Useful if '
'the book has its own table of contents.')),
OptionRecommendation(name='toc_title', recommended_value=None,
help=_('Title for any generated in-line table of contents.')
),
OptionRecommendation(name='dont_compress',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Disable compression of the file contents.')
),
OptionRecommendation(name='personal_doc', recommended_value='[PDOC]',
help=_('Tag for MOBI files to be marked as personal documents.'
' This option has no effect on the conversion. It is used'
' only when sending MOBI files to a device. If the file'
' being sent has the specified tag, it will be marked as'
' a personal document when sent to the Kindle.')
),
OptionRecommendation(name='mobi_ignore_margins',
recommended_value=False,
help=_('Ignore margins in the input document. If False, then '
'the MOBI output plugin will try to convert margins specified'
' in the input document, otherwise it will ignore them.')
),
OptionRecommendation(name='mobi_toc_at_start',
recommended_value=False,
help=_('When adding the Table of Contents to the book, add it at the start of the '
'book instead of the end. Not recommended.')
),
OptionRecommendation(name='extract_to',
help=_('Extract the contents of the generated %s file to the '
'specified directory. The contents of the directory are first '
'deleted, so be careful.') % 'MOBI'
),
OptionRecommendation(name='share_not_sync', recommended_value=False,
help=_('Enable sharing of book content via Facebook etc. '
' on the Kindle. WARNING: Using this feature means that '
' the book will not auto sync its last read position '
' on multiple devices. Complain to Amazon.')
),
OptionRecommendation(name='mobi_keep_original_images',
recommended_value=False,
help=_('By default calibre converts all images to JPEG format '
'in the output MOBI file. This is for maximum compatibility '
'as some older MOBI viewers have problems with other image '
'formats. This option tells calibre not to do this. '
'Useful if your document contains lots of GIF/PNG images that '
'become very large when converted to JPEG.')),
OptionRecommendation(name='mobi_file_type', choices=ui_data['file_types'], recommended_value='old',
help=_('By default calibre generates MOBI files that contain the '
'old MOBI 6 format. This format is compatible with all '
'devices. However, by changing this setting, you can tell '
'calibre to generate MOBI files that contain both MOBI 6 and '
'the new KF8 format, or only the new KF8 format. KF8 has '
'more features than MOBI 6, but only works with newer Kindles. '
'Allowed values: {}').format('old, both, new')),
}
def check_for_periodical(self):
if self.is_periodical:
self.periodicalize_toc()
self.check_for_masthead()
self.opts.mobi_periodical = True
else:
self.opts.mobi_periodical = False
def check_for_masthead(self):
found = 'masthead' in self.oeb.guide
if not found:
from calibre.ebooks import generate_masthead
self.oeb.log.debug('No masthead found in manifest, generating default mastheadImage...')
raw = generate_masthead(unicode_type(self.oeb.metadata['title'][0]))
id, href = self.oeb.manifest.generate('masthead', 'masthead')
self.oeb.manifest.add(id, href, 'image/gif', data=raw)
self.oeb.guide.add('masthead', 'Masthead Image', href)
else:
self.oeb.log.debug('Using mastheadImage supplied in manifest...')
def periodicalize_toc(self):
from calibre.ebooks.oeb.base import TOC
toc = self.oeb.toc
if not toc or len(self.oeb.spine) < 3:
return
if toc and toc[0].klass != 'periodical':
one, two = self.oeb.spine[0], self.oeb.spine[1]
self.log('Converting TOC for MOBI periodical indexing...')
articles = {}
if toc.depth() < 3:
# single section periodical
self.oeb.manifest.remove(one)
self.oeb.manifest.remove(two)
sections = [TOC(klass='section', title=_('All articles'),
href=self.oeb.spine[0].href)]
for x in toc:
sections[0].nodes.append(x)
else:
# multi-section periodical
self.oeb.manifest.remove(one)
sections = list(toc)
for i,x in enumerate(sections):
x.klass = 'section'
articles_ = list(x)
if articles_:
self.oeb.manifest.remove(self.oeb.manifest.hrefs[x.href])
x.href = articles_[0].href
for sec in sections:
articles[id(sec)] = []
for a in list(sec):
a.klass = 'article'
articles[id(sec)].append(a)
sec.nodes.remove(a)
root = TOC(klass='periodical', href=self.oeb.spine[0].href,
title=unicode_type(self.oeb.metadata.title[0]))
for s in sections:
if articles[id(s)]:
for a in articles[id(s)]:
s.nodes.append(a)
root.nodes.append(s)
for x in list(toc.nodes):
toc.nodes.remove(x)
toc.nodes.append(root)
# Fix up the periodical href to point to first section href
toc.nodes[0].href = toc.nodes[0].nodes[0].href
def convert(self, oeb, output_path, input_plugin, opts, log):
from calibre.ebooks.mobi.writer2.resources import Resources
self.log, self.opts, self.oeb = log, opts, oeb
mobi_type = opts.mobi_file_type
if self.is_periodical:
mobi_type = 'old' # Amazon does not support KF8 periodicals
create_kf8 = mobi_type in ('new', 'both')
remove_html_cover(self.oeb, self.log)
resources = Resources(oeb, opts, self.is_periodical,
add_fonts=create_kf8)
self.check_for_periodical()
if create_kf8:
from calibre.ebooks.mobi.writer8.cleanup import remove_duplicate_anchors
remove_duplicate_anchors(self.oeb)
# Split on pagebreaks so that the resulting KF8 is faster to load
from calibre.ebooks.oeb.transforms.split import Split
Split()(self.oeb, self.opts)
kf8 = self.create_kf8(resources, for_joint=mobi_type=='both'
) if create_kf8 else None
if mobi_type == 'new':
kf8.write(output_path)
extract_mobi(output_path, opts)
return
self.log('Creating MOBI 6 output')
self.write_mobi(input_plugin, output_path, kf8, resources)
def create_kf8(self, resources, for_joint=False):
from calibre.ebooks.mobi.writer8.main import create_kf8_book
return create_kf8_book(self.oeb, self.opts, resources,
for_joint=for_joint)
def write_mobi(self, input_plugin, output_path, kf8, resources):
from calibre.ebooks.mobi.mobiml import MobiMLizer
from calibre.ebooks.oeb.transforms.manglecase import CaseMangler
from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer, Unavailable
from calibre.ebooks.oeb.transforms.htmltoc import HTMLTOCAdder
from calibre.customize.ui import plugin_for_input_format
opts, oeb = self.opts, self.oeb
if not opts.no_inline_toc:
tocadder = HTMLTOCAdder(title=opts.toc_title, position='start' if
opts.mobi_toc_at_start else 'end')
tocadder(oeb, opts)
mangler = CaseMangler()
mangler(oeb, opts)
try:
rasterizer = SVGRasterizer()
rasterizer(oeb, opts)
except Unavailable:
self.log.warn('SVG rasterizer unavailable, SVG will not be converted')
else:
# Add rasterized SVG images
resources.add_extra_images()
if hasattr(self.oeb, 'inserted_metadata_jacket'):
self.workaround_fire_bugs(self.oeb.inserted_metadata_jacket)
mobimlizer = MobiMLizer(ignore_tables=opts.linearize_tables)
mobimlizer(oeb, opts)
write_page_breaks_after_item = input_plugin is not plugin_for_input_format('cbz')
from calibre.ebooks.mobi.writer2.main import MobiWriter
writer = MobiWriter(opts, resources, kf8,
write_page_breaks_after_item=write_page_breaks_after_item)
writer(oeb, output_path)
extract_mobi(output_path, opts)
def specialize_css_for_output(self, log, opts, item, stylizer):
from calibre.ebooks.mobi.writer8.cleanup import CSSCleanup
CSSCleanup(log, opts)(item, stylizer)
def workaround_fire_bugs(self, jacket):
# The idiotic Fire crashes when trying to render the table used to
# layout the jacket
from calibre.ebooks.oeb.base import XHTML
for table in jacket.data.xpath('//*[local-name()="table"]'):
table.tag = XHTML('div')
for tr in table.xpath('descendant::*[local-name()="tr"]'):
cols = tr.xpath('descendant::*[local-name()="td"]')
tr.tag = XHTML('div')
for td in cols:
td.tag = XHTML('span' if cols else 'div')
class AZW3Output(OutputFormatPlugin):
name = 'AZW3 Output'
author = 'Kovid Goyal'
file_type = 'azw3'
commit_name = 'azw3_output'
options = {
OptionRecommendation(name='prefer_author_sort',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('When present, use author sort field as author.')
),
OptionRecommendation(name='no_inline_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Don\'t add Table of Contents to the book. Useful if '
'the book has its own table of contents.')),
OptionRecommendation(name='toc_title', recommended_value=None,
help=_('Title for any generated in-line table of contents.')
),
OptionRecommendation(name='dont_compress',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Disable compression of the file contents.')
),
OptionRecommendation(name='mobi_toc_at_start',
recommended_value=False,
help=_('When adding the Table of Contents to the book, add it at the start of the '
'book instead of the end. Not recommended.')
),
OptionRecommendation(name='extract_to',
help=_('Extract the contents of the generated %s file to the '
'specified directory. The contents of the directory are first '
'deleted, so be careful.') % 'AZW3'),
OptionRecommendation(name='share_not_sync', recommended_value=False,
help=_('Enable sharing of book content via Facebook etc. '
' on the Kindle. WARNING: Using this feature means that '
' the book will not auto sync its last read position '
' on multiple devices. Complain to Amazon.')
),
}
def convert(self, oeb, output_path, input_plugin, opts, log):
from calibre.ebooks.mobi.writer2.resources import Resources
from calibre.ebooks.mobi.writer8.main import create_kf8_book
from calibre.ebooks.mobi.writer8.cleanup import remove_duplicate_anchors
self.oeb, self.opts, self.log = oeb, opts, log
opts.mobi_periodical = self.is_periodical
passthrough = getattr(opts, 'mobi_passthrough', False)
remove_duplicate_anchors(oeb)
resources = Resources(self.oeb, self.opts, self.is_periodical,
add_fonts=True, process_images=False)
if not passthrough:
remove_html_cover(self.oeb, self.log)
# Split on pagebreaks so that the resulting KF8 is faster to load
from calibre.ebooks.oeb.transforms.split import Split
Split()(self.oeb, self.opts)
kf8 = create_kf8_book(self.oeb, self.opts, resources, for_joint=False)
kf8.write(output_path)
extract_mobi(output_path, opts)
def specialize_css_for_output(self, log, opts, item, stylizer):
from calibre.ebooks.mobi.writer8.cleanup import CSSCleanup
CSSCleanup(log, opts)(item, stylizer)

View File

@@ -0,0 +1,25 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'
'''
Convert an ODT file into a Open Ebook
'''
from calibre.customize.conversion import InputFormatPlugin
class ODTInput(InputFormatPlugin):
name = 'ODT Input'
author = 'Kovid Goyal'
description = 'Convert ODT (OpenOffice) files to HTML'
file_types = {'odt'}
commit_name = 'odt_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.odt.input import Extract
return Extract()(stream, '.', log)

View File

@@ -0,0 +1,122 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os, re
from calibre.customize.conversion import (OutputFormatPlugin,
OptionRecommendation)
from calibre import CurrentDir
class OEBOutput(OutputFormatPlugin):
name = 'OEB Output'
author = 'Kovid Goyal'
file_type = 'oeb'
commit_name = 'oeb_output'
recommendations = {('pretty_print', True, OptionRecommendation.HIGH)}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from polyglot.urllib import unquote
from lxml import etree
self.log, self.opts = log, opts
if not os.path.exists(output_path):
os.makedirs(output_path)
from calibre.ebooks.oeb.base import OPF_MIME, NCX_MIME, PAGE_MAP_MIME, OEB_STYLES
from calibre.ebooks.oeb.normalize_css import condense_sheet
with CurrentDir(output_path):
results = oeb_book.to_opf2(page_map=True)
for key in (OPF_MIME, NCX_MIME, PAGE_MAP_MIME):
href, root = results.pop(key, [None, None])
if root is not None:
if key == OPF_MIME:
try:
self.workaround_nook_cover_bug(root)
except:
self.log.exception('Something went wrong while trying to'
' workaround Nook cover bug, ignoring')
try:
self.workaround_pocketbook_cover_bug(root)
except:
self.log.exception('Something went wrong while trying to'
' workaround Pocketbook cover bug, ignoring')
self.migrate_lang_code(root)
raw = etree.tostring(root, pretty_print=True,
encoding='utf-8', xml_declaration=True)
if key == OPF_MIME:
# Needed as I can't get lxml to output opf:role and
# not output <opf:metadata> as well
raw = re.sub(br'(<[/]{0,1})opf:', br'\1', raw)
with lopen(href, 'wb') as f:
f.write(raw)
for item in oeb_book.manifest:
if (
not self.opts.expand_css and item.media_type in OEB_STYLES and hasattr(
item.data, 'cssText') and 'nook' not in self.opts.output_profile.short_name):
condense_sheet(item.data)
path = os.path.abspath(unquote(item.href))
dir = os.path.dirname(path)
if not os.path.exists(dir):
os.makedirs(dir)
with lopen(path, 'wb') as f:
f.write(item.bytes_representation)
item.unload_data_from_memory(memory=path)
def workaround_nook_cover_bug(self, root): # {{{
cov = root.xpath('//*[local-name() = "meta" and @name="cover" and'
' @content != "cover"]')
def manifest_items_with_id(id_):
return root.xpath('//*[local-name() = "manifest"]/*[local-name() = "item" '
' and @id="%s"]'%id_)
if len(cov) == 1:
cov = cov[0]
covid = cov.get('content', '')
if covid:
manifest_item = manifest_items_with_id(covid)
if len(manifest_item) == 1 and \
manifest_item[0].get('media-type',
'').startswith('image/'):
self.log.warn('The cover image has an id != "cover". Renaming'
' to work around bug in Nook Color')
from calibre.ebooks.oeb.base import uuid_id
newid = uuid_id()
for item in manifest_items_with_id('cover'):
item.set('id', newid)
for x in root.xpath('//*[@idref="cover"]'):
x.set('idref', newid)
manifest_item = manifest_item[0]
manifest_item.set('id', 'cover')
cov.set('content', 'cover')
# }}}
def workaround_pocketbook_cover_bug(self, root): # {{{
m = root.xpath('//*[local-name() = "manifest"]/*[local-name() = "item" '
' and @id="cover"]')
if len(m) == 1:
m = m[0]
p = m.getparent()
p.remove(m)
p.insert(0, m)
# }}}
def migrate_lang_code(self, root): # {{{
from calibre.utils.localization import lang_as_iso639_1
for lang in root.xpath('//*[local-name() = "language"]'):
clc = lang_as_iso639_1(lang.text)
if clc:
lang.text = clc
# }}}

View File

@@ -0,0 +1,37 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
from calibre.customize.conversion import InputFormatPlugin
from polyglot.builtins import getcwd
class PDBInput(InputFormatPlugin):
name = 'PDB Input'
author = 'John Schember'
description = 'Convert PDB to HTML'
file_types = {'pdb', 'updb'}
commit_name = 'pdb_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.pdb.header import PdbHeaderReader
from calibre.ebooks.pdb import PDBError, IDENTITY_TO_NAME, get_reader
header = PdbHeaderReader(stream)
Reader = get_reader(header.ident)
if Reader is None:
raise PDBError('No reader available for format within container.\n Identity is %s. Book type is %s' %
(header.ident, IDENTITY_TO_NAME.get(header.ident, _('Unknown'))))
log.debug('Detected ebook format as: %s with identity: %s' % (IDENTITY_TO_NAME[header.ident], header.ident))
reader = Reader(header, stream, log, options)
opf = reader.extract_content(getcwd())
return opf

View File

@@ -0,0 +1,64 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import OutputFormatPlugin, \
OptionRecommendation
from calibre.ebooks.pdb import PDBError, get_writer, ALL_FORMAT_WRITERS
class PDBOutput(OutputFormatPlugin):
name = 'PDB Output'
author = 'John Schember'
file_type = 'pdb'
commit_name = 'pdb_output'
ui_data = {'formats': tuple(ALL_FORMAT_WRITERS)}
options = {
OptionRecommendation(name='format', recommended_value='doc',
level=OptionRecommendation.LOW,
short_switch='f', choices=list(ALL_FORMAT_WRITERS),
help=(_('Format to use inside the pdb container. Choices are:') + ' %s' % sorted(ALL_FORMAT_WRITERS))),
OptionRecommendation(name='pdb_output_encoding', recommended_value='cp1252',
level=OptionRecommendation.LOW,
help=_('Specify the character encoding of the output document. '
'The default is cp1252. Note: This option is not honored by all '
'formats.')),
OptionRecommendation(name='inline_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Add Table of Contents to beginning of the book.')),
}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
close = False
if not hasattr(output_path, 'write'):
close = True
if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path):
os.makedirs(os.path.dirname(output_path))
out_stream = lopen(output_path, 'wb')
else:
out_stream = output_path
Writer = get_writer(opts.format)
if Writer is None:
raise PDBError('No writer available for format %s.' % format)
setattr(opts, 'max_line_length', 0)
setattr(opts, 'force_max_line_length', False)
writer = Writer(opts, log)
out_stream.seek(0)
out_stream.truncate()
writer.write_content(oeb_book, out_stream, oeb_book.metadata)
if close:
out_stream.close()

View File

@@ -0,0 +1,82 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from polyglot.builtins import as_bytes, getcwd
class PDFInput(InputFormatPlugin):
name = 'PDF Input'
author = 'Kovid Goyal and John Schember'
description = 'Convert PDF files to HTML'
file_types = {'pdf'}
commit_name = 'pdf_input'
options = {
OptionRecommendation(name='no_images', recommended_value=False,
help=_('Do not extract images from the document')),
OptionRecommendation(name='unwrap_factor', recommended_value=0.45,
help=_('Scale used to determine the length at which a line should '
'be unwrapped. Valid values are a decimal between 0 and 1. The '
'default is 0.45, just below the median line length.')),
OptionRecommendation(name='new_pdf_engine', recommended_value=False,
help=_('Use the new PDF conversion engine. Currently not operational.'))
}
def convert_new(self, stream, accelerators):
from calibre.ebooks.pdf.pdftohtml import pdftohtml
from calibre.utils.cleantext import clean_ascii_chars
from calibre.ebooks.pdf.reflow import PDFDocument
pdftohtml(getcwd(), stream.name, self.opts.no_images, as_xml=True)
with lopen('index.xml', 'rb') as f:
xml = clean_ascii_chars(f.read())
PDFDocument(xml, self.opts, self.log)
return os.path.join(getcwd(), 'metadata.opf')
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.ebooks.pdf.pdftohtml import pdftohtml
log.debug('Converting file to html...')
# The main html file will be named index.html
self.opts, self.log = options, log
if options.new_pdf_engine:
return self.convert_new(stream, accelerators)
pdftohtml(getcwd(), stream.name, options.no_images)
from calibre.ebooks.metadata.meta import get_metadata
log.debug('Retrieving document metadata...')
mi = get_metadata(stream, 'pdf')
opf = OPFCreator(getcwd(), mi)
manifest = [('index.html', None)]
images = os.listdir(getcwd())
images.remove('index.html')
for i in images:
manifest.append((i, None))
log.debug('Generating manifest...')
opf.create_manifest(manifest)
opf.create_spine(['index.html'])
log.debug('Rendering manifest...')
with lopen('metadata.opf', 'wb') as opffile:
opf.render(opffile)
if os.path.exists('toc.ncx'):
ncxid = opf.manifest.id_for_path('toc.ncx')
if ncxid:
with lopen('metadata.opf', 'r+b') as f:
raw = f.read().replace(b'<spine', b'<spine toc="%s"' % as_bytes(ncxid))
f.seek(0)
f.write(raw)
return os.path.join(getcwd(), 'metadata.opf')

View File

@@ -0,0 +1,256 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
'''
Convert OEB ebook format to PDF.
'''
import glob, os
from calibre.customize.conversion import (OutputFormatPlugin,
OptionRecommendation)
from calibre.ptempfile import TemporaryDirectory
from polyglot.builtins import iteritems, unicode_type
UNITS = ('millimeter', 'centimeter', 'point', 'inch' , 'pica' , 'didot',
'cicero', 'devicepixel')
PAPER_SIZES = ('a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'b0', 'b1',
'b2', 'b3', 'b4', 'b5', 'b6', 'legal', 'letter')
class PDFOutput(OutputFormatPlugin):
name = 'PDF Output'
author = 'Kovid Goyal'
file_type = 'pdf'
commit_name = 'pdf_output'
ui_data = {'paper_sizes': PAPER_SIZES, 'units': UNITS, 'font_types': ('serif', 'sans', 'mono')}
options = {
OptionRecommendation(name='use_profile_size', recommended_value=False,
help=_('Instead of using the paper size specified in the PDF Output options,'
' use a paper size corresponding to the current output profile.'
' Useful if you want to generate a PDF for viewing on a specific device.')),
OptionRecommendation(name='unit', recommended_value='inch',
level=OptionRecommendation.LOW, short_switch='u', choices=UNITS,
help=_('The unit of measure for page sizes. Default is inch. Choices '
'are {} '
'Note: This does not override the unit for margins!').format(', '.join(UNITS))),
OptionRecommendation(name='paper_size', recommended_value='letter',
level=OptionRecommendation.LOW, choices=PAPER_SIZES,
help=_('The size of the paper. This size will be overridden when a '
'non default output profile is used. Default is letter. Choices '
'are {}').format(', '.join(PAPER_SIZES))),
OptionRecommendation(name='custom_size', recommended_value=None,
help=_('Custom size of the document. Use the form widthxheight '
'e.g. `123x321` to specify the width and height. '
'This overrides any specified paper-size.')),
OptionRecommendation(name='preserve_cover_aspect_ratio',
recommended_value=False,
help=_('Preserve the aspect ratio of the cover, instead'
' of stretching it to fill the full first page of the'
' generated pdf.')),
OptionRecommendation(name='pdf_serif_family',
recommended_value='Times', help=_(
'The font family used to render serif fonts. Will work only if the font is available system-wide.')),
OptionRecommendation(name='pdf_sans_family',
recommended_value='Helvetica', help=_(
'The font family used to render sans-serif fonts. Will work only if the font is available system-wide.')),
OptionRecommendation(name='pdf_mono_family',
recommended_value='Courier', help=_(
'The font family used to render monospace fonts. Will work only if the font is available system-wide.')),
OptionRecommendation(name='pdf_standard_font', choices=ui_data['font_types'],
recommended_value='serif', help=_(
'The font family used to render monospace fonts')),
OptionRecommendation(name='pdf_default_font_size',
recommended_value=20, help=_(
'The default font size')),
OptionRecommendation(name='pdf_mono_font_size',
recommended_value=16, help=_(
'The default font size for monospaced text')),
OptionRecommendation(name='pdf_hyphenate', recommended_value=False,
help=_('Break long words at the end of lines. This can give the text at the right margin a more even appearance.')),
OptionRecommendation(name='pdf_mark_links', recommended_value=False,
help=_('Surround all links with a red box, useful for debugging.')),
OptionRecommendation(name='pdf_page_numbers', recommended_value=False,
help=_('Add page numbers to the bottom of every page in the generated PDF file. If you '
'specify a footer template, it will take precedence '
'over this option.')),
OptionRecommendation(name='pdf_footer_template', recommended_value=None,
help=_('An HTML template used to generate %s on every page.'
' The strings _PAGENUM_, _TITLE_, _AUTHOR_ and _SECTION_ will be replaced by their current values.')%_('footers')),
OptionRecommendation(name='pdf_header_template', recommended_value=None,
help=_('An HTML template used to generate %s on every page.'
' The strings _PAGENUM_, _TITLE_, _AUTHOR_ and _SECTION_ will be replaced by their current values.')%_('headers')),
OptionRecommendation(name='pdf_add_toc', recommended_value=False,
help=_('Add a Table of Contents at the end of the PDF that lists page numbers. '
'Useful if you want to print out the PDF. If this PDF is intended for electronic use, use the PDF Outline instead.')),
OptionRecommendation(name='toc_title', recommended_value=None,
help=_('Title for generated table of contents.')
),
OptionRecommendation(name='pdf_page_margin_left', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the left page margin, in pts. Default is 72pt.'
' Overrides the common left page margin setting.')
),
OptionRecommendation(name='pdf_page_margin_top', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the top page margin, in pts. Default is 72pt.'
' Overrides the common top page margin setting, unless set to zero.')
),
OptionRecommendation(name='pdf_page_margin_right', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the right page margin, in pts. Default is 72pt.'
' Overrides the common right page margin setting, unless set to zero.')
),
OptionRecommendation(name='pdf_page_margin_bottom', recommended_value=72.0,
level=OptionRecommendation.LOW,
help=_('The size of the bottom page margin, in pts. Default is 72pt.'
' Overrides the common bottom page margin setting, unless set to zero.')
),
OptionRecommendation(name='pdf_use_document_margins', recommended_value=False,
help=_('Use the page margins specified in the input document via @page CSS rules.'
' This will cause the margins specified in the conversion settings to be ignored.'
' If the document does not specify page margins, the conversion settings will be used as a fallback.')
),
OptionRecommendation(name='pdf_page_number_map', recommended_value=None,
help=_('Adjust page numbers, as needed. Syntax is a JavaScript expression for the page number.'
' For example, "if (n < 3) 0; else n - 3;", where n is current page number.')
),
OptionRecommendation(name='uncompressed_pdf',
recommended_value=False, help=_(
'Generate an uncompressed PDF, useful for debugging.')
),
OptionRecommendation(name='pdf_odd_even_offset', recommended_value=0.0,
level=OptionRecommendation.LOW,
help=_(
'Shift the text horizontally by the specified offset (in pts).'
' On odd numbered pages, it is shifted to the right and on even'
' numbered pages to the left. Use negative numbers for the opposite'
' effect. Note that this setting is ignored on pages where the margins'
' are smaller than the specified offset. Shifting is done by setting'
' the PDF CropBox, not all software respects the CropBox.'
)
),
}
def specialize_options(self, log, opts, input_fmt):
# Ensure Qt is setup to be used with WebEngine
# specialize_options is called early enough in the pipeline
# that hopefully no Qt application has been constructed as yet
from PyQt5.QtWebEngineCore import QWebEngineUrlScheme
from PyQt5.QtWebEngineWidgets import QWebEnginePage # noqa
from calibre.gui2 import must_use_qt
from calibre.constants import FAKE_PROTOCOL
scheme = QWebEngineUrlScheme(FAKE_PROTOCOL.encode('ascii'))
scheme.setSyntax(QWebEngineUrlScheme.Syntax.Host)
scheme.setFlags(QWebEngineUrlScheme.SecureScheme)
QWebEngineUrlScheme.registerScheme(scheme)
must_use_qt()
self.input_fmt = input_fmt
if opts.pdf_use_document_margins:
# Prevent the conversion pipeline from overwriting document margins
opts.margin_left = opts.margin_right = opts.margin_top = opts.margin_bottom = -1
def convert(self, oeb_book, output_path, input_plugin, opts, log):
self.stored_page_margins = getattr(opts, '_stored_page_margins', {})
self.oeb = oeb_book
self.input_plugin, self.opts, self.log = input_plugin, opts, log
self.output_path = output_path
from calibre.ebooks.oeb.base import OPF, OPF2_NS
from lxml import etree
from io import BytesIO
package = etree.Element(OPF('package'),
attrib={'version': '2.0', 'unique-identifier': 'dummy'},
nsmap={None: OPF2_NS})
from calibre.ebooks.metadata.opf2 import OPF
self.oeb.metadata.to_opf2(package)
self.metadata = OPF(BytesIO(etree.tostring(package))).to_book_metadata()
self.cover_data = None
if input_plugin.is_image_collection:
log.debug('Converting input as an image collection...')
self.convert_images(input_plugin.get_images())
else:
log.debug('Converting input as a text based book...')
self.convert_text(oeb_book)
def convert_images(self, images):
from calibre.ebooks.pdf.image_writer import convert
convert(images, self.output_path, self.opts, self.metadata, self.report_progress)
def get_cover_data(self):
oeb = self.oeb
if (oeb.metadata.cover and unicode_type(oeb.metadata.cover[0]) in oeb.manifest.ids):
cover_id = unicode_type(oeb.metadata.cover[0])
item = oeb.manifest.ids[cover_id]
self.cover_data = item.data
def process_fonts(self):
''' Make sure all fonts are embeddable '''
from calibre.ebooks.oeb.base import urlnormalize
from calibre.utils.fonts.utils import remove_embed_restriction
processed = set()
for item in list(self.oeb.manifest):
if not hasattr(item.data, 'cssRules'):
continue
for i, rule in enumerate(item.data.cssRules):
if rule.type == rule.FONT_FACE_RULE:
try:
s = rule.style
src = s.getProperty('src').propertyValue[0].uri
except:
continue
path = item.abshref(src)
ff = self.oeb.manifest.hrefs.get(urlnormalize(path), None)
if ff is None:
continue
raw = nraw = ff.data
if path not in processed:
processed.add(path)
try:
nraw = remove_embed_restriction(raw)
except:
continue
if nraw != raw:
ff.data = nraw
self.oeb.container.write(path, nraw)
def convert_text(self, oeb_book):
import json
from calibre.ebooks.pdf.html_writer import convert
self.get_cover_data()
self.process_fonts()
if self.opts.pdf_use_document_margins and self.stored_page_margins:
for href, margins in iteritems(self.stored_page_margins):
item = oeb_book.manifest.hrefs.get(href)
if item is not None:
root = item.data
if hasattr(root, 'xpath') and margins:
root.set('data-calibre-pdf-output-page-margins', json.dumps(margins))
with TemporaryDirectory('_pdf_out') as oeb_dir:
from calibre.customize.ui import plugin_for_output_format
oeb_dir = os.path.realpath(oeb_dir)
oeb_output = plugin_for_output_format('oeb')
oeb_output.convert(oeb_book, oeb_dir, self.input_plugin, self.opts, self.log)
opfpath = glob.glob(os.path.join(oeb_dir, '*.opf'))[0]
convert(
opfpath, self.opts, metadata=self.metadata, output_path=self.output_path,
log=self.log, cover_data=self.cover_data, report_progress=self.report_progress
)

View File

@@ -0,0 +1,165 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import glob
import os
import shutil
from calibre.customize.conversion import InputFormatPlugin
from calibre.ptempfile import TemporaryDirectory
from polyglot.builtins import getcwd
class PMLInput(InputFormatPlugin):
name = 'PML Input'
author = 'John Schember'
description = 'Convert PML to OEB'
# pmlz is a zip file containing pml files and png images.
file_types = {'pml', 'pmlz'}
commit_name = 'pml_input'
def process_pml(self, pml_path, html_path, close_all=False):
from calibre.ebooks.pml.pmlconverter import PML_HTMLizer
pclose = False
hclose = False
if not hasattr(pml_path, 'read'):
pml_stream = lopen(pml_path, 'rb')
pclose = True
else:
pml_stream = pml_path
pml_stream.seek(0)
if not hasattr(html_path, 'write'):
html_stream = lopen(html_path, 'wb')
hclose = True
else:
html_stream = html_path
ienc = getattr(pml_stream, 'encoding', None)
if ienc is None:
ienc = 'cp1252'
if self.options.input_encoding:
ienc = self.options.input_encoding
self.log.debug('Converting PML to HTML...')
hizer = PML_HTMLizer()
html = hizer.parse_pml(pml_stream.read().decode(ienc), html_path)
html = '<html><head><title></title></head><body>%s</body></html>'%html
html_stream.write(html.encode('utf-8', 'replace'))
if pclose:
pml_stream.close()
if hclose:
html_stream.close()
return hizer.get_toc()
def get_images(self, stream, tdir, top_level=False):
images = []
imgs = []
if top_level:
imgs = glob.glob(os.path.join(tdir, '*.png'))
# Images not in top level try bookname_img directory because
# that's where Dropbook likes to see them.
if not imgs:
if hasattr(stream, 'name'):
imgs = glob.glob(os.path.join(tdir, os.path.splitext(os.path.basename(stream.name))[0] + '_img', '*.png'))
# No images in Dropbook location try generic images directory
if not imgs:
imgs = glob.glob(os.path.join(os.path.join(tdir, 'images'), '*.png'))
if imgs:
os.makedirs(os.path.join(getcwd(), 'images'))
for img in imgs:
pimg_name = os.path.basename(img)
pimg_path = os.path.join(getcwd(), 'images', pimg_name)
images.append('images/' + pimg_name)
shutil.copy(img, pimg_path)
return images
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.metadata.toc import TOC
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.utils.zipfile import ZipFile
self.options = options
self.log = log
pages, images = [], []
toc = TOC()
if file_ext == 'pmlz':
log.debug('De-compressing content to temporary directory...')
with TemporaryDirectory('_unpmlz') as tdir:
zf = ZipFile(stream)
zf.extractall(tdir)
pmls = glob.glob(os.path.join(tdir, '*.pml'))
for pml in pmls:
html_name = os.path.splitext(os.path.basename(pml))[0]+'.html'
html_path = os.path.join(getcwd(), html_name)
pages.append(html_name)
log.debug('Processing PML item %s...' % pml)
ttoc = self.process_pml(pml, html_path)
toc += ttoc
images = self.get_images(stream, tdir, True)
else:
toc = self.process_pml(stream, 'index.html')
pages.append('index.html')
if hasattr(stream, 'name'):
images = self.get_images(stream, os.path.abspath(os.path.dirname(stream.name)))
# We want pages to be orded alphabetically.
pages.sort()
manifest_items = []
for item in pages+images:
manifest_items.append((item, None))
from calibre.ebooks.metadata.meta import get_metadata
log.debug('Reading metadata from input file...')
mi = get_metadata(stream, 'pml')
if 'images/cover.png' in images:
mi.cover = 'images/cover.png'
opf = OPFCreator(getcwd(), mi)
log.debug('Generating manifest...')
opf.create_manifest(manifest_items)
opf.create_spine(pages)
opf.set_toc(toc)
with lopen('metadata.opf', 'wb') as opffile:
with lopen('toc.ncx', 'wb') as tocfile:
opf.render(opffile, tocfile, 'toc.ncx')
return os.path.join(getcwd(), 'metadata.opf')
def postprocess_book(self, oeb, opts, log):
from calibre.ebooks.oeb.base import XHTML, barename
for item in oeb.spine:
if hasattr(item.data, 'xpath'):
for heading in item.data.iterdescendants(*map(XHTML, 'h1 h2 h3 h4 h5 h6'.split())):
if not len(heading):
continue
span = heading[0]
if not heading.text and not span.text and not len(span) and barename(span.tag) == 'span':
if not heading.get('id') and span.get('id'):
heading.set('id', span.get('id'))
heading.text = span.tail
heading.remove(span)
if len(heading) == 1 and heading[0].get('style') == 'text-align: center; margin: auto;':
div = heading[0]
if barename(div.tag) == 'div' and not len(div) and not div.get('id') and not heading.get('style'):
heading.text = (heading.text or '') + (div.text or '') + (div.tail or '')
heading.remove(div)
heading.set('style', 'text-align: center')

View File

@@ -0,0 +1,77 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os, io
from calibre.customize.conversion import (OutputFormatPlugin,
OptionRecommendation)
from calibre.ptempfile import TemporaryDirectory
from polyglot.builtins import unicode_type
class PMLOutput(OutputFormatPlugin):
name = 'PML Output'
author = 'John Schember'
file_type = 'pmlz'
commit_name = 'pml_output'
options = {
OptionRecommendation(name='pml_output_encoding', recommended_value='cp1252',
level=OptionRecommendation.LOW,
help=_('Specify the character encoding of the output document. '
'The default is cp1252.')),
OptionRecommendation(name='inline_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Add Table of Contents to beginning of the book.')),
OptionRecommendation(name='full_image_depth',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Do not reduce the size or bit depth of images. Images '
'have their size and depth reduced by default to accommodate '
'applications that can not convert images on their '
'own such as Dropbook.')),
}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.pml.pmlml import PMLMLizer
from calibre.utils.zipfile import ZipFile
with TemporaryDirectory('_pmlz_output') as tdir:
pmlmlizer = PMLMLizer(log)
pml = unicode_type(pmlmlizer.extract_content(oeb_book, opts))
with lopen(os.path.join(tdir, 'index.pml'), 'wb') as out:
out.write(pml.encode(opts.pml_output_encoding, 'replace'))
img_path = os.path.join(tdir, 'index_img')
if not os.path.exists(img_path):
os.makedirs(img_path)
self.write_images(oeb_book.manifest, pmlmlizer.image_hrefs, img_path, opts)
log.debug('Compressing output...')
pmlz = ZipFile(output_path, 'w')
pmlz.add_dir(tdir)
def write_images(self, manifest, image_hrefs, out_dir, opts):
from PIL import Image
from calibre.ebooks.oeb.base import OEB_RASTER_IMAGES
for item in manifest:
if item.media_type in OEB_RASTER_IMAGES and item.href in image_hrefs.keys():
if opts.full_image_depth:
im = Image.open(io.BytesIO(item.data))
else:
im = Image.open(io.BytesIO(item.data)).convert('P')
im.thumbnail((300,300), Image.ANTIALIAS)
data = io.BytesIO()
im.save(data, 'PNG')
data = data.getvalue()
path = os.path.join(out_dir, image_hrefs[item.href])
with lopen(path, 'wb') as out:
out.write(data)

View File

@@ -0,0 +1,28 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
from calibre.customize.conversion import InputFormatPlugin
from polyglot.builtins import getcwd
class RBInput(InputFormatPlugin):
name = 'RB Input'
author = 'John Schember'
description = 'Convert RB files to HTML'
file_types = {'rb'}
commit_name = 'rb_input'
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.rb.reader import Reader
reader = Reader(stream, log, options.input_encoding)
opf = reader.extract_content(getcwd())
return opf

View File

@@ -0,0 +1,45 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
class RBOutput(OutputFormatPlugin):
name = 'RB Output'
author = 'John Schember'
file_type = 'rb'
commit_name = 'rb_output'
options = {
OptionRecommendation(name='inline_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Add Table of Contents to beginning of the book.'))}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.rb.writer import RBWriter
close = False
if not hasattr(output_path, 'write'):
close = True
if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path):
os.makedirs(os.path.dirname(output_path))
out_stream = lopen(output_path, 'wb')
else:
out_stream = output_path
writer = RBWriter(opts, log)
out_stream.seek(0)
out_stream.truncate()
writer.write_content(oeb_book, out_stream, oeb_book.metadata)
if close:
out_stream.close()

View File

@@ -0,0 +1,169 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from calibre.constants import numeric_version
from calibre import walk
from polyglot.builtins import unicode_type
class RecipeDisabled(Exception):
pass
class RecipeInput(InputFormatPlugin):
name = 'Recipe Input'
author = 'Kovid Goyal'
description = _('Download periodical content from the internet')
file_types = {'recipe', 'downloaded_recipe'}
commit_name = 'recipe_input'
recommendations = {
('chapter', None, OptionRecommendation.HIGH),
('dont_split_on_page_breaks', True, OptionRecommendation.HIGH),
('use_auto_toc', False, OptionRecommendation.HIGH),
('input_encoding', None, OptionRecommendation.HIGH),
('input_profile', 'default', OptionRecommendation.HIGH),
('page_breaks_before', None, OptionRecommendation.HIGH),
('insert_metadata', False, OptionRecommendation.HIGH),
}
options = {
OptionRecommendation(name='test', recommended_value=False,
help=_(
'Useful for recipe development. Forces'
' max_articles_per_feed to 2 and downloads at most 2 feeds.'
' You can change the number of feeds and articles by supplying optional arguments.'
' For example: --test 3 1 will download at most 3 feeds and only 1 article per feed.')),
OptionRecommendation(name='username', recommended_value=None,
help=_('Username for sites that require a login to access '
'content.')),
OptionRecommendation(name='password', recommended_value=None,
help=_('Password for sites that require a login to access '
'content.')),
OptionRecommendation(name='dont_download_recipe',
recommended_value=False,
help=_('Do not download latest version of builtin recipes from the calibre server')),
OptionRecommendation(name='lrf', recommended_value=False,
help='Optimize fetching for subsequent conversion to LRF.'),
}
def convert(self, recipe_or_file, opts, file_ext, log,
accelerators):
from calibre.web.feeds.recipes import compile_recipe
opts.output_profile.flow_size = 0
if file_ext == 'downloaded_recipe':
from calibre.utils.zipfile import ZipFile
zf = ZipFile(recipe_or_file, 'r')
zf.extractall()
zf.close()
with lopen('download.recipe', 'rb') as f:
self.recipe_source = f.read()
recipe = compile_recipe(self.recipe_source)
recipe.needs_subscription = False
self.recipe_object = recipe(opts, log, self.report_progress)
else:
if os.environ.get('CALIBRE_RECIPE_URN'):
from calibre.web.feeds.recipes.collection import get_custom_recipe, get_builtin_recipe_by_id
urn = os.environ['CALIBRE_RECIPE_URN']
log('Downloading recipe urn: ' + urn)
rtype, recipe_id = urn.partition(':')[::2]
if not recipe_id:
raise ValueError('Invalid recipe urn: ' + urn)
if rtype == 'custom':
self.recipe_source = get_custom_recipe(recipe_id)
else:
self.recipe_source = get_builtin_recipe_by_id(urn, log=log, download_recipe=True)
if not self.recipe_source:
raise ValueError('Could not find recipe with urn: ' + urn)
if not isinstance(self.recipe_source, bytes):
self.recipe_source = self.recipe_source.encode('utf-8')
recipe = compile_recipe(self.recipe_source)
elif os.access(recipe_or_file, os.R_OK):
with lopen(recipe_or_file, 'rb') as f:
self.recipe_source = f.read()
recipe = compile_recipe(self.recipe_source)
log('Using custom recipe')
else:
from calibre.web.feeds.recipes.collection import (
get_builtin_recipe_by_title, get_builtin_recipe_titles)
title = getattr(opts, 'original_recipe_input_arg', recipe_or_file)
title = os.path.basename(title).rpartition('.')[0]
titles = frozenset(get_builtin_recipe_titles())
if title not in titles:
title = getattr(opts, 'original_recipe_input_arg', recipe_or_file)
title = title.rpartition('.')[0]
raw = get_builtin_recipe_by_title(title, log=log,
download_recipe=not opts.dont_download_recipe)
builtin = False
try:
recipe = compile_recipe(raw)
self.recipe_source = raw
if recipe.requires_version > numeric_version:
log.warn(
'Downloaded recipe needs calibre version at least: %s' %
('.'.join(recipe.requires_version)))
builtin = True
except:
log.exception('Failed to compile downloaded recipe. Falling '
'back to builtin one')
builtin = True
if builtin:
log('Using bundled builtin recipe')
raw = get_builtin_recipe_by_title(title, log=log,
download_recipe=False)
if raw is None:
raise ValueError('Failed to find builtin recipe: '+title)
recipe = compile_recipe(raw)
self.recipe_source = raw
else:
log('Using downloaded builtin recipe')
if recipe is None:
raise ValueError('%r is not a valid recipe file or builtin recipe' %
recipe_or_file)
disabled = getattr(recipe, 'recipe_disabled', None)
if disabled is not None:
raise RecipeDisabled(disabled)
ro = recipe(opts, log, self.report_progress)
ro.download()
self.recipe_object = ro
for key, val in self.recipe_object.conversion_options.items():
setattr(opts, key, val)
for f in os.listdir('.'):
if f.endswith('.opf'):
return os.path.abspath(f)
for f in walk('.'):
if f.endswith('.opf'):
return os.path.abspath(f)
def postprocess_book(self, oeb, opts, log):
if self.recipe_object is not None:
self.recipe_object.internal_postprocess_book(oeb, opts, log)
self.recipe_object.postprocess_book(oeb, opts, log)
def specialize(self, oeb, opts, log, output_fmt):
if opts.no_inline_navbars:
from calibre.ebooks.oeb.base import XPath
for item in oeb.spine:
for div in XPath('//h:div[contains(@class, "calibre_navbar")]')(item.data):
div.getparent().remove(div)
def save_download(self, zf):
raw = self.recipe_source
if isinstance(raw, unicode_type):
raw = raw.encode('utf-8')
zf.writestr('download.recipe', raw)

View File

@@ -0,0 +1,323 @@
from __future__ import with_statement, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import os, glob, re, textwrap
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from polyglot.builtins import iteritems, filter, getcwd, as_bytes
border_style_map = {
'single' : 'solid',
'double-thickness-border' : 'double',
'shadowed-border': 'outset',
'double-border': 'double',
'dotted-border': 'dotted',
'dashed': 'dashed',
'hairline': 'solid',
'inset': 'inset',
'dash-small': 'dashed',
'dot-dash': 'dotted',
'dot-dot-dash': 'dotted',
'outset': 'outset',
'tripple': 'double',
'triple': 'double',
'thick-thin-small': 'solid',
'thin-thick-small': 'solid',
'thin-thick-thin-small': 'solid',
'thick-thin-medium': 'solid',
'thin-thick-medium': 'solid',
'thin-thick-thin-medium': 'solid',
'thick-thin-large': 'solid',
'thin-thick-thin-large': 'solid',
'wavy': 'ridge',
'double-wavy': 'ridge',
'striped': 'ridge',
'emboss': 'inset',
'engrave': 'inset',
'frame': 'ridge',
}
class RTFInput(InputFormatPlugin):
name = 'RTF Input'
author = 'Kovid Goyal'
description = 'Convert RTF files to HTML'
file_types = {'rtf'}
commit_name = 'rtf_input'
options = {
OptionRecommendation(name='ignore_wmf', recommended_value=False,
help=_('Ignore WMF images instead of replacing them with a placeholder image.')),
}
def generate_xml(self, stream):
from calibre.ebooks.rtf2xml.ParseRtf import ParseRtf
ofile = u'dataxml.xml'
run_lev, debug_dir, indent_out = 1, None, 0
if getattr(self.opts, 'debug_pipeline', None) is not None:
try:
os.mkdir(u'rtfdebug')
debug_dir = u'rtfdebug'
run_lev = 4
indent_out = 1
self.log('Running RTFParser in debug mode')
except:
self.log.warn('Impossible to run RTFParser in debug mode')
parser = ParseRtf(
in_file=stream,
out_file=ofile,
# Convert symbol fonts to unicode equivalents. Default
# is 1
convert_symbol=1,
# Convert Zapf fonts to unicode equivalents. Default
# is 1.
convert_zapf=1,
# Convert Wingding fonts to unicode equivalents.
# Default is 1.
convert_wingdings=1,
# Convert RTF caps to real caps.
# Default is 1.
convert_caps=1,
# Indent resulting XML.
# Default is 0 (no indent).
indent=indent_out,
# Form lists from RTF. Default is 1.
form_lists=1,
# Convert headings to sections. Default is 0.
headings_to_sections=1,
# Group paragraphs with the same style name. Default is 1.
group_styles=1,
# Group borders. Default is 1.
group_borders=1,
# Write or do not write paragraphs. Default is 0.
empty_paragraphs=1,
# Debug
deb_dir=debug_dir,
# Default encoding
default_encoding=getattr(self.opts, 'input_encoding', 'cp1252') or 'cp1252',
# Run level
run_level=run_lev,
)
parser.parse_rtf()
with open(ofile, 'rb') as f:
return f.read()
def extract_images(self, picts):
from calibre.utils.imghdr import what
from binascii import unhexlify
self.log('Extracting images...')
with open(picts, 'rb') as f:
raw = f.read()
picts = filter(len, re.findall(br'\{\\pict([^}]+)\}', raw))
hex_pat = re.compile(br'[^a-fA-F0-9]')
encs = [hex_pat.sub(b'', pict) for pict in picts]
count = 0
imap = {}
for enc in encs:
if len(enc) % 2 == 1:
enc = enc[:-1]
data = unhexlify(enc)
fmt = what(None, data)
if fmt is None:
fmt = 'wmf'
count += 1
name = u'%04d.%s' % (count, fmt)
with open(name, 'wb') as f:
f.write(data)
imap[count] = name
# with open(name+'.hex', 'wb') as f:
# f.write(enc)
return self.convert_images(imap)
def convert_images(self, imap):
self.default_img = None
for count, val in iteritems(imap):
try:
imap[count] = self.convert_image(val)
except:
self.log.exception('Failed to convert', val)
return imap
def convert_image(self, name):
if not name.endswith('.wmf'):
return name
try:
return self.rasterize_wmf(name)
except Exception:
self.log.exception('Failed to convert WMF image %r'%name)
return self.replace_wmf(name)
def replace_wmf(self, name):
if self.opts.ignore_wmf:
os.remove(name)
return '__REMOVE_ME__'
from calibre.ebooks.covers import message_image
if self.default_img is None:
self.default_img = message_image('Conversion of WMF images is not supported.'
' Use Microsoft Word or OpenOffice to save this RTF file'
' as HTML and convert that in calibre.')
name = name.replace('.wmf', '.jpg')
with lopen(name, 'wb') as f:
f.write(self.default_img)
return name
def rasterize_wmf(self, name):
from calibre.utils.wmf.parse import wmf_unwrap
with open(name, 'rb') as f:
data = f.read()
data = wmf_unwrap(data)
name = name.replace('.wmf', '.png')
with open(name, 'wb') as f:
f.write(data)
return name
def write_inline_css(self, ic, border_styles):
font_size_classes = ['span.fs%d { font-size: %spt }'%(i, x) for i, x in
enumerate(ic.font_sizes)]
color_classes = ['span.col%d { color: %s }'%(i, x) for i, x in
enumerate(ic.colors) if x != 'false']
css = textwrap.dedent('''
span.none {
text-decoration: none; font-weight: normal;
font-style: normal; font-variant: normal
}
span.italics { font-style: italic }
span.bold { font-weight: bold }
span.small-caps { font-variant: small-caps }
span.underlined { text-decoration: underline }
span.strike-through { text-decoration: line-through }
''')
css += '\n'+'\n'.join(font_size_classes)
css += '\n' +'\n'.join(color_classes)
for cls, val in iteritems(border_styles):
css += '\n\n.%s {\n%s\n}'%(cls, val)
with open(u'styles.css', 'ab') as f:
f.write(css.encode('utf-8'))
def convert_borders(self, doc):
border_styles = []
style_map = {}
for elem in doc.xpath(r'//*[local-name()="cell"]'):
style = ['border-style: hidden', 'border-width: 1px',
'border-color: black']
for x in ('bottom', 'top', 'left', 'right'):
bs = elem.get('border-cell-%s-style'%x, None)
if bs:
cbs = border_style_map.get(bs, 'solid')
style.append('border-%s-style: %s'%(x, cbs))
bw = elem.get('border-cell-%s-line-width'%x, None)
if bw:
style.append('border-%s-width: %spt'%(x, bw))
bc = elem.get('border-cell-%s-color'%x, None)
if bc:
style.append('border-%s-color: %s'%(x, bc))
style = ';\n'.join(style)
if style not in border_styles:
border_styles.append(style)
idx = border_styles.index(style)
cls = 'border_style%d'%idx
style_map[cls] = style
elem.set('class', cls)
return style_map
def convert(self, stream, options, file_ext, log,
accelerators):
from lxml import etree
from calibre.ebooks.metadata.meta import get_metadata
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.ebooks.rtf2xml.ParseRtf import RtfInvalidCodeException
from calibre.ebooks.rtf.input import InlineClass
from calibre.utils.xml_parse import safe_xml_fromstring
self.opts = options
self.log = log
self.log('Converting RTF to XML...')
try:
xml = self.generate_xml(stream.name)
except RtfInvalidCodeException as e:
self.log.exception('Unable to parse RTF')
raise ValueError(_('This RTF file has a feature calibre does not '
'support. Convert it to HTML first and then try it.\n%s')%e)
d = glob.glob(os.path.join('*_rtf_pict_dir', 'picts.rtf'))
if d:
imap = {}
try:
imap = self.extract_images(d[0])
except:
self.log.exception('Failed to extract images...')
self.log('Parsing XML...')
doc = safe_xml_fromstring(xml)
border_styles = self.convert_borders(doc)
for pict in doc.xpath('//rtf:pict[@num]',
namespaces={'rtf':'http://rtf2xml.sourceforge.net/'}):
num = int(pict.get('num'))
name = imap.get(num, None)
if name is not None:
pict.set('num', name)
self.log('Converting XML to HTML...')
inline_class = InlineClass(self.log)
styledoc = safe_xml_fromstring(P('templates/rtf.xsl', data=True), recover=False)
extensions = {('calibre', 'inline-class') : inline_class}
transform = etree.XSLT(styledoc, extensions=extensions)
result = transform(doc)
html = u'index.xhtml'
with open(html, 'wb') as f:
res = as_bytes(transform.tostring(result))
# res = res[:100].replace('xmlns:html', 'xmlns') + res[100:]
# clean multiple \n
res = re.sub(b'\n+', b'\n', res)
# Replace newlines inserted by the 'empty_paragraphs' option in rtf2xml with html blank lines
# res = re.sub('\s*<body>', '<body>', res)
# res = re.sub('(?<=\n)\n{2}',
# u'<p>\u00a0</p>\n'.encode('utf-8'), res)
f.write(res)
self.write_inline_css(inline_class, border_styles)
stream.seek(0)
mi = get_metadata(stream, 'rtf')
if not mi.title:
mi.title = _('Unknown')
if not mi.authors:
mi.authors = [_('Unknown')]
opf = OPFCreator(getcwd(), mi)
opf.create_manifest([(u'index.xhtml', None)])
opf.create_spine([u'index.xhtml'])
opf.render(open(u'metadata.opf', 'wb'))
return os.path.abspath(u'metadata.opf')
def postprocess_book(self, oeb, opts, log):
for item in oeb.spine:
for img in item.data.xpath('//*[local-name()="img" and @src="__REMOVE_ME__"]'):
p = img.getparent()
idx = p.index(img)
p.remove(img)
if img.tail:
if idx == 0:
p.text = (p.text or '') + img.tail
else:
p[idx-1].tail = (p[idx-1].tail or '') + img.tail

View File

@@ -0,0 +1,40 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import OutputFormatPlugin
class RTFOutput(OutputFormatPlugin):
name = 'RTF Output'
author = 'John Schember'
file_type = 'rtf'
commit_name = 'rtf_output'
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.rtf.rtfml import RTFMLizer
rtfmlitzer = RTFMLizer(log)
content = rtfmlitzer.extract_content(oeb_book, opts)
close = False
if not hasattr(output_path, 'write'):
close = True
if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path) != '':
os.makedirs(os.path.dirname(output_path))
out_stream = lopen(output_path, 'wb')
else:
out_stream = output_path
out_stream.seek(0)
out_stream.truncate()
out_stream.write(content.encode('ascii', 'replace'))
if close:
out_stream.close()

View File

@@ -0,0 +1,122 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2010, Li Fanxi <lifanxi@freemindworld.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import InputFormatPlugin
from calibre.ptempfile import TemporaryDirectory
from calibre.utils.filenames import ascii_filename
from polyglot.builtins import unicode_type
HTML_TEMPLATE = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>%s</title></head><body>\n%s\n</body></html>'
def html_encode(s):
return s.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&apos;').replace('\n', '<br/>').replace(' ', '&nbsp;') # noqa
class SNBInput(InputFormatPlugin):
name = 'SNB Input'
author = 'Li Fanxi'
description = 'Convert SNB files to OEB'
file_types = {'snb'}
commit_name = 'snb_input'
options = set()
def convert(self, stream, options, file_ext, log,
accelerators):
import uuid
from calibre.ebooks.oeb.base import DirContainer
from calibre.ebooks.snb.snbfile import SNBFile
from calibre.utils.xml_parse import safe_xml_fromstring
log.debug("Parsing SNB file...")
snbFile = SNBFile()
try:
snbFile.Parse(stream)
except:
raise ValueError("Invalid SNB file")
if not snbFile.IsValid():
log.debug("Invalid SNB file")
raise ValueError("Invalid SNB file")
log.debug("Handle meta data ...")
from calibre.ebooks.conversion.plumber import create_oebbook
oeb = create_oebbook(log, None, options,
encoding=options.input_encoding, populate=False)
meta = snbFile.GetFileStream('snbf/book.snbf')
if meta is not None:
meta = safe_xml_fromstring(meta)
l = {'title' : './/head/name',
'creator' : './/head/author',
'language' : './/head/language',
'generator': './/head/generator',
'publisher': './/head/publisher',
'cover' : './/head/cover', }
d = {}
for item in l:
node = meta.find(l[item])
if node is not None:
d[item] = node.text if node.text is not None else ''
else:
d[item] = ''
oeb.metadata.add('title', d['title'])
oeb.metadata.add('creator', d['creator'], attrib={'role':'aut'})
oeb.metadata.add('language', d['language'].lower().replace('_', '-'))
oeb.metadata.add('generator', d['generator'])
oeb.metadata.add('publisher', d['publisher'])
if d['cover'] != '':
oeb.guide.add('cover', 'Cover', d['cover'])
bookid = unicode_type(uuid.uuid4())
oeb.metadata.add('identifier', bookid, id='uuid_id', scheme='uuid')
for ident in oeb.metadata.identifier:
if 'id' in ident.attrib:
oeb.uid = oeb.metadata.identifier[0]
break
with TemporaryDirectory('_snb2oeb', keep=True) as tdir:
log.debug('Process TOC ...')
toc = snbFile.GetFileStream('snbf/toc.snbf')
oeb.container = DirContainer(tdir, log)
if toc is not None:
toc = safe_xml_fromstring(toc)
i = 1
for ch in toc.find('.//body'):
chapterName = ch.text
chapterSrc = ch.get('src')
fname = 'ch_%d.htm' % i
data = snbFile.GetFileStream('snbc/' + chapterSrc)
if data is None:
continue
snbc = safe_xml_fromstring(data)
lines = []
for line in snbc.find('.//body'):
if line.tag == 'text':
lines.append('<p>%s</p>' % html_encode(line.text))
elif line.tag == 'img':
lines.append('<p><img src="%s" /></p>' % html_encode(line.text))
with open(os.path.join(tdir, fname), 'wb') as f:
f.write((HTML_TEMPLATE % (chapterName, '\n'.join(lines))).encode('utf-8', 'replace'))
oeb.toc.add(ch.text, fname)
id, href = oeb.manifest.generate(id='html',
href=ascii_filename(fname))
item = oeb.manifest.add(id, href, 'text/html')
item.html_input_href = fname
oeb.spine.add(item, True)
i = i + 1
imageFiles = snbFile.OutputImageFiles(tdir)
for f, m in imageFiles:
id, href = oeb.manifest.generate(id='image',
href=ascii_filename(f))
item = oeb.manifest.add(id, href, m)
item.html_input_href = f
return oeb

View File

@@ -0,0 +1,269 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2010, Li Fanxi <lifanxi@freemindworld.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
from calibre.ptempfile import TemporaryDirectory
from calibre.constants import __appname__, __version__
from polyglot.builtins import unicode_type
class SNBOutput(OutputFormatPlugin):
name = 'SNB Output'
author = 'Li Fanxi'
file_type = 'snb'
commit_name = 'snb_output'
options = {
OptionRecommendation(name='snb_output_encoding', recommended_value='utf-8',
level=OptionRecommendation.LOW,
help=_('Specify the character encoding of the output document. '
'The default is utf-8.')),
OptionRecommendation(name='snb_max_line_length',
recommended_value=0, level=OptionRecommendation.LOW,
help=_('The maximum number of characters per line. This splits on '
'the first space before the specified value. If no space is found '
'the line will be broken at the space after and will exceed the '
'specified value. Also, there is a minimum of 25 characters. '
'Use 0 to disable line splitting.')),
OptionRecommendation(name='snb_insert_empty_line',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Specify whether or not to insert an empty line between '
'two paragraphs.')),
OptionRecommendation(name='snb_dont_indent_first_line',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Specify whether or not to insert two space characters '
'to indent the first line of each paragraph.')),
OptionRecommendation(name='snb_hide_chapter_name',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Specify whether or not to hide the chapter title for each '
'chapter. Useful for image-only output (eg. comics).')),
OptionRecommendation(name='snb_full_screen',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Resize all the images for full screen view. ')),
}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from lxml import etree
from calibre.ebooks.snb.snbfile import SNBFile
from calibre.ebooks.snb.snbml import SNBMLizer, ProcessFileName
self.opts = opts
from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer, Unavailable
try:
rasterizer = SVGRasterizer()
rasterizer(oeb_book, opts)
except Unavailable:
log.warn('SVG rasterizer unavailable, SVG will not be converted')
# Create temp dir
with TemporaryDirectory('_snb_output') as tdir:
# Create stub directories
snbfDir = os.path.join(tdir, 'snbf')
snbcDir = os.path.join(tdir, 'snbc')
snbiDir = os.path.join(tdir, 'snbc/images')
os.mkdir(snbfDir)
os.mkdir(snbcDir)
os.mkdir(snbiDir)
# Process Meta data
meta = oeb_book.metadata
if meta.title:
title = unicode_type(meta.title[0])
else:
title = ''
authors = [unicode_type(x) for x in meta.creator if x.role == 'aut']
if meta.publisher:
publishers = unicode_type(meta.publisher[0])
else:
publishers = ''
if meta.language:
lang = unicode_type(meta.language[0]).upper()
else:
lang = ''
if meta.description:
abstract = unicode_type(meta.description[0])
else:
abstract = ''
# Process Cover
g, m, s = oeb_book.guide, oeb_book.manifest, oeb_book.spine
href = None
if 'titlepage' not in g:
if 'cover' in g:
href = g['cover'].href
# Output book info file
bookInfoTree = etree.Element("book-snbf", version="1.0")
headTree = etree.SubElement(bookInfoTree, "head")
etree.SubElement(headTree, "name").text = title
etree.SubElement(headTree, "author").text = ' '.join(authors)
etree.SubElement(headTree, "language").text = lang
etree.SubElement(headTree, "rights")
etree.SubElement(headTree, "publisher").text = publishers
etree.SubElement(headTree, "generator").text = __appname__ + ' ' + __version__
etree.SubElement(headTree, "created")
etree.SubElement(headTree, "abstract").text = abstract
if href is not None:
etree.SubElement(headTree, "cover").text = ProcessFileName(href)
else:
etree.SubElement(headTree, "cover")
with open(os.path.join(snbfDir, 'book.snbf'), 'wb') as f:
f.write(etree.tostring(bookInfoTree, pretty_print=True, encoding='utf-8'))
# Output TOC
tocInfoTree = etree.Element("toc-snbf")
tocHead = etree.SubElement(tocInfoTree, "head")
tocBody = etree.SubElement(tocInfoTree, "body")
outputFiles = {}
if oeb_book.toc.count() == 0:
log.warn('This SNB file has no Table of Contents. '
'Creating a default TOC')
first = next(iter(oeb_book.spine))
oeb_book.toc.add(_('Start page'), first.href)
else:
first = next(iter(oeb_book.spine))
if oeb_book.toc[0].href != first.href:
# The pages before the fist item in toc will be stored as
# "Cover Pages".
# oeb_book.toc does not support "insert", so we generate
# the tocInfoTree directly instead of modifying the toc
ch = etree.SubElement(tocBody, "chapter")
ch.set("src", ProcessFileName(first.href) + ".snbc")
ch.text = _('Cover pages')
outputFiles[first.href] = []
outputFiles[first.href].append(("", _("Cover pages")))
for tocitem in oeb_book.toc:
if tocitem.href.find('#') != -1:
item = tocitem.href.split('#')
if len(item) != 2:
log.error('Error in TOC item: %s' % tocitem)
else:
if item[0] in outputFiles:
outputFiles[item[0]].append((item[1], tocitem.title))
else:
outputFiles[item[0]] = []
if "" not in outputFiles[item[0]]:
outputFiles[item[0]].append(("", tocitem.title + _(" (Preface)")))
ch = etree.SubElement(tocBody, "chapter")
ch.set("src", ProcessFileName(item[0]) + ".snbc")
ch.text = tocitem.title + _(" (Preface)")
outputFiles[item[0]].append((item[1], tocitem.title))
else:
if tocitem.href in outputFiles:
outputFiles[tocitem.href].append(("", tocitem.title))
else:
outputFiles[tocitem.href] = []
outputFiles[tocitem.href].append(("", tocitem.title))
ch = etree.SubElement(tocBody, "chapter")
ch.set("src", ProcessFileName(tocitem.href) + ".snbc")
ch.text = tocitem.title
etree.SubElement(tocHead, "chapters").text = '%d' % len(tocBody)
with open(os.path.join(snbfDir, 'toc.snbf'), 'wb') as f:
f.write(etree.tostring(tocInfoTree, pretty_print=True, encoding='utf-8'))
# Output Files
oldTree = None
mergeLast = False
lastName = None
for item in s:
from calibre.ebooks.oeb.base import OEB_DOCS, OEB_IMAGES
if m.hrefs[item.href].media_type in OEB_DOCS:
if item.href not in outputFiles:
log.debug('File %s is unused in TOC. Continue in last chapter' % item.href)
mergeLast = True
else:
if oldTree is not None and mergeLast:
log.debug('Output the modified chapter again: %s' % lastName)
with open(os.path.join(snbcDir, lastName), 'wb') as f:
f.write(etree.tostring(oldTree, pretty_print=True, encoding='utf-8'))
mergeLast = False
log.debug('Converting %s to snbc...' % item.href)
snbwriter = SNBMLizer(log)
snbcTrees = None
if not mergeLast:
snbcTrees = snbwriter.extract_content(oeb_book, item, outputFiles[item.href], opts)
for subName in snbcTrees:
postfix = ''
if subName != '':
postfix = '_' + subName
lastName = ProcessFileName(item.href + postfix + ".snbc")
oldTree = snbcTrees[subName]
with open(os.path.join(snbcDir, lastName), 'wb') as f:
f.write(etree.tostring(oldTree, pretty_print=True, encoding='utf-8'))
else:
log.debug('Merge %s with last TOC item...' % item.href)
snbwriter.merge_content(oldTree, oeb_book, item, [('', _("Start"))], opts)
# Output the last one if needed
log.debug('Output the last modified chapter again: %s' % lastName)
if oldTree is not None and mergeLast:
with open(os.path.join(snbcDir, lastName), 'wb') as f:
f.write(etree.tostring(oldTree, pretty_print=True, encoding='utf-8'))
mergeLast = False
for item in m:
if m.hrefs[item.href].media_type in OEB_IMAGES:
log.debug('Converting image: %s ...' % item.href)
content = m.hrefs[item.href].data
# Convert & Resize image
self.HandleImage(content, os.path.join(snbiDir, ProcessFileName(item.href)))
# Package as SNB File
snbFile = SNBFile()
snbFile.FromDir(tdir)
snbFile.Output(output_path)
def HandleImage(self, imageData, imagePath):
from calibre.utils.img import image_from_data, resize_image, image_to_data
img = image_from_data(imageData)
x, y = img.width(), img.height()
if self.opts:
if self.opts.snb_full_screen:
SCREEN_X, SCREEN_Y = self.opts.output_profile.screen_size
else:
SCREEN_X, SCREEN_Y = self.opts.output_profile.comic_screen_size
else:
SCREEN_X = 540
SCREEN_Y = 700
# Handle big image only
if x > SCREEN_X or y > SCREEN_Y:
xScale = float(x) / SCREEN_X
yScale = float(y) / SCREEN_Y
scale = max(xScale, yScale)
# TODO : intelligent image rotation
# img = img.rotate(90)
# x,y = y,x
img = resize_image(img, x // scale, y // scale)
with lopen(imagePath, 'wb') as f:
f.write(image_to_data(img, fmt=imagePath.rpartition('.')[-1]))
if __name__ == '__main__':
from calibre.ebooks.oeb.reader import OEBReader
from calibre.ebooks.oeb.base import OEBBook
from calibre.ebooks.conversion.preprocess import HTMLPreProcessor
from calibre.customize.profiles import HanlinV3Output
class OptionValues(object):
pass
opts = OptionValues()
opts.output_profile = HanlinV3Output(None)
html_preprocessor = HTMLPreProcessor(None, None, opts)
from calibre.utils.logging import default_log
oeb = OEBBook(default_log, html_preprocessor)
reader = OEBReader
reader()(oeb, '/tmp/bbb/processed/')
SNBOutput(None).convert(oeb, '/tmp/test.snb', None, None, default_log)

View File

@@ -0,0 +1,39 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
from io import BytesIO
from calibre.customize.conversion import InputFormatPlugin
class TCRInput(InputFormatPlugin):
name = 'TCR Input'
author = 'John Schember'
description = 'Convert TCR files to HTML'
file_types = {'tcr'}
commit_name = 'tcr_input'
def convert(self, stream, options, file_ext, log, accelerators):
from calibre.ebooks.compression.tcr import decompress
log.info('Decompressing text...')
raw_txt = decompress(stream)
log.info('Converting text to OEB...')
stream = BytesIO(raw_txt)
from calibre.customize.ui import plugin_for_input_format
txt_plugin = plugin_for_input_format('txt')
for opt in txt_plugin.options:
if not hasattr(self.options, opt.option.name):
setattr(options, opt.option.name, opt.recommended_value)
stream.seek(0)
return txt_plugin.convert(stream, options,
'txt', log, accelerators)

View File

@@ -0,0 +1,56 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre.customize.conversion import OutputFormatPlugin, \
OptionRecommendation
class TCROutput(OutputFormatPlugin):
name = 'TCR Output'
author = 'John Schember'
file_type = 'tcr'
commit_name = 'tcr_output'
options = {
OptionRecommendation(name='tcr_output_encoding', recommended_value='utf-8',
level=OptionRecommendation.LOW,
help=_('Specify the character encoding of the output document. '
'The default is utf-8.'))}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.txt.txtml import TXTMLizer
from calibre.ebooks.compression.tcr import compress
close = False
if not hasattr(output_path, 'write'):
close = True
if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path):
os.makedirs(os.path.dirname(output_path))
out_stream = lopen(output_path, 'wb')
else:
out_stream = output_path
setattr(opts, 'flush_paras', False)
setattr(opts, 'max_line_length', 0)
setattr(opts, 'force_max_line_length', False)
setattr(opts, 'indent_paras', False)
writer = TXTMLizer(log)
txt = writer.extract_content(oeb_book, opts).encode(opts.tcr_output_encoding, 'replace')
log.info('Compressing text...')
txt = compress(txt)
out_stream.seek(0)
out_stream.truncate()
out_stream.write(txt)
if close:
out_stream.close()

View File

@@ -0,0 +1,308 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
from calibre import _ent_pat, walk, xml_entity_to_unicode
from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
from polyglot.builtins import getcwd
MD_EXTENSIONS = {
'abbr': _('Abbreviations'),
'admonition': _('Support admonitions'),
'attr_list': _('Add attribute to HTML tags'),
'codehilite': _('Add code highlighting via Pygments'),
'def_list': _('Definition lists'),
'extra': _('Enables various common extensions'),
'fenced_code': _('Alternative code block syntax'),
'footnotes': _('Footnotes'),
'legacy_attrs': _('Use legacy element attributes'),
'legacy_em': _('Use legacy underscore handling for connected words'),
'meta': _('Metadata in the document'),
'nl2br': _('Treat newlines as hard breaks'),
'sane_lists': _('Do not allow mixing list types'),
'smarty': _('Use markdown\'s internal smartypants parser'),
'tables': _('Support tables'),
'toc': _('Generate a table of contents'),
'wikilinks': _('Wiki style links'),
}
class TXTInput(InputFormatPlugin):
name = 'TXT Input'
author = 'John Schember'
description = 'Convert TXT files to HTML'
file_types = {'txt', 'txtz', 'text', 'md', 'textile', 'markdown'}
commit_name = 'txt_input'
ui_data = {
'md_extensions': MD_EXTENSIONS,
'paragraph_types': {
'auto': _('Try to auto detect paragraph type'),
'block': _('Treat a blank line as a paragraph break'),
'single': _('Assume every line is a paragraph'),
'print': _('Assume every line starting with 2+ spaces or a tab starts a paragraph'),
'unformatted': _('Most lines have hard line breaks, few/no blank lines or indents'),
'off': _('Don\'t modify the paragraph structure'),
},
'formatting_types': {
'auto': _('Automatically decide which formatting processor to use'),
'plain': _('No formatting'),
'heuristic': _('Use heuristics to determine chapter headings, italics, etc.'),
'textile': _('Use the TexTile markup language'),
'markdown': _('Use the Markdown markup language')
},
}
options = {
OptionRecommendation(name='formatting_type', recommended_value='auto',
choices=list(ui_data['formatting_types']),
help=_('Formatting used within the document.\n'
'* auto: {auto}\n'
'* plain: {plain}\n'
'* heuristic: {heuristic}\n'
'* textile: {textile}\n'
'* markdown: {markdown}\n'
'To learn more about markdown see {url}').format(
url='https://daringfireball.net/projects/markdown/', **ui_data['formatting_types'])
),
OptionRecommendation(name='paragraph_type', recommended_value='auto',
choices=list(ui_data['paragraph_types']),
help=_('Paragraph structure to assume. The value of "off" is useful for formatted documents such as Markdown or Textile. '
'Choices are:\n'
'* auto: {auto}\n'
'* block: {block}\n'
'* single: {single}\n'
'* print: {print}\n'
'* unformatted: {unformatted}\n'
'* off: {off}').format(**ui_data['paragraph_types'])
),
OptionRecommendation(name='preserve_spaces', recommended_value=False,
help=_('Normally extra spaces are condensed into a single space. '
'With this option all spaces will be displayed.')),
OptionRecommendation(name='txt_in_remove_indents', recommended_value=False,
help=_('Normally extra space at the beginning of lines is retained. '
'With this option they will be removed.')),
OptionRecommendation(name="markdown_extensions", recommended_value='footnotes, tables, toc',
help=_('Enable extensions to markdown syntax. Extensions are formatting that is not part '
'of the standard markdown format. The extensions enabled by default: %default.\n'
'To learn more about markdown extensions, see {}\n'
'This should be a comma separated list of extensions to enable:\n'
).format('https://python-markdown.github.io/extensions/') + '\n'.join('* %s: %s' % (k, MD_EXTENSIONS[k]) for k in sorted(MD_EXTENSIONS))),
}
def shift_file(self, fname, data):
name, ext = os.path.splitext(fname)
candidate = os.path.join(self.output_dir, fname)
c = 0
while os.path.exists(candidate):
c += 1
candidate = os.path.join(self.output_dir, '{}-{}{}'.format(name, c, ext))
ans = candidate
with open(ans, 'wb') as f:
f.write(data)
return f.name
def fix_resources(self, html, base_dir):
from html5_parser import parse
root = parse(html)
changed = False
for img in root.xpath('//img[@src]'):
src = img.get('src')
prefix = src.split(':', 1)[0].lower()
if prefix not in ('file', 'http', 'https', 'ftp') and not os.path.isabs(src):
src = os.path.join(base_dir, src)
if os.access(src, os.R_OK):
with open(src, 'rb') as f:
data = f.read()
f = self.shift_file(os.path.basename(src), data)
changed = True
img.set('src', os.path.basename(f))
if changed:
from lxml import etree
html = etree.tostring(root, encoding='unicode')
return html
def convert(self, stream, options, file_ext, log,
accelerators):
from calibre.ebooks.conversion.preprocess import DocAnalysis, Dehyphenator
from calibre.ebooks.chardet import detect
from calibre.utils.zipfile import ZipFile
from calibre.ebooks.txt.processor import (convert_basic,
convert_markdown_with_metadata, separate_paragraphs_single_line,
separate_paragraphs_print_formatted, preserve_spaces,
detect_paragraph_type, detect_formatting_type,
normalize_line_endings, convert_textile, remove_indents,
block_to_single_line, separate_hard_scene_breaks)
self.log = log
txt = b''
log.debug('Reading text from file...')
length = 0
base_dir = self.output_dir = getcwd()
# Extract content from zip archive.
if file_ext == 'txtz':
zf = ZipFile(stream)
zf.extractall('.')
for x in walk('.'):
if os.path.splitext(x)[1].lower() in ('.txt', '.text'):
with open(x, 'rb') as tf:
txt += tf.read() + b'\n\n'
else:
if getattr(stream, 'name', None):
base_dir = os.path.dirname(stream.name)
txt = stream.read()
if file_ext in {'md', 'textile', 'markdown'}:
options.formatting_type = {'md': 'markdown'}.get(file_ext, file_ext)
log.info('File extension indicates particular formatting. '
'Forcing formatting type to: %s'%options.formatting_type)
options.paragraph_type = 'off'
# Get the encoding of the document.
if options.input_encoding:
ienc = options.input_encoding
log.debug('Using user specified input encoding of %s' % ienc)
else:
det_encoding = detect(txt[:4096])
det_encoding, confidence = det_encoding['encoding'], det_encoding['confidence']
if det_encoding and det_encoding.lower().replace('_', '-').strip() in (
'gb2312', 'chinese', 'csiso58gb231280', 'euc-cn', 'euccn',
'eucgb2312-cn', 'gb2312-1980', 'gb2312-80', 'iso-ir-58'):
# Microsoft Word exports to HTML with encoding incorrectly set to
# gb2312 instead of gbk. gbk is a superset of gb2312, anyway.
det_encoding = 'gbk'
ienc = det_encoding
log.debug('Detected input encoding as %s with a confidence of %s%%' % (ienc, confidence * 100))
if not ienc:
ienc = 'utf-8'
log.debug('No input encoding specified and could not auto detect using %s' % ienc)
# Remove BOM from start of txt as its presence can confuse markdown
import codecs
for bom in (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE, codecs.BOM_UTF8, codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
if txt.startswith(bom):
txt = txt[len(bom):]
break
txt = txt.decode(ienc, 'replace')
# Replace entities
txt = _ent_pat.sub(xml_entity_to_unicode, txt)
# Normalize line endings
txt = normalize_line_endings(txt)
# Determine the paragraph type of the document.
if options.paragraph_type == 'auto':
options.paragraph_type = detect_paragraph_type(txt)
if options.paragraph_type == 'unknown':
log.debug('Could not reliably determine paragraph type using block')
options.paragraph_type = 'block'
else:
log.debug('Auto detected paragraph type as %s' % options.paragraph_type)
# Detect formatting
if options.formatting_type == 'auto':
options.formatting_type = detect_formatting_type(txt)
log.debug('Auto detected formatting as %s' % options.formatting_type)
if options.formatting_type == 'heuristic':
setattr(options, 'enable_heuristics', True)
setattr(options, 'unwrap_lines', False)
setattr(options, 'smarten_punctuation', True)
# Reformat paragraphs to block formatting based on the detected type.
# We don't check for block because the processor assumes block.
# single and print at transformed to block for processing.
if options.paragraph_type == 'single':
txt = separate_paragraphs_single_line(txt)
elif options.paragraph_type == 'print':
txt = separate_hard_scene_breaks(txt)
txt = separate_paragraphs_print_formatted(txt)
txt = block_to_single_line(txt)
elif options.paragraph_type == 'unformatted':
from calibre.ebooks.conversion.utils import HeuristicProcessor
# unwrap lines based on punctuation
docanalysis = DocAnalysis('txt', txt)
length = docanalysis.line_length(.5)
preprocessor = HeuristicProcessor(options, log=getattr(self, 'log', None))
txt = preprocessor.punctuation_unwrap(length, txt, 'txt')
txt = separate_paragraphs_single_line(txt)
elif options.paragraph_type == 'block':
txt = separate_hard_scene_breaks(txt)
txt = block_to_single_line(txt)
if getattr(options, 'enable_heuristics', False) and getattr(options, 'dehyphenate', False):
docanalysis = DocAnalysis('txt', txt)
if not length:
length = docanalysis.line_length(.5)
dehyphenator = Dehyphenator(options.verbose, log=self.log)
txt = dehyphenator(txt,'txt', length)
# User requested transformation on the text.
if options.txt_in_remove_indents:
txt = remove_indents(txt)
# Preserve spaces will replace multiple spaces to a space
# followed by the &nbsp; entity.
if options.preserve_spaces:
txt = preserve_spaces(txt)
# Process the text using the appropriate text processor.
self.shifted_files = []
try:
html = ''
input_mi = None
if options.formatting_type == 'markdown':
log.debug('Running text through markdown conversion...')
try:
input_mi, html = convert_markdown_with_metadata(txt, extensions=[x.strip() for x in options.markdown_extensions.split(',') if x.strip()])
except RuntimeError:
raise ValueError('This txt file has malformed markup, it cannot be'
' converted by calibre. See https://daringfireball.net/projects/markdown/syntax')
html = self.fix_resources(html, base_dir)
elif options.formatting_type == 'textile':
log.debug('Running text through textile conversion...')
html = convert_textile(txt)
html = self.fix_resources(html, base_dir)
else:
log.debug('Running text through basic conversion...')
flow_size = getattr(options, 'flow_size', 0)
html = convert_basic(txt, epub_split_size_kb=flow_size)
# Run the HTMLized text through the html processing plugin.
from calibre.customize.ui import plugin_for_input_format
html_input = plugin_for_input_format('html')
for opt in html_input.options:
setattr(options, opt.option.name, opt.recommended_value)
options.input_encoding = 'utf-8'
htmlfile = self.shift_file('index.html', html.encode('utf-8'))
odi = options.debug_pipeline
options.debug_pipeline = None
# Generate oeb from html conversion.
oeb = html_input.convert(open(htmlfile, 'rb'), options, 'html', log, {})
options.debug_pipeline = odi
finally:
for x in self.shifted_files:
os.remove(x)
# Set metadata from file.
if input_mi is None:
from calibre.customize.ui import get_file_type_metadata
input_mi = get_file_type_metadata(stream, file_ext)
from calibre.ebooks.oeb.transforms.metadata import meta_info_to_oeb_metadata
meta_info_to_oeb_metadata(input_mi, oeb.metadata, log)
self.html_postprocess_title = input_mi.title
return oeb
def postprocess_book(self, oeb, opts, log):
for item in oeb.spine:
if hasattr(item.data, 'xpath'):
for title in item.data.xpath('//*[local-name()="title"]'):
if title.text == _('Unknown'):
title.text = self.html_postprocess_title

View File

@@ -0,0 +1,165 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL 3'
__copyright__ = '2009, John Schember <john@nachtimwald.com>'
__docformat__ = 'restructuredtext en'
import os
import shutil
from calibre.customize.conversion import OutputFormatPlugin, \
OptionRecommendation
from calibre.ptempfile import TemporaryDirectory, TemporaryFile
NEWLINE_TYPES = ['system', 'unix', 'old_mac', 'windows']
class TXTOutput(OutputFormatPlugin):
name = 'TXT Output'
author = 'John Schember'
file_type = 'txt'
commit_name = 'txt_output'
ui_data = {
'newline_types': NEWLINE_TYPES,
'formatting_types': {
'plain': _('Plain text'),
'markdown': _('Markdown formatted text'),
'textile': _('TexTile formatted text')
},
}
options = {
OptionRecommendation(name='newline', recommended_value='system',
level=OptionRecommendation.LOW,
short_switch='n', choices=NEWLINE_TYPES,
help=_('Type of newline to use. Options are %s. Default is \'system\'. '
'Use \'old_mac\' for compatibility with Mac OS 9 and earlier. '
'For macOS use \'unix\'. \'system\' will default to the newline '
'type used by this OS.') % sorted(NEWLINE_TYPES)),
OptionRecommendation(name='txt_output_encoding', recommended_value='utf-8',
level=OptionRecommendation.LOW,
help=_('Specify the character encoding of the output document. '
'The default is utf-8.')),
OptionRecommendation(name='inline_toc',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Add Table of Contents to beginning of the book.')),
OptionRecommendation(name='max_line_length',
recommended_value=0, level=OptionRecommendation.LOW,
help=_('The maximum number of characters per line. This splits on '
'the first space before the specified value. If no space is found '
'the line will be broken at the space after and will exceed the '
'specified value. Also, there is a minimum of 25 characters. '
'Use 0 to disable line splitting.')),
OptionRecommendation(name='force_max_line_length',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Force splitting on the max-line-length value when no space '
'is present. Also allows max-line-length to be below the minimum')),
OptionRecommendation(name='txt_output_formatting',
recommended_value='plain',
choices=list(ui_data['formatting_types']),
help=_('Formatting used within the document.\n'
'* plain: {plain}\n'
'* markdown: {markdown}\n'
'* textile: {textile}').format(**ui_data['formatting_types'])),
OptionRecommendation(name='keep_links',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Do not remove links within the document. This is only '
'useful when paired with a txt-output-formatting option that '
'is not none because links are always removed with plain text output.')),
OptionRecommendation(name='keep_image_references',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Do not remove image references within the document. This is only '
'useful when paired with a txt-output-formatting option that '
'is not none because links are always removed with plain text output.')),
OptionRecommendation(name='keep_color',
recommended_value=False, level=OptionRecommendation.LOW,
help=_('Do not remove font color from output. This is only useful when '
'txt-output-formatting is set to textile. Textile is the only '
'formatting that supports setting font color. If this option is '
'not specified font color will not be set and default to the '
'color displayed by the reader (generally this is black).')),
}
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.txt.txtml import TXTMLizer
from calibre.utils.cleantext import clean_ascii_chars
from calibre.ebooks.txt.newlines import specified_newlines, TxtNewlines
if opts.txt_output_formatting.lower() == 'markdown':
from calibre.ebooks.txt.markdownml import MarkdownMLizer
self.writer = MarkdownMLizer(log)
elif opts.txt_output_formatting.lower() == 'textile':
from calibre.ebooks.txt.textileml import TextileMLizer
self.writer = TextileMLizer(log)
else:
self.writer = TXTMLizer(log)
txt = self.writer.extract_content(oeb_book, opts)
txt = clean_ascii_chars(txt)
log.debug('\tReplacing newlines with selected type...')
txt = specified_newlines(TxtNewlines(opts.newline).newline, txt)
close = False
if not hasattr(output_path, 'write'):
close = True
if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path) != '':
os.makedirs(os.path.dirname(output_path))
out_stream = open(output_path, 'wb')
else:
out_stream = output_path
out_stream.seek(0)
out_stream.truncate()
out_stream.write(txt.encode(opts.txt_output_encoding, 'replace'))
if close:
out_stream.close()
class TXTZOutput(TXTOutput):
name = 'TXTZ Output'
author = 'John Schember'
file_type = 'txtz'
def convert(self, oeb_book, output_path, input_plugin, opts, log):
from calibre.ebooks.oeb.base import OEB_IMAGES
from calibre.utils.zipfile import ZipFile
from lxml import etree
with TemporaryDirectory('_txtz_output') as tdir:
# TXT
txt_name = 'index.txt'
if opts.txt_output_formatting.lower() == 'textile':
txt_name = 'index.text'
with TemporaryFile(txt_name) as tf:
TXTOutput.convert(self, oeb_book, tf, input_plugin, opts, log)
shutil.copy(tf, os.path.join(tdir, txt_name))
# Images
for item in oeb_book.manifest:
if item.media_type in OEB_IMAGES:
if hasattr(self.writer, 'images'):
path = os.path.join(tdir, 'images')
if item.href in self.writer.images:
href = self.writer.images[item.href]
else:
continue
else:
path = os.path.join(tdir, os.path.dirname(item.href))
href = os.path.basename(item.href)
if not os.path.exists(path):
os.makedirs(path)
with open(os.path.join(path, href), 'wb') as imgf:
imgf.write(item.data)
# Metadata
with open(os.path.join(tdir, 'metadata.opf'), 'wb') as mdataf:
mdataf.write(etree.tostring(oeb_book.metadata.to_opf1()))
txtz = ZipFile(output_path, 'w')
txtz.add_dir(tdir)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,646 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import functools, re, json
from math import ceil
from calibre import entity_to_unicode, as_unicode
from polyglot.builtins import unicode_type, range
XMLDECL_RE = re.compile(r'^\s*<[?]xml.*?[?]>')
SVG_NS = 'http://www.w3.org/2000/svg'
XLINK_NS = 'http://www.w3.org/1999/xlink'
convert_entities = functools.partial(entity_to_unicode,
result_exceptions={
'<' : '&lt;',
'>' : '&gt;',
"'" : '&apos;',
'"' : '&quot;',
'&' : '&amp;',
})
_span_pat = re.compile('<span.*?</span>', re.DOTALL|re.IGNORECASE)
LIGATURES = {
# '\u00c6': 'AE',
# '\u00e6': 'ae',
# '\u0152': 'OE',
# '\u0153': 'oe',
# '\u0132': 'IJ',
# '\u0133': 'ij',
# '\u1D6B': 'ue',
'\uFB00': 'ff',
'\uFB01': 'fi',
'\uFB02': 'fl',
'\uFB03': 'ffi',
'\uFB04': 'ffl',
'\uFB05': 'ft',
'\uFB06': 'st',
}
_ligpat = re.compile('|'.join(LIGATURES))
def sanitize_head(match):
x = match.group(1)
x = _span_pat.sub('', x)
return '<head>\n%s\n</head>' % x
def chap_head(match):
chap = match.group('chap')
title = match.group('title')
if not title:
return '<h1>'+chap+'</h1><br/>\n'
else:
return '<h1>'+chap+'</h1>\n<h3>'+title+'</h3>\n'
def wrap_lines(match):
ital = match.group('ital')
if not ital:
return ' '
else:
return ital+' '
def smarten_punctuation(html, log=None):
from calibre.utils.smartypants import smartyPants
from calibre.ebooks.chardet import substitute_entites
from calibre.ebooks.conversion.utils import HeuristicProcessor
preprocessor = HeuristicProcessor(log=log)
from uuid import uuid4
start = 'calibre-smartypants-'+unicode_type(uuid4())
stop = 'calibre-smartypants-'+unicode_type(uuid4())
html = html.replace('<!--', start)
html = html.replace('-->', stop)
html = preprocessor.fix_nbsp_indents(html)
html = smartyPants(html)
html = html.replace(start, '<!--')
html = html.replace(stop, '-->')
return substitute_entites(html)
class DocAnalysis(object):
'''
Provides various text analysis functions to determine how the document is structured.
format is the type of document analysis will be done against.
raw is the raw text to determine the line length to use for wrapping.
Blank lines are excluded from analysis
'''
def __init__(self, format='html', raw=''):
raw = raw.replace('&nbsp;', ' ')
if format == 'html':
linere = re.compile(r'(?<=<p)(?![^>]*>\s*</p>).*?(?=</p>)', re.DOTALL)
elif format == 'pdf':
linere = re.compile(r'(?<=<br>)(?!\s*<br>).*?(?=<br>)', re.DOTALL)
elif format == 'spanned_html':
linere = re.compile('(?<=<span).*?(?=</span>)', re.DOTALL)
elif format == 'txt':
linere = re.compile('.*?\n')
self.lines = linere.findall(raw)
def line_length(self, percent):
'''
Analyses the document to find the median line length.
percentage is a decimal number, 0 - 1 which is used to determine
how far in the list of line lengths to use. The list of line lengths is
ordered smallest to largest and does not include duplicates. 0.5 is the
median value.
'''
lengths = []
for line in self.lines:
if len(line) > 0:
lengths.append(len(line))
if not lengths:
return 0
lengths = list(set(lengths))
total = sum(lengths)
avg = total / len(lengths)
max_line = ceil(avg * 2)
lengths = sorted(lengths)
for i in range(len(lengths) - 1, -1, -1):
if lengths[i] > max_line:
del lengths[i]
if percent > 1:
percent = 1
if percent < 0:
percent = 0
index = int(len(lengths) * percent) - 1
return lengths[index]
def line_histogram(self, percent):
'''
Creates a broad histogram of the document to determine whether it incorporates hard
line breaks. Lines are sorted into 20 'buckets' based on length.
percent is the percentage of lines that should be in a single bucket to return true
The majority of the lines will exist in 1-2 buckets in typical docs with hard line breaks
'''
minLineLength=20 # Ignore lines under 20 chars (typical of spaces)
maxLineLength=1900 # Discard larger than this to stay in range
buckets=20 # Each line is divided into a bucket based on length
# print("there are "+unicode_type(len(lines))+" lines")
# max = 0
# for line in self.lines:
# l = len(line)
# if l > max:
# max = l
# print("max line found is "+unicode_type(max))
# Build the line length histogram
hRaw = [0 for i in range(0,buckets)]
for line in self.lines:
l = len(line)
if l > minLineLength and l < maxLineLength:
l = int(l // 100)
# print("adding "+unicode_type(l))
hRaw[l]+=1
# Normalize the histogram into percents
totalLines = len(self.lines)
if totalLines > 0:
h = [float(count)/totalLines for count in hRaw]
else:
h = []
# print("\nhRaw histogram lengths are: "+unicode_type(hRaw))
# print(" percents are: "+unicode_type(h)+"\n")
# Find the biggest bucket
maxValue = 0
for i in range(0,len(h)):
if h[i] > maxValue:
maxValue = h[i]
if maxValue < percent:
# print("Line lengths are too variable. Not unwrapping.")
return False
else:
# print(unicode_type(maxValue)+" of the lines were in one bucket")
return True
class Dehyphenator(object):
'''
Analyzes words to determine whether hyphens should be retained/removed. Uses the document
itself is as a dictionary. This method handles all languages along with uncommon, made-up, and
scientific words. The primary disadvantage is that words appearing only once in the document
retain hyphens.
'''
def __init__(self, verbose=0, log=None):
self.log = log
self.verbose = verbose
# Add common suffixes to the regex below to increase the likelihood of a match -
# don't add suffixes which are also complete words, such as 'able' or 'sex'
# only remove if it's not already the point of hyphenation
self.suffix_string = (
"((ed)?ly|'?e?s||a?(t|s)?ion(s|al(ly)?)?|ings?|er|(i)?ous|"
"(i|a)ty|(it)?ies|ive|gence|istic(ally)?|(e|a)nce|m?ents?|ism|ated|"
"(e|u)ct(ed)?|ed|(i|ed)?ness|(e|a)ncy|ble|ier|al|ex|ian)$")
self.suffixes = re.compile(r"^%s" % self.suffix_string, re.IGNORECASE)
self.removesuffixes = re.compile(r"%s" % self.suffix_string, re.IGNORECASE)
# remove prefixes if the prefix was not already the point of hyphenation
self.prefix_string = '^(dis|re|un|in|ex)'
self.prefixes = re.compile(r'%s$' % self.prefix_string, re.IGNORECASE)
self.removeprefix = re.compile(r'%s' % self.prefix_string, re.IGNORECASE)
def dehyphenate(self, match):
firsthalf = match.group('firstpart')
secondhalf = match.group('secondpart')
try:
wraptags = match.group('wraptags')
except:
wraptags = ''
hyphenated = unicode_type(firsthalf) + "-" + unicode_type(secondhalf)
dehyphenated = unicode_type(firsthalf) + unicode_type(secondhalf)
if self.suffixes.match(secondhalf) is None:
lookupword = self.removesuffixes.sub('', dehyphenated)
else:
lookupword = dehyphenated
if len(firsthalf) > 4 and self.prefixes.match(firsthalf) is None:
lookupword = self.removeprefix.sub('', lookupword)
if self.verbose > 2:
self.log("lookup word is: "+lookupword+", orig is: " + hyphenated)
try:
searchresult = self.html.find(lookupword.lower())
except:
return hyphenated
if self.format == 'html_cleanup' or self.format == 'txt_cleanup':
if self.html.find(lookupword) != -1 or searchresult != -1:
if self.verbose > 2:
self.log(" Cleanup:returned dehyphenated word: " + dehyphenated)
return dehyphenated
elif self.html.find(hyphenated) != -1:
if self.verbose > 2:
self.log(" Cleanup:returned hyphenated word: " + hyphenated)
return hyphenated
else:
if self.verbose > 2:
self.log(" Cleanup:returning original text "+firsthalf+" + linefeed "+secondhalf)
return firsthalf+'\u2014'+wraptags+secondhalf
else:
if self.format == 'individual_words' and len(firsthalf) + len(secondhalf) <= 6:
if self.verbose > 2:
self.log("too short, returned hyphenated word: " + hyphenated)
return hyphenated
if len(firsthalf) <= 2 and len(secondhalf) <= 2:
if self.verbose > 2:
self.log("too short, returned hyphenated word: " + hyphenated)
return hyphenated
if self.html.find(lookupword) != -1 or searchresult != -1:
if self.verbose > 2:
self.log(" returned dehyphenated word: " + dehyphenated)
return dehyphenated
else:
if self.verbose > 2:
self.log(" returned hyphenated word: " + hyphenated)
return hyphenated
def __call__(self, html, format, length=1):
self.html = html
self.format = format
if format == 'html':
intextmatch = re.compile((
r'(?<=.{%i})(?P<firstpart>[^\W\-]+)(-|)\s*(?=<)(?P<wraptags>(</span>)?'
r'\s*(</[iubp]>\s*){1,2}(?P<up2threeblanks><(p|div)[^>]*>\s*(<p[^>]*>\s*</p>\s*)'
r'?</(p|div)>\s+){0,3}\s*(<[iubp][^>]*>\s*){1,2}(<span[^>]*>)?)\s*(?P<secondpart>[\w\d]+)') % length)
elif format == 'pdf':
intextmatch = re.compile((
r'(?<=.{%i})(?P<firstpart>[^\W\-]+)(-|)\s*(?P<wraptags><p>|'
r'</[iub]>\s*<p>\s*<[iub]>)\s*(?P<secondpart>[\w\d]+)')% length)
elif format == 'txt':
intextmatch = re.compile(
'(?<=.{%i})(?P<firstpart>[^\\W\\-]+)(-|)(\u0020|\u0009)*(?P<wraptags>(\n(\u0020|\u0009)*)+)(?P<secondpart>[\\w\\d]+)'% length)
elif format == 'individual_words':
intextmatch = re.compile(
r'(?!<)(?P<firstpart>[^\W\-]+)(-|)\s*(?P<secondpart>\w+)(?![^<]*?>)', re.UNICODE)
elif format == 'html_cleanup':
intextmatch = re.compile(
r'(?P<firstpart>[^\W\-]+)(-|)\s*(?=<)(?P<wraptags></span>\s*(</[iubp]>'
r'\s*<[iubp][^>]*>\s*)?<span[^>]*>|</[iubp]>\s*<[iubp][^>]*>)?\s*(?P<secondpart>[\w\d]+)')
elif format == 'txt_cleanup':
intextmatch = re.compile(
r'(?P<firstpart>[^\W\-]+)(-|)(?P<wraptags>\s+)(?P<secondpart>[\w\d]+)')
html = intextmatch.sub(self.dehyphenate, html)
return html
class CSSPreProcessor(object):
# Remove some of the broken CSS Microsoft products
# create
MS_PAT = re.compile(r'''
(?P<start>^|;|\{)\s* # The end of the previous rule or block start
(%s).+? # The invalid selectors
(?P<end>$|;|\}) # The end of the declaration
'''%'mso-|panose-|text-underline|tab-interval',
re.MULTILINE|re.IGNORECASE|re.VERBOSE)
def ms_sub(self, match):
end = match.group('end')
try:
start = match.group('start')
except:
start = ''
if end == ';':
end = ''
return start + end
def __call__(self, data, add_namespace=False):
from calibre.ebooks.oeb.base import XHTML_CSS_NAMESPACE
data = self.MS_PAT.sub(self.ms_sub, data)
if not add_namespace:
return data
# Remove comments as the following namespace logic will break if there
# are commented lines before the first @import or @charset rule. Since
# the conversion will remove all stylesheets anyway, we don't lose
# anything
data = re.sub(unicode_type(r'/\*.*?\*/'), '', data, flags=re.DOTALL)
ans, namespaced = [], False
for line in data.splitlines():
ll = line.lstrip()
if not (namespaced or ll.startswith('@import') or not ll or
ll.startswith('@charset')):
ans.append(XHTML_CSS_NAMESPACE.strip())
namespaced = True
ans.append(line)
return '\n'.join(ans)
def accent_regex(accent_maps, letter_before=False):
accent_cat = set()
letters = set()
for accent in tuple(accent_maps):
accent_cat.add(accent)
k, v = accent_maps[accent].split(':', 1)
if len(k) != len(v):
raise ValueError('Invalid mapping for: {} -> {}'.format(k, v))
accent_maps[accent] = lmap = dict(zip(k, v))
letters |= set(lmap)
if letter_before:
args = ''.join(letters), ''.join(accent_cat)
accent_group, letter_group = 2, 1
else:
args = ''.join(accent_cat), ''.join(letters)
accent_group, letter_group = 1, 2
pat = re.compile(r'([{}])\s*(?:<br[^>]*>){{0,1}}\s*([{}])'.format(*args), re.UNICODE)
def sub(m):
lmap = accent_maps[m.group(accent_group)]
return lmap.get(m.group(letter_group)) or m.group()
return pat, sub
def html_preprocess_rules():
ans = getattr(html_preprocess_rules, 'ans', None)
if ans is None:
ans = html_preprocess_rules.ans = [
# Remove huge block of contiguous spaces as they slow down
# the following regexes pretty badly
(re.compile(r'\s{10000,}'), ''),
# Some idiotic HTML generators (Frontpage I'm looking at you)
# Put all sorts of crap into <head>. This messes up lxml
(re.compile(r'<head[^>]*>\n*(.*?)\n*</head>', re.IGNORECASE|re.DOTALL),
sanitize_head),
# Convert all entities, since lxml doesn't handle them well
(re.compile(r'&(\S+?);'), convert_entities),
# Remove the <![if/endif tags inserted by everybody's darling, MS Word
(re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), ''),
]
return ans
def pdftohtml_rules():
ans = getattr(pdftohtml_rules, 'ans', None)
if ans is None:
ans = pdftohtml_rules.ans = [
accent_regex({
'¨': 'aAeEiIoOuU:äÄëËïÏöÖüÜ',
'`': 'aAeEiIoOuU:àÀèÈìÌòÒùÙ',
'´': 'aAcCeEiIlLoOnNrRsSuUzZ:áÁćĆéÉíÍĺĹóÓńŃŕŔśŚúÚźŹ',
'ˆ': 'aAeEiIoOuU:âÂêÊîÎôÔûÛ',
'¸': 'cC:çÇ',
'˛': 'aAeE:ąĄęĘ',
'˙': 'zZ:żŻ',
'ˇ': 'cCdDeElLnNrRsStTzZ:čČďĎěĚľĽňŇřŘšŠťŤžŽ',
'°': 'uU:ůŮ',
}),
accent_regex({'`': 'aAeEiIoOuU:àÀèÈìÌòÒùÙ'}, letter_before=True),
# If pdf printed from a browser then the header/footer has a reliable pattern
(re.compile(r'((?<=</a>)\s*file:/{2,4}[A-Z].*<br>|file:////?[A-Z].*<br>(?=\s*<hr>))', re.IGNORECASE), lambda match: ''),
# Center separator lines
(re.compile(r'<br>\s*(?P<break>([*#•✦=] *){3,})\s*<br>'), lambda match: '<p>\n<p style="text-align:center">' + match.group('break') + '</p>'),
# Remove <hr> tags
(re.compile(r'<hr.*?>', re.IGNORECASE), ''),
# Remove gray background
(re.compile(r'<BODY[^<>]+>'), '<BODY>'),
# Convert line breaks to paragraphs
(re.compile(r'<br[^>]*>\s*'), '</p>\n<p>'),
(re.compile(r'<body[^>]*>\s*'), '<body>\n<p>'),
(re.compile(r'\s*</body>'), '</p>\n</body>'),
# Clean up spaces
(re.compile(r'(?<=[\.,;\?!”"\'])[\s^ ]*(?=<)'), ' '),
# Add space before and after italics
(re.compile(r'(?<!“)<i>'), ' <i>'),
(re.compile(r'</i>(?=\w)'), '</i> '),
]
return ans
def book_designer_rules():
ans = getattr(book_designer_rules, 'ans', None)
if ans is None:
ans = book_designer_rules.ans = [
# HR
(re.compile('<hr>', re.IGNORECASE),
lambda match : '<span style="page-break-after:always"> </span>'),
# Create header tags
(re.compile(r'<h2[^><]*?id=BookTitle[^><]*?(align=)*(?(1)(\w+))*[^><]*?>[^><]*?</h2>', re.IGNORECASE),
lambda match : '<h1 id="BookTitle" align="%s">%s</h1>'%(match.group(2) if match.group(2) else 'center', match.group(3))),
(re.compile(r'<h2[^><]*?id=BookAuthor[^><]*?(align=)*(?(1)(\w+))*[^><]*?>[^><]*?</h2>', re.IGNORECASE),
lambda match : '<h2 id="BookAuthor" align="%s">%s</h2>'%(match.group(2) if match.group(2) else 'center', match.group(3))),
(re.compile('<span[^><]*?id=title[^><]*?>(.*?)</span>', re.IGNORECASE|re.DOTALL),
lambda match : '<h2 class="title">%s</h2>'%(match.group(1),)),
(re.compile('<span[^><]*?id=subtitle[^><]*?>(.*?)</span>', re.IGNORECASE|re.DOTALL),
lambda match : '<h3 class="subtitle">%s</h3>'%(match.group(1),)),
]
return None
class HTMLPreProcessor(object):
def __init__(self, log=None, extra_opts=None, regex_wizard_callback=None):
self.log = log
self.extra_opts = extra_opts
self.regex_wizard_callback = regex_wizard_callback
self.current_href = None
def is_baen(self, src):
return re.compile(r'<meta\s+name="Publisher"\s+content=".*?Baen.*?"',
re.IGNORECASE).search(src) is not None
def is_book_designer(self, raw):
return re.search('<H2[^><]*id=BookTitle', raw) is not None
def is_pdftohtml(self, src):
return '<!-- created by calibre\'s pdftohtml -->' in src[:1000]
def __call__(self, html, remove_special_chars=None,
get_preprocess_html=False):
if remove_special_chars is not None:
html = remove_special_chars.sub('', html)
html = html.replace('\0', '')
is_pdftohtml = self.is_pdftohtml(html)
if self.is_baen(html):
rules = []
elif self.is_book_designer(html):
rules = book_designer_rules()
elif is_pdftohtml:
rules = pdftohtml_rules()
else:
rules = []
start_rules = []
if not getattr(self.extra_opts, 'keep_ligatures', False):
html = _ligpat.sub(lambda m:LIGATURES[m.group()], html)
user_sr_rules = {}
# Function for processing search and replace
def do_search_replace(search_pattern, replace_txt):
from calibre.ebooks.conversion.search_replace import compile_regular_expression
try:
search_re = compile_regular_expression(search_pattern)
if not replace_txt:
replace_txt = ''
rules.insert(0, (search_re, replace_txt))
user_sr_rules[(search_re, replace_txt)] = search_pattern
except Exception as e:
self.log.error('Failed to parse %r regexp because %s' %
(search, as_unicode(e)))
# search / replace using the sr?_search / sr?_replace options
for i in range(1, 4):
search, replace = 'sr%d_search'%i, 'sr%d_replace'%i
search_pattern = getattr(self.extra_opts, search, '')
replace_txt = getattr(self.extra_opts, replace, '')
if search_pattern:
do_search_replace(search_pattern, replace_txt)
# multi-search / replace using the search_replace option
search_replace = getattr(self.extra_opts, 'search_replace', None)
if search_replace:
search_replace = json.loads(search_replace)
for search_pattern, replace_txt in reversed(search_replace):
do_search_replace(search_pattern, replace_txt)
end_rules = []
# delete soft hyphens - moved here so it's executed after header/footer removal
if is_pdftohtml:
# unwrap/delete soft hyphens
end_rules.append((re.compile(
r'[­](</p>\s*<p>\s*)+\s*(?=[\[a-z\d])'), lambda match: ''))
# unwrap/delete soft hyphens with formatting
end_rules.append((re.compile(
r'[­]\s*(</(i|u|b)>)+(</p>\s*<p>\s*)+\s*(<(i|u|b)>)+\s*(?=[\[a-z\d])'), lambda match: ''))
length = -1
if getattr(self.extra_opts, 'unwrap_factor', 0.0) > 0.01:
docanalysis = DocAnalysis('pdf', html)
length = docanalysis.line_length(getattr(self.extra_opts, 'unwrap_factor'))
if length:
# print("The pdf line length returned is " + unicode_type(length))
# unwrap em/en dashes
end_rules.append((re.compile(
r'(?<=.{%i}[–—])\s*<p>\s*(?=[\[a-z\d])' % length), lambda match: ''))
end_rules.append(
# Un wrap using punctuation
(re.compile((
r'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\\IAß]'
r'|(?<!\&\w{4});))\s*(?P<ital></(i|b|u)>)?\s*(</p>\s*<p>\s*)+\s*(?=(<(i|b|u)>)?'
r'\s*[\w\d$(])') % length, re.UNICODE), wrap_lines),
)
for rule in html_preprocess_rules() + start_rules:
html = rule[0].sub(rule[1], html)
if self.regex_wizard_callback is not None:
self.regex_wizard_callback(self.current_href, html)
if get_preprocess_html:
return html
def dump(raw, where):
import os
dp = getattr(self.extra_opts, 'debug_pipeline', None)
if dp and os.path.exists(dp):
odir = os.path.join(dp, 'input')
if os.path.exists(odir):
odir = os.path.join(odir, where)
if not os.path.exists(odir):
os.makedirs(odir)
name, i = None, 0
while not name or os.path.exists(os.path.join(odir, name)):
i += 1
name = '%04d.html'%i
with open(os.path.join(odir, name), 'wb') as f:
f.write(raw.encode('utf-8'))
# dump(html, 'pre-preprocess')
for rule in rules + end_rules:
try:
html = rule[0].sub(rule[1], html)
except Exception as e:
if rule in user_sr_rules:
self.log.error(
'User supplied search & replace rule: %s -> %s '
'failed with error: %s, ignoring.'%(
user_sr_rules[rule], rule[1], e))
else:
raise
if is_pdftohtml and length > -1:
# Dehyphenate
dehyphenator = Dehyphenator(self.extra_opts.verbose, self.log)
html = dehyphenator(html,'html', length)
if is_pdftohtml:
from calibre.ebooks.conversion.utils import HeuristicProcessor
pdf_markup = HeuristicProcessor(self.extra_opts, None)
totalwords = 0
if pdf_markup.get_word_count(html) > 7000:
html = pdf_markup.markup_chapters(html, totalwords, True)
# dump(html, 'post-preprocess')
# Handle broken XHTML w/ SVG (ugh)
if 'svg:' in html and SVG_NS not in html:
html = html.replace(
'<html', '<html xmlns:svg="%s"' % SVG_NS, 1)
if 'xlink:' in html and XLINK_NS not in html:
html = html.replace(
'<html', '<html xmlns:xlink="%s"' % XLINK_NS, 1)
html = XMLDECL_RE.sub('', html)
if getattr(self.extra_opts, 'asciiize', False):
from calibre.utils.localization import get_udc
from calibre.utils.mreplace import MReplace
unihandecoder = get_udc()
mr = MReplace(data={'«':'&lt;'*3, '»':'&gt;'*3})
html = mr.mreplace(html)
html = unihandecoder.decode(html)
if getattr(self.extra_opts, 'enable_heuristics', False):
from calibre.ebooks.conversion.utils import HeuristicProcessor
preprocessor = HeuristicProcessor(self.extra_opts, self.log)
html = preprocessor(html)
if is_pdftohtml:
html = html.replace('<!-- created by calibre\'s pdftohtml -->', '')
if getattr(self.extra_opts, 'smarten_punctuation', False):
html = smarten_punctuation(html, self.log)
try:
unsupported_unicode_chars = self.extra_opts.output_profile.unsupported_unicode_chars
except AttributeError:
unsupported_unicode_chars = ''
if unsupported_unicode_chars:
from calibre.utils.localization import get_udc
unihandecoder = get_udc()
for char in unsupported_unicode_chars:
asciichar = unihandecoder.decode(char)
html = html.replace(char, asciichar)
return html

View File

@@ -0,0 +1,881 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2010, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import re
from math import ceil
from calibre.ebooks.conversion.preprocess import DocAnalysis, Dehyphenator
from calibre.utils.logging import default_log
from calibre.utils.wordcount import get_wordcount_obj
from polyglot.builtins import unicode_type
class HeuristicProcessor(object):
def __init__(self, extra_opts=None, log=None):
self.log = default_log if log is None else log
self.html_preprocess_sections = 0
self.found_indents = 0
self.extra_opts = extra_opts
self.deleted_nbsps = False
self.totalwords = 0
self.min_chapters = 1
self.chapters_no_title = 0
self.chapters_with_title = 0
self.blanks_deleted = False
self.blanks_between_paragraphs = False
self.linereg = re.compile('(?<=<p).*?(?=</p>)', re.IGNORECASE|re.DOTALL)
self.blankreg = re.compile(r'\s*(?P<openline><p(?!\sclass=\"(softbreak|whitespace)\")[^>]*>)\s*(?P<closeline></p>)', re.IGNORECASE)
self.anyblank = re.compile(r'\s*(?P<openline><p[^>]*>)\s*(?P<closeline></p>)', re.IGNORECASE)
self.multi_blank = re.compile(r'(\s*<p[^>]*>\s*</p>(\s*<div[^>]*>\s*</div>\s*)*){2,}(?!\s*<h\d)', re.IGNORECASE)
self.any_multi_blank = re.compile(r'(\s*<p[^>]*>\s*</p>(\s*<div[^>]*>\s*</div>\s*)*){2,}', re.IGNORECASE)
self.line_open = (
r"<(?P<outer>p|div)[^>]*>\s*(<(?P<inner1>font|span|[ibu])[^>]*>)?\s*"
r"(<(?P<inner2>font|span|[ibu])[^>]*>)?\s*(<(?P<inner3>font|span|[ibu])[^>]*>)?\s*")
self.line_close = "(</(?P=inner3)>)?\\s*(</(?P=inner2)>)?\\s*(</(?P=inner1)>)?\\s*</(?P=outer)>"
self.single_blank = re.compile(r'(\s*<(p|div)[^>]*>\s*</(p|div)>)', re.IGNORECASE)
self.scene_break_open = '<p class="scenebreak" style="text-align:center; text-indent:0%; margin-top:1em; margin-bottom:1em; page-break-before:avoid">'
self.common_in_text_endings = '[\"\'—’”,\\.!\\?\\\\)„\\w]'
self.common_in_text_beginnings = '[\\w\'\"“‘‛]'
def is_pdftohtml(self, src):
return '<!-- created by calibre\'s pdftohtml -->' in src[:1000]
def is_abbyy(self, src):
return '<meta name="generator" content="ABBYY FineReader' in src[:1000]
def chapter_head(self, match):
from calibre.utils.html2text import html2text
chap = match.group('chap')
title = match.group('title')
if not title:
self.html_preprocess_sections = self.html_preprocess_sections + 1
self.log.debug("marked " + unicode_type(self.html_preprocess_sections) +
" chapters. - " + unicode_type(chap))
return '<h2>'+chap+'</h2>\n'
else:
delete_whitespace = re.compile('^\\s*(?P<c>.*?)\\s*$')
delete_quotes = re.compile('\'\"')
txt_chap = delete_quotes.sub('', delete_whitespace.sub('\\g<c>', html2text(chap)))
txt_title = delete_quotes.sub('', delete_whitespace.sub('\\g<c>', html2text(title)))
self.html_preprocess_sections = self.html_preprocess_sections + 1
self.log.debug("marked " + unicode_type(self.html_preprocess_sections) +
" chapters & titles. - " + unicode_type(chap) + ", " + unicode_type(title))
return '<h2 title="'+txt_chap+', '+txt_title+'">'+chap+'</h2>\n<h3 class="sigilNotInTOC">'+title+'</h3>\n'
def chapter_break(self, match):
chap = match.group('section')
styles = match.group('styles')
self.html_preprocess_sections = self.html_preprocess_sections + 1
self.log.debug("marked " + unicode_type(self.html_preprocess_sections) +
" section markers based on punctuation. - " + unicode_type(chap))
return '<'+styles+' style="page-break-before:always">'+chap
def analyze_title_matches(self, match):
# chap = match.group('chap')
title = match.group('title')
if not title:
self.chapters_no_title = self.chapters_no_title + 1
else:
self.chapters_with_title = self.chapters_with_title + 1
def insert_indent(self, match):
pstyle = match.group('formatting')
tag = match.group('tagtype')
span = match.group('span')
self.found_indents = self.found_indents + 1
if pstyle:
if pstyle.lower().find('style') != -1:
pstyle = re.sub(r'"$', '; text-indent:3%"', pstyle)
else:
pstyle = pstyle+' style="text-indent:3%"'
if not span:
return '<'+tag+' '+pstyle+'>'
else:
return '<'+tag+' '+pstyle+'>'+span
else:
if not span:
return '<'+tag+' style="text-indent:3%">'
else:
return '<'+tag+' style="text-indent:3%">'+span
def no_markup(self, raw, percent):
'''
Detects total marked up line endings in the file. raw is the text to
inspect. Percent is the minimum percent of line endings which should
be marked up to return true.
'''
htm_end_ere = re.compile('</(p|div)>', re.DOTALL)
line_end_ere = re.compile('(\n|\r|\r\n)', re.DOTALL)
htm_end = htm_end_ere.findall(raw)
line_end = line_end_ere.findall(raw)
tot_htm_ends = len(htm_end)
tot_ln_fds = len(line_end)
# self.log.debug("There are " + unicode_type(tot_ln_fds) + " total Line feeds, and " +
# unicode_type(tot_htm_ends) + " marked up endings")
if percent > 1:
percent = 1
if percent < 0:
percent = 0
min_lns = tot_ln_fds * percent
# self.log.debug("There must be fewer than " + unicode_type(min_lns) + " unmarked lines to add markup")
return min_lns > tot_htm_ends
def dump(self, raw, where):
import os
dp = getattr(self.extra_opts, 'debug_pipeline', None)
if dp and os.path.exists(dp):
odir = os.path.join(dp, 'preprocess')
if not os.path.exists(odir):
os.makedirs(odir)
if os.path.exists(odir):
odir = os.path.join(odir, where)
if not os.path.exists(odir):
os.makedirs(odir)
name, i = None, 0
while not name or os.path.exists(os.path.join(odir, name)):
i += 1
name = '%04d.html'%i
with open(os.path.join(odir, name), 'wb') as f:
f.write(raw.encode('utf-8'))
def get_word_count(self, html):
word_count_text = re.sub(r'(?s)<head[^>]*>.*?</head>', '', html)
word_count_text = re.sub(r'<[^>]*>', '', word_count_text)
wordcount = get_wordcount_obj(word_count_text)
return wordcount.words
def markup_italicis(self, html):
# self.log.debug("\n\n\nitalicize debugging \n\n\n")
ITALICIZE_WORDS = [
'Etc.', 'etc.', 'viz.', 'ie.', 'i.e.', 'Ie.', 'I.e.', 'eg.',
'e.g.', 'Eg.', 'E.g.', 'et al.', 'et cetera', 'n.b.', 'N.b.',
'nota bene', 'Nota bene', 'Ste.', 'Mme.', 'Mdme.',
'Mlle.', 'Mons.', 'PS.', 'PPS.',
]
ITALICIZE_STYLE_PATS = [
unicode_type(r'(?msu)(?<=[\s>"\'])_\*/(?P<words>[^\*_]+)/\*_'),
unicode_type(r'(?msu)(?<=[\s>"\'])~~(?P<words>[^~]+)~~'),
unicode_type(r'(?msu)(?<=[\s>"\'])_/(?P<words>[^/_]+)/_'),
unicode_type(r'(?msu)(?<=[\s>"\'])_\*(?P<words>[^\*_]+)\*_'),
unicode_type(r'(?msu)(?<=[\s>"\'])\*/(?P<words>[^/\*]+)/\*'),
unicode_type(r'(?msu)(?<=[\s>"\'])/:(?P<words>[^:/]+):/'),
unicode_type(r'(?msu)(?<=[\s>"\'])\|:(?P<words>[^:\|]+):\|'),
unicode_type(r'(?msu)(?<=[\s>"\'])\*(?P<words>[^\*]+)\*'),
unicode_type(r'(?msu)(?<=[\s>"\'])~(?P<words>[^~]+)~'),
unicode_type(r'(?msu)(?<=[\s>"\'])/(?P<words>[^/\*><]+)/'),
unicode_type(r'(?msu)(?<=[\s>"\'])_(?P<words>[^_]+)_'),
]
for word in ITALICIZE_WORDS:
html = re.sub(r'(?<=\s|>)' + re.escape(word) + r'(?=\s|<)', '<i>%s</i>' % word, html)
search_text = re.sub(r'(?s)<head[^>]*>.*?</head>', '', html)
search_text = re.sub(r'<[^>]*>', '', search_text)
for pat in ITALICIZE_STYLE_PATS:
for match in re.finditer(pat, search_text):
ital_string = unicode_type(match.group('words'))
# self.log.debug("italicising "+unicode_type(match.group(0))+" with <i>"+ital_string+"</i>")
try:
html = re.sub(re.escape(unicode_type(match.group(0))), '<i>%s</i>' % ital_string, html)
except OverflowError:
# match.group(0) was too large to be compiled into a regex
continue
except re.error:
# the match was not a valid regular expression
continue
return html
def markup_chapters(self, html, wordcount, blanks_between_paragraphs):
'''
Searches for common chapter headings throughout the document
attempts multiple patterns based on likelihood of a match
with minimum false positives. Exits after finding a successful pattern
'''
# Typical chapters are between 2000 and 7000 words, use the larger number to decide the
# minimum of chapters to search for. A max limit is calculated to prevent things like OCR
# or pdf page numbers from being treated as TOC markers
max_chapters = 150
typical_chapters = 7000.
if wordcount > 7000:
if wordcount > 200000:
typical_chapters = 15000.
self.min_chapters = int(ceil(wordcount / typical_chapters))
self.log.debug("minimum chapters required are: "+unicode_type(self.min_chapters))
heading = re.compile('<h[1-3][^>]*>', re.IGNORECASE)
self.html_preprocess_sections = len(heading.findall(html))
self.log.debug("found " + unicode_type(self.html_preprocess_sections) + " pre-existing headings")
# Build the Regular Expressions in pieces
init_lookahead = "(?=<(p|div))"
chapter_line_open = self.line_open
title_line_open = (r"<(?P<outer2>p|div)[^>]*>\s*(<(?P<inner4>font|span|[ibu])[^>]*>)?"
r"\s*(<(?P<inner5>font|span|[ibu])[^>]*>)?\s*(<(?P<inner6>font|span|[ibu])[^>]*>)?\s*")
chapter_header_open = r"(?P<chap>"
title_header_open = r"(?P<title>"
chapter_header_close = ")\\s*"
title_header_close = ")"
chapter_line_close = self.line_close
title_line_close = "(</(?P=inner6)>)?\\s*(</(?P=inner5)>)?\\s*(</(?P=inner4)>)?\\s*</(?P=outer2)>"
is_pdftohtml = self.is_pdftohtml(html)
if is_pdftohtml:
title_line_open = "<(?P<outer2>p)[^>]*>\\s*"
title_line_close = "\\s*</(?P=outer2)>"
if blanks_between_paragraphs:
blank_lines = "(\\s*<p[^>]*>\\s*</p>){0,2}\\s*"
else:
blank_lines = ""
opt_title_open = "("
opt_title_close = ")?"
n_lookahead_open = "(?!\\s*"
n_lookahead_close = ")\\s*"
default_title = r"(<[ibu][^>]*>)?\s{0,3}(?!Chapter)([\w\:\'\"-]+\s{0,3}){1,5}?(</[ibu][^>]*>)?(?=<)"
simple_title = r"(<[ibu][^>]*>)?\s{0,3}(?!(Chapter|\s+<)).{0,65}?(</[ibu][^>]*>)?(?=<)"
analysis_result = []
chapter_types = [
[(
r"[^'\"]?(Introduction|Synopsis|Acknowledgements|Epilogue|CHAPTER|Kapitel|Volume\b|Prologue|Book\b|Part\b|Dedication|Preface)"
r"\s*([\d\w-]+\:?\'?\s*){0,5}"), True, True, True, False, "Searching for common section headings", 'common'],
# Highest frequency headings which include titles
[r"[^'\"]?(CHAPTER|Kapitel)\s*([\dA-Z\-\'\"\?!#,]+\s*){0,7}\s*", True, True, True, False, "Searching for most common chapter headings", 'chapter'],
[r"<b[^>]*>\s*(<span[^>]*>)?\s*(?!([*#•=]+\s*)+)(\s*(?=[\d.\w#\-*\s]+<)([\d.\w#-*]+\s*){1,5}\s*)(?!\.)(</span>)?\s*</b>",
True, True, True, False, "Searching for emphasized lines", 'emphasized'], # Emphasized lines
[r"[^'\"]?(\d+(\.|:))\s*([\w\-\'\"#,]+\s*){0,7}\s*", True, True, True, False,
"Searching for numeric chapter headings", 'numeric'], # Numeric Chapters
[r"([A-Z]\s+){3,}\s*([\d\w-]+\s*){0,3}\s*", True, True, True, False, "Searching for letter spaced headings", 'letter_spaced'], # Spaced Lettering
[r"[^'\"]?(\d+\.?\s+([\d\w-]+\:?\'?-?\s?){0,5})\s*", True, True, True, False,
"Searching for numeric chapters with titles", 'numeric_title'], # Numeric Titles
[r"[^'\"]?(\d+)\s*([\dA-Z\-\'\"\?!#,]+\s*){0,7}\s*", True, True, True, False,
"Searching for simple numeric headings", 'plain_number'], # Numeric Chapters, no dot or colon
[r"\s*[^'\"]?([A-Z#]+(\s|-){0,3}){1,5}\s*", False, True, False, False,
"Searching for chapters with Uppercase Characters", 'uppercase'] # Uppercase Chapters
]
def recurse_patterns(html, analyze):
# Start with most typical chapter headings, get more aggressive until one works
for [chapter_type, n_lookahead_req, strict_title, ignorecase, title_req, log_message, type_name] in chapter_types:
n_lookahead = ''
hits = 0
self.chapters_no_title = 0
self.chapters_with_title = 0
if n_lookahead_req:
lp_n_lookahead_open = n_lookahead_open
lp_n_lookahead_close = n_lookahead_close
else:
lp_n_lookahead_open = ''
lp_n_lookahead_close = ''
if strict_title:
lp_title = default_title
else:
lp_title = simple_title
if ignorecase:
arg_ignorecase = r'(?i)'
else:
arg_ignorecase = ''
if title_req:
lp_opt_title_open = ''
lp_opt_title_close = ''
else:
lp_opt_title_open = opt_title_open
lp_opt_title_close = opt_title_close
if self.html_preprocess_sections >= self.min_chapters:
break
full_chapter_line = chapter_line_open+chapter_header_open+chapter_type+chapter_header_close+chapter_line_close
if n_lookahead_req:
n_lookahead = re.sub("(ou|in|cha)", "lookahead_", full_chapter_line)
if not analyze:
self.log.debug("Marked " + unicode_type(self.html_preprocess_sections) + " headings, " + log_message)
chapter_marker = arg_ignorecase+init_lookahead+full_chapter_line+blank_lines+lp_n_lookahead_open+n_lookahead+lp_n_lookahead_close+ \
lp_opt_title_open+title_line_open+title_header_open+lp_title+title_header_close+title_line_close+lp_opt_title_close
chapdetect = re.compile(r'%s' % chapter_marker)
if analyze:
hits = len(chapdetect.findall(html))
if hits:
chapdetect.sub(self.analyze_title_matches, html)
if float(self.chapters_with_title) / float(hits) > .5:
title_req = True
strict_title = False
self.log.debug(
unicode_type(type_name)+" had "+unicode_type(hits)+
" hits - "+unicode_type(self.chapters_no_title)+" chapters with no title, "+
unicode_type(self.chapters_with_title)+" chapters with titles, "+
unicode_type(float(self.chapters_with_title) / float(hits))+" percent. ")
if type_name == 'common':
analysis_result.append([chapter_type, n_lookahead_req, strict_title, ignorecase, title_req, log_message, type_name])
elif self.min_chapters <= hits < max_chapters or self.min_chapters < 3 > hits:
analysis_result.append([chapter_type, n_lookahead_req, strict_title, ignorecase, title_req, log_message, type_name])
break
else:
html = chapdetect.sub(self.chapter_head, html)
return html
recurse_patterns(html, True)
chapter_types = analysis_result
html = recurse_patterns(html, False)
words_per_chptr = wordcount
if words_per_chptr > 0 and self.html_preprocess_sections > 0:
words_per_chptr = wordcount // self.html_preprocess_sections
self.log.debug("Total wordcount is: "+ unicode_type(wordcount)+", Average words per section is: "+
unicode_type(words_per_chptr)+", Marked up "+unicode_type(self.html_preprocess_sections)+" chapters")
return html
def punctuation_unwrap(self, length, content, format):
'''
Unwraps lines based on line length and punctuation
supports a range of html markup and text files
the lookahead regex below is meant look for any non-full stop characters - punctuation
characters which can be used as a full stop should *not* be added below - e.g. ?!“”. etc
the reason for this is to prevent false positive wrapping. False positives are more
difficult to detect than false negatives during a manual review of the doc
This function intentionally leaves hyphenated content alone as that is handled by the
dehyphenate routine in a separate step
'''
def style_unwrap(match):
style_close = match.group('style_close')
style_open = match.group('style_open')
if style_open and style_close:
return style_close+' '+style_open
elif style_open and not style_close:
return ' '+style_open
elif not style_open and style_close:
return style_close+' '
else:
return ' '
# define the pieces of the regex
# (?<!\&\w{4});) is a semicolon not part of an entity
lookahead = "(?<=.{"+unicode_type(length)+r"}([a-zა-ჰäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\\IAß]|(?<!\&\w{4});))"
em_en_lookahead = "(?<=.{"+unicode_type(length)+"}[\u2013\u2014])"
soft_hyphen = "\xad"
line_ending = "\\s*(?P<style_close></(span|[iub])>)?\\s*(</(p|div)>)?"
blanklines = "\\s*(?P<up2threeblanks><(p|span|div)[^>]*>\\s*(<(p|span|div)[^>]*>\\s*</(span|p|div)>\\s*)</(span|p|div)>\\s*){0,3}\\s*"
line_opening = "<(p|div)[^>]*>\\s*(?P<style_open><(span|[iub])[^>]*>)?\\s*"
txt_line_wrap = "((\u0020|\u0009)*\n){1,4}"
if format == 'txt':
unwrap_regex = lookahead+txt_line_wrap
em_en_unwrap_regex = em_en_lookahead+txt_line_wrap
shy_unwrap_regex = soft_hyphen+txt_line_wrap
else:
unwrap_regex = lookahead+line_ending+blanklines+line_opening
em_en_unwrap_regex = em_en_lookahead+line_ending+blanklines+line_opening
shy_unwrap_regex = soft_hyphen+line_ending+blanklines+line_opening
unwrap = re.compile("%s" % unwrap_regex, re.UNICODE)
em_en_unwrap = re.compile("%s" % em_en_unwrap_regex, re.UNICODE)
shy_unwrap = re.compile("%s" % shy_unwrap_regex, re.UNICODE)
if format == 'txt':
content = unwrap.sub(' ', content)
content = em_en_unwrap.sub('', content)
content = shy_unwrap.sub('', content)
else:
content = unwrap.sub(style_unwrap, content)
content = em_en_unwrap.sub(style_unwrap, content)
content = shy_unwrap.sub(style_unwrap, content)
return content
def txt_process(self, match):
from calibre.ebooks.txt.processor import convert_basic, separate_paragraphs_single_line
content = match.group('text')
content = separate_paragraphs_single_line(content)
content = convert_basic(content, epub_split_size_kb=0)
return content
def markup_pre(self, html):
pre = re.compile(r'<pre>', re.IGNORECASE)
if len(pre.findall(html)) >= 1:
self.log.debug("Running Text Processing")
outerhtml = re.compile(r'.*?(?<=<pre>)(?P<text>.*?)</pre>', re.IGNORECASE|re.DOTALL)
html = outerhtml.sub(self.txt_process, html)
from calibre.ebooks.conversion.preprocess import convert_entities
html = re.sub(r'&(\S+?);', convert_entities, html)
else:
# Add markup naively
# TODO - find out if there are cases where there are more than one <pre> tag or
# other types of unmarked html and handle them in some better fashion
add_markup = re.compile('(?<!>)(\n)')
html = add_markup.sub('</p>\n<p>', html)
return html
def arrange_htm_line_endings(self, html):
html = re.sub(r"\s*</(?P<tag>p|div)>", "</"+"\\g<tag>"+">\n", html)
html = re.sub(r"\s*<(?P<tag>p|div)(?P<style>[^>]*)>\s*", "\n<"+"\\g<tag>"+"\\g<style>"+">", html)
return html
def fix_nbsp_indents(self, html):
txtindent = re.compile(unicode_type(r'<(?P<tagtype>p|div)(?P<formatting>[^>]*)>\s*(?P<span>(<span[^>]*>\s*)+)?\s*(\u00a0){2,}'), re.IGNORECASE)
html = txtindent.sub(self.insert_indent, html)
if self.found_indents > 1:
self.log.debug("replaced "+unicode_type(self.found_indents)+ " nbsp indents with inline styles")
return html
def cleanup_markup(self, html):
# remove remaining non-breaking spaces
html = re.sub(unicode_type(r'\u00a0'), ' ', html)
# Get rid of various common microsoft specific tags which can cause issues later
# Get rid of empty <o:p> tags to simplify other processing
html = re.sub(unicode_type(r'\s*<o:p>\s*</o:p>'), ' ', html)
# Delete microsoft 'smart' tags
html = re.sub('(?i)</?st1:\\w+>', '', html)
# Re-open self closing paragraph tags
html = re.sub('<p[^>/]*/>', '<p> </p>', html)
# Get rid of empty span, bold, font, em, & italics tags
fmt_tags = 'font|[ibu]|em|strong'
open_fmt_pat, close_fmt_pat = r'<(?:{})(?:\s[^>]*)?>'.format(fmt_tags), '</(?:{})>'.format(fmt_tags)
for i in range(2):
html = re.sub(r"\s*<span[^>]*>\s*(<span[^>]*>\s*</span>){0,2}\s*</span>\s*", " ", html)
html = re.sub(
r"\s*{open}\s*({open}\s*{close}\s*){{0,2}}\s*{close}".format(open=open_fmt_pat, close=close_fmt_pat) , " ", html)
# delete surrounding divs from empty paragraphs
html = re.sub('<div[^>]*>\\s*<p[^>]*>\\s*</p>\\s*</div>', '<p> </p>', html)
# Empty heading tags
html = re.sub(r'(?i)<h\d+>\s*</h\d+>', '', html)
self.deleted_nbsps = True
return html
def analyze_line_endings(self, html):
'''
determines the type of html line ending used most commonly in a document
use before calling docanalysis functions
'''
paras_reg = re.compile('<p[^>]*>', re.IGNORECASE)
spans_reg = re.compile('<span[^>]*>', re.IGNORECASE)
paras = len(paras_reg.findall(html))
spans = len(spans_reg.findall(html))
if spans > 1:
if float(paras) / float(spans) < 0.75:
return 'spanned_html'
else:
return 'html'
else:
return 'html'
def analyze_blanks(self, html):
blanklines = self.blankreg.findall(html)
lines = self.linereg.findall(html)
if len(lines) > 1:
self.log.debug("There are " + unicode_type(len(blanklines)) + " blank lines. " +
unicode_type(float(len(blanklines)) / float(len(lines))) + " percent blank")
if float(len(blanklines)) / float(len(lines)) > 0.40:
return True
else:
return False
def cleanup_required(self):
for option in ['unwrap_lines', 'markup_chapter_headings', 'format_scene_breaks', 'delete_blank_paragraphs']:
if getattr(self.extra_opts, option, False):
return True
return False
def merge_blanks(self, html, blanks_count=None):
base_em = .5 # Baseline is 1.5em per blank line, 1st line is .5 em css and 1em for the nbsp
em_per_line = 1.5 # Add another 1.5 em for each additional blank
def merge_matches(match):
to_merge = match.group(0)
lines = float(len(self.single_blank.findall(to_merge))) - 1.
em = base_em + (em_per_line * lines)
if to_merge.find('whitespace'):
newline = self.any_multi_blank.sub('\n<p class="whitespace'+unicode_type(int(em * 10))+
'" style="text-align:center; margin-top:'+unicode_type(em)+'em"> </p>', match.group(0))
else:
newline = self.any_multi_blank.sub('\n<p class="softbreak'+unicode_type(int(em * 10))+
'" style="text-align:center; margin-top:'+unicode_type(em)+'em"> </p>', match.group(0))
return newline
html = self.any_multi_blank.sub(merge_matches, html)
return html
def detect_whitespace(self, html):
blanks_around_headings = re.compile(
r'(?P<initparas>(<(p|div)[^>]*>\s*</(p|div)>\s*){1,}\s*)?'
r'(?P<content><h(?P<hnum>\d+)[^>]*>.*?</h(?P=hnum)>)(?P<endparas>\s*(<(p|div)[^>]*>\s*</(p|div)>\s*){1,})?', re.IGNORECASE|re.DOTALL)
blanks_around_scene_breaks = re.compile(
r'(?P<initparas>(<(p|div)[^>]*>\s*</(p|div)>\s*){1,}\s*)?'
r'(?P<content><p class="scenebreak"[^>]*>.*?</p>)(?P<endparas>\s*(<(p|div)[^>]*>\s*</(p|div)>\s*){1,})?', re.IGNORECASE|re.DOTALL)
blanks_n_nopunct = re.compile(
r'(?P<initparas>(<p[^>]*>\s*</p>\s*){1,}\s*)?<p[^>]*>\s*(<(span|[ibu]|em|strong|font)[^>]*>\s*)*'
r'.{1,100}?[^\W](</(span|[ibu]|em|strong|font)>\s*)*</p>(?P<endparas>\s*(<p[^>]*>\s*</p>\s*){1,})?', re.IGNORECASE|re.DOTALL)
def merge_header_whitespace(match):
initblanks = match.group('initparas')
endblanks = match.group('endparas')
content = match.group('content')
top_margin = ''
bottom_margin = ''
if initblanks is not None:
top_margin = 'margin-top:'+unicode_type(len(self.single_blank.findall(initblanks)))+'em;'
if endblanks is not None:
bottom_margin = 'margin-bottom:'+unicode_type(len(self.single_blank.findall(endblanks)))+'em;'
if initblanks is None and endblanks is None:
return content
elif content.find('scenebreak') != -1:
return content
else:
content = re.sub('(?i)<h(?P<hnum>\\d+)[^>]*>', '\n\n<h'+'\\g<hnum>'+' style="'+top_margin+bottom_margin+'">', content)
return content
html = blanks_around_headings.sub(merge_header_whitespace, html)
html = blanks_around_scene_breaks.sub(merge_header_whitespace, html)
def markup_whitespaces(match):
blanks = match.group(0)
blanks = self.blankreg.sub('\n<p class="whitespace" style="text-align:center; margin-top:0em; margin-bottom:0em"> </p>', blanks)
return blanks
html = blanks_n_nopunct.sub(markup_whitespaces, html)
if self.html_preprocess_sections > self.min_chapters:
html = re.sub('(?si)^.*?(?=<h\\d)', markup_whitespaces, html)
return html
def detect_soft_breaks(self, html):
line = '(?P<initline>'+self.line_open+'\\s*(?P<init_content>.*?)'+self.line_close+')'
line_two = '(?P<line_two>'+re.sub('(ou|in|cha)', 'linetwo_', self.line_open)+ \
'\\s*(?P<line_two_content>.*?)'+re.sub('(ou|in|cha)', 'linetwo_', self.line_close)+')'
div_break_candidate_pattern = line+'\\s*<div[^>]*>\\s*</div>\\s*'+line_two
div_break_candidate = re.compile(r'%s' % div_break_candidate_pattern, re.IGNORECASE|re.UNICODE)
def convert_div_softbreaks(match):
init_is_paragraph = self.check_paragraph(match.group('init_content'))
line_two_is_paragraph = self.check_paragraph(match.group('line_two_content'))
if init_is_paragraph and line_two_is_paragraph:
return (match.group('initline')+
'\n<p class="softbreak" style="margin-top:.5em; page-break-before:avoid; text-align:center"> </p>\n'+
match.group('line_two'))
else:
return match.group(0)
html = div_break_candidate.sub(convert_div_softbreaks, html)
if not self.blanks_deleted and self.blanks_between_paragraphs:
html = self.multi_blank.sub('\n<p class="softbreak" style="margin-top:1em; page-break-before:avoid; text-align:center"> </p>', html)
else:
html = self.blankreg.sub('\n<p class="softbreak" style="margin-top:.5em; page-break-before:avoid; text-align:center"> </p>', html)
return html
def detect_scene_breaks(self, html):
scene_break_regex = self.line_open+'(?!('+self.common_in_text_beginnings+'|.*?'+self.common_in_text_endings+ \
'<))(?P<break>((?P<break_char>((?!\\s)\\W))\\s*(?P=break_char)?)+)\\s*'+self.line_close
scene_breaks = re.compile(r'%s' % scene_break_regex, re.IGNORECASE|re.UNICODE)
html = scene_breaks.sub(self.scene_break_open+'\\g<break>'+'</p>', html)
return html
def markup_user_break(self, replacement_break):
'''
Takes string a user supplies and wraps it in markup that will be centered with
appropriate margins. <hr> and <img> tags are allowed. If the user specifies
a style with width attributes in the <hr> tag then the appropriate margins are
applied to wrapping divs. This is because many ebook devices don't support margin:auto
All other html is converted to text.
'''
hr_open = '<div id="scenebreak" style="margin-left: 45%; margin-right: 45%; margin-top:1.5em; margin-bottom:1.5em; page-break-before:avoid">'
if re.findall('(<|>)', replacement_break):
if re.match('^<hr', replacement_break):
if replacement_break.find('width') != -1:
try:
width = int(re.sub('.*?width(:|=)(?P<wnum>\\d+).*', '\\g<wnum>', replacement_break))
except:
scene_break = hr_open+'<hr style="height: 3px; background:#505050" /></div>'
self.log.warn('Invalid replacement scene break'
' expression, using default')
else:
replacement_break = re.sub('(?i)(width=\\d+\\%?|width:\\s*\\d+(\\%|px|pt|em)?;?)', '', replacement_break)
divpercent = (100 - width) // 2
hr_open = re.sub('45', unicode_type(divpercent), hr_open)
scene_break = hr_open+replacement_break+'</div>'
else:
scene_break = hr_open+'<hr style="height: 3px; background:#505050" /></div>'
elif re.match('^<img', replacement_break):
scene_break = self.scene_break_open+replacement_break+'</p>'
else:
from calibre.utils.html2text import html2text
replacement_break = html2text(replacement_break)
replacement_break = re.sub('\\s', '&nbsp;', replacement_break)
scene_break = self.scene_break_open+replacement_break+'</p>'
else:
replacement_break = re.sub('\\s', '&nbsp;', replacement_break)
scene_break = self.scene_break_open+replacement_break+'</p>'
return scene_break
def check_paragraph(self, content):
content = re.sub('\\s*</?span[^>]*>\\s*', '', content)
if re.match('.*[\"\'.!?:]$', content):
# print "detected this as a paragraph"
return True
else:
return False
def abbyy_processor(self, html):
abbyy_line = re.compile('((?P<linestart><p\\sstyle="(?P<styles>[^\"]*?);?">)(?P<content>.*?)(?P<lineend></p>)|(?P<image><img[^>]*>))', re.IGNORECASE)
empty_paragraph = '\n<p> </p>\n'
self.in_blockquote = False
self.previous_was_paragraph = False
html = re.sub('</?a[^>]*>', '', html)
def convert_styles(match):
# print "raw styles are: "+match.group('styles')
content = match.group('content')
# print "raw content is: "+match.group('content')
image = match.group('image')
is_paragraph = False
text_align = ''
text_indent = ''
paragraph_before = ''
paragraph_after = ''
blockquote_open = '\n<blockquote>\n'
blockquote_close = '</blockquote>\n'
indented_text = 'text-indent:3%;'
blockquote_open_loop = ''
blockquote_close_loop = ''
debugabby = False
if image:
debugabby = True
if self.in_blockquote:
self.in_blockquote = False
blockquote_close_loop = blockquote_close
self.previous_was_paragraph = False
return blockquote_close_loop+'\n'+image+'\n'
else:
styles = match.group('styles').split(';')
is_paragraph = self.check_paragraph(content)
# print "styles for this line are: "+unicode_type(styles)
split_styles = []
for style in styles:
# print "style is: "+unicode_type(style)
newstyle = style.split(':')
# print "newstyle is: "+unicode_type(newstyle)
split_styles.append(newstyle)
styles = split_styles
for style, setting in styles:
if style == 'text-align' and setting != 'left':
text_align = style+':'+setting+';'
if style == 'text-indent':
setting = int(re.sub('\\s*pt\\s*', '', setting))
if 9 < setting < 14:
text_indent = indented_text
else:
text_indent = style+':'+unicode_type(setting)+'pt;'
if style == 'padding':
setting = re.sub('pt', '', setting).split(' ')
if int(setting[1]) < 16 and int(setting[3]) < 16:
if self.in_blockquote:
debugabby = True
if is_paragraph:
self.in_blockquote = False
blockquote_close_loop = blockquote_close
if int(setting[3]) > 8 and text_indent == '':
text_indent = indented_text
if int(setting[0]) > 5:
paragraph_before = empty_paragraph
if int(setting[2]) > 5:
paragraph_after = empty_paragraph
elif not self.in_blockquote and self.previous_was_paragraph:
debugabby = True
self.in_blockquote = True
blockquote_open_loop = blockquote_open
if debugabby:
self.log.debug('\n\n******\n')
self.log.debug('padding top is: '+unicode_type(setting[0]))
self.log.debug('padding right is:' +unicode_type(setting[1]))
self.log.debug('padding bottom is: ' + unicode_type(setting[2]))
self.log.debug('padding left is: ' +unicode_type(setting[3]))
# print "text-align is: "+unicode_type(text_align)
# print "\n***\nline is:\n "+unicode_type(match.group(0))+'\n'
if debugabby:
# print "this line is a paragraph = "+unicode_type(is_paragraph)+", previous line was "+unicode_type(self.previous_was_paragraph)
self.log.debug("styles for this line were:", styles)
self.log.debug('newline is:')
self.log.debug(blockquote_open_loop+blockquote_close_loop+
paragraph_before+'<p style="'+text_indent+text_align+
'">'+content+'</p>'+paragraph_after+'\n\n\n\n\n')
# print "is_paragraph is "+unicode_type(is_paragraph)+", previous_was_paragraph is "+unicode_type(self.previous_was_paragraph)
self.previous_was_paragraph = is_paragraph
# print "previous_was_paragraph is now set to "+unicode_type(self.previous_was_paragraph)+"\n\n\n"
return blockquote_open_loop+blockquote_close_loop+paragraph_before+'<p style="'+text_indent+text_align+'">'+content+'</p>'+paragraph_after
html = abbyy_line.sub(convert_styles, html)
return html
def __call__(self, html):
self.log.debug("********* Heuristic processing HTML *********")
# Count the words in the document to estimate how many chapters to look for and whether
# other types of processing are attempted
try:
self.totalwords = self.get_word_count(html)
except:
self.log.warn("Can't get wordcount")
if self.totalwords < 50:
self.log.warn("flow is too short, not running heuristics")
return html
is_abbyy = self.is_abbyy(html)
if is_abbyy:
html = self.abbyy_processor(html)
# Arrange line feeds and </p> tags so the line_length and no_markup functions work correctly
html = self.arrange_htm_line_endings(html)
# self.dump(html, 'after_arrange_line_endings')
if self.cleanup_required():
# ##### Check Markup ######
#
# some lit files don't have any <p> tags or equivalent (generally just plain text between
# <pre> tags), check and mark up line endings if required before proceeding
# fix indents must run after this step
if self.no_markup(html, 0.1):
self.log.debug("not enough paragraph markers, adding now")
# markup using text processing
html = self.markup_pre(html)
# Replace series of non-breaking spaces with text-indent
if getattr(self.extra_opts, 'fix_indents', False):
html = self.fix_nbsp_indents(html)
if self.cleanup_required():
# fix indents must run before this step, as it removes non-breaking spaces
html = self.cleanup_markup(html)
is_pdftohtml = self.is_pdftohtml(html)
if is_pdftohtml:
self.line_open = "<(?P<outer>p)[^>]*>(\\s*<[ibu][^>]*>)?\\s*"
self.line_close = "\\s*(</[ibu][^>]*>\\s*)?</(?P=outer)>"
# ADE doesn't render <br />, change to empty paragraphs
# html = re.sub('<br[^>]*>', u'<p>\u00a0</p>', html)
# Determine whether the document uses interleaved blank lines
self.blanks_between_paragraphs = self.analyze_blanks(html)
# detect chapters/sections to match xpath or splitting logic
if getattr(self.extra_opts, 'markup_chapter_headings', False):
html = self.markup_chapters(html, self.totalwords, self.blanks_between_paragraphs)
# self.dump(html, 'after_chapter_markup')
if getattr(self.extra_opts, 'italicize_common_cases', False):
html = self.markup_italicis(html)
# If more than 40% of the lines are empty paragraphs and the user has enabled delete
# blank paragraphs then delete blank lines to clean up spacing
if self.blanks_between_paragraphs and getattr(self.extra_opts, 'delete_blank_paragraphs', False):
self.log.debug("deleting blank lines")
self.blanks_deleted = True
html = self.multi_blank.sub('\n<p class="softbreak" style="margin-top:.5em; page-break-before:avoid; text-align:center"> </p>', html)
html = self.blankreg.sub('', html)
# Determine line ending type
# Some OCR sourced files have line breaks in the html using a combination of span & p tags
# span are used for hard line breaks, p for new paragraphs. Determine which is used so
# that lines can be un-wrapped across page boundaries
format = self.analyze_line_endings(html)
# Check Line histogram to determine if the document uses hard line breaks, If 50% or
# more of the lines break in the same region of the document then unwrapping is required
docanalysis = DocAnalysis(format, html)
hardbreaks = docanalysis.line_histogram(.50)
self.log.debug("Hard line breaks check returned "+unicode_type(hardbreaks))
# Calculate Length
unwrap_factor = getattr(self.extra_opts, 'html_unwrap_factor', 0.4)
length = docanalysis.line_length(unwrap_factor)
self.log.debug("Median line length is " + unicode_type(length) + ", calculated with " + format + " format")
# ##### Unwrap lines ######
if getattr(self.extra_opts, 'unwrap_lines', False):
# only go through unwrapping code if the histogram shows unwrapping is required or if the user decreased the default unwrap_factor
if hardbreaks or unwrap_factor < 0.4:
self.log.debug("Unwrapping required, unwrapping Lines")
# Dehyphenate with line length limiters
dehyphenator = Dehyphenator(self.extra_opts.verbose, self.log)
html = dehyphenator(html,'html', length)
html = self.punctuation_unwrap(length, html, 'html')
if getattr(self.extra_opts, 'dehyphenate', False):
# dehyphenate in cleanup mode to fix anything previous conversions/editing missed
self.log.debug("Fixing hyphenated content")
dehyphenator = Dehyphenator(self.extra_opts.verbose, self.log)
html = dehyphenator(html,'html_cleanup', length)
html = dehyphenator(html, 'individual_words', length)
# If still no sections after unwrapping mark split points on lines with no punctuation
if self.html_preprocess_sections < self.min_chapters and getattr(self.extra_opts, 'markup_chapter_headings', False):
self.log.debug("Looking for more split points based on punctuation,"
" currently have " + unicode_type(self.html_preprocess_sections))
chapdetect3 = re.compile(
r'<(?P<styles>(p|div)[^>]*)>\s*(?P<section>(<span[^>]*>)?\s*(?!([\W]+\s*)+)'
r'(<[ibu][^>]*>){0,2}\s*(<span[^>]*>)?\s*(<[ibu][^>]*>){0,2}\s*(<span[^>]*>)?\s*'
r'.?(?=[a-z#\-*\s]+<)([a-z#-*]+\s*){1,5}\s*\s*(</span>)?(</[ibu]>){0,2}\s*'
r'(</span>)?\s*(</[ibu]>){0,2}\s*(</span>)?\s*</(p|div)>)', re.IGNORECASE)
html = chapdetect3.sub(self.chapter_break, html)
if getattr(self.extra_opts, 'renumber_headings', False):
# search for places where a first or second level heading is immediately followed by another
# top level heading. demote the second heading to h3 to prevent splitting between chapter
# headings and titles, images, etc
doubleheading = re.compile(
r'(?P<firsthead><h(1|2)[^>]*>.+?</h(1|2)>\s*(<(?!h\d)[^>]*>\s*)*)<h(1|2)(?P<secondhead>[^>]*>.+?)</h(1|2)>', re.IGNORECASE)
html = doubleheading.sub('\\g<firsthead>'+'\n<h3'+'\\g<secondhead>'+'</h3>', html)
# If scene break formatting is enabled, find all blank paragraphs that definitely aren't scenebreaks,
# style it with the 'whitespace' class. All remaining blank lines are styled as softbreaks.
# Multiple sequential blank paragraphs are merged with appropriate margins
# If non-blank scene breaks exist they are center aligned and styled with appropriate margins.
if getattr(self.extra_opts, 'format_scene_breaks', False):
self.log.debug('Formatting scene breaks')
html = re.sub('(?i)<div[^>]*>\\s*<br(\\s?/)?>\\s*</div>', '<p></p>', html)
html = self.detect_scene_breaks(html)
html = self.detect_whitespace(html)
html = self.detect_soft_breaks(html)
blanks_count = len(self.any_multi_blank.findall(html))
if blanks_count >= 1:
html = self.merge_blanks(html, blanks_count)
detected_scene_break = re.compile(r'<p class="scenebreak"[^>]*>.*?</p>')
scene_break_count = len(detected_scene_break.findall(html))
# If the user has enabled scene break replacement, then either softbreaks
# or 'hard' scene breaks are replaced, depending on which is in use
# Otherwise separator lines are centered, use a bit larger margin in this case
replacement_break = getattr(self.extra_opts, 'replace_scene_breaks', None)
if replacement_break:
replacement_break = self.markup_user_break(replacement_break)
if scene_break_count >= 1:
html = detected_scene_break.sub(replacement_break, html)
html = re.sub('<p\\s+class="softbreak"[^>]*>\\s*</p>', replacement_break, html)
else:
html = re.sub('<p\\s+class="softbreak"[^>]*>\\s*</p>', replacement_break, html)
if self.deleted_nbsps:
# put back non-breaking spaces in empty paragraphs so they render correctly
html = self.anyblank.sub('\n'+r'\g<openline>'+'\u00a0'+r'\g<closeline>', html)
return html

View File

@@ -0,0 +1,11 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
class InvalidDOCX(ValueError):
pass

View File

@@ -0,0 +1,478 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import numbers
from collections import OrderedDict
from polyglot.builtins import iteritems
class Inherit(object):
def __eq__(self, other):
return other is self
def __hash__(self):
return id(self)
def __lt__(self, other):
return False
def __gt__(self, other):
return other is not self
def __ge__(self, other):
if self is other:
return True
return True
def __le__(self, other):
if self is other:
return True
return False
inherit = Inherit()
def binary_property(parent, name, XPath, get):
vals = XPath('./w:%s' % name)(parent)
if not vals:
return inherit
val = get(vals[0], 'w:val', 'on')
return True if val in {'on', '1', 'true'} else False
def simple_color(col, auto='black'):
if not col or col == 'auto' or len(col) != 6:
return auto
return '#'+col
def simple_float(val, mult=1.0):
try:
return float(val) * mult
except (ValueError, TypeError, AttributeError, KeyError):
pass
def twips(val, mult=0.05):
''' Parse val as either a pure number representing twentieths of a point or a number followed by the suffix pt, representing pts.'''
try:
return float(val) * mult
except (ValueError, TypeError, AttributeError, KeyError):
if val and val.endswith('pt') and mult == 0.05:
return twips(val[:-2], mult=1.0)
LINE_STYLES = { # {{{
'basicBlackDashes': 'dashed',
'basicBlackDots': 'dotted',
'basicBlackSquares': 'dashed',
'basicThinLines': 'solid',
'dashDotStroked': 'groove',
'dashed': 'dashed',
'dashSmallGap': 'dashed',
'dotDash': 'dashed',
'dotDotDash': 'dashed',
'dotted': 'dotted',
'double': 'double',
'inset': 'inset',
'nil': 'none',
'none': 'none',
'outset': 'outset',
'single': 'solid',
'thick': 'solid',
'thickThinLargeGap': 'double',
'thickThinMediumGap': 'double',
'thickThinSmallGap' : 'double',
'thinThickLargeGap': 'double',
'thinThickMediumGap': 'double',
'thinThickSmallGap': 'double',
'thinThickThinLargeGap': 'double',
'thinThickThinMediumGap': 'double',
'thinThickThinSmallGap': 'double',
'threeDEmboss': 'ridge',
'threeDEngrave': 'groove',
'triple': 'double',
} # }}}
# Read from XML {{{
border_props = ('padding_%s', 'border_%s_width', 'border_%s_style', 'border_%s_color')
border_edges = ('left', 'top', 'right', 'bottom', 'between')
def read_single_border(parent, edge, XPath, get):
color = style = width = padding = None
for elem in XPath('./w:%s' % edge)(parent):
c = get(elem, 'w:color')
if c is not None:
color = simple_color(c)
s = get(elem, 'w:val')
if s is not None:
style = LINE_STYLES.get(s, 'solid')
space = get(elem, 'w:space')
if space is not None:
try:
padding = float(space)
except (ValueError, TypeError):
pass
sz = get(elem, 'w:sz')
if sz is not None:
# we dont care about art borders (they are only used for page borders)
try:
width = min(96, max(2, float(sz))) / 8
except (ValueError, TypeError):
pass
return {p:v for p, v in zip(border_props, (padding, width, style, color))}
def read_border(parent, dest, XPath, get, border_edges=border_edges, name='pBdr'):
vals = {k % edge:inherit for edge in border_edges for k in border_props}
for border in XPath('./w:' + name)(parent):
for edge in border_edges:
for prop, val in iteritems(read_single_border(border, edge, XPath, get)):
if val is not None:
vals[prop % edge] = val
for key, val in iteritems(vals):
setattr(dest, key, val)
def border_to_css(edge, style, css):
bs = getattr(style, 'border_%s_style' % edge)
bc = getattr(style, 'border_%s_color' % edge)
bw = getattr(style, 'border_%s_width' % edge)
if isinstance(bw, numbers.Number):
# WebKit needs at least 1pt to render borders and 3pt to render double borders
bw = max(bw, (3 if bs == 'double' else 1))
if bs is not inherit and bs is not None:
css['border-%s-style' % edge] = bs
if bc is not inherit and bc is not None:
css['border-%s-color' % edge] = bc
if bw is not inherit and bw is not None:
if isinstance(bw, numbers.Number):
bw = '%.3gpt' % bw
css['border-%s-width' % edge] = bw
def read_indent(parent, dest, XPath, get):
padding_left = padding_right = text_indent = inherit
for indent in XPath('./w:ind')(parent):
l, lc = get(indent, 'w:left'), get(indent, 'w:leftChars')
pl = simple_float(lc, 0.01) if lc is not None else simple_float(l, 0.05) if l is not None else None
if pl is not None:
padding_left = '%.3g%s' % (pl, 'em' if lc is not None else 'pt')
r, rc = get(indent, 'w:right'), get(indent, 'w:rightChars')
pr = simple_float(rc, 0.01) if rc is not None else simple_float(r, 0.05) if r is not None else None
if pr is not None:
padding_right = '%.3g%s' % (pr, 'em' if rc is not None else 'pt')
h, hc = get(indent, 'w:hanging'), get(indent, 'w:hangingChars')
fl, flc = get(indent, 'w:firstLine'), get(indent, 'w:firstLineChars')
h = h if h is None else '-'+h
hc = hc if hc is None else '-'+hc
ti = (simple_float(hc, 0.01) if hc is not None else simple_float(h, 0.05) if h is not None else
simple_float(flc, 0.01) if flc is not None else simple_float(fl, 0.05) if fl is not None else None)
if ti is not None:
text_indent = '%.3g%s' % (ti, 'em' if hc is not None or (h is None and flc is not None) else 'pt')
setattr(dest, 'margin_left', padding_left)
setattr(dest, 'margin_right', padding_right)
setattr(dest, 'text_indent', text_indent)
def read_justification(parent, dest, XPath, get):
ans = inherit
for jc in XPath('./w:jc[@w:val]')(parent):
val = get(jc, 'w:val')
if not val:
continue
if val in {'both', 'distribute'} or 'thai' in val or 'kashida' in val:
ans = 'justify'
elif val in {'left', 'center', 'right', 'start', 'end'}:
ans = val
elif val in {'start', 'end'}:
ans = {'start':'left'}.get(val, 'right')
setattr(dest, 'text_align', ans)
def read_spacing(parent, dest, XPath, get):
padding_top = padding_bottom = line_height = inherit
for s in XPath('./w:spacing')(parent):
a, al, aa = get(s, 'w:after'), get(s, 'w:afterLines'), get(s, 'w:afterAutospacing')
pb = None if aa in {'on', '1', 'true'} else simple_float(al, 0.02) if al is not None else simple_float(a, 0.05) if a is not None else None
if pb is not None:
padding_bottom = '%.3g%s' % (pb, 'ex' if al is not None else 'pt')
b, bl, bb = get(s, 'w:before'), get(s, 'w:beforeLines'), get(s, 'w:beforeAutospacing')
pt = None if bb in {'on', '1', 'true'} else simple_float(bl, 0.02) if bl is not None else simple_float(b, 0.05) if b is not None else None
if pt is not None:
padding_top = '%.3g%s' % (pt, 'ex' if bl is not None else 'pt')
l, lr = get(s, 'w:line'), get(s, 'w:lineRule', 'auto')
if l is not None:
lh = simple_float(l, 0.05) if lr in {'exact', 'atLeast'} else simple_float(l, 1/240.0)
if lh is not None:
line_height = '%.3g%s' % (lh, 'pt' if lr in {'exact', 'atLeast'} else '')
setattr(dest, 'margin_top', padding_top)
setattr(dest, 'margin_bottom', padding_bottom)
setattr(dest, 'line_height', line_height)
def read_shd(parent, dest, XPath, get):
ans = inherit
for shd in XPath('./w:shd[@w:fill]')(parent):
val = get(shd, 'w:fill')
if val:
ans = simple_color(val, auto='transparent')
setattr(dest, 'background_color', ans)
def read_numbering(parent, dest, XPath, get):
lvl = num_id = inherit
for np in XPath('./w:numPr')(parent):
for ilvl in XPath('./w:ilvl[@w:val]')(np):
try:
lvl = int(get(ilvl, 'w:val'))
except (ValueError, TypeError):
pass
for num in XPath('./w:numId[@w:val]')(np):
num_id = get(num, 'w:val')
setattr(dest, 'numbering_id', num_id)
setattr(dest, 'numbering_level', lvl)
class Frame(object):
all_attributes = ('drop_cap', 'h', 'w', 'h_anchor', 'h_rule', 'v_anchor', 'wrap',
'h_space', 'v_space', 'lines', 'x_align', 'y_align', 'x', 'y')
def __init__(self, fp, XPath, get):
self.drop_cap = get(fp, 'w:dropCap', 'none')
try:
self.h = int(get(fp, 'w:h'))/20
except (ValueError, TypeError):
self.h = 0
try:
self.w = int(get(fp, 'w:w'))/20
except (ValueError, TypeError):
self.w = None
try:
self.x = int(get(fp, 'w:x'))/20
except (ValueError, TypeError):
self.x = 0
try:
self.y = int(get(fp, 'w:y'))/20
except (ValueError, TypeError):
self.y = 0
self.h_anchor = get(fp, 'w:hAnchor', 'page')
self.h_rule = get(fp, 'w:hRule', 'auto')
self.v_anchor = get(fp, 'w:vAnchor', 'page')
self.wrap = get(fp, 'w:wrap', 'around')
self.x_align = get(fp, 'w:xAlign')
self.y_align = get(fp, 'w:yAlign')
try:
self.h_space = int(get(fp, 'w:hSpace'))/20
except (ValueError, TypeError):
self.h_space = 0
try:
self.v_space = int(get(fp, 'w:vSpace'))/20
except (ValueError, TypeError):
self.v_space = 0
try:
self.lines = int(get(fp, 'w:lines'))
except (ValueError, TypeError):
self.lines = 1
def css(self, page):
is_dropcap = self.drop_cap in {'drop', 'margin'}
ans = {'overflow': 'hidden'}
if is_dropcap:
ans['float'] = 'left'
ans['margin'] = '0'
ans['padding-right'] = '0.2em'
else:
if self.h_rule != 'auto':
t = 'min-height' if self.h_rule == 'atLeast' else 'height'
ans[t] = '%.3gpt' % self.h
if self.w is not None:
ans['width'] = '%.3gpt' % self.w
ans['padding-top'] = ans['padding-bottom'] = '%.3gpt' % self.v_space
if self.wrap not in {None, 'none'}:
ans['padding-left'] = ans['padding-right'] = '%.3gpt' % self.h_space
if self.x_align is None:
fl = 'left' if self.x/page.width < 0.5 else 'right'
else:
fl = 'right' if self.x_align == 'right' else 'left'
ans['float'] = fl
return ans
def __eq__(self, other):
for x in self.all_attributes:
if getattr(other, x, inherit) != getattr(self, x):
return False
return True
def __ne__(self, other):
return not self.__eq__(other)
def read_frame(parent, dest, XPath, get):
ans = inherit
for fp in XPath('./w:framePr')(parent):
ans = Frame(fp, XPath, get)
setattr(dest, 'frame', ans)
# }}}
class ParagraphStyle(object):
all_properties = (
'adjustRightInd', 'autoSpaceDE', 'autoSpaceDN', 'bidi',
'contextualSpacing', 'keepLines', 'keepNext', 'mirrorIndents',
'pageBreakBefore', 'snapToGrid', 'suppressLineNumbers',
'suppressOverlap', 'topLinePunct', 'widowControl', 'wordWrap',
# Border margins padding
'border_left_width', 'border_left_style', 'border_left_color', 'padding_left',
'border_top_width', 'border_top_style', 'border_top_color', 'padding_top',
'border_right_width', 'border_right_style', 'border_right_color', 'padding_right',
'border_bottom_width', 'border_bottom_style', 'border_bottom_color', 'padding_bottom',
'border_between_width', 'border_between_style', 'border_between_color', 'padding_between',
'margin_left', 'margin_top', 'margin_right', 'margin_bottom',
# Misc.
'text_indent', 'text_align', 'line_height', 'background_color',
'numbering_id', 'numbering_level', 'font_family', 'font_size', 'color', 'frame',
'cs_font_size', 'cs_font_family',
)
def __init__(self, namespace, pPr=None):
self.namespace = namespace
self.linked_style = None
if pPr is None:
for p in self.all_properties:
setattr(self, p, inherit)
else:
for p in (
'adjustRightInd', 'autoSpaceDE', 'autoSpaceDN', 'bidi',
'contextualSpacing', 'keepLines', 'keepNext', 'mirrorIndents',
'pageBreakBefore', 'snapToGrid', 'suppressLineNumbers',
'suppressOverlap', 'topLinePunct', 'widowControl', 'wordWrap',
):
setattr(self, p, binary_property(pPr, p, namespace.XPath, namespace.get))
for x in ('border', 'indent', 'justification', 'spacing', 'shd', 'numbering', 'frame'):
f = read_funcs[x]
f(pPr, self, namespace.XPath, namespace.get)
for s in namespace.XPath('./w:pStyle[@w:val]')(pPr):
self.linked_style = namespace.get(s, 'w:val')
self.font_family = self.font_size = self.color = self.cs_font_size = self.cs_font_family = inherit
self._css = None
self._border_key = None
def update(self, other):
for prop in self.all_properties:
nval = getattr(other, prop)
if nval is not inherit:
setattr(self, prop, nval)
if other.linked_style is not None:
self.linked_style = other.linked_style
def resolve_based_on(self, parent):
for p in self.all_properties:
val = getattr(self, p)
if val is inherit:
setattr(self, p, getattr(parent, p))
@property
def css(self):
if self._css is None:
self._css = c = OrderedDict()
if self.keepLines is True:
c['page-break-inside'] = 'avoid'
if self.pageBreakBefore is True:
c['page-break-before'] = 'always'
if self.keepNext is True:
c['page-break-after'] = 'avoid'
for edge in ('left', 'top', 'right', 'bottom'):
border_to_css(edge, self, c)
val = getattr(self, 'padding_%s' % edge)
if val is not inherit:
c['padding-%s' % edge] = '%.3gpt' % val
val = getattr(self, 'margin_%s' % edge)
if val is not inherit:
c['margin-%s' % edge] = val
if self.line_height not in {inherit, '1'}:
c['line-height'] = self.line_height
for x in ('text_indent', 'background_color', 'font_family', 'font_size', 'color'):
val = getattr(self, x)
if val is not inherit:
if x == 'font_size':
val = '%.3gpt' % val
c[x.replace('_', '-')] = val
ta = self.text_align
if ta is not inherit:
if self.bidi is True:
ta = {'left':'right', 'right':'left'}.get(ta, ta)
c['text-align'] = ta
return self._css
@property
def border_key(self):
if self._border_key is None:
k = []
for edge in border_edges:
for prop in border_props:
prop = prop % edge
k.append(getattr(self, prop))
self._border_key = tuple(k)
return self._border_key
def has_identical_borders(self, other_style):
return self.border_key == getattr(other_style, 'border_key', None)
def clear_borders(self):
for edge in border_edges[:-1]:
for prop in ('width', 'color', 'style'):
setattr(self, 'border_%s_%s' % (edge, prop), inherit)
def clone_border_styles(self):
style = ParagraphStyle(self.namespace)
for edge in border_edges[:-1]:
for prop in ('width', 'color', 'style'):
attr = 'border_%s_%s' % (edge, prop)
setattr(style, attr, getattr(self, attr))
return style
def apply_between_border(self):
for prop in ('width', 'color', 'style'):
setattr(self, 'border_bottom_%s' % prop, getattr(self, 'border_between_%s' % prop))
def has_visible_border(self):
for edge in border_edges[:-1]:
bw, bs = getattr(self, 'border_%s_width' % edge), getattr(self, 'border_%s_style' % edge)
if bw is not inherit and bw and bs is not inherit and bs != 'none':
return True
return False
read_funcs = {k[5:]:v for k, v in iteritems(globals()) if k.startswith('read_')}

View File

@@ -0,0 +1,302 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
from collections import OrderedDict
from calibre.ebooks.docx.block_styles import ( # noqa
inherit, simple_color, LINE_STYLES, simple_float, binary_property, read_shd)
# Read from XML {{{
def read_text_border(parent, dest, XPath, get):
border_color = border_style = border_width = padding = inherit
elems = XPath('./w:bdr')(parent)
if elems and elems[0].attrib:
border_color = simple_color('auto')
border_style = 'none'
border_width = 1
for elem in elems:
color = get(elem, 'w:color')
if color is not None:
border_color = simple_color(color)
style = get(elem, 'w:val')
if style is not None:
border_style = LINE_STYLES.get(style, 'solid')
space = get(elem, 'w:space')
if space is not None:
try:
padding = float(space)
except (ValueError, TypeError):
pass
sz = get(elem, 'w:sz')
if sz is not None:
# we dont care about art borders (they are only used for page borders)
try:
# A border of less than 1pt is not rendered by WebKit
border_width = min(96, max(8, float(sz))) / 8
except (ValueError, TypeError):
pass
setattr(dest, 'border_color', border_color)
setattr(dest, 'border_style', border_style)
setattr(dest, 'border_width', border_width)
setattr(dest, 'padding', padding)
def read_color(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:color[@w:val]')(parent):
val = get(col, 'w:val')
if not val:
continue
ans = simple_color(val)
setattr(dest, 'color', ans)
def convert_highlight_color(val):
return {
'darkBlue': '#000080', 'darkCyan': '#008080', 'darkGray': '#808080',
'darkGreen': '#008000', 'darkMagenta': '#800080', 'darkRed': '#800000', 'darkYellow': '#808000',
'lightGray': '#c0c0c0'}.get(val, val)
def read_highlight(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:highlight[@w:val]')(parent):
val = get(col, 'w:val')
if not val:
continue
if not val or val == 'none':
val = 'transparent'
else:
val = convert_highlight_color(val)
ans = val
setattr(dest, 'highlight', ans)
def read_lang(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:lang[@w:val]')(parent):
val = get(col, 'w:val')
if not val:
continue
try:
code = int(val, 16)
except (ValueError, TypeError):
ans = val
else:
from calibre.ebooks.docx.lcid import lcid
val = lcid.get(code, None)
if val:
ans = val
setattr(dest, 'lang', ans)
def read_letter_spacing(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:spacing[@w:val]')(parent):
val = simple_float(get(col, 'w:val'), 0.05)
if val is not None:
ans = val
setattr(dest, 'letter_spacing', ans)
def read_underline(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:u[@w:val]')(parent):
val = get(col, 'w:val')
if val:
ans = val if val == 'none' else 'underline'
setattr(dest, 'text_decoration', ans)
def read_vert_align(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:vertAlign[@w:val]')(parent):
val = get(col, 'w:val')
if val and val in {'baseline', 'subscript', 'superscript'}:
ans = val
setattr(dest, 'vert_align', ans)
def read_position(parent, dest, XPath, get):
ans = inherit
for col in XPath('./w:position[@w:val]')(parent):
val = get(col, 'w:val')
try:
ans = float(val)/2.0
except Exception:
pass
setattr(dest, 'position', ans)
def read_font(parent, dest, XPath, get):
ff = inherit
for col in XPath('./w:rFonts')(parent):
val = get(col, 'w:asciiTheme')
if val:
val = '|%s|' % val
else:
val = get(col, 'w:ascii')
if val:
ff = val
setattr(dest, 'font_family', ff)
for col in XPath('./w:sz[@w:val]')(parent):
val = simple_float(get(col, 'w:val'), 0.5)
if val is not None:
setattr(dest, 'font_size', val)
return
setattr(dest, 'font_size', inherit)
def read_font_cs(parent, dest, XPath, get):
ff = inherit
for col in XPath('./w:rFonts')(parent):
val = get(col, 'w:csTheme')
if val:
val = '|%s|' % val
else:
val = get(col, 'w:cs')
if val:
ff = val
setattr(dest, 'cs_font_family', ff)
for col in XPath('./w:szCS[@w:val]')(parent):
val = simple_float(get(col, 'w:val'), 0.5)
if val is not None:
setattr(dest, 'font_size', val)
return
setattr(dest, 'cs_font_size', inherit)
# }}}
class RunStyle(object):
all_properties = {
'b', 'bCs', 'caps', 'cs', 'dstrike', 'emboss', 'i', 'iCs', 'imprint',
'rtl', 'shadow', 'smallCaps', 'strike', 'vanish', 'webHidden',
'border_color', 'border_style', 'border_width', 'padding', 'color', 'highlight', 'background_color',
'letter_spacing', 'font_size', 'text_decoration', 'vert_align', 'lang', 'font_family', 'position',
'cs_font_size', 'cs_font_family'
}
toggle_properties = {
'b', 'bCs', 'caps', 'emboss', 'i', 'iCs', 'imprint', 'shadow', 'smallCaps', 'strike', 'vanish',
}
def __init__(self, namespace, rPr=None):
self.namespace = namespace
self.linked_style = None
if rPr is None:
for p in self.all_properties:
setattr(self, p, inherit)
else:
X, g = namespace.XPath, namespace.get
for p in (
'b', 'bCs', 'caps', 'cs', 'dstrike', 'emboss', 'i', 'iCs', 'imprint', 'rtl', 'shadow',
'smallCaps', 'strike', 'vanish', 'webHidden',
):
setattr(self, p, binary_property(rPr, p, X, g))
read_font(rPr, self, X, g)
read_font_cs(rPr, self, X, g)
read_text_border(rPr, self, X, g)
read_color(rPr, self, X, g)
read_highlight(rPr, self, X, g)
read_shd(rPr, self, X, g)
read_letter_spacing(rPr, self, X, g)
read_underline(rPr, self, X, g)
read_vert_align(rPr, self, X, g)
read_position(rPr, self, X, g)
read_lang(rPr, self, X, g)
for s in X('./w:rStyle[@w:val]')(rPr):
self.linked_style = g(s, 'w:val')
self._css = None
def update(self, other):
for prop in self.all_properties:
nval = getattr(other, prop)
if nval is not inherit:
setattr(self, prop, nval)
if other.linked_style is not None:
self.linked_style = other.linked_style
def resolve_based_on(self, parent):
for p in self.all_properties:
val = getattr(self, p)
if val is inherit:
setattr(self, p, getattr(parent, p))
def get_border_css(self, ans):
for x in ('color', 'style', 'width'):
val = getattr(self, 'border_'+x)
if x == 'width' and val is not inherit:
val = '%.3gpt' % val
if val is not inherit:
ans['border-%s' % x] = val
def clear_border_css(self):
for x in ('color', 'style', 'width'):
setattr(self, 'border_'+x, inherit)
@property
def css(self):
if self._css is None:
c = self._css = OrderedDict()
td = set()
if self.text_decoration is not inherit:
td.add(self.text_decoration)
if self.strike and self.strike is not inherit:
td.add('line-through')
if self.dstrike and self.dstrike is not inherit:
td.add('line-through')
if td:
c['text-decoration'] = ' '.join(td)
if self.caps is True:
c['text-transform'] = 'uppercase'
if self.i is True:
c['font-style'] = 'italic'
if self.shadow and self.shadow is not inherit:
c['text-shadow'] = '2px 2px'
if self.smallCaps is True:
c['font-variant'] = 'small-caps'
if self.vanish is True or self.webHidden is True:
c['display'] = 'none'
self.get_border_css(c)
if self.padding is not inherit:
c['padding'] = '%.3gpt' % self.padding
for x in ('color', 'background_color'):
val = getattr(self, x)
if val is not inherit:
c[x.replace('_', '-')] = val
for x in ('letter_spacing', 'font_size'):
val = getattr(self, x)
if val is not inherit:
c[x.replace('_', '-')] = '%.3gpt' % val
if self.position is not inherit:
c['vertical-align'] = '%.3gpt' % self.position
if self.highlight is not inherit and self.highlight != 'transparent':
c['background-color'] = self.highlight
if self.b:
c['font-weight'] = 'bold'
if self.font_family is not inherit:
c['font-family'] = self.font_family
return self._css
def same_border(self, other):
return self.get_border_css({}) == other.get_border_css({})

View File

@@ -0,0 +1,235 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import os
from polyglot.builtins import itervalues, range
NBSP = '\xa0'
def mergeable(previous, current):
if previous.tail or current.tail:
return False
if previous.get('class', None) != current.get('class', None):
return False
if current.get('id', False):
return False
for attr in ('style', 'lang', 'dir'):
if previous.get(attr) != current.get(attr):
return False
try:
return next(previous.itersiblings()) is current
except StopIteration:
return False
def append_text(parent, text):
if len(parent) > 0:
parent[-1].tail = (parent[-1].tail or '') + text
else:
parent.text = (parent.text or '') + text
def merge(parent, span):
if span.text:
append_text(parent, span.text)
for child in span:
parent.append(child)
if span.tail:
append_text(parent, span.tail)
span.getparent().remove(span)
def merge_run(run):
parent = run[0]
for span in run[1:]:
merge(parent, span)
def liftable(css):
# A <span> is liftable if all its styling would work just as well if it is
# specified on the parent element.
prefixes = {x.partition('-')[0] for x in css}
return not (prefixes - {'text', 'font', 'letter', 'color', 'background'})
def add_text(elem, attr, text):
old = getattr(elem, attr) or ''
setattr(elem, attr, old + text)
def lift(span):
# Replace an element by its content (text, children and tail)
parent = span.getparent()
idx = parent.index(span)
try:
last_child = span[-1]
except IndexError:
last_child = None
if span.text:
if idx == 0:
add_text(parent, 'text', span.text)
else:
add_text(parent[idx - 1], 'tail', span.text)
for child in reversed(span):
parent.insert(idx, child)
parent.remove(span)
if span.tail:
if last_child is None:
if idx == 0:
add_text(parent, 'text', span.tail)
else:
add_text(parent[idx - 1], 'tail', span.tail)
else:
add_text(last_child, 'tail', span.tail)
def before_count(root, tag, limit=10):
body = root.xpath('//body[1]')
if not body:
return limit
ans = 0
for elem in body[0].iterdescendants():
if elem is tag:
return ans
ans += 1
if ans > limit:
return limit
def wrap_contents(tag_name, elem):
wrapper = elem.makeelement(tag_name)
wrapper.text, elem.text = elem.text, ''
for child in elem:
elem.remove(child)
wrapper.append(child)
elem.append(wrapper)
def cleanup_markup(log, root, styles, dest_dir, detect_cover, XPath):
# Apply vertical-align
for span in root.xpath('//span[@data-docx-vert]'):
wrap_contents(span.attrib.pop('data-docx-vert'), span)
# Move <hr>s outside paragraphs, if possible.
pancestor = XPath('|'.join('ancestor::%s[1]' % x for x in ('p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6')))
for hr in root.xpath('//span/hr'):
p = pancestor(hr)
if p:
p = p[0]
descendants = tuple(p.iterdescendants())
if descendants[-1] is hr:
parent = p.getparent()
idx = parent.index(p)
parent.insert(idx+1, hr)
hr.tail = '\n\t'
# Merge consecutive spans that have the same styling
current_run = []
for span in root.xpath('//span'):
if not current_run:
current_run.append(span)
else:
last = current_run[-1]
if mergeable(last, span):
current_run.append(span)
else:
if len(current_run) > 1:
merge_run(current_run)
current_run = [span]
# Process dir attributes
class_map = dict(itervalues(styles.classes))
parents = ('p', 'div') + tuple('h%d' % i for i in range(1, 7))
for parent in root.xpath('//*[(%s)]' % ' or '.join('name()="%s"' % t for t in parents)):
# Ensure that children of rtl parents that are not rtl have an
# explicit dir set. Also, remove dir from children if it is the same as
# that of the parent.
if len(parent):
parent_dir = parent.get('dir')
for child in parent.iterchildren('span'):
child_dir = child.get('dir')
if parent_dir == 'rtl' and child_dir != 'rtl':
child_dir = 'ltr'
child.set('dir', child_dir)
if child_dir and child_dir == parent_dir:
child.attrib.pop('dir')
# Remove unnecessary span tags that are the only child of a parent block
# element
for parent in root.xpath('//*[(%s) and count(span)=1]' % ' or '.join('name()="%s"' % t for t in parents)):
if len(parent) == 1 and not parent.text and not parent[0].tail and not parent[0].get('id', None):
# We have a block whose contents are entirely enclosed in a <span>
span = parent[0]
span_class = span.get('class', None)
span_css = class_map.get(span_class, {})
span_dir = span.get('dir')
if liftable(span_css) and (not span_dir or span_dir == parent.get('dir')):
pclass = parent.get('class', None)
if span_class:
pclass = (pclass + ' ' + span_class) if pclass else span_class
parent.set('class', pclass)
parent.text = span.text
parent.remove(span)
if span.get('lang'):
parent.set('lang', span.get('lang'))
if span.get('dir'):
parent.set('dir', span.get('dir'))
for child in span:
parent.append(child)
# Make spans whose only styling is bold or italic into <b> and <i> tags
for span in root.xpath('//span[@class and not(@style)]'):
css = class_map.get(span.get('class', None), {})
if len(css) == 1:
if css == {'font-style':'italic'}:
span.tag = 'i'
del span.attrib['class']
elif css == {'font-weight':'bold'}:
span.tag = 'b'
del span.attrib['class']
# Get rid of <span>s that have no styling
for span in root.xpath('//span[not(@class or @id or @style or @lang or @dir)]'):
lift(span)
# Convert <p><br style="page-break-after:always"> </p> style page breaks
# into something the viewer will render as a page break
for p in root.xpath('//p[br[@style="page-break-after:always"]]'):
if len(p) == 1 and (not p[0].tail or not p[0].tail.strip()):
p.remove(p[0])
prefix = p.get('style', '')
if prefix:
prefix += '; '
p.set('style', prefix + 'page-break-after:always')
p.text = NBSP if not p.text else p.text
if detect_cover:
# Check if the first image in the document is possibly a cover
img = root.xpath('//img[@src][1]')
if img:
img = img[0]
path = os.path.join(dest_dir, img.get('src'))
if os.path.exists(path) and before_count(root, img, limit=10) < 5:
from calibre.utils.imghdr import identify
try:
with lopen(path, 'rb') as imf:
fmt, width, height = identify(imf)
except:
width, height, fmt = 0, 0, None # noqa
del fmt
try:
is_cover = 0.8 <= height/width <= 1.8 and height*width >= 160000
except ZeroDivisionError:
is_cover = False
if is_cover:
log.debug('Detected an image that looks like a cover')
img.getparent().remove(img)
return path

View File

@@ -0,0 +1,268 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import os, sys, shutil
from lxml import etree
from calibre import walk, guess_type
from calibre.ebooks.metadata import string_to_authors, authors_to_sort_string
from calibre.ebooks.metadata.book.base import Metadata
from calibre.ebooks.docx import InvalidDOCX
from calibre.ebooks.docx.names import DOCXNamespace
from calibre.ptempfile import PersistentTemporaryDirectory
from calibre.utils.localization import canonicalize_lang
from calibre.utils.logging import default_log
from calibre.utils.zipfile import ZipFile
from calibre.utils.xml_parse import safe_xml_fromstring
def fromstring(raw, parser=None):
return safe_xml_fromstring(raw)
# Read metadata {{{
def read_doc_props(raw, mi, XPath):
root = fromstring(raw)
titles = XPath('//dc:title')(root)
if titles:
title = titles[0].text
if title and title.strip():
mi.title = title.strip()
tags = []
for subject in XPath('//dc:subject')(root):
if subject.text and subject.text.strip():
tags.append(subject.text.strip().replace(',', '_'))
for keywords in XPath('//cp:keywords')(root):
if keywords.text and keywords.text.strip():
for x in keywords.text.split():
tags.extend(y.strip() for y in x.split(',') if y.strip())
if tags:
mi.tags = tags
authors = XPath('//dc:creator')(root)
aut = []
for author in authors:
if author.text and author.text.strip():
aut.extend(string_to_authors(author.text))
if aut:
mi.authors = aut
mi.author_sort = authors_to_sort_string(aut)
desc = XPath('//dc:description')(root)
if desc:
raw = etree.tostring(desc[0], method='text', encoding='unicode')
raw = raw.replace('_x000d_', '') # Word 2007 mangles newlines in the summary
mi.comments = raw.strip()
langs = []
for lang in XPath('//dc:language')(root):
if lang.text and lang.text.strip():
l = canonicalize_lang(lang.text)
if l:
langs.append(l)
if langs:
mi.languages = langs
def read_app_props(raw, mi):
root = fromstring(raw)
company = root.xpath('//*[local-name()="Company"]')
if company and company[0].text and company[0].text.strip():
mi.publisher = company[0].text.strip()
def read_default_style_language(raw, mi, XPath):
root = fromstring(raw)
for lang in XPath('/w:styles/w:docDefaults/w:rPrDefault/w:rPr/w:lang/@w:val')(root):
lang = canonicalize_lang(lang)
if lang:
mi.languages = [lang]
break
# }}}
class DOCX(object):
def __init__(self, path_or_stream, log=None, extract=True):
self.docx_is_transitional = True
stream = path_or_stream if hasattr(path_or_stream, 'read') else open(path_or_stream, 'rb')
self.name = getattr(stream, 'name', None) or '<stream>'
self.log = log or default_log
if extract:
self.extract(stream)
else:
self.init_zipfile(stream)
self.read_content_types()
self.read_package_relationships()
self.namespace = DOCXNamespace(self.docx_is_transitional)
def init_zipfile(self, stream):
self.zipf = ZipFile(stream)
self.names = frozenset(self.zipf.namelist())
def extract(self, stream):
self.tdir = PersistentTemporaryDirectory('docx_container')
try:
zf = ZipFile(stream)
zf.extractall(self.tdir)
except:
self.log.exception('DOCX appears to be invalid ZIP file, trying a'
' more forgiving ZIP parser')
from calibre.utils.localunzip import extractall
stream.seek(0)
extractall(stream, self.tdir)
self.names = {}
for f in walk(self.tdir):
name = os.path.relpath(f, self.tdir).replace(os.sep, '/')
self.names[name] = f
def exists(self, name):
return name in self.names
def read(self, name):
if hasattr(self, 'zipf'):
return self.zipf.open(name).read()
path = self.names[name]
with open(path, 'rb') as f:
return f.read()
def read_content_types(self):
try:
raw = self.read('[Content_Types].xml')
except KeyError:
raise InvalidDOCX('The file %s docx file has no [Content_Types].xml' % self.name)
root = fromstring(raw)
self.content_types = {}
self.default_content_types = {}
for item in root.xpath('//*[local-name()="Types"]/*[local-name()="Default" and @Extension and @ContentType]'):
self.default_content_types[item.get('Extension').lower()] = item.get('ContentType')
for item in root.xpath('//*[local-name()="Types"]/*[local-name()="Override" and @PartName and @ContentType]'):
name = item.get('PartName').lstrip('/')
self.content_types[name] = item.get('ContentType')
def content_type(self, name):
if name in self.content_types:
return self.content_types[name]
ext = name.rpartition('.')[-1].lower()
if ext in self.default_content_types:
return self.default_content_types[ext]
return guess_type(name)[0]
def read_package_relationships(self):
try:
raw = self.read('_rels/.rels')
except KeyError:
raise InvalidDOCX('The file %s docx file has no _rels/.rels' % self.name)
root = fromstring(raw)
self.relationships = {}
self.relationships_rmap = {}
for item in root.xpath('//*[local-name()="Relationships"]/*[local-name()="Relationship" and @Type and @Target]'):
target = item.get('Target').lstrip('/')
typ = item.get('Type')
if target == 'word/document.xml':
self.docx_is_transitional = typ != 'http://purl.oclc.org/ooxml/officeDocument/relationships/officeDocument'
self.relationships[typ] = target
self.relationships_rmap[target] = typ
@property
def document_name(self):
name = self.relationships.get(self.namespace.names['DOCUMENT'], None)
if name is None:
names = tuple(n for n in self.names if n == 'document.xml' or n.endswith('/document.xml'))
if not names:
raise InvalidDOCX('The file %s docx file has no main document' % self.name)
name = names[0]
return name
@property
def document(self):
return fromstring(self.read(self.document_name))
@property
def document_relationships(self):
return self.get_relationships(self.document_name)
def get_relationships(self, name):
base = '/'.join(name.split('/')[:-1])
by_id, by_type = {}, {}
parts = name.split('/')
name = '/'.join(parts[:-1] + ['_rels', parts[-1] + '.rels'])
try:
raw = self.read(name)
except KeyError:
pass
else:
root = fromstring(raw)
for item in root.xpath('//*[local-name()="Relationships"]/*[local-name()="Relationship" and @Type and @Target]'):
target = item.get('Target')
if item.get('TargetMode', None) != 'External' and not target.startswith('#'):
target = '/'.join((base, target.lstrip('/')))
typ = item.get('Type')
Id = item.get('Id')
by_id[Id] = by_type[typ] = target
return by_id, by_type
def get_document_properties_names(self):
name = self.relationships.get(self.namespace.names['DOCPROPS'], None)
if name is None:
names = tuple(n for n in self.names if n.lower() == 'docprops/core.xml')
if names:
name = names[0]
yield name
name = self.relationships.get(self.namespace.names['APPPROPS'], None)
if name is None:
names = tuple(n for n in self.names if n.lower() == 'docprops/app.xml')
if names:
name = names[0]
yield name
@property
def metadata(self):
mi = Metadata(_('Unknown'))
dp_name, ap_name = self.get_document_properties_names()
if dp_name:
try:
raw = self.read(dp_name)
except KeyError:
pass
else:
read_doc_props(raw, mi, self.namespace.XPath)
if mi.is_null('language'):
try:
raw = self.read('word/styles.xml')
except KeyError:
pass
else:
read_default_style_language(raw, mi, self.namespace.XPath)
ap_name = self.relationships.get(self.namespace.names['APPPROPS'], None)
if ap_name:
try:
raw = self.read(ap_name)
except KeyError:
pass
else:
read_app_props(raw, mi)
return mi
def close(self):
if hasattr(self, 'zipf'):
self.zipf.close()
else:
try:
shutil.rmtree(self.tdir)
except EnvironmentError:
pass
if __name__ == '__main__':
d = DOCX(sys.argv[-1], extract=False)
print(d.metadata)

View File

@@ -0,0 +1,276 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import re
from calibre.ebooks.docx.index import process_index, polish_index_markup
from polyglot.builtins import iteritems, native_string_type
class Field(object):
def __init__(self, start):
self.start = start
self.end = None
self.contents = []
self.buf = []
self.instructions = None
self.name = None
def add_instr(self, elem):
self.add_raw(elem.text)
def add_raw(self, raw):
if not raw:
return
if self.name is None:
# There are cases where partial index entries end with
# a significant space, along the lines of
# <>Summary <> ... <>Hearing<>.
# No known examples of starting with a space yet.
# self.name, raw = raw.strip().partition(' ')[0::2]
self.name, raw = raw.lstrip().partition(' ')[0::2]
self.buf.append(raw)
def finalize(self):
self.instructions = ''.join(self.buf)
del self.buf
WORD, FLAG = 0, 1
scanner = re.Scanner([
(r'\\\S{1}', lambda s, t: (t, FLAG)), # A flag of the form \x
(r'"[^"]*"', lambda s, t: (t[1:-1], WORD)), # Quoted word
(r'[^\s\\"]\S*', lambda s, t: (t, WORD)), # A non-quoted word, must not start with a backslash or a space or a quote
(r'\s+', None),
], flags=re.DOTALL)
null = object()
def parser(name, field_map, default_field_name=None):
field_map = dict((x.split(':') for x in field_map.split()))
def parse(raw, log=None):
ans = {}
last_option = None
raw = raw.replace('\\\\', '\x01').replace('\\"', '\x02')
for token, token_type in scanner.scan(raw)[0]:
token = token.replace('\x01', '\\').replace('\x02', '"')
if token_type is FLAG:
last_option = field_map.get(token[1], null)
if last_option is not None:
ans[last_option] = None
elif token_type is WORD:
if last_option is None:
ans[default_field_name] = token
else:
ans[last_option] = token
last_option = None
ans.pop(null, None)
return ans
parse.__name__ = native_string_type('parse_' + name)
return parse
parse_hyperlink = parser('hyperlink',
'l:anchor m:image-map n:target o:title t:target', 'url')
parse_xe = parser('xe',
'b:bold i:italic f:entry-type r:page-range-bookmark t:page-number-text y:yomi', 'text')
parse_index = parser('index',
'b:bookmark c:columns-per-page d:sequence-separator e:first-page-number-separator'
' f:entry-type g:page-range-separator h:heading k:crossref-separator'
' l:page-number-separator p:letter-range s:sequence-name r:run-together y:yomi z:langcode')
parse_ref = parser('ref',
'd:separator f:footnote h:hyperlink n:number p:position r:relative-number t:suppress w:number-full-context')
parse_noteref = parser('noteref',
'f:footnote h:hyperlink p:position')
class Fields(object):
def __init__(self, namespace):
self.namespace = namespace
self.fields = []
self.index_bookmark_counter = 0
self.index_bookmark_prefix = 'index-'
def __call__(self, doc, log):
all_ids = frozenset(self.namespace.XPath('//*/@w:id')(doc))
c = 0
while self.index_bookmark_prefix in all_ids:
c += 1
self.index_bookmark_prefix = self.index_bookmark_prefix.replace('-', '%d-' % c)
stack = []
for elem in self.namespace.XPath(
'//*[name()="w:p" or name()="w:r" or'
' name()="w:instrText" or'
' (name()="w:fldChar" and (@w:fldCharType="begin" or @w:fldCharType="end") or'
' name()="w:fldSimple")]')(doc):
if elem.tag.endswith('}fldChar'):
typ = self.namespace.get(elem, 'w:fldCharType')
if typ == 'begin':
stack.append(Field(elem))
self.fields.append(stack[-1])
else:
try:
stack.pop().end = elem
except IndexError:
pass
elif elem.tag.endswith('}instrText'):
if stack:
stack[-1].add_instr(elem)
elif elem.tag.endswith('}fldSimple'):
field = Field(elem)
instr = self.namespace.get(elem, 'w:instr')
if instr:
field.add_raw(instr)
self.fields.append(field)
for r in self.namespace.XPath('descendant::w:r')(elem):
field.contents.append(r)
else:
if stack:
stack[-1].contents.append(elem)
field_types = ('hyperlink', 'xe', 'index', 'ref', 'noteref')
parsers = {x.upper():getattr(self, 'parse_'+x) for x in field_types}
parsers.update({x:getattr(self, 'parse_'+x) for x in field_types})
field_parsers = {f.upper():globals()['parse_%s' % f] for f in field_types}
field_parsers.update({f:globals()['parse_%s' % f] for f in field_types})
for f in field_types:
setattr(self, '%s_fields' % f, [])
unknown_fields = {'TOC', 'toc', 'PAGEREF', 'pageref'} # The TOC and PAGEREF fields are handled separately
for field in self.fields:
field.finalize()
if field.instructions:
func = parsers.get(field.name, None)
if func is not None:
func(field, field_parsers[field.name], log)
elif field.name not in unknown_fields:
log.warn('Encountered unknown field: %s, ignoring it.' % field.name)
unknown_fields.add(field.name)
def get_runs(self, field):
all_runs = []
current_runs = []
# We only handle spans in a single paragraph
# being wrapped in <a>
for x in field.contents:
if x.tag.endswith('}p'):
if current_runs:
all_runs.append(current_runs)
current_runs = []
elif x.tag.endswith('}r'):
current_runs.append(x)
if current_runs:
all_runs.append(current_runs)
return all_runs
def parse_hyperlink(self, field, parse_func, log):
# Parse hyperlink fields
hl = parse_func(field.instructions, log)
if hl:
if 'target' in hl and hl['target'] is None:
hl['target'] = '_blank'
for runs in self.get_runs(field):
self.hyperlink_fields.append((hl, runs))
def parse_ref(self, field, parse_func, log):
ref = parse_func(field.instructions, log)
dest = ref.get(None, None)
if dest is not None and 'hyperlink' in ref:
for runs in self.get_runs(field):
self.hyperlink_fields.append(({'anchor':dest}, runs))
else:
log.warn('Unsupported reference field (%s), ignoring: %r' % (field.name, ref))
parse_noteref = parse_ref
def parse_xe(self, field, parse_func, log):
# Parse XE fields
if None in (field.start, field.end):
return
xe = parse_func(field.instructions, log)
if xe:
# We insert a synthetic bookmark around this index item so that we
# can link to it later
def WORD(x):
return self.namespace.expand('w:' + x)
self.index_bookmark_counter += 1
bmark = xe['anchor'] = '%s%d' % (self.index_bookmark_prefix, self.index_bookmark_counter)
p = field.start.getparent()
bm = p.makeelement(WORD('bookmarkStart'))
bm.set(WORD('id'), bmark), bm.set(WORD('name'), bmark)
p.insert(p.index(field.start), bm)
p = field.end.getparent()
bm = p.makeelement(WORD('bookmarkEnd'))
bm.set(WORD('id'), bmark)
p.insert(p.index(field.end) + 1, bm)
xe['start_elem'] = field.start
self.xe_fields.append(xe)
def parse_index(self, field, parse_func, log):
if not field.contents:
return
idx = parse_func(field.instructions, log)
hyperlinks, blocks = process_index(field, idx, self.xe_fields, log, self.namespace.XPath, self.namespace.expand)
if not blocks:
return
for anchor, run in hyperlinks:
self.hyperlink_fields.append(({'anchor':anchor}, [run]))
self.index_fields.append((idx, blocks))
def polish_markup(self, object_map):
if not self.index_fields:
return
rmap = {v:k for k, v in iteritems(object_map)}
for idx, blocks in self.index_fields:
polish_index_markup(idx, [rmap[b] for b in blocks])
def test_parse_fields(return_tests=False):
import unittest
class TestParseFields(unittest.TestCase):
def test_hyperlink(self):
ae = lambda x, y: self.assertEqual(parse_hyperlink(x, None), y)
ae(r'\l anchor1', {'anchor':'anchor1'})
ae(r'www.calibre-ebook.com', {'url':'www.calibre-ebook.com'})
ae(r'www.calibre-ebook.com \t target \o tt', {'url':'www.calibre-ebook.com', 'target':'target', 'title': 'tt'})
ae(r'"c:\\Some Folder"', {'url': 'c:\\Some Folder'})
ae(r'xxxx \y yyyy', {'url': 'xxxx'})
def test_xe(self):
ae = lambda x, y: self.assertEqual(parse_xe(x, None), y)
ae(r'"some name"', {'text':'some name'})
ae(r'name \b \i', {'text':'name', 'bold':None, 'italic':None})
ae(r'xxx \y a', {'text':'xxx', 'yomi':'a'})
def test_index(self):
ae = lambda x, y: self.assertEqual(parse_index(x, None), y)
ae(r'', {})
ae(r'\b \c 1', {'bookmark':None, 'columns-per-page': '1'})
suite = unittest.TestLoader().loadTestsFromTestCase(TestParseFields)
if return_tests:
return suite
unittest.TextTestRunner(verbosity=4).run(suite)
if __name__ == '__main__':
test_parse_fields()

View File

@@ -0,0 +1,197 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import os, re
from collections import namedtuple
from calibre.ebooks.docx.block_styles import binary_property, inherit
from calibre.utils.filenames import ascii_filename
from calibre.utils.fonts.scanner import font_scanner, NoFonts
from calibre.utils.fonts.utils import panose_to_css_generic_family, is_truetype_font
from calibre.utils.icu import ord_string
from polyglot.builtins import codepoint_to_chr, iteritems, range
Embed = namedtuple('Embed', 'name key subsetted')
def has_system_fonts(name):
try:
return bool(font_scanner.fonts_for_family(name))
except NoFonts:
return False
def get_variant(bold=False, italic=False):
return {(False, False):'Regular', (False, True):'Italic',
(True, False):'Bold', (True, True):'BoldItalic'}[(bold, italic)]
def find_fonts_matching(fonts, style='normal', stretch='normal'):
for font in fonts:
if font['font-style'] == style and font['font-stretch'] == stretch:
yield font
def weight_key(font):
w = font['font-weight']
try:
return abs(int(w) - 400)
except Exception:
return abs({'normal': 400, 'bold': 700}.get(w, 1000000) - 400)
def get_best_font(fonts, style, stretch):
try:
return sorted(find_fonts_matching(fonts, style, stretch), key=weight_key)[0]
except Exception:
pass
class Family(object):
def __init__(self, elem, embed_relationships, XPath, get):
self.name = self.family_name = get(elem, 'w:name')
self.alt_names = tuple(get(x, 'w:val') for x in XPath('./w:altName')(elem))
if self.alt_names and not has_system_fonts(self.name):
for x in self.alt_names:
if has_system_fonts(x):
self.family_name = x
break
self.embedded = {}
for x in ('Regular', 'Bold', 'Italic', 'BoldItalic'):
for y in XPath('./w:embed%s[@r:id]' % x)(elem):
rid = get(y, 'r:id')
key = get(y, 'w:fontKey')
subsetted = get(y, 'w:subsetted') in {'1', 'true', 'on'}
if rid in embed_relationships:
self.embedded[x] = Embed(embed_relationships[rid], key, subsetted)
self.generic_family = 'auto'
for x in XPath('./w:family[@w:val]')(elem):
self.generic_family = get(x, 'w:val', 'auto')
ntt = binary_property(elem, 'notTrueType', XPath, get)
self.is_ttf = ntt is inherit or not ntt
self.panose1 = None
self.panose_name = None
for x in XPath('./w:panose1[@w:val]')(elem):
try:
v = get(x, 'w:val')
v = tuple(int(v[i:i+2], 16) for i in range(0, len(v), 2))
except (TypeError, ValueError, IndexError):
pass
else:
self.panose1 = v
self.panose_name = panose_to_css_generic_family(v)
self.css_generic_family = {'roman':'serif', 'swiss':'sans-serif', 'modern':'monospace',
'decorative':'fantasy', 'script':'cursive'}.get(self.generic_family, None)
self.css_generic_family = self.css_generic_family or self.panose_name or 'serif'
SYMBOL_MAPS = { # {{{
'Wingdings': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '🖉', '', '', '👓', '🕭', '🕮', '🕯', '🕿', '', '🖂', '🖃', '📪', '📫', '📬', '📭', '🗀', '🗁', '🗎', '🗏', '🗐', '🗄', '', '🖮', '🖰', '🖲', '🖳', '🖴', '🖫', '🖬', '', '', '🖎', '', '🖏', '👍', '👎', '', '', '', '🖗', '🖐', '', '😐', '', '💣', '🕱', '🏳', '🏱', '', '', '🌢', '', '🕆', '', '🕈', '', '', '', '', '🕉', '', '', '', '', '', '', '', '', '', '', '', '', '', '🙰', '🙵', '', '🔾', '', '🞏', '🞐', '', '', '🞟', '', '', '', '🞙', '', '', '', '🏵', '🏶', '🙶', '🙷', ' ', '🄋', '', '', '', '', '', '', '', '', '', '', '🄌', '', '', '', '', '', '', '', '', '', '', '🙢', '🙠', '🙡', '🙣', '🙦', '🙤', '🙥', '🙧', '', '', '', '', '🞆', '🞈', '🞊', '🞋', '🔿', '', '🞎', '🟀', '🟁', '', '🟋', '🟏', '🟓', '🟑', '', '', '', '', '', '', '', '🕐', '🕑', '🕒', '🕓', '🕔', '🕕', '🕖', '🕗', '🕘', '🕙', '🕚', '🕛', '', '', '', '', '', '', '', '', '🙪', '🙫', '🙕', '🙔', '🙗', '🙖', '🙐', '🙑', '🙒', '🙓', '', '', '', '', '', '', '', '', '', '', '🡨', '🡪', '🡩', '🡫', '🡬', '🡭', '🡯', '🡮', '🡸', '🡺', '🡹', '🡻', '🡼', '🡽', '🡿', '🡾', '', '', '', '', '', '', '', '', '', '', '🢬', '🢭', '🗶', '', '🗷', '🗹', ' '), # noqa
'Wingdings 2': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '🖊', '🖋', '🖌', '🖍', '', '', '🕾', '🕽', '🗅', '🗆', '🗇', '🗈', '🗉', '🗊', '🗋', '🗌', '🗍', '📋', '🗑', '🗔', '🖵', '🖶', '🖷', '🖸', '🖭', '🖯', '🖱', '🖒', '🖓', '🖘', '🖙', '🖚', '🖛', '👈', '👉', '🖜', '🖝', '🖞', '🖟', '🖠', '🖡', '👆', '👇', '🖢', '🖣', '🖑', '🗴', '🗸', '🗵', '', '', '', '', '⮿', '🛇', '', '🙱', '🙴', '🙲', '🙳', '', '🙹', '🙺', '🙻', '🙦', '🙤', '🙥', '🙧', '🙚', '🙘', '🙙', '🙛', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ' ', '', '🌕', '', '', '⸿', '', '🕇', '🕜', '🕝', '🕞', '🕟', '🕠', '🕡', '🕢', '🕣', '🕤', '🕥', '🕦', '🕧', '🙨', '🙩', '', '🞄', '', '', '', '🞅', '🞇', '🞉', '', '⦿', '🞌', '🞍', '', '', '', '🞑', '🞒', '🞓', '🞔', '', '🞕', '🞖', '🞗', '🞘', '', '', '', '🞚', '', '🞛', '🞜', '🞝', '🞞', '', '', '', '🞠', '', '', '', '', '', '', '', '', '', '', '', '', '🞡', '🞢', '🞣', '🞤', '🞥', '🞦', '🞧', '🞨', '🞩', '🞪', '🞫', '🞬', '🞭', '🞮', '🞯', '🞰', '🞱', '🞲', '🞳', '🞴', '🞵', '🞶', '🞷', '🞸', '🞹', '🞺', '🞻', '🞼', '🞽', '🞾', '🞿', '🟀', '🟂', '🟄', '🟆', '🟉', '🟊', '', '🟌', '🟎', '🟐', '🟒', '', '🟃', '🟇', '', '🟍', '🟔', '', '', '', '', ' ', ' ', ' ', ' ', ' ', ' ',), # noqa
'Wingdings 3': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '⭿', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '🢠', '🢡', '🢢', '🢣', '🢤', '🢥', '🢦', '🢧', '🢨', '🢩', '🢪', '🢫', '🡐', '🡒', '🡑', '🡓', '🡔', '🡕', '🡗', '🡖', '🡘', '🡙', '', '', '', '', '', '', '', '', '', '', '', '', '🞀', '🞂', '🞁', ' ', '🞃', '', '', '', '', '', '', '', '', '🠐', '🠒', '🠑', '🠓', '🠔', '🠖', '🠕', '🠗', '🠘', '🠚', '🠙', '🠛', '🠜', '🠞', '🠝', '🠟', '🠀', '🠂', '🠁', '🠃', '🠄', '🠆', '🠅', '🠇', '🠈', '🠊', '🠉', '🠋', '🠠', '🠢', '🠤', '🠦', '🠨', '🠪', '🠬', '🢜', '🢝', '🢞', '🢟', '🠮', '🠰', '🠲', '🠴', '🠶', '🠸', '🠺', '🠹', '🠻', '🢘', '🢚', '🢙', '🢛', '🠼', '🠾', '🠽', '🠿', '🡀', '🡂', '🡁', '🡃', '🡄', '🡆', '🡅', '🡇', '', '', '', '', '', '', '', '', '🡠', '🡢', '🡡', '🡣', '🡤', '🡥', '🡧', '🡦', '🡰', '🡲', '🡱', '🡳', '🡴', '🡵', '🡷', '🡶', '🢀', '🢂', '🢁', '🢃', '🢄', '🢅', '🢇', '🢆', '🢐', '🢒', '🢑', '🢓', '🢔', '🢕', '🢗', '🢖', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',), # noqa
'Webdings': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '🕷', '🕸', '🕲', '🕶', '🏆', '🎖', '🖇', '🗨', '🗩', '🗰', '🗱', '🌶', '🎗', '🙾', '🙼', '🗕', '🗖', '🗗', '', '', '', '', '', '', '', '', '', '', '', '🗚', '🗳', '🛠', '🏗', '🏘', '🏙', '🏚', '🏜', '🏭', '🏛', '🏠', '🏖', '🏝', '🛣', '🔍', '🏔', '👁', '👂', '🏞', '🏕', '🛤', '🏟', '🛳', '🕬', '🕫', '🕨', '🔈', '🎔', '🎕', '🗬', '🙽', '🗭', '🗪', '🗫', '', '', '🚲', '', '🛡', '📦', '🛱', '', '🚑', '🛈', '🛩', '🛰', '🟈', '🕴', '', '🛥', '🚔', '🗘', '🗙', '', '🛲', '🚇', '🚍', '', '', '', '🚭', '🗮', '', '🗯', '🗲', ' ', '🚹', '🚺', '🛉', '🛊', '🚼', '👽', '🏋', '', '🏂', '🏌', '🏊', '🏄', '🏍', '🏎', '🚘', '🗠', '🛢', '📠', '🏷', '📣', '👪', '🗡', '🗢', '🗣', '', '🖄', '🖅', '🖃', '🖆', '🖹', '🖺', '🖻', '🕵', '🕰', '🖽', '🖾', '📋', '🗒', '🗓', '🕮', '📚', '🗞', '🗟', '🗃', '🗂', '🖼', '🎭', '🎜', '🎘', '🎙', '🎧', '💿', '🎞', '📷', '🎟', '🎬', '📽', '📹', '📾', '📻', '🎚', '🎛', '📺', '💻', '🖥', '🖦', '🖧', '🍹', '🎮', '🎮', '🕻', '🕼', '🖁', '🖀', '🖨', '🖩', '🖿', '🖪', '🗜', '🔒', '🔓', '🗝', '📥', '📤', '🕳', '🌣', '🌤', '🌥', '🌦', '', '🌨', '🌧', '🌩', '🌪', '🌬', '🌫', '🌜', '🌡', '🛋', '🛏', '🍽', '🍸', '🛎', '🛍', '', '', '🛆', '🖈', '🎓', '🗤', '🗥', '🗦', '🗧', '🛪', '🐿', '🐦', '🐟', '🐕', '🐈', '🙬', '🙮', '🙭', '🙯', '🗺', '🌍', '🌏', '🌎', '🕊',), # noqa
'Symbol': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', '', '#', '', '%', '&', '', '(', ')', '*', '+', ',', '', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '', 'Α', 'Β', 'Χ', 'Δ', 'Ε', 'Φ', 'Γ', 'Η', 'Ι', 'ϑ', 'Λ', 'Μ', 'Ν', 'Ξ', 'Ο', 'Π', 'Θ', 'Ρ', 'Σ', 'Τ', 'Υ', 'ς', 'Ω', 'Ξ', 'Ψ', 'Ζ', '[', '', ']', '', '_', '', 'α', 'β', 'χ', 'δ', 'ε', 'φ', 'γ', 'η', 'ι', 'ϕ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π', 'θ', 'ρ', 'σ', 'τ', 'υ', 'ϖ', 'ω', 'ξ', 'ψ', 'ζ', '{', '|', '}', '~', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', 'ϒ', '', '', '', '', 'ƒ', '', '', '', '', '', '', '', '', '', '°', '±', '', '', '×', '', '', '', '÷', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '®', '©', '', '', '', '', '¬', '', '', '', '', '', '', '', '', '', '®', '©', '', '', '', '', '', '', '', '', '', '', '', '', ' ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ' ',), # noqa
} # }}}
SYMBOL_FONT_NAMES = frozenset(n.lower() for n in SYMBOL_MAPS)
def is_symbol_font(family):
try:
return family.lower() in SYMBOL_FONT_NAMES
except AttributeError:
return False
def do_map(m, points):
base = 0xf000
limit = len(m) + base
for p in points:
if base < p < limit:
yield m[p - base]
else:
yield codepoint_to_chr(p)
def map_symbol_text(text, font):
m = SYMBOL_MAPS[font]
if isinstance(text, bytes):
text = text.decode('utf-8')
return ''.join(do_map(m, ord_string(text)))
class Fonts(object):
def __init__(self, namespace):
self.namespace = namespace
self.fonts = {}
self.used = set()
def __call__(self, root, embed_relationships, docx, dest_dir):
for elem in self.namespace.XPath('//w:font[@w:name]')(root):
self.fonts[self.namespace.get(elem, 'w:name')] = Family(elem, embed_relationships, self.namespace.XPath, self.namespace.get)
def family_for(self, name, bold=False, italic=False):
f = self.fonts.get(name, None)
if f is None:
return 'serif'
variant = get_variant(bold, italic)
self.used.add((name, variant))
name = f.name if variant in f.embedded else f.family_name
if is_symbol_font(name):
return name
return '"%s", %s' % (name.replace('"', ''), f.css_generic_family)
def embed_fonts(self, dest_dir, docx):
defs = []
dest_dir = os.path.join(dest_dir, 'fonts')
for name, variant in self.used:
f = self.fonts[name]
if variant in f.embedded:
if not os.path.exists(dest_dir):
os.mkdir(dest_dir)
fname = self.write(name, dest_dir, docx, variant)
if fname is not None:
d = {'font-family':'"%s"' % name.replace('"', ''), 'src': 'url("fonts/%s")' % fname}
if 'Bold' in variant:
d['font-weight'] = 'bold'
if 'Italic' in variant:
d['font-style'] = 'italic'
d = ['%s: %s' % (k, v) for k, v in iteritems(d)]
d = ';\n\t'.join(d)
defs.append('@font-face {\n\t%s\n}\n' % d)
return '\n'.join(defs)
def write(self, name, dest_dir, docx, variant):
f = self.fonts[name]
ef = f.embedded[variant]
raw = docx.read(ef.name)
prefix = raw[:32]
if ef.key:
key = re.sub(r'[^A-Fa-f0-9]', '', ef.key)
key = bytearray(reversed(tuple(int(key[i:i+2], 16) for i in range(0, len(key), 2))))
prefix = bytearray(prefix)
prefix = bytes(bytearray(prefix[i]^key[i % len(key)] for i in range(len(prefix))))
if not is_truetype_font(prefix):
return None
ext = 'otf' if prefix.startswith(b'OTTO') else 'ttf'
fname = ascii_filename('%s - %s.%s' % (name, variant, ext))
with open(os.path.join(dest_dir, fname), 'wb') as dest:
dest.write(prefix)
dest.write(raw[32:])
return fname

View File

@@ -0,0 +1,65 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
from collections import OrderedDict
from polyglot.builtins import iteritems, unicode_type
class Note(object):
def __init__(self, namespace, parent, rels):
self.type = namespace.get(parent, 'w:type', 'normal')
self.parent = parent
self.rels = rels
self.namespace = namespace
def __iter__(self):
for p in self.namespace.descendants(self.parent, 'w:p', 'w:tbl'):
yield p
class Footnotes(object):
def __init__(self, namespace):
self.namespace = namespace
self.footnotes = {}
self.endnotes = {}
self.counter = 0
self.notes = OrderedDict()
def __call__(self, footnotes, footnotes_rels, endnotes, endnotes_rels):
XPath, get = self.namespace.XPath, self.namespace.get
if footnotes is not None:
for footnote in XPath('./w:footnote[@w:id]')(footnotes):
fid = get(footnote, 'w:id')
if fid:
self.footnotes[fid] = Note(self.namespace, footnote, footnotes_rels)
if endnotes is not None:
for endnote in XPath('./w:endnote[@w:id]')(endnotes):
fid = get(endnote, 'w:id')
if fid:
self.endnotes[fid] = Note(self.namespace, endnote, endnotes_rels)
def get_ref(self, ref):
fid = self.namespace.get(ref, 'w:id')
notes = self.footnotes if ref.tag.endswith('}footnoteReference') else self.endnotes
note = notes.get(fid, None)
if note is not None and note.type == 'normal':
self.counter += 1
anchor = 'note_%d' % self.counter
self.notes[anchor] = (unicode_type(self.counter), note)
return anchor, unicode_type(self.counter)
return None, None
def __iter__(self):
for anchor, (counter, note) in iteritems(self.notes):
yield anchor, counter, note
@property
def has_notes(self):
return bool(self.notes)

View File

@@ -0,0 +1,343 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import os
from lxml.html.builder import IMG, HR
from calibre.constants import iswindows
from calibre.ebooks.docx.names import barename
from calibre.utils.filenames import ascii_filename
from calibre.utils.img import resize_to_fit, image_to_data
from calibre.utils.imghdr import what
from polyglot.builtins import iteritems, itervalues
class LinkedImageNotFound(ValueError):
def __init__(self, fname):
ValueError.__init__(self, fname)
self.fname = fname
def image_filename(x):
return ascii_filename(x).replace(' ', '_').replace('#', '_')
def emu_to_pt(x):
return x / 12700
def pt_to_emu(x):
return int(x * 12700)
def get_image_properties(parent, XPath, get):
width = height = None
for extent in XPath('./wp:extent')(parent):
try:
width = emu_to_pt(int(extent.get('cx')))
except (TypeError, ValueError):
pass
try:
height = emu_to_pt(int(extent.get('cy')))
except (TypeError, ValueError):
pass
ans = {}
if width is not None:
ans['width'] = '%.3gpt' % width
if height is not None:
ans['height'] = '%.3gpt' % height
alt = None
title = None
for docPr in XPath('./wp:docPr')(parent):
alt = docPr.get('descr') or alt
title = docPr.get('title') or title
if docPr.get('hidden', None) in {'true', 'on', '1'}:
ans['display'] = 'none'
return ans, alt, title
def get_image_margins(elem):
ans = {}
for w, css in iteritems({'L':'left', 'T':'top', 'R':'right', 'B':'bottom'}):
val = elem.get('dist%s' % w, None)
if val is not None:
try:
val = emu_to_pt(val)
except (TypeError, ValueError):
continue
ans['padding-%s' % css] = '%.3gpt' % val
return ans
def get_hpos(anchor, page_width, XPath, get, width_frac):
for ph in XPath('./wp:positionH')(anchor):
rp = ph.get('relativeFrom', None)
if rp == 'leftMargin':
return 0 + width_frac
if rp == 'rightMargin':
return 1 + width_frac
al = None
almap = {'left':0, 'center':0.5, 'right':1}
for align in XPath('./wp:align')(ph):
al = almap.get(align.text)
if al is not None:
if rp == 'page':
return al
return al + width_frac
for po in XPath('./wp:posOffset')(ph):
try:
pos = emu_to_pt(int(po.text))
except (TypeError, ValueError):
continue
return pos/page_width + width_frac
for sp in XPath('./wp:simplePos')(anchor):
try:
x = emu_to_pt(sp.get('x', None))
except (TypeError, ValueError):
continue
return x/page_width + width_frac
return 0
class Images(object):
def __init__(self, namespace, log):
self.namespace = namespace
self.rid_map = {}
self.used = {}
self.resized = {}
self.names = set()
self.all_images = set()
self.links = []
self.log = log
def __call__(self, relationships_by_id):
self.rid_map = relationships_by_id
def read_image_data(self, fname, base=None):
if fname.startswith('file://'):
src = fname[len('file://'):]
if iswindows and src and src[0] == '/':
src = src[1:]
if not src or not os.path.exists(src):
raise LinkedImageNotFound(src)
with open(src, 'rb') as rawsrc:
raw = rawsrc.read()
else:
try:
raw = self.docx.read(fname)
except KeyError:
raise LinkedImageNotFound(fname)
base = base or image_filename(fname.rpartition('/')[-1]) or 'image'
ext = what(None, raw) or base.rpartition('.')[-1] or 'jpeg'
if ext == 'emf':
# For an example, see: https://bugs.launchpad.net/bugs/1224849
self.log('Found an EMF image: %s, trying to extract embedded raster image' % fname)
from calibre.utils.wmf.emf import emf_unwrap
try:
raw = emf_unwrap(raw)
except Exception:
self.log.exception('Failed to extract embedded raster image from EMF')
else:
ext = 'png'
base = base.rpartition('.')[0]
if not base:
base = 'image'
base += '.' + ext
return raw, base
def unique_name(self, base):
exists = frozenset(itervalues(self.used))
c = 1
name = base
while name in exists:
n, e = base.rpartition('.')[0::2]
name = '%s-%d.%s' % (n, c, e)
c += 1
return name
def resize_image(self, raw, base, max_width, max_height):
resized, img = resize_to_fit(raw, max_width, max_height)
if resized:
base, ext = os.path.splitext(base)
base = base + '-%dx%d%s' % (max_width, max_height, ext)
raw = image_to_data(img, fmt=ext[1:])
return raw, base, resized
def generate_filename(self, rid, base=None, rid_map=None, max_width=None, max_height=None):
rid_map = self.rid_map if rid_map is None else rid_map
fname = rid_map[rid]
key = (fname, max_width, max_height)
ans = self.used.get(key)
if ans is not None:
return ans
raw, base = self.read_image_data(fname, base=base)
resized = False
if max_width is not None and max_height is not None:
raw, base, resized = self.resize_image(raw, base, max_width, max_height)
name = self.unique_name(base)
self.used[key] = name
if max_width is not None and max_height is not None and not resized:
okey = (fname, None, None)
if okey in self.used:
return self.used[okey]
self.used[okey] = name
with open(os.path.join(self.dest_dir, name), 'wb') as f:
f.write(raw)
self.all_images.add('images/' + name)
return name
def pic_to_img(self, pic, alt, parent, title):
XPath, get = self.namespace.XPath, self.namespace.get
name = None
link = None
for hl in XPath('descendant::a:hlinkClick[@r:id]')(parent):
link = {'id':get(hl, 'r:id')}
tgt = hl.get('tgtFrame', None)
if tgt:
link['target'] = tgt
title = hl.get('tooltip', None)
if title:
link['title'] = title
for pr in XPath('descendant::pic:cNvPr')(pic):
name = pr.get('name', None)
if name:
name = image_filename(name)
alt = pr.get('descr') or alt
for a in XPath('descendant::a:blip[@r:embed or @r:link]')(pic):
rid = get(a, 'r:embed')
if not rid:
rid = get(a, 'r:link')
if rid and rid in self.rid_map:
try:
src = self.generate_filename(rid, name)
except LinkedImageNotFound as err:
self.log.warn('Linked image: %s not found, ignoring' % err.fname)
continue
img = IMG(src='images/%s' % src)
img.set('alt', alt or 'Image')
if title:
img.set('title', title)
if link is not None:
self.links.append((img, link, self.rid_map))
return img
def drawing_to_html(self, drawing, page):
XPath, get = self.namespace.XPath, self.namespace.get
# First process the inline pictures
for inline in XPath('./wp:inline')(drawing):
style, alt, title = get_image_properties(inline, XPath, get)
for pic in XPath('descendant::pic:pic')(inline):
ans = self.pic_to_img(pic, alt, inline, title)
if ans is not None:
if style:
ans.set('style', '; '.join('%s: %s' % (k, v) for k, v in iteritems(style)))
yield ans
# Now process the floats
for anchor in XPath('./wp:anchor')(drawing):
style, alt, title = get_image_properties(anchor, XPath, get)
self.get_float_properties(anchor, style, page)
for pic in XPath('descendant::pic:pic')(anchor):
ans = self.pic_to_img(pic, alt, anchor, title)
if ans is not None:
if style:
ans.set('style', '; '.join('%s: %s' % (k, v) for k, v in iteritems(style)))
yield ans
def pict_to_html(self, pict, page):
XPath, get = self.namespace.XPath, self.namespace.get
# First see if we have an <hr>
is_hr = len(pict) == 1 and get(pict[0], 'o:hr') in {'t', 'true'}
if is_hr:
style = {}
hr = HR()
try:
pct = float(get(pict[0], 'o:hrpct'))
except (ValueError, TypeError, AttributeError):
pass
else:
if pct > 0:
style['width'] = '%.3g%%' % pct
align = get(pict[0], 'o:hralign', 'center')
if align in {'left', 'right'}:
style['margin-left'] = '0' if align == 'left' else 'auto'
style['margin-right'] = 'auto' if align == 'left' else '0'
if style:
hr.set('style', '; '.join(('%s:%s' % (k, v) for k, v in iteritems(style))))
yield hr
for imagedata in XPath('descendant::v:imagedata[@r:id]')(pict):
rid = get(imagedata, 'r:id')
if rid in self.rid_map:
try:
src = self.generate_filename(rid)
except LinkedImageNotFound as err:
self.log.warn('Linked image: %s not found, ignoring' % err.fname)
continue
img = IMG(src='images/%s' % src, style="display:block")
alt = get(imagedata, 'o:title')
img.set('alt', alt or 'Image')
yield img
def get_float_properties(self, anchor, style, page):
XPath, get = self.namespace.XPath, self.namespace.get
if 'display' not in style:
style['display'] = 'block'
padding = get_image_margins(anchor)
width = float(style.get('width', '100pt')[:-2])
page_width = page.width - page.margin_left - page.margin_right
if page_width <= 0:
# Ignore margins
page_width = page.width
hpos = get_hpos(anchor, page_width, XPath, get, width/(2*page_width))
wrap_elem = None
dofloat = False
for child in reversed(anchor):
bt = barename(child.tag)
if bt in {'wrapNone', 'wrapSquare', 'wrapThrough', 'wrapTight', 'wrapTopAndBottom'}:
wrap_elem = child
dofloat = bt not in {'wrapNone', 'wrapTopAndBottom'}
break
if wrap_elem is not None:
padding.update(get_image_margins(wrap_elem))
wt = wrap_elem.get('wrapText', None)
hpos = 0 if wt == 'right' else 1 if wt == 'left' else hpos
if dofloat:
style['float'] = 'left' if hpos < 0.65 else 'right'
else:
ml, mr = (None, None) if hpos < 0.34 else ('auto', None) if hpos > 0.65 else ('auto', 'auto')
if ml is not None:
style['margin-left'] = ml
if mr is not None:
style['margin-right'] = mr
style.update(padding)
def to_html(self, elem, page, docx, dest_dir):
dest = os.path.join(dest_dir, 'images')
if not os.path.exists(dest):
os.mkdir(dest)
self.dest_dir, self.docx = dest, docx
if elem.tag.endswith('}drawing'):
for tag in self.drawing_to_html(elem, page):
yield tag
else:
for tag in self.pict_to_html(elem, page):
yield tag

View File

@@ -0,0 +1,273 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2014, Kovid Goyal <kovid at kovidgoyal.net>'
from operator import itemgetter
from lxml import etree
from calibre.utils.icu import partition_by_first_letter, sort_key
from polyglot.builtins import iteritems, filter
def get_applicable_xe_fields(index, xe_fields, XPath, expand):
iet = index.get('entry-type', None)
xe_fields = [xe for xe in xe_fields if xe.get('entry-type', None) == iet]
lr = index.get('letter-range', None)
if lr is not None:
sl, el = lr.parition('-')[0::2]
sl, el = sl.strip(), el.strip()
if sl and el:
def inrange(text):
return sl <= text[0] <= el
xe_fields = [xe for xe in xe_fields if inrange(xe.get('text', ''))]
bmark = index.get('bookmark', None)
if bmark is None:
return xe_fields
attr = expand('w:name')
bookmarks = {b for b in XPath('//w:bookmarkStart')(xe_fields[0]['start_elem']) if b.get(attr, None) == bmark}
ancestors = XPath('ancestor::w:bookmarkStart')
def contained(xe):
# Check if the xe field is contained inside a bookmark with the
# specified name
return bool(set(ancestors(xe['start_elem'])) & bookmarks)
return [xe for xe in xe_fields if contained(xe)]
def make_block(expand, style, parent, pos):
p = parent.makeelement(expand('w:p'))
parent.insert(pos, p)
if style is not None:
ppr = p.makeelement(expand('w:pPr'))
p.append(ppr)
ps = ppr.makeelement(expand('w:pStyle'))
ppr.append(ps)
ps.set(expand('w:val'), style)
r = p.makeelement(expand('w:r'))
p.append(r)
t = r.makeelement(expand('w:t'))
t.set(expand('xml:space'), 'preserve')
r.append(t)
return p, t
def add_xe(xe, t, expand):
run = t.getparent()
idx = run.index(t)
t.text = xe.get('text') or ' '
pt = xe.get('page-number-text', None)
if pt:
p = t.getparent().getparent()
r = p.makeelement(expand('w:r'))
p.append(r)
t2 = r.makeelement(expand('w:t'))
t2.set(expand('xml:space'), 'preserve')
t2.text = ' [%s]' % pt
r.append(t2)
# put separate entries on separate lines
run.insert(idx + 1, run.makeelement(expand('w:br')))
return xe['anchor'], run
def process_index(field, index, xe_fields, log, XPath, expand):
'''
We remove all the word generated index markup and replace it with our own
that is more suitable for an ebook.
'''
styles = []
heading_text = index.get('heading', None)
heading_style = 'IndexHeading'
start_pos = None
for elem in field.contents:
if elem.tag.endswith('}p'):
s = XPath('descendant::pStyle/@w:val')(elem)
if s:
styles.append(s[0])
p = elem.getparent()
if start_pos is None:
start_pos = (p, p.index(elem))
p.remove(elem)
xe_fields = get_applicable_xe_fields(index, xe_fields, XPath, expand)
if not xe_fields:
return [], []
if heading_text is not None:
groups = partition_by_first_letter(xe_fields, key=itemgetter('text'))
items = []
for key, fields in iteritems(groups):
items.append(key), items.extend(fields)
if styles:
heading_style = styles[0]
else:
items = sorted(xe_fields, key=lambda x:sort_key(x['text']))
hyperlinks = []
blocks = []
for item in reversed(items):
is_heading = not isinstance(item, dict)
style = heading_style if is_heading else None
p, t = make_block(expand, style, *start_pos)
if is_heading:
text = heading_text
if text.lower().startswith('a'):
text = item + text[1:]
t.text = text
else:
hyperlinks.append(add_xe(item, t, expand))
blocks.append(p)
return hyperlinks, blocks
def split_up_block(block, a, text, parts, ldict):
prefix = parts[:-1]
a.text = parts[-1]
parent = a.getparent()
style = 'display:block; margin-left: %.3gem'
for i, prefix in enumerate(prefix):
m = 1.5 * i
span = parent.makeelement('span', style=style % m)
ldict[span] = i
parent.append(span)
span.text = prefix
span = parent.makeelement('span', style=style % ((i + 1) * 1.5))
parent.append(span)
span.append(a)
ldict[span] = len(prefix)
"""
The merge algorithm is a little tricky.
We start with a list of elementary blocks. Each is an HtmlElement, a p node
with a list of child nodes. The last child may be a link, and the earlier ones are
just text.
The list is in reverse order from what we want in the index.
There is a dictionary ldict which records the level of each child node.
Now we want to do a reduce-like operation, combining all blocks with the same
top level index entry into a single block representing the structure of all
references, subentries, etc. under that top entry.
Here's the algorithm.
Given a block p and the next block n, and the top level entries p1 and n1 in each
block, which we assume have the same text:
Start with (p, p1) and (n, n1).
Given (p, p1, ..., pk) and (n, n1, ..., nk) which we want to merge:
If there are no more levels in n, and we have a link in nk,
then add the link from nk to the links for pk.
This might be the first link for pk, or we might get a list of references.
Otherwise nk+1 is the next level in n. Look for a matching entry in p. It must have
the same text, it must follow pk, it must come before we find any other p entries at
the same level as pk, and it must have the same level as nk+1.
If we find such a matching entry, go back to the start with (p ... pk+1) and (n ... nk+1).
If there is no matching entry, then because of the original reversed order we want
to insert nk+1 and all following entries from n into p immediately following pk.
"""
def find_match(prev_block, pind, nextent, ldict):
curlevel = ldict.get(prev_block[pind], -1)
if curlevel < 0:
return -1
for p in range(pind+1, len(prev_block)):
trylev = ldict.get(prev_block[p], -1)
if trylev <= curlevel:
return -1
if trylev > (curlevel+1):
continue
if prev_block[p].text_content() == nextent.text_content():
return p
return -1
def add_link(pent, nent, ldict):
na = nent.xpath('descendant::a[1]')
# If there is no link, leave it as text
if not na or len(na) == 0:
return
na = na[0]
pa = pent.xpath('descendant::a')
if pa and len(pa) > 0:
# Put on same line with a comma
pa = pa[-1]
pa.tail = ', '
p = pa.getparent()
p.insert(p.index(pa) + 1, na)
else:
# substitute link na for plain text in pent
pent.text = ""
pent.append(na)
def merge_blocks(prev_block, next_block, pind, nind, next_path, ldict):
# First elements match. Any more in next?
if len(next_path) == (nind + 1):
nextent = next_block[nind]
add_link(prev_block[pind], nextent, ldict)
return
nind = nind + 1
nextent = next_block[nind]
prevent = find_match(prev_block, pind, nextent, ldict)
if prevent > 0:
merge_blocks(prev_block, next_block, prevent, nind, next_path, ldict)
return
# Want to insert elements into previous block
while nind < len(next_block):
# insert takes it out of old
pind = pind + 1
prev_block.insert(pind, next_block[nind])
next_block.getparent().remove(next_block)
def polish_index_markup(index, blocks):
# Blocks are in reverse order at this point
path_map = {}
ldict = {}
for block in blocks:
cls = block.get('class', '') or ''
block.set('class', (cls + ' index-entry').lstrip())
a = block.xpath('descendant::a[1]')
text = ''
if a:
text = etree.tostring(a[0], method='text', with_tail=False, encoding='unicode').strip()
if ':' in text:
path_map[block] = parts = list(filter(None, (x.strip() for x in text.split(':'))))
if len(parts) > 1:
split_up_block(block, a[0], text, parts, ldict)
else:
# try using a span all the time
path_map[block] = [text]
parent = a[0].getparent()
span = parent.makeelement('span', style='display:block; margin-left: 0em')
parent.append(span)
span.append(a[0])
ldict[span] = 0
for br in block.xpath('descendant::br'):
br.tail = None
# We want a single block for each main entry
prev_block = blocks[0]
for block in blocks[1:]:
pp, pn = path_map[prev_block], path_map[block]
if pp[0] == pn[0]:
merge_blocks(prev_block, block, 0, 0, pn, ldict)
else:
prev_block = block

View File

@@ -0,0 +1,144 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import re
from lxml.etree import XPath as X
from calibre.utils.filenames import ascii_text
from polyglot.builtins import iteritems
# Names {{{
TRANSITIONAL_NAMES = {
'DOCUMENT' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument',
'DOCPROPS' : 'http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties',
'APPPROPS' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties',
'STYLES' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles',
'NUMBERING' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering',
'FONTS' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable',
'EMBEDDED_FONT' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/font',
'IMAGES' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/image',
'LINKS' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink',
'FOOTNOTES' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes',
'ENDNOTES' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes',
'THEMES' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme',
'SETTINGS' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings',
'WEB_SETTINGS' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings',
}
STRICT_NAMES = {
k:v.replace('http://schemas.openxmlformats.org/officeDocument/2006', 'http://purl.oclc.org/ooxml/officeDocument')
for k, v in iteritems(TRANSITIONAL_NAMES)
}
TRANSITIONAL_NAMESPACES = {
'mo': 'http://schemas.microsoft.com/office/mac/office/2008/main',
'o': 'urn:schemas-microsoft-com:office:office',
've': 'http://schemas.openxmlformats.org/markup-compatibility/2006',
'mc': 'http://schemas.openxmlformats.org/markup-compatibility/2006',
# Text Content
'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main',
'w10': 'urn:schemas-microsoft-com:office:word',
'wne': 'http://schemas.microsoft.com/office/word/2006/wordml',
'xml': 'http://www.w3.org/XML/1998/namespace',
# Drawing
'a': 'http://schemas.openxmlformats.org/drawingml/2006/main',
'm': 'http://schemas.openxmlformats.org/officeDocument/2006/math',
'mv': 'urn:schemas-microsoft-com:mac:vml',
'pic': 'http://schemas.openxmlformats.org/drawingml/2006/picture',
'v': 'urn:schemas-microsoft-com:vml',
'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing',
# Properties (core and extended)
'cp': 'http://schemas.openxmlformats.org/package/2006/metadata/core-properties',
'dc': 'http://purl.org/dc/elements/1.1/',
'ep': 'http://schemas.openxmlformats.org/officeDocument/2006/extended-properties',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
# Content Types
'ct': 'http://schemas.openxmlformats.org/package/2006/content-types',
# Package Relationships
'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships',
'pr': 'http://schemas.openxmlformats.org/package/2006/relationships',
# Dublin Core document properties
'dcmitype': 'http://purl.org/dc/dcmitype/',
'dcterms': 'http://purl.org/dc/terms/'
}
STRICT_NAMESPACES = {
k:v.replace(
'http://schemas.openxmlformats.org/officeDocument/2006', 'http://purl.oclc.org/ooxml/officeDocument').replace(
'http://schemas.openxmlformats.org/wordprocessingml/2006', 'http://purl.oclc.org/ooxml/wordprocessingml').replace(
'http://schemas.openxmlformats.org/drawingml/2006', 'http://purl.oclc.org/ooxml/drawingml')
for k, v in iteritems(TRANSITIONAL_NAMESPACES)
}
# }}}
def barename(x):
return x.rpartition('}')[-1]
def XML(x):
return '{%s}%s' % (TRANSITIONAL_NAMESPACES['xml'], x)
def generate_anchor(name, existing):
x = y = 'id_' + re.sub(r'[^0-9a-zA-Z_]', '', ascii_text(name)).lstrip('_')
c = 1
while y in existing:
y = '%s_%d' % (x, c)
c += 1
return y
class DOCXNamespace(object):
def __init__(self, transitional=True):
self.xpath_cache = {}
if transitional:
self.namespaces = TRANSITIONAL_NAMESPACES.copy()
self.names = TRANSITIONAL_NAMES.copy()
else:
self.namespaces = STRICT_NAMESPACES.copy()
self.names = STRICT_NAMES.copy()
def XPath(self, expr):
ans = self.xpath_cache.get(expr, None)
if ans is None:
self.xpath_cache[expr] = ans = X(expr, namespaces=self.namespaces)
return ans
def is_tag(self, x, q):
tag = getattr(x, 'tag', x)
ns, name = q.partition(':')[0::2]
return '{%s}%s' % (self.namespaces.get(ns, None), name) == tag
def expand(self, name, sep=':'):
ns, tag = name.partition(sep)[::2]
if ns and tag:
tag = '{%s}%s' % (self.namespaces[ns], tag)
return tag or ns
def get(self, x, attr, default=None):
return x.attrib.get(self.expand(attr), default)
def ancestor(self, elem, name):
try:
return self.XPath('ancestor::%s[1]' % name)(elem)[0]
except IndexError:
return None
def children(self, elem, *args):
return self.XPath('|'.join('child::%s' % a for a in args))(elem)
def descendants(self, elem, *args):
return self.XPath('|'.join('descendant::%s' % a for a in args))(elem)
def makeelement(self, root, tag, append=True, **attrs):
ans = root.makeelement(self.expand(tag), **{self.expand(k, sep='_'):v for k, v in iteritems(attrs)})
if append:
root.append(ans)
return ans

View File

@@ -0,0 +1,388 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import re, string
from collections import Counter, defaultdict
from functools import partial
from lxml.html.builder import OL, UL, SPAN
from calibre.ebooks.docx.block_styles import ParagraphStyle
from calibre.ebooks.docx.char_styles import RunStyle, inherit
from calibre.ebooks.metadata import roman
from polyglot.builtins import iteritems, unicode_type
STYLE_MAP = {
'aiueo': 'hiragana',
'aiueoFullWidth': 'hiragana',
'hebrew1': 'hebrew',
'iroha': 'katakana-iroha',
'irohaFullWidth': 'katakana-iroha',
'lowerLetter': 'lower-alpha',
'lowerRoman': 'lower-roman',
'none': 'none',
'upperLetter': 'upper-alpha',
'upperRoman': 'upper-roman',
'chineseCounting': 'cjk-ideographic',
'decimalZero': 'decimal-leading-zero',
}
def alphabet(val, lower=True):
x = string.ascii_lowercase if lower else string.ascii_uppercase
return x[(abs(val - 1)) % len(x)]
alphabet_map = {
'lower-alpha':alphabet, 'upper-alpha':partial(alphabet, lower=False),
'lower-roman':lambda x:roman(x).lower(), 'upper-roman':roman,
'decimal-leading-zero': lambda x: '0%d' % x
}
class Level(object):
def __init__(self, namespace, lvl=None):
self.namespace = namespace
self.restart = None
self.start = 0
self.fmt = 'decimal'
self.para_link = None
self.paragraph_style = self.character_style = None
self.is_numbered = False
self.num_template = None
self.bullet_template = None
self.pic_id = None
if lvl is not None:
self.read_from_xml(lvl)
def copy(self):
ans = Level(self.namespace)
for x in ('restart', 'pic_id', 'start', 'fmt', 'para_link', 'paragraph_style', 'character_style', 'is_numbered', 'num_template', 'bullet_template'):
setattr(ans, x, getattr(self, x))
return ans
def format_template(self, counter, ilvl, template):
def sub(m):
x = int(m.group(1)) - 1
if x > ilvl or x not in counter:
return ''
val = counter[x] - (0 if x == ilvl else 1)
formatter = alphabet_map.get(self.fmt, lambda x: '%d' % x)
return formatter(val)
return re.sub(r'%(\d+)', sub, template).rstrip() + '\xa0'
def read_from_xml(self, lvl, override=False):
XPath, get = self.namespace.XPath, self.namespace.get
for lr in XPath('./w:lvlRestart[@w:val]')(lvl):
try:
self.restart = int(get(lr, 'w:val'))
except (TypeError, ValueError):
pass
for lr in XPath('./w:start[@w:val]')(lvl):
try:
self.start = int(get(lr, 'w:val'))
except (TypeError, ValueError):
pass
for rPr in XPath('./w:rPr')(lvl):
ps = RunStyle(self.namespace, rPr)
if self.character_style is None:
self.character_style = ps
else:
self.character_style.update(ps)
lt = None
for lr in XPath('./w:lvlText[@w:val]')(lvl):
lt = get(lr, 'w:val')
for lr in XPath('./w:numFmt[@w:val]')(lvl):
val = get(lr, 'w:val')
if val == 'bullet':
self.is_numbered = False
cs = self.character_style
if lt in {'\uf0a7', 'o'} or (
cs is not None and cs.font_family is not inherit and cs.font_family.lower() in {'wingdings', 'symbol'}):
self.fmt = {'\uf0a7':'square', 'o':'circle'}.get(lt, 'disc')
else:
self.bullet_template = lt
for lpid in XPath('./w:lvlPicBulletId[@w:val]')(lvl):
self.pic_id = get(lpid, 'w:val')
else:
self.is_numbered = True
self.fmt = STYLE_MAP.get(val, 'decimal')
if lt and re.match(r'%\d+\.$', lt) is None:
self.num_template = lt
for lr in XPath('./w:pStyle[@w:val]')(lvl):
self.para_link = get(lr, 'w:val')
for pPr in XPath('./w:pPr')(lvl):
ps = ParagraphStyle(self.namespace, pPr)
if self.paragraph_style is None:
self.paragraph_style = ps
else:
self.paragraph_style.update(ps)
def css(self, images, pic_map, rid_map):
ans = {'list-style-type': self.fmt}
if self.pic_id:
rid = pic_map.get(self.pic_id, None)
if rid:
try:
fname = images.generate_filename(rid, rid_map=rid_map, max_width=20, max_height=20)
except Exception:
fname = None
else:
ans['list-style-image'] = 'url("images/%s")' % fname
return ans
def char_css(self):
try:
css = self.character_style.css
except AttributeError:
css = {}
css.pop('font-family', None)
return css
class NumberingDefinition(object):
def __init__(self, namespace, parent=None, an_id=None):
self.namespace = namespace
XPath, get = self.namespace.XPath, self.namespace.get
self.levels = {}
self.abstract_numbering_definition_id = an_id
if parent is not None:
for lvl in XPath('./w:lvl')(parent):
try:
ilvl = int(get(lvl, 'w:ilvl', 0))
except (TypeError, ValueError):
ilvl = 0
self.levels[ilvl] = Level(namespace, lvl)
def copy(self):
ans = NumberingDefinition(self.namespace, an_id=self.abstract_numbering_definition_id)
for l, lvl in iteritems(self.levels):
ans.levels[l] = lvl.copy()
return ans
class Numbering(object):
def __init__(self, namespace):
self.namespace = namespace
self.definitions = {}
self.instances = {}
self.counters = defaultdict(Counter)
self.starts = {}
self.pic_map = {}
def __call__(self, root, styles, rid_map):
' Read all numbering style definitions '
XPath, get = self.namespace.XPath, self.namespace.get
self.rid_map = rid_map
for npb in XPath('./w:numPicBullet[@w:numPicBulletId]')(root):
npbid = get(npb, 'w:numPicBulletId')
for idata in XPath('descendant::v:imagedata[@r:id]')(npb):
rid = get(idata, 'r:id')
self.pic_map[npbid] = rid
lazy_load = {}
for an in XPath('./w:abstractNum[@w:abstractNumId]')(root):
an_id = get(an, 'w:abstractNumId')
nsl = XPath('./w:numStyleLink[@w:val]')(an)
if nsl:
lazy_load[an_id] = get(nsl[0], 'w:val')
else:
nd = NumberingDefinition(self.namespace, an, an_id=an_id)
self.definitions[an_id] = nd
def create_instance(n, definition):
nd = definition.copy()
start_overrides = {}
for lo in XPath('./w:lvlOverride')(n):
try:
ilvl = int(get(lo, 'w:ilvl'))
except (ValueError, TypeError):
ilvl = None
for so in XPath('./w:startOverride[@w:val]')(lo):
try:
start_override = int(get(so, 'w:val'))
except (TypeError, ValueError):
pass
else:
start_overrides[ilvl] = start_override
for lvl in XPath('./w:lvl')(lo)[:1]:
nilvl = get(lvl, 'w:ilvl')
ilvl = nilvl if ilvl is None else ilvl
alvl = nd.levels.get(ilvl, None)
if alvl is None:
alvl = Level(self.namespace)
alvl.read_from_xml(lvl, override=True)
for ilvl, so in iteritems(start_overrides):
try:
nd.levels[ilvl].start = start_override
except KeyError:
pass
return nd
next_pass = {}
for n in XPath('./w:num[@w:numId]')(root):
an_id = None
num_id = get(n, 'w:numId')
for an in XPath('./w:abstractNumId[@w:val]')(n):
an_id = get(an, 'w:val')
d = self.definitions.get(an_id, None)
if d is None:
next_pass[num_id] = (an_id, n)
continue
self.instances[num_id] = create_instance(n, d)
numbering_links = styles.numbering_style_links
for an_id, style_link in iteritems(lazy_load):
num_id = numbering_links[style_link]
self.definitions[an_id] = self.instances[num_id].copy()
for num_id, (an_id, n) in iteritems(next_pass):
d = self.definitions.get(an_id, None)
if d is not None:
self.instances[num_id] = create_instance(n, d)
for num_id, d in iteritems(self.instances):
self.starts[num_id] = {lvl:d.levels[lvl].start for lvl in d.levels}
def get_pstyle(self, num_id, style_id):
d = self.instances.get(num_id, None)
if d is not None:
for ilvl, lvl in iteritems(d.levels):
if lvl.para_link == style_id:
return ilvl
def get_para_style(self, num_id, lvl):
d = self.instances.get(num_id, None)
if d is not None:
lvl = d.levels.get(lvl, None)
return getattr(lvl, 'paragraph_style', None)
def update_counter(self, counter, levelnum, levels):
counter[levelnum] += 1
for ilvl, lvl in iteritems(levels):
restart = lvl.restart
if (restart is None and ilvl == levelnum + 1) or restart == levelnum + 1:
counter[ilvl] = lvl.start
def apply_markup(self, items, body, styles, object_map, images):
seen_instances = set()
for p, num_id, ilvl in items:
d = self.instances.get(num_id, None)
if d is not None:
lvl = d.levels.get(ilvl, None)
if lvl is not None:
an_id = d.abstract_numbering_definition_id
counter = self.counters[an_id]
if ilvl not in counter or num_id not in seen_instances:
counter[ilvl] = self.starts[num_id][ilvl]
seen_instances.add(num_id)
p.tag = 'li'
p.set('value', '%s' % counter[ilvl])
p.set('list-lvl', unicode_type(ilvl))
p.set('list-id', num_id)
if lvl.num_template is not None:
val = lvl.format_template(counter, ilvl, lvl.num_template)
p.set('list-template', val)
elif lvl.bullet_template is not None:
val = lvl.format_template(counter, ilvl, lvl.bullet_template)
p.set('list-template', val)
self.update_counter(counter, ilvl, d.levels)
templates = {}
def commit(current_run):
if not current_run:
return
start = current_run[0]
parent = start.getparent()
idx = parent.index(start)
d = self.instances[start.get('list-id')]
ilvl = int(start.get('list-lvl'))
lvl = d.levels[ilvl]
lvlid = start.get('list-id') + start.get('list-lvl')
has_template = 'list-template' in start.attrib
wrap = (OL if lvl.is_numbered or has_template else UL)('\n\t')
if has_template:
wrap.set('lvlid', lvlid)
else:
wrap.set('class', styles.register(lvl.css(images, self.pic_map, self.rid_map), 'list'))
ccss = lvl.char_css()
if ccss:
ccss = styles.register(ccss, 'bullet')
parent.insert(idx, wrap)
last_val = None
for child in current_run:
wrap.append(child)
child.tail = '\n\t'
if has_template:
span = SPAN()
span.text = child.text
child.text = None
for gc in child:
span.append(gc)
child.append(span)
span = SPAN(child.get('list-template'))
if ccss:
span.set('class', ccss)
last = templates.get(lvlid, '')
if span.text and len(span.text) > len(last):
templates[lvlid] = span.text
child.insert(0, span)
for attr in ('list-lvl', 'list-id', 'list-template'):
child.attrib.pop(attr, None)
val = int(child.get('value'))
if last_val == val - 1 or wrap.tag == 'ul' or (last_val is None and val == 1):
child.attrib.pop('value')
last_val = val
current_run[-1].tail = '\n'
del current_run[:]
parents = set()
for child in body.iterdescendants('li'):
parents.add(child.getparent())
for parent in parents:
current_run = []
for child in parent:
if child.tag == 'li':
if current_run:
last = current_run[-1]
if (last.get('list-id') , last.get('list-lvl')) != (child.get('list-id'), child.get('list-lvl')):
commit(current_run)
current_run.append(child)
else:
commit(current_run)
commit(current_run)
# Convert the list items that use custom text for bullets into tables
# so that they display correctly
for wrap in body.xpath('//ol[@lvlid]'):
wrap.attrib.pop('lvlid')
wrap.tag = 'div'
wrap.set('style', 'display:table')
for i, li in enumerate(wrap.iterchildren('li')):
li.tag = 'div'
li.attrib.pop('value', None)
li.set('style', 'display:table-row')
obj = object_map[li]
bs = styles.para_cache[obj]
if i == 0:
wrap.set('style', 'display:table; padding-left:%s' %
bs.css.get('margin-left', '0'))
bs.css.pop('margin-left', None)
for child in li:
child.set('style', 'display:table-cell')

View File

@@ -0,0 +1,21 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
class Settings(object):
def __init__(self, namespace):
self.default_tab_stop = 720 / 20
self.namespace = namespace
def __call__(self, root):
for dts in self.namespace.XPath('//w:defaultTabStop[@w:val]')(root):
try:
self.default_tab_stop = int(self.namespace.get(dts, 'w:val')) / 20
except (ValueError, TypeError, AttributeError):
pass

View File

@@ -0,0 +1,504 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import textwrap
from collections import OrderedDict, Counter
from calibre.ebooks.docx.block_styles import ParagraphStyle, inherit, twips
from calibre.ebooks.docx.char_styles import RunStyle
from calibre.ebooks.docx.tables import TableStyle
from polyglot.builtins import iteritems, itervalues
class PageProperties(object):
'''
Class representing page level properties (page size/margins) read from
sectPr elements.
'''
def __init__(self, namespace, elems=()):
self.width, self.height = 595.28, 841.89 # pts, A4
self.margin_left = self.margin_right = 72 # pts
def setval(attr, val):
val = twips(val)
if val is not None:
setattr(self, attr, val)
for sectPr in elems:
for pgSz in namespace.XPath('./w:pgSz')(sectPr):
w, h = namespace.get(pgSz, 'w:w'), namespace.get(pgSz, 'w:h')
setval('width', w), setval('height', h)
for pgMar in namespace.XPath('./w:pgMar')(sectPr):
l, r = namespace.get(pgMar, 'w:left'), namespace.get(pgMar, 'w:right')
setval('margin_left', l), setval('margin_right', r)
class Style(object):
'''
Class representing a <w:style> element. Can contain block, character, etc. styles.
'''
def __init__(self, namespace, elem):
self.namespace = namespace
self.name_path = namespace.XPath('./w:name[@w:val]')
self.based_on_path = namespace.XPath('./w:basedOn[@w:val]')
self.resolved = False
self.style_id = namespace.get(elem, 'w:styleId')
self.style_type = namespace.get(elem, 'w:type')
names = self.name_path(elem)
self.name = namespace.get(names[-1], 'w:val') if names else None
based_on = self.based_on_path(elem)
self.based_on = namespace.get(based_on[0], 'w:val') if based_on else None
if self.style_type == 'numbering':
self.based_on = None
self.is_default = namespace.get(elem, 'w:default') in {'1', 'on', 'true'}
self.paragraph_style = self.character_style = self.table_style = None
if self.style_type in {'paragraph', 'character', 'table'}:
if self.style_type == 'table':
for tblPr in namespace.XPath('./w:tblPr')(elem):
ts = TableStyle(namespace, tblPr)
if self.table_style is None:
self.table_style = ts
else:
self.table_style.update(ts)
if self.style_type in {'paragraph', 'table'}:
for pPr in namespace.XPath('./w:pPr')(elem):
ps = ParagraphStyle(namespace, pPr)
if self.paragraph_style is None:
self.paragraph_style = ps
else:
self.paragraph_style.update(ps)
for rPr in namespace.XPath('./w:rPr')(elem):
rs = RunStyle(namespace, rPr)
if self.character_style is None:
self.character_style = rs
else:
self.character_style.update(rs)
if self.style_type in {'numbering', 'paragraph'}:
self.numbering_style_link = None
for x in namespace.XPath('./w:pPr/w:numPr/w:numId[@w:val]')(elem):
self.numbering_style_link = namespace.get(x, 'w:val')
def resolve_based_on(self, parent):
if parent.table_style is not None:
if self.table_style is None:
self.table_style = TableStyle(self.namespace)
self.table_style.resolve_based_on(parent.table_style)
if parent.paragraph_style is not None:
if self.paragraph_style is None:
self.paragraph_style = ParagraphStyle(self.namespace)
self.paragraph_style.resolve_based_on(parent.paragraph_style)
if parent.character_style is not None:
if self.character_style is None:
self.character_style = RunStyle(self.namespace)
self.character_style.resolve_based_on(parent.character_style)
class Styles(object):
'''
Collection of all styles defined in the document. Used to get the final styles applicable to elements in the document markup.
'''
def __init__(self, namespace, tables):
self.namespace = namespace
self.id_map = OrderedDict()
self.para_cache = {}
self.para_char_cache = {}
self.run_cache = {}
self.classes = {}
self.counter = Counter()
self.default_styles = {}
self.tables = tables
self.numbering_style_links = {}
self.default_paragraph_style = self.default_character_style = None
def __iter__(self):
for s in itervalues(self.id_map):
yield s
def __getitem__(self, key):
return self.id_map[key]
def __len__(self):
return len(self.id_map)
def get(self, key, default=None):
return self.id_map.get(key, default)
def __call__(self, root, fonts, theme):
self.fonts, self.theme = fonts, theme
self.default_paragraph_style = self.default_character_style = None
if root is not None:
for s in self.namespace.XPath('//w:style')(root):
s = Style(self.namespace, s)
if s.style_id:
self.id_map[s.style_id] = s
if s.is_default:
self.default_styles[s.style_type] = s
if getattr(s, 'numbering_style_link', None) is not None:
self.numbering_style_links[s.style_id] = s.numbering_style_link
for dd in self.namespace.XPath('./w:docDefaults')(root):
for pd in self.namespace.XPath('./w:pPrDefault')(dd):
for pPr in self.namespace.XPath('./w:pPr')(pd):
ps = ParagraphStyle(self.namespace, pPr)
if self.default_paragraph_style is None:
self.default_paragraph_style = ps
else:
self.default_paragraph_style.update(ps)
for pd in self.namespace.XPath('./w:rPrDefault')(dd):
for pPr in self.namespace.XPath('./w:rPr')(pd):
ps = RunStyle(self.namespace, pPr)
if self.default_character_style is None:
self.default_character_style = ps
else:
self.default_character_style.update(ps)
def resolve(s, p):
if p is not None:
if not p.resolved:
resolve(p, self.get(p.based_on))
s.resolve_based_on(p)
s.resolved = True
for s in self:
if not s.resolved:
resolve(s, self.get(s.based_on))
def para_val(self, parent_styles, direct_formatting, attr):
val = getattr(direct_formatting, attr)
if val is inherit:
for ps in reversed(parent_styles):
pval = getattr(ps, attr)
if pval is not inherit:
val = pval
break
return val
def run_val(self, parent_styles, direct_formatting, attr):
val = getattr(direct_formatting, attr)
if val is not inherit:
return val
if attr in direct_formatting.toggle_properties:
# The spec (section 17.7.3) does not make sense, so we follow the behavior
# of Word, which seems to only consider the document default if the
# property has not been defined in any styles.
vals = [int(getattr(rs, attr)) for rs in parent_styles if rs is not self.default_character_style and getattr(rs, attr) is not inherit]
if vals:
return sum(vals) % 2 == 1
if self.default_character_style is not None:
return getattr(self.default_character_style, attr) is True
return False
for rs in reversed(parent_styles):
rval = getattr(rs, attr)
if rval is not inherit:
return rval
return val
def resolve_paragraph(self, p):
ans = self.para_cache.get(p, None)
if ans is None:
linked_style = None
ans = self.para_cache[p] = ParagraphStyle(self.namespace)
ans.style_name = None
direct_formatting = None
is_section_break = False
for pPr in self.namespace.XPath('./w:pPr')(p):
ps = ParagraphStyle(self.namespace, pPr)
if direct_formatting is None:
direct_formatting = ps
else:
direct_formatting.update(ps)
if self.namespace.XPath('./w:sectPr')(pPr):
is_section_break = True
if direct_formatting is None:
direct_formatting = ParagraphStyle(self.namespace)
parent_styles = []
if self.default_paragraph_style is not None:
parent_styles.append(self.default_paragraph_style)
ts = self.tables.para_style(p)
if ts is not None:
parent_styles.append(ts)
default_para = self.default_styles.get('paragraph', None)
if direct_formatting.linked_style is not None:
ls = linked_style = self.get(direct_formatting.linked_style)
if ls is not None:
ans.style_name = ls.name
ps = ls.paragraph_style
if ps is not None:
parent_styles.append(ps)
if ls.character_style is not None:
self.para_char_cache[p] = ls.character_style
elif default_para is not None:
if default_para.paragraph_style is not None:
parent_styles.append(default_para.paragraph_style)
if default_para.character_style is not None:
self.para_char_cache[p] = default_para.character_style
def has_numbering(block_style):
num_id, lvl = getattr(block_style, 'numbering_id', inherit), getattr(block_style, 'numbering_level', inherit)
return num_id is not None and num_id is not inherit and lvl is not None and lvl is not inherit
is_numbering = has_numbering(direct_formatting)
is_section_break = is_section_break and not self.namespace.XPath('./w:r')(p)
if is_numbering and not is_section_break:
num_id, lvl = direct_formatting.numbering_id, direct_formatting.numbering_level
p.set('calibre_num_id', '%s:%s' % (lvl, num_id))
ps = self.numbering.get_para_style(num_id, lvl)
if ps is not None:
parent_styles.append(ps)
if (
not is_numbering and not is_section_break and linked_style is not None and has_numbering(linked_style.paragraph_style)
):
num_id, lvl = linked_style.paragraph_style.numbering_id, linked_style.paragraph_style.numbering_level
p.set('calibre_num_id', '%s:%s' % (lvl, num_id))
is_numbering = True
ps = self.numbering.get_para_style(num_id, lvl)
if ps is not None:
parent_styles.append(ps)
for attr in ans.all_properties:
if not (is_numbering and attr == 'text_indent'): # skip text-indent for lists
setattr(ans, attr, self.para_val(parent_styles, direct_formatting, attr))
ans.linked_style = direct_formatting.linked_style
return ans
def resolve_run(self, r):
ans = self.run_cache.get(r, None)
if ans is None:
p = self.namespace.XPath('ancestor::w:p[1]')(r)
p = p[0] if p else None
ans = self.run_cache[r] = RunStyle(self.namespace)
direct_formatting = None
for rPr in self.namespace.XPath('./w:rPr')(r):
rs = RunStyle(self.namespace, rPr)
if direct_formatting is None:
direct_formatting = rs
else:
direct_formatting.update(rs)
if direct_formatting is None:
direct_formatting = RunStyle(self.namespace)
parent_styles = []
default_char = self.default_styles.get('character', None)
if self.default_character_style is not None:
parent_styles.append(self.default_character_style)
pstyle = self.para_char_cache.get(p, None)
if pstyle is not None:
parent_styles.append(pstyle)
# As best as I can understand the spec, table overrides should be
# applied before paragraph overrides, but word does it
# this way, see the December 2007 table header in the demo
# document.
ts = self.tables.run_style(p)
if ts is not None:
parent_styles.append(ts)
if direct_formatting.linked_style is not None:
ls = getattr(self.get(direct_formatting.linked_style), 'character_style', None)
if ls is not None:
parent_styles.append(ls)
elif default_char is not None and default_char.character_style is not None:
parent_styles.append(default_char.character_style)
for attr in ans.all_properties:
setattr(ans, attr, self.run_val(parent_styles, direct_formatting, attr))
if ans.font_family is not inherit:
ff = self.theme.resolve_font_family(ans.font_family)
ans.font_family = self.fonts.family_for(ff, ans.b, ans.i)
return ans
def resolve(self, obj):
if obj.tag.endswith('}p'):
return self.resolve_paragraph(obj)
if obj.tag.endswith('}r'):
return self.resolve_run(obj)
def cascade(self, layers):
self.body_font_family = 'serif'
self.body_font_size = '10pt'
self.body_color = 'black'
def promote_property(char_styles, block_style, prop):
vals = {getattr(s, prop) for s in char_styles}
if len(vals) == 1:
# All the character styles have the same value
for s in char_styles:
setattr(s, prop, inherit)
setattr(block_style, prop, next(iter(vals)))
for p, runs in iteritems(layers):
has_links = '1' in {r.get('is-link', None) for r in runs}
char_styles = [self.resolve_run(r) for r in runs]
block_style = self.resolve_paragraph(p)
for prop in ('font_family', 'font_size', 'cs_font_family', 'cs_font_size', 'color'):
if has_links and prop == 'color':
# We cannot promote color as browser rendering engines will
# override the link color setting it to blue, unless the
# color is specified on the link element itself
continue
promote_property(char_styles, block_style, prop)
for s in char_styles:
if s.text_decoration == 'none':
# The default text decoration is 'none'
s.text_decoration = inherit
def promote_most_common(block_styles, prop, default):
c = Counter()
for s in block_styles:
val = getattr(s, prop)
if val is not inherit:
c[val] += 1
val = None
if c:
val = c.most_common(1)[0][0]
for s in block_styles:
oval = getattr(s, prop)
if oval is inherit:
if default != val:
setattr(s, prop, default)
elif oval == val:
setattr(s, prop, inherit)
return val
block_styles = tuple(self.resolve_paragraph(p) for p in layers)
ff = promote_most_common(block_styles, 'font_family', self.body_font_family)
if ff is not None:
self.body_font_family = ff
fs = promote_most_common(block_styles, 'font_size', int(self.body_font_size[:2]))
if fs is not None:
self.body_font_size = '%.3gpt' % fs
color = promote_most_common(block_styles, 'color', self.body_color)
if color is not None:
self.body_color = color
def resolve_numbering(self, numbering):
# When a numPr element appears inside a paragraph style, the lvl info
# must be discarded and pStyle used instead.
self.numbering = numbering
for style in self:
ps = style.paragraph_style
if ps is not None and ps.numbering_id is not inherit:
lvl = numbering.get_pstyle(ps.numbering_id, style.style_id)
if lvl is None:
ps.numbering_id = ps.numbering_level = inherit
else:
ps.numbering_level = lvl
def apply_contextual_spacing(self, paras):
last_para = None
for p in paras:
if last_para is not None:
ls = self.resolve_paragraph(last_para)
ps = self.resolve_paragraph(p)
if ls.linked_style is not None and ls.linked_style == ps.linked_style:
if ls.contextualSpacing is True:
ls.margin_bottom = 0
if ps.contextualSpacing is True:
ps.margin_top = 0
last_para = p
def apply_section_page_breaks(self, paras):
for p in paras:
ps = self.resolve_paragraph(p)
ps.pageBreakBefore = True
def register(self, css, prefix):
h = hash(frozenset(iteritems(css)))
ans, _ = self.classes.get(h, (None, None))
if ans is None:
self.counter[prefix] += 1
ans = '%s_%d' % (prefix, self.counter[prefix])
self.classes[h] = (ans, css)
return ans
def generate_classes(self):
for bs in itervalues(self.para_cache):
css = bs.css
if css:
self.register(css, 'block')
for bs in itervalues(self.run_cache):
css = bs.css
if css:
self.register(css, 'text')
def class_name(self, css):
h = hash(frozenset(iteritems(css)))
return self.classes.get(h, (None, None))[0]
def generate_css(self, dest_dir, docx, notes_nopb, nosupsub):
ef = self.fonts.embed_fonts(dest_dir, docx)
s = '''\
body { font-family: %s; font-size: %s; color: %s }
/* In word all paragraphs have zero margins unless explicitly specified in a style */
p, h1, h2, h3, h4, h5, h6, div { margin: 0; padding: 0 }
/* In word headings only have bold font if explicitly specified,
similarly the font size is the body font size, unless explicitly set. */
h1, h2, h3, h4, h5, h6 { font-weight: normal; font-size: 1rem }
/* Setting padding-left to zero breaks rendering of lists, so we only set the other values to zero and leave padding-left for the user-agent */
ul, ol { margin: 0; padding-top: 0; padding-bottom: 0; padding-right: 0 }
/* The word hyperlink styling will set text-decoration to underline if needed */
a { text-decoration: none }
sup.noteref a { text-decoration: none }
h1.notes-header { page-break-before: always }
dl.footnote dt { font-size: large }
dl.footnote dt a { text-decoration: none }
'''
if not notes_nopb:
s += '''\
dl.footnote { page-break-after: always }
dl.footnote:last-of-type { page-break-after: avoid }
'''
s = s + '''\
span.tab { white-space: pre }
p.index-entry { text-indent: 0pt; }
p.index-entry a:visited { color: blue }
p.index-entry a:hover { color: red }
'''
if nosupsub:
s = s + '''\
sup { vertical-align: top }
sub { vertical-align: bottom }
'''
prefix = textwrap.dedent(s) % (self.body_font_family, self.body_font_size, self.body_color)
if ef:
prefix = ef + '\n' + prefix
ans = []
for (cls, css) in sorted(itervalues(self.classes), key=lambda x:x[0]):
b = ('\t%s: %s;' % (k, v) for k, v in iteritems(css))
b = '\n'.join(b)
ans.append('.%s {\n%s\n}\n' % (cls, b.rstrip(';')))
return prefix + '\n' + '\n'.join(ans)

View File

@@ -0,0 +1,700 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
from lxml.html.builder import TABLE, TR, TD
from calibre.ebooks.docx.block_styles import inherit, read_shd as rs, read_border, binary_property, border_props, ParagraphStyle, border_to_css
from calibre.ebooks.docx.char_styles import RunStyle
from polyglot.builtins import filter, iteritems, itervalues, range, unicode_type
# Read from XML {{{
read_shd = rs
edges = ('left', 'top', 'right', 'bottom')
def _read_width(elem, get):
ans = inherit
try:
w = int(get(elem, 'w:w'))
except (TypeError, ValueError):
w = 0
typ = get(elem, 'w:type', 'auto')
if typ == 'nil':
ans = '0'
elif typ == 'auto':
ans = 'auto'
elif typ == 'dxa':
ans = '%.3gpt' % (w/20)
elif typ == 'pct':
ans = '%.3g%%' % (w/50)
return ans
def read_width(parent, dest, XPath, get):
ans = inherit
for tblW in XPath('./w:tblW')(parent):
ans = _read_width(tblW, get)
setattr(dest, 'width', ans)
def read_cell_width(parent, dest, XPath, get):
ans = inherit
for tblW in XPath('./w:tcW')(parent):
ans = _read_width(tblW, get)
setattr(dest, 'width', ans)
def read_padding(parent, dest, XPath, get):
name = 'tblCellMar' if parent.tag.endswith('}tblPr') else 'tcMar'
ans = {x:inherit for x in edges}
for mar in XPath('./w:%s' % name)(parent):
for x in edges:
for edge in XPath('./w:%s' % x)(mar):
ans[x] = _read_width(edge, get)
for x in edges:
setattr(dest, 'cell_padding_%s' % x, ans[x])
def read_justification(parent, dest, XPath, get):
left = right = inherit
for jc in XPath('./w:jc[@w:val]')(parent):
val = get(jc, 'w:val')
if not val:
continue
if val == 'left':
right = 'auto'
elif val == 'right':
left = 'auto'
elif val == 'center':
left = right = 'auto'
setattr(dest, 'margin_left', left)
setattr(dest, 'margin_right', right)
def read_spacing(parent, dest, XPath, get):
ans = inherit
for cs in XPath('./w:tblCellSpacing')(parent):
ans = _read_width(cs, get)
setattr(dest, 'spacing', ans)
def read_float(parent, dest, XPath, get):
ans = inherit
for x in XPath('./w:tblpPr')(parent):
ans = {k.rpartition('}')[-1]: v for k, v in iteritems(x.attrib)}
setattr(dest, 'float', ans)
def read_indent(parent, dest, XPath, get):
ans = inherit
for cs in XPath('./w:tblInd')(parent):
ans = _read_width(cs, get)
setattr(dest, 'indent', ans)
border_edges = ('left', 'top', 'right', 'bottom', 'insideH', 'insideV')
def read_borders(parent, dest, XPath, get):
name = 'tblBorders' if parent.tag.endswith('}tblPr') else 'tcBorders'
read_border(parent, dest, XPath, get, border_edges, name)
def read_height(parent, dest, XPath, get):
ans = inherit
for rh in XPath('./w:trHeight')(parent):
rule = get(rh, 'w:hRule', 'auto')
if rule in {'auto', 'atLeast', 'exact'}:
val = get(rh, 'w:val')
ans = (rule, val)
setattr(dest, 'height', ans)
def read_vertical_align(parent, dest, XPath, get):
ans = inherit
for va in XPath('./w:vAlign')(parent):
val = get(va, 'w:val')
ans = {'center': 'middle', 'top': 'top', 'bottom': 'bottom'}.get(val, 'middle')
setattr(dest, 'vertical_align', ans)
def read_col_span(parent, dest, XPath, get):
ans = inherit
for gs in XPath('./w:gridSpan')(parent):
try:
ans = int(get(gs, 'w:val'))
except (TypeError, ValueError):
continue
setattr(dest, 'col_span', ans)
def read_merge(parent, dest, XPath, get):
for x in ('hMerge', 'vMerge'):
ans = inherit
for m in XPath('./w:%s' % x)(parent):
ans = get(m, 'w:val', 'continue')
setattr(dest, x, ans)
def read_band_size(parent, dest, XPath, get):
for x in ('Col', 'Row'):
ans = 1
for y in XPath('./w:tblStyle%sBandSize' % x)(parent):
try:
ans = int(get(y, 'w:val'))
except (TypeError, ValueError):
continue
setattr(dest, '%s_band_size' % x.lower(), ans)
def read_look(parent, dest, XPath, get):
ans = 0
for x in XPath('./w:tblLook')(parent):
try:
ans = int(get(x, 'w:val'), 16)
except (ValueError, TypeError):
continue
setattr(dest, 'look', ans)
# }}}
def clone(style):
if style is None:
return None
try:
ans = type(style)(style.namespace)
except TypeError:
return None
ans.update(style)
return ans
class Style(object):
is_bidi = False
def update(self, other):
for prop in self.all_properties:
nval = getattr(other, prop)
if nval is not inherit:
setattr(self, prop, nval)
def apply_bidi(self):
self.is_bidi = True
def convert_spacing(self):
ans = {}
if self.spacing is not inherit:
if self.spacing in {'auto', '0'}:
ans['border-collapse'] = 'collapse'
else:
ans['border-collapse'] = 'separate'
ans['border-spacing'] = self.spacing
return ans
def convert_border(self):
c = {}
for x in edges:
border_to_css(x, self, c)
val = getattr(self, 'padding_%s' % x)
if val is not inherit:
c['padding-%s' % x] = '%.3gpt' % val
if self.is_bidi:
for a in ('padding-%s', 'border-%s-style', 'border-%s-color', 'border-%s-width'):
l, r = c.get(a % 'left'), c.get(a % 'right')
if l is not None:
c[a % 'right'] = l
if r is not None:
c[a % 'left'] = r
return c
class RowStyle(Style):
all_properties = ('height', 'cantSplit', 'hidden', 'spacing',)
def __init__(self, namespace, trPr=None):
self.namespace = namespace
if trPr is None:
for p in self.all_properties:
setattr(self, p, inherit)
else:
for p in ('hidden', 'cantSplit'):
setattr(self, p, binary_property(trPr, p, namespace.XPath, namespace.get))
for p in ('spacing', 'height'):
f = globals()['read_%s' % p]
f(trPr, self, namespace.XPath, namespace.get)
self._css = None
@property
def css(self):
if self._css is None:
c = self._css = {}
if self.hidden is True:
c['display'] = 'none'
if self.cantSplit is True:
c['page-break-inside'] = 'avoid'
if self.height is not inherit:
rule, val = self.height
if rule != 'auto':
try:
c['min-height' if rule == 'atLeast' else 'height'] = '%.3gpt' % (int(val)/20)
except (ValueError, TypeError):
pass
c.update(self.convert_spacing())
return self._css
class CellStyle(Style):
all_properties = ('background_color', 'cell_padding_left', 'cell_padding_right', 'cell_padding_top',
'cell_padding_bottom', 'width', 'vertical_align', 'col_span', 'vMerge', 'hMerge', 'row_span',
) + tuple(k % edge for edge in border_edges for k in border_props)
def __init__(self, namespace, tcPr=None):
self.namespace = namespace
if tcPr is None:
for p in self.all_properties:
setattr(self, p, inherit)
else:
for x in ('borders', 'shd', 'padding', 'cell_width', 'vertical_align', 'col_span', 'merge'):
f = globals()['read_%s' % x]
f(tcPr, self, namespace.XPath, namespace.get)
self.row_span = inherit
self._css = None
@property
def css(self):
if self._css is None:
self._css = c = {}
if self.background_color is not inherit:
c['background-color'] = self.background_color
if self.width not in (inherit, 'auto'):
c['width'] = self.width
c['vertical-align'] = 'top' if self.vertical_align is inherit else self.vertical_align
for x in edges:
val = getattr(self, 'cell_padding_%s' % x)
if val not in (inherit, 'auto'):
c['padding-%s' % x] = val
elif val is inherit and x in {'left', 'right'}:
c['padding-%s' % x] = '%.3gpt' % (115/20)
# In Word, tables are apparently rendered with some default top and
# bottom padding irrespective of the cellMargin values. Simulate
# that here.
for x in ('top', 'bottom'):
if c.get('padding-%s' % x, '0pt') == '0pt':
c['padding-%s' % x] = '0.5ex'
c.update(self.convert_border())
return self._css
class TableStyle(Style):
all_properties = (
'width', 'float', 'cell_padding_left', 'cell_padding_right', 'cell_padding_top',
'cell_padding_bottom', 'margin_left', 'margin_right', 'background_color',
'spacing', 'indent', 'overrides', 'col_band_size', 'row_band_size', 'look', 'bidi',
) + tuple(k % edge for edge in border_edges for k in border_props)
def __init__(self, namespace, tblPr=None):
self.namespace = namespace
if tblPr is None:
for p in self.all_properties:
setattr(self, p, inherit)
else:
self.overrides = inherit
self.bidi = binary_property(tblPr, 'bidiVisual', namespace.XPath, namespace.get)
for x in ('width', 'float', 'padding', 'shd', 'justification', 'spacing', 'indent', 'borders', 'band_size', 'look'):
f = globals()['read_%s' % x]
f(tblPr, self, self.namespace.XPath, self.namespace.get)
parent = tblPr.getparent()
if self.namespace.is_tag(parent, 'w:style'):
self.overrides = {}
for tblStylePr in self.namespace.XPath('./w:tblStylePr[@w:type]')(parent):
otype = self.namespace.get(tblStylePr, 'w:type')
orides = self.overrides[otype] = {}
for tblPr in self.namespace.XPath('./w:tblPr')(tblStylePr):
orides['table'] = TableStyle(self.namespace, tblPr)
for trPr in self.namespace.XPath('./w:trPr')(tblStylePr):
orides['row'] = RowStyle(self.namespace, trPr)
for tcPr in self.namespace.XPath('./w:tcPr')(tblStylePr):
orides['cell'] = CellStyle(self.namespace, tcPr)
for pPr in self.namespace.XPath('./w:pPr')(tblStylePr):
orides['para'] = ParagraphStyle(self.namespace, pPr)
for rPr in self.namespace.XPath('./w:rPr')(tblStylePr):
orides['run'] = RunStyle(self.namespace, rPr)
self._css = None
def resolve_based_on(self, parent):
for p in self.all_properties:
val = getattr(self, p)
if val is inherit:
setattr(self, p, getattr(parent, p))
@property
def css(self):
if self._css is None:
c = self._css = {}
if self.width not in (inherit, 'auto'):
c['width'] = self.width
for x in ('background_color', 'margin_left', 'margin_right'):
val = getattr(self, x)
if val is not inherit:
c[x.replace('_', '-')] = val
if self.indent not in (inherit, 'auto') and self.margin_left != 'auto':
c['margin-left'] = self.indent
if self.float is not inherit:
for x in ('left', 'top', 'right', 'bottom'):
val = self.float.get('%sFromText' % x, 0)
try:
val = '%.3gpt' % (int(val) / 20)
except (ValueError, TypeError):
val = '0'
c['margin-%s' % x] = val
if 'tblpXSpec' in self.float:
c['float'] = 'right' if self.float['tblpXSpec'] in {'right', 'outside'} else 'left'
else:
page = self.page
page_width = page.width - page.margin_left - page.margin_right
try:
x = int(self.float['tblpX']) / 20
except (KeyError, ValueError, TypeError):
x = 0
c['float'] = 'left' if (x/page_width) < 0.65 else 'right'
c.update(self.convert_spacing())
if 'border-collapse' not in c:
c['border-collapse'] = 'collapse'
c.update(self.convert_border())
return self._css
class Table(object):
def __init__(self, namespace, tbl, styles, para_map, is_sub_table=False):
self.namespace = namespace
self.tbl = tbl
self.styles = styles
self.is_sub_table = is_sub_table
# Read Table Style
style = {'table':TableStyle(self.namespace)}
for tblPr in self.namespace.XPath('./w:tblPr')(tbl):
for ts in self.namespace.XPath('./w:tblStyle[@w:val]')(tblPr):
style_id = self.namespace.get(ts, 'w:val')
s = styles.get(style_id)
if s is not None:
if s.table_style is not None:
style['table'].update(s.table_style)
if s.paragraph_style is not None:
if 'paragraph' in style:
style['paragraph'].update(s.paragraph_style)
else:
style['paragraph'] = s.paragraph_style
if s.character_style is not None:
if 'run' in style:
style['run'].update(s.character_style)
else:
style['run'] = s.character_style
style['table'].update(TableStyle(self.namespace, tblPr))
self.table_style, self.paragraph_style = style['table'], style.get('paragraph', None)
self.run_style = style.get('run', None)
self.overrides = self.table_style.overrides
if self.overrides is inherit:
self.overrides = {}
if 'wholeTable' in self.overrides and 'table' in self.overrides['wholeTable']:
self.table_style.update(self.overrides['wholeTable']['table'])
self.style_map = {}
self.paragraphs = []
self.cell_map = []
rows = self.namespace.XPath('./w:tr')(tbl)
for r, tr in enumerate(rows):
overrides = self.get_overrides(r, None, len(rows), None)
self.resolve_row_style(tr, overrides)
cells = self.namespace.XPath('./w:tc')(tr)
self.cell_map.append([])
for c, tc in enumerate(cells):
overrides = self.get_overrides(r, c, len(rows), len(cells))
self.resolve_cell_style(tc, overrides, r, c, len(rows), len(cells))
self.cell_map[-1].append(tc)
for p in self.namespace.XPath('./w:p')(tc):
para_map[p] = self
self.paragraphs.append(p)
self.resolve_para_style(p, overrides)
self.handle_merged_cells()
self.sub_tables = {x:Table(namespace, x, styles, para_map, is_sub_table=True) for x in self.namespace.XPath('./w:tr/w:tc/w:tbl')(tbl)}
@property
def bidi(self):
return self.table_style.bidi is True
def override_allowed(self, name):
'Check if the named override is allowed by the tblLook element'
if name.endswith('Cell') or name == 'wholeTable':
return True
look = self.table_style.look
if (look & 0x0020 and name == 'firstRow') or (look & 0x0040 and name == 'lastRow') or \
(look & 0x0080 and name == 'firstCol') or (look & 0x0100 and name == 'lastCol'):
return True
if name.startswith('band'):
if name.endswith('Horz'):
return not bool(look & 0x0200)
if name.endswith('Vert'):
return not bool(look & 0x0400)
return False
def get_overrides(self, r, c, num_of_rows, num_of_cols_in_row):
'List of possible overrides for the given para'
overrides = ['wholeTable']
def divisor(m, n):
return (m - (m % n)) // n
if c is not None:
odd_column_band = (divisor(c, self.table_style.col_band_size) % 2) == 1
overrides.append('band%dVert' % (1 if odd_column_band else 2))
odd_row_band = (divisor(r, self.table_style.row_band_size) % 2) == 1
overrides.append('band%dHorz' % (1 if odd_row_band else 2))
# According to the OOXML spec columns should have higher override
# priority than rows, but Word seems to do it the other way around.
if c is not None:
if c == 0:
overrides.append('firstCol')
if c >= num_of_cols_in_row - 1:
overrides.append('lastCol')
if r == 0:
overrides.append('firstRow')
if r >= num_of_rows - 1:
overrides.append('lastRow')
if c is not None:
if r == 0:
if c == 0:
overrides.append('nwCell')
if c == num_of_cols_in_row - 1:
overrides.append('neCell')
if r == num_of_rows - 1:
if c == 0:
overrides.append('swCell')
if c == num_of_cols_in_row - 1:
overrides.append('seCell')
return tuple(filter(self.override_allowed, overrides))
def resolve_row_style(self, tr, overrides):
rs = RowStyle(self.namespace)
for o in overrides:
if o in self.overrides:
ovr = self.overrides[o]
ors = ovr.get('row', None)
if ors is not None:
rs.update(ors)
for trPr in self.namespace.XPath('./w:trPr')(tr):
rs.update(RowStyle(self.namespace, trPr))
if self.bidi:
rs.apply_bidi()
self.style_map[tr] = rs
def resolve_cell_style(self, tc, overrides, row, col, rows, cols_in_row):
cs = CellStyle(self.namespace)
for o in overrides:
if o in self.overrides:
ovr = self.overrides[o]
ors = ovr.get('cell', None)
if ors is not None:
cs.update(ors)
for tcPr in self.namespace.XPath('./w:tcPr')(tc):
cs.update(CellStyle(self.namespace, tcPr))
for x in edges:
p = 'cell_padding_%s' % x
val = getattr(cs, p)
if val is inherit:
setattr(cs, p, getattr(self.table_style, p))
is_inside_edge = (
(x == 'left' and col > 0) or
(x == 'top' and row > 0) or
(x == 'right' and col < cols_in_row - 1) or
(x == 'bottom' and row < rows -1)
)
inside_edge = ('insideH' if x in {'top', 'bottom'} else 'insideV') if is_inside_edge else None
for prop in border_props:
if not prop.startswith('border'):
continue
eprop = prop % x
iprop = (prop % inside_edge) if inside_edge else None
val = getattr(cs, eprop)
if val is inherit and iprop is not None:
# Use the insideX borders if the main cell borders are not
# specified
val = getattr(cs, iprop)
if val is inherit:
val = getattr(self.table_style, iprop)
if not is_inside_edge and val == 'none':
# Cell borders must override table borders even when the
# table border is not null and the cell border is null.
val = 'hidden'
setattr(cs, eprop, val)
if self.bidi:
cs.apply_bidi()
self.style_map[tc] = cs
def resolve_para_style(self, p, overrides):
text_styles = [clone(self.paragraph_style), clone(self.run_style)]
for o in overrides:
if o in self.overrides:
ovr = self.overrides[o]
for i, name in enumerate(('para', 'run')):
ops = ovr.get(name, None)
if ops is not None:
if text_styles[i] is None:
text_styles[i] = ops
else:
text_styles[i].update(ops)
self.style_map[p] = text_styles
def handle_merged_cells(self):
if not self.cell_map:
return
# Handle vMerge
max_col_num = max(len(r) for r in self.cell_map)
for c in range(max_col_num):
cells = [row[c] if c < len(row) else None for row in self.cell_map]
runs = [[]]
for cell in cells:
try:
s = self.style_map[cell]
except KeyError: # cell is None
s = CellStyle(self.namespace)
if s.vMerge == 'restart':
runs.append([cell])
elif s.vMerge == 'continue':
runs[-1].append(cell)
else:
runs.append([])
for run in runs:
if len(run) > 1:
self.style_map[run[0]].row_span = len(run)
for tc in run[1:]:
tc.getparent().remove(tc)
# Handle hMerge
for cells in self.cell_map:
runs = [[]]
for cell in cells:
try:
s = self.style_map[cell]
except KeyError: # cell is None
s = CellStyle(self.namespace)
if s.col_span is not inherit:
runs.append([])
continue
if s.hMerge == 'restart':
runs.append([cell])
elif s.hMerge == 'continue':
runs[-1].append(cell)
else:
runs.append([])
for run in runs:
if len(run) > 1:
self.style_map[run[0]].col_span = len(run)
for tc in run[1:]:
tc.getparent().remove(tc)
def __iter__(self):
for p in self.paragraphs:
yield p
for t in itervalues(self.sub_tables):
for p in t:
yield p
def apply_markup(self, rmap, page, parent=None):
table = TABLE('\n\t\t')
if self.bidi:
table.set('dir', 'rtl')
self.table_style.page = page
style_map = {}
if parent is None:
try:
first_para = rmap[next(iter(self))]
except StopIteration:
return
parent = first_para.getparent()
idx = parent.index(first_para)
parent.insert(idx, table)
else:
parent.append(table)
for row in self.namespace.XPath('./w:tr')(self.tbl):
tr = TR('\n\t\t\t')
style_map[tr] = self.style_map[row]
tr.tail = '\n\t\t'
table.append(tr)
for tc in self.namespace.XPath('./w:tc')(row):
td = TD()
style_map[td] = s = self.style_map[tc]
if s.col_span is not inherit:
td.set('colspan', unicode_type(s.col_span))
if s.row_span is not inherit:
td.set('rowspan', unicode_type(s.row_span))
td.tail = '\n\t\t\t'
tr.append(td)
for x in self.namespace.XPath('./w:p|./w:tbl')(tc):
if x.tag.endswith('}p'):
td.append(rmap[x])
else:
self.sub_tables[x].apply_markup(rmap, page, parent=td)
if len(tr):
tr[-1].tail = '\n\t\t'
if len(table):
table[-1].tail = '\n\t'
table_style = self.table_style.css
if table_style:
table.set('class', self.styles.register(table_style, 'table'))
for elem, style in iteritems(style_map):
css = style.css
if css:
elem.set('class', self.styles.register(css, elem.tag))
class Tables(object):
def __init__(self, namespace):
self.tables = []
self.para_map = {}
self.sub_tables = set()
self.namespace = namespace
def register(self, tbl, styles):
if tbl in self.sub_tables:
return
self.tables.append(Table(self.namespace, tbl, styles, self.para_map))
self.sub_tables |= set(self.tables[-1].sub_tables)
def apply_markup(self, object_map, page_map):
rmap = {v:k for k, v in iteritems(object_map)}
for table in self.tables:
table.apply_markup(rmap, page_map[table.tbl])
def para_style(self, p):
table = self.para_map.get(p, None)
if table is not None:
return table.style_map.get(p, (None, None))[0]
def run_style(self, p):
table = self.para_map.get(p, None)
if table is not None:
return table.style_map.get(p, (None, None))[1]

View File

@@ -0,0 +1,29 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
class Theme(object):
def __init__(self, namespace):
self.major_latin_font = 'Cambria'
self.minor_latin_font = 'Calibri'
self.namespace = namespace
def __call__(self, root):
for fs in self.namespace.XPath('//a:fontScheme')(root):
for mj in self.namespace.XPath('./a:majorFont')(fs):
for l in self.namespace.XPath('./a:latin[@typeface]')(mj):
self.major_latin_font = l.get('typeface')
for mj in self.namespace.XPath('./a:minorFont')(fs):
for l in self.namespace.XPath('./a:latin[@typeface]')(mj):
self.minor_latin_font = l.get('typeface')
def resolve_font_family(self, ff):
if ff.startswith('|'):
ff = ff[1:-1]
ff = self.major_latin_font if ff.startswith('major') else self.minor_latin_font
return ff

View File

@@ -0,0 +1,839 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
import sys, os, re, math, errno, uuid, numbers
from collections import OrderedDict, defaultdict
from lxml import html
from lxml.html.builder import (
HTML, HEAD, TITLE, BODY, LINK, META, P, SPAN, BR, DIV, A, DT, DL, DD, H1)
from calibre import guess_type
from calibre.ebooks.docx.container import DOCX, fromstring
from calibre.ebooks.docx.names import XML, generate_anchor
from calibre.ebooks.docx.styles import Styles, inherit, PageProperties
from calibre.ebooks.docx.numbering import Numbering
from calibre.ebooks.docx.fonts import Fonts, is_symbol_font, map_symbol_text
from calibre.ebooks.docx.images import Images
from calibre.ebooks.docx.tables import Tables
from calibre.ebooks.docx.footnotes import Footnotes
from calibre.ebooks.docx.cleanup import cleanup_markup
from calibre.ebooks.docx.theme import Theme
from calibre.ebooks.docx.toc import create_toc
from calibre.ebooks.docx.fields import Fields
from calibre.ebooks.docx.settings import Settings
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.utils.localization import canonicalize_lang, lang_as_iso639_1
from polyglot.builtins import iteritems, itervalues, filter, getcwd, map, unicode_type
NBSP = '\xa0'
class Text:
def __init__(self, elem, attr, buf):
self.elem, self.attr, self.buf = elem, attr, buf
self.elems = [self.elem]
def add_elem(self, elem):
self.elems.append(elem)
setattr(self.elem, self.attr, ''.join(self.buf))
self.elem, self.attr, self.buf = elem, 'tail', []
def __iter__(self):
return iter(self.elems)
def html_lang(docx_lang):
lang = canonicalize_lang(docx_lang)
if lang and lang != 'und':
lang = lang_as_iso639_1(lang)
if lang:
return lang
class Convert(object):
def __init__(self, path_or_stream, dest_dir=None, log=None, detect_cover=True, notes_text=None, notes_nopb=False, nosupsub=False):
self.docx = DOCX(path_or_stream, log=log)
self.namespace = self.docx.namespace
self.ms_pat = re.compile(r'\s{2,}')
self.ws_pat = re.compile(r'[\n\r\t]')
self.log = self.docx.log
self.detect_cover = detect_cover
self.notes_text = notes_text or _('Notes')
self.notes_nopb = notes_nopb
self.nosupsub = nosupsub
self.dest_dir = dest_dir or getcwd()
self.mi = self.docx.metadata
self.body = BODY()
self.theme = Theme(self.namespace)
self.settings = Settings(self.namespace)
self.tables = Tables(self.namespace)
self.fields = Fields(self.namespace)
self.styles = Styles(self.namespace, self.tables)
self.images = Images(self.namespace, self.log)
self.object_map = OrderedDict()
self.html = HTML(
HEAD(
META(charset='utf-8'),
TITLE(self.mi.title or _('Unknown')),
LINK(rel='stylesheet', type='text/css', href='docx.css'),
),
self.body
)
self.html.text='\n\t'
self.html[0].text='\n\t\t'
self.html[0].tail='\n'
for child in self.html[0]:
child.tail = '\n\t\t'
self.html[0][-1].tail = '\n\t'
self.html[1].text = self.html[1].tail = '\n'
lang = html_lang(self.mi.language)
if lang:
self.html.set('lang', lang)
self.doc_lang = lang
else:
self.doc_lang = None
def __call__(self):
doc = self.docx.document
relationships_by_id, relationships_by_type = self.docx.document_relationships
self.resolve_alternate_content(doc)
self.fields(doc, self.log)
self.read_styles(relationships_by_type)
self.images(relationships_by_id)
self.layers = OrderedDict()
self.framed = [[]]
self.frame_map = {}
self.framed_map = {}
self.anchor_map = {}
self.link_map = defaultdict(list)
self.link_source_map = {}
self.toc_anchor = None
self.block_runs = []
paras = []
self.log.debug('Converting Word markup to HTML')
self.read_page_properties(doc)
self.current_rels = relationships_by_id
for wp, page_properties in iteritems(self.page_map):
self.current_page = page_properties
if wp.tag.endswith('}p'):
p = self.convert_p(wp)
self.body.append(p)
paras.append(wp)
self.read_block_anchors(doc)
self.styles.apply_contextual_spacing(paras)
self.mark_block_runs(paras)
# Apply page breaks at the start of every section, except the first
# section (since that will be the start of the file)
self.styles.apply_section_page_breaks(self.section_starts[1:])
notes_header = None
orig_rid_map = self.images.rid_map
if self.footnotes.has_notes:
self.body.append(H1(self.notes_text))
notes_header = self.body[-1]
notes_header.set('class', 'notes-header')
for anchor, text, note in self.footnotes:
dl = DL(id=anchor)
dl.set('class', 'footnote')
self.body.append(dl)
dl.append(DT('[', A('' + text, href='#back_%s' % anchor, title=text)))
dl[-1][0].tail = ']'
dl.append(DD())
paras = []
self.images.rid_map = self.current_rels = note.rels[0]
for wp in note:
if wp.tag.endswith('}tbl'):
self.tables.register(wp, self.styles)
self.page_map[wp] = self.current_page
else:
p = self.convert_p(wp)
dl[-1].append(p)
paras.append(wp)
self.styles.apply_contextual_spacing(paras)
self.mark_block_runs(paras)
for p, wp in iteritems(self.object_map):
if len(p) > 0 and not p.text and len(p[0]) > 0 and not p[0].text and p[0][0].get('class', None) == 'tab':
# Paragraph uses tabs for indentation, convert to text-indent
parent = p[0]
tabs = []
for child in parent:
if child.get('class', None) == 'tab':
tabs.append(child)
if child.tail:
break
else:
break
indent = len(tabs) * self.settings.default_tab_stop
style = self.styles.resolve(wp)
if style.text_indent is inherit or (hasattr(style.text_indent, 'endswith') and style.text_indent.endswith('pt')):
if style.text_indent is not inherit:
indent = float(style.text_indent[:-2]) + indent
style.text_indent = '%.3gpt' % indent
parent.text = tabs[-1].tail or ''
list(map(parent.remove, tabs))
self.images.rid_map = orig_rid_map
self.resolve_links()
self.styles.cascade(self.layers)
self.tables.apply_markup(self.object_map, self.page_map)
numbered = []
for html_obj, obj in iteritems(self.object_map):
raw = obj.get('calibre_num_id', None)
if raw is not None:
lvl, num_id = raw.partition(':')[0::2]
try:
lvl = int(lvl)
except (TypeError, ValueError):
lvl = 0
numbered.append((html_obj, num_id, lvl))
self.numbering.apply_markup(numbered, self.body, self.styles, self.object_map, self.images)
self.apply_frames()
if len(self.body) > 0:
self.body.text = '\n\t'
for child in self.body:
child.tail = '\n\t'
self.body[-1].tail = '\n'
self.log.debug('Converting styles to CSS')
self.styles.generate_classes()
for html_obj, obj in iteritems(self.object_map):
style = self.styles.resolve(obj)
if style is not None:
css = style.css
if css:
cls = self.styles.class_name(css)
if cls:
html_obj.set('class', cls)
for html_obj, css in iteritems(self.framed_map):
cls = self.styles.class_name(css)
if cls:
html_obj.set('class', cls)
if notes_header is not None:
for h in self.namespace.children(self.body, 'h1', 'h2', 'h3'):
notes_header.tag = h.tag
cls = h.get('class', None)
if cls and cls != 'notes-header':
notes_header.set('class', '%s notes-header' % cls)
break
self.fields.polish_markup(self.object_map)
self.log.debug('Cleaning up redundant markup generated by Word')
self.cover_image = cleanup_markup(self.log, self.html, self.styles, self.dest_dir, self.detect_cover, self.namespace.XPath)
return self.write(doc)
def read_page_properties(self, doc):
current = []
self.page_map = OrderedDict()
self.section_starts = []
for p in self.namespace.descendants(doc, 'w:p', 'w:tbl'):
if p.tag.endswith('}tbl'):
self.tables.register(p, self.styles)
current.append(p)
continue
sect = tuple(self.namespace.descendants(p, 'w:sectPr'))
if sect:
pr = PageProperties(self.namespace, sect)
paras = current + [p]
for x in paras:
self.page_map[x] = pr
self.section_starts.append(paras[0])
current = []
else:
current.append(p)
if current:
self.section_starts.append(current[0])
last = self.namespace.XPath('./w:body/w:sectPr')(doc)
pr = PageProperties(self.namespace, last)
for x in current:
self.page_map[x] = pr
def resolve_alternate_content(self, doc):
# For proprietary extensions in Word documents use the fallback, spec
# compliant form
# See https://wiki.openoffice.org/wiki/OOXML/Markup_Compatibility_and_Extensibility
for ac in self.namespace.descendants(doc, 'mc:AlternateContent'):
choices = self.namespace.XPath('./mc:Choice')(ac)
fallbacks = self.namespace.XPath('./mc:Fallback')(ac)
if fallbacks:
for choice in choices:
ac.remove(choice)
def read_styles(self, relationships_by_type):
def get_name(rtype, defname):
name = relationships_by_type.get(rtype, None)
if name is None:
cname = self.docx.document_name.split('/')
cname[-1] = defname
if self.docx.exists('/'.join(cname)):
name = name
if name and name.startswith('word/word') and not self.docx.exists(name):
name = name.partition('/')[2]
return name
nname = get_name(self.namespace.names['NUMBERING'], 'numbering.xml')
sname = get_name(self.namespace.names['STYLES'], 'styles.xml')
sename = get_name(self.namespace.names['SETTINGS'], 'settings.xml')
fname = get_name(self.namespace.names['FONTS'], 'fontTable.xml')
tname = get_name(self.namespace.names['THEMES'], 'theme1.xml')
foname = get_name(self.namespace.names['FOOTNOTES'], 'footnotes.xml')
enname = get_name(self.namespace.names['ENDNOTES'], 'endnotes.xml')
numbering = self.numbering = Numbering(self.namespace)
footnotes = self.footnotes = Footnotes(self.namespace)
fonts = self.fonts = Fonts(self.namespace)
foraw = enraw = None
forel, enrel = ({}, {}), ({}, {})
if sename is not None:
try:
seraw = self.docx.read(sename)
except KeyError:
self.log.warn('Settings %s do not exist' % sename)
except EnvironmentError as e:
if e.errno != errno.ENOENT:
raise
self.log.warn('Settings %s file missing' % sename)
else:
self.settings(fromstring(seraw))
if foname is not None:
try:
foraw = self.docx.read(foname)
except KeyError:
self.log.warn('Footnotes %s do not exist' % foname)
else:
forel = self.docx.get_relationships(foname)
if enname is not None:
try:
enraw = self.docx.read(enname)
except KeyError:
self.log.warn('Endnotes %s do not exist' % enname)
else:
enrel = self.docx.get_relationships(enname)
footnotes(fromstring(foraw) if foraw else None, forel, fromstring(enraw) if enraw else None, enrel)
if fname is not None:
embed_relationships = self.docx.get_relationships(fname)[0]
try:
raw = self.docx.read(fname)
except KeyError:
self.log.warn('Fonts table %s does not exist' % fname)
else:
fonts(fromstring(raw), embed_relationships, self.docx, self.dest_dir)
if tname is not None:
try:
raw = self.docx.read(tname)
except KeyError:
self.log.warn('Styles %s do not exist' % sname)
else:
self.theme(fromstring(raw))
styles_loaded = False
if sname is not None:
try:
raw = self.docx.read(sname)
except KeyError:
self.log.warn('Styles %s do not exist' % sname)
else:
self.styles(fromstring(raw), fonts, self.theme)
styles_loaded = True
if not styles_loaded:
self.styles(None, fonts, self.theme)
if nname is not None:
try:
raw = self.docx.read(nname)
except KeyError:
self.log.warn('Numbering styles %s do not exist' % nname)
else:
numbering(fromstring(raw), self.styles, self.docx.get_relationships(nname)[0])
self.styles.resolve_numbering(numbering)
def write(self, doc):
toc = create_toc(doc, self.body, self.resolved_link_map, self.styles, self.object_map, self.log, self.namespace)
raw = html.tostring(self.html, encoding='utf-8', doctype='<!DOCTYPE html>')
with lopen(os.path.join(self.dest_dir, 'index.html'), 'wb') as f:
f.write(raw)
css = self.styles.generate_css(self.dest_dir, self.docx, self.notes_nopb, self.nosupsub)
if css:
with lopen(os.path.join(self.dest_dir, 'docx.css'), 'wb') as f:
f.write(css.encode('utf-8'))
opf = OPFCreator(self.dest_dir, self.mi)
opf.toc = toc
opf.create_manifest_from_files_in([self.dest_dir])
for item in opf.manifest:
if item.media_type == 'text/html':
item.media_type = guess_type('a.xhtml')[0]
opf.create_spine(['index.html'])
if self.cover_image is not None:
opf.guide.set_cover(self.cover_image)
def process_guide(E, guide):
if self.toc_anchor is not None:
guide.append(E.reference(
href='index.html#' + self.toc_anchor, title=_('Table of Contents'), type='toc'))
toc_file = os.path.join(self.dest_dir, 'toc.ncx')
with lopen(os.path.join(self.dest_dir, 'metadata.opf'), 'wb') as of, open(toc_file, 'wb') as ncx:
opf.render(of, ncx, 'toc.ncx', process_guide=process_guide)
if os.path.getsize(toc_file) == 0:
os.remove(toc_file)
return os.path.join(self.dest_dir, 'metadata.opf')
def read_block_anchors(self, doc):
doc_anchors = frozenset(self.namespace.XPath('./w:body/w:bookmarkStart[@w:name]')(doc))
if doc_anchors:
current_bm = set()
rmap = {v:k for k, v in iteritems(self.object_map)}
for p in self.namespace.descendants(doc, 'w:p', 'w:bookmarkStart[@w:name]'):
if p.tag.endswith('}p'):
if current_bm and p in rmap:
para = rmap[p]
if 'id' not in para.attrib:
para.set('id', generate_anchor(next(iter(current_bm)), frozenset(itervalues(self.anchor_map))))
for name in current_bm:
self.anchor_map[name] = para.get('id')
current_bm = set()
elif p in doc_anchors:
anchor = self.namespace.get(p, 'w:name')
if anchor:
current_bm.add(anchor)
def convert_p(self, p):
dest = P()
self.object_map[dest] = p
style = self.styles.resolve_paragraph(p)
self.layers[p] = []
self.frame_map[p] = style.frame
self.add_frame(dest, style.frame)
current_anchor = None
current_hyperlink = None
hl_xpath = self.namespace.XPath('ancestor::w:hyperlink[1]')
def p_parent(x):
# Ensure that nested <w:p> tags are handled. These can occur if a
# textbox is present inside a paragraph.
while True:
x = x.getparent()
try:
if x.tag.endswith('}p'):
return x
except AttributeError:
break
for x in self.namespace.descendants(p, 'w:r', 'w:bookmarkStart', 'w:hyperlink', 'w:instrText'):
if p_parent(x) is not p:
continue
if x.tag.endswith('}r'):
span = self.convert_run(x)
if current_anchor is not None:
(dest if len(dest) == 0 else span).set('id', current_anchor)
current_anchor = None
if current_hyperlink is not None:
try:
hl = hl_xpath(x)[0]
self.link_map[hl].append(span)
self.link_source_map[hl] = self.current_rels
x.set('is-link', '1')
except IndexError:
current_hyperlink = None
dest.append(span)
self.layers[p].append(x)
elif x.tag.endswith('}bookmarkStart'):
anchor = self.namespace.get(x, 'w:name')
if anchor and anchor not in self.anchor_map and anchor != '_GoBack':
# _GoBack is a special bookmark inserted by Word 2010 for
# the return to previous edit feature, we ignore it
old_anchor = current_anchor
self.anchor_map[anchor] = current_anchor = generate_anchor(anchor, frozenset(itervalues(self.anchor_map)))
if old_anchor is not None:
# The previous anchor was not applied to any element
for a, t in tuple(iteritems(self.anchor_map)):
if t == old_anchor:
self.anchor_map[a] = current_anchor
elif x.tag.endswith('}hyperlink'):
current_hyperlink = x
elif x.tag.endswith('}instrText') and x.text and x.text.strip().startswith('TOC '):
old_anchor = current_anchor
anchor = unicode_type(uuid.uuid4())
self.anchor_map[anchor] = current_anchor = generate_anchor('toc', frozenset(itervalues(self.anchor_map)))
self.toc_anchor = current_anchor
if old_anchor is not None:
# The previous anchor was not applied to any element
for a, t in tuple(iteritems(self.anchor_map)):
if t == old_anchor:
self.anchor_map[a] = current_anchor
if current_anchor is not None:
# This paragraph had no <w:r> descendants
dest.set('id', current_anchor)
current_anchor = None
m = re.match(r'heading\s+(\d+)$', style.style_name or '', re.IGNORECASE)
if m is not None:
n = min(6, max(1, int(m.group(1))))
dest.tag = 'h%d' % n
dest.set('data-heading-level', unicode_type(n))
if style.bidi is True:
dest.set('dir', 'rtl')
border_runs = []
common_borders = []
for span in dest:
run = self.object_map[span]
style = self.styles.resolve_run(run)
if not border_runs or border_runs[-1][1].same_border(style):
border_runs.append((span, style))
elif border_runs:
if len(border_runs) > 1:
common_borders.append(border_runs)
border_runs = []
for border_run in common_borders:
spans = []
bs = {}
for span, style in border_run:
style.get_border_css(bs)
style.clear_border_css()
spans.append(span)
if bs:
cls = self.styles.register(bs, 'text_border')
wrapper = self.wrap_elems(spans, SPAN())
wrapper.set('class', cls)
if not dest.text and len(dest) == 0 and not style.has_visible_border():
# Empty paragraph add a non-breaking space so that it is rendered
# by WebKit
dest.text = NBSP
# If the last element in a block is a <br> the <br> is not rendered in
# HTML, unless it is followed by a trailing space. Word, on the other
# hand inserts a blank line for trailing <br>s.
if len(dest) > 0 and not dest[-1].tail:
if dest[-1].tag == 'br':
dest[-1].tail = NBSP
elif len(dest[-1]) > 0 and dest[-1][-1].tag == 'br' and not dest[-1][-1].tail:
dest[-1][-1].tail = NBSP
return dest
def wrap_elems(self, elems, wrapper):
p = elems[0].getparent()
idx = p.index(elems[0])
p.insert(idx, wrapper)
wrapper.tail = elems[-1].tail
elems[-1].tail = None
for elem in elems:
try:
p.remove(elem)
except ValueError:
# Probably a hyperlink that spans multiple
# paragraphs,theoretically we should break this up into
# multiple hyperlinks, but I can't be bothered.
elem.getparent().remove(elem)
wrapper.append(elem)
return wrapper
def resolve_links(self):
self.resolved_link_map = {}
for hyperlink, spans in iteritems(self.link_map):
relationships_by_id = self.link_source_map[hyperlink]
span = spans[0]
if len(spans) > 1:
span = self.wrap_elems(spans, SPAN())
span.tag = 'a'
self.resolved_link_map[hyperlink] = span
tgt = self.namespace.get(hyperlink, 'w:tgtFrame')
if tgt:
span.set('target', tgt)
tt = self.namespace.get(hyperlink, 'w:tooltip')
if tt:
span.set('title', tt)
rid = self.namespace.get(hyperlink, 'r:id')
if rid and rid in relationships_by_id:
span.set('href', relationships_by_id[rid])
continue
anchor = self.namespace.get(hyperlink, 'w:anchor')
if anchor and anchor in self.anchor_map:
span.set('href', '#' + self.anchor_map[anchor])
continue
self.log.warn('Hyperlink with unknown target (rid=%s, anchor=%s), ignoring' %
(rid, anchor))
# hrefs that point nowhere give epubcheck a hernia. The element
# should be styled explicitly by Word anyway.
# span.set('href', '#')
rmap = {v:k for k, v in iteritems(self.object_map)}
for hyperlink, runs in self.fields.hyperlink_fields:
spans = [rmap[r] for r in runs if r in rmap]
if not spans:
continue
span = spans[0]
if len(spans) > 1:
span = self.wrap_elems(spans, SPAN())
span.tag = 'a'
tgt = hyperlink.get('target', None)
if tgt:
span.set('target', tgt)
tt = hyperlink.get('title', None)
if tt:
span.set('title', tt)
url = hyperlink.get('url', None)
if url is None:
anchor = hyperlink.get('anchor', None)
if anchor in self.anchor_map:
span.set('href', '#' + self.anchor_map[anchor])
continue
self.log.warn('Hyperlink field with unknown anchor: %s' % anchor)
else:
if url in self.anchor_map:
span.set('href', '#' + self.anchor_map[url])
continue
span.set('href', url)
for img, link, relationships_by_id in self.images.links:
parent = img.getparent()
idx = parent.index(img)
a = A(img)
a.tail, img.tail = img.tail, None
parent.insert(idx, a)
tgt = link.get('target', None)
if tgt:
a.set('target', tgt)
tt = link.get('title', None)
if tt:
a.set('title', tt)
rid = link['id']
if rid in relationships_by_id:
dest = relationships_by_id[rid]
if dest.startswith('#'):
if dest[1:] in self.anchor_map:
a.set('href', '#' + self.anchor_map[dest[1:]])
else:
a.set('href', dest)
def convert_run(self, run):
ans = SPAN()
self.object_map[ans] = run
text = Text(ans, 'text', [])
for child in run:
if self.namespace.is_tag(child, 'w:t'):
if not child.text:
continue
space = child.get(XML('space'), None)
preserve = False
ctext = child.text
if space != 'preserve':
# Remove leading and trailing whitespace. Word ignores
# leading and trailing whitespace without preserve
ctext = ctext.strip(' \n\r\t')
# Only use a <span> with white-space:pre-wrap if this element
# actually needs it, i.e. if it has more than one
# consecutive space or it has newlines or tabs.
multi_spaces = self.ms_pat.search(ctext) is not None
preserve = multi_spaces or self.ws_pat.search(ctext) is not None
if preserve:
text.add_elem(SPAN(ctext, style="white-space:pre-wrap"))
ans.append(text.elem)
else:
text.buf.append(ctext)
elif self.namespace.is_tag(child, 'w:cr'):
text.add_elem(BR())
ans.append(text.elem)
elif self.namespace.is_tag(child, 'w:br'):
typ = self.namespace.get(child, 'w:type')
if typ in {'column', 'page'}:
br = BR(style='page-break-after:always')
else:
clear = child.get('clear', None)
if clear in {'all', 'left', 'right'}:
br = BR(style='clear:%s'%('both' if clear == 'all' else clear))
else:
br = BR()
text.add_elem(br)
ans.append(text.elem)
elif self.namespace.is_tag(child, 'w:drawing') or self.namespace.is_tag(child, 'w:pict'):
for img in self.images.to_html(child, self.current_page, self.docx, self.dest_dir):
text.add_elem(img)
ans.append(text.elem)
elif self.namespace.is_tag(child, 'w:footnoteReference') or self.namespace.is_tag(child, 'w:endnoteReference'):
anchor, name = self.footnotes.get_ref(child)
if anchor and name:
l = A(name, id='back_%s' % anchor, href='#' + anchor, title=name)
l.set('class', 'noteref')
text.add_elem(l)
ans.append(text.elem)
elif self.namespace.is_tag(child, 'w:tab'):
spaces = int(math.ceil((self.settings.default_tab_stop / 36) * 6))
text.add_elem(SPAN(NBSP * spaces))
ans.append(text.elem)
ans[-1].set('class', 'tab')
elif self.namespace.is_tag(child, 'w:noBreakHyphen'):
text.buf.append('\u2011')
elif self.namespace.is_tag(child, 'w:softHyphen'):
text.buf.append('\u00ad')
if text.buf:
setattr(text.elem, text.attr, ''.join(text.buf))
style = self.styles.resolve_run(run)
if style.vert_align in {'superscript', 'subscript'}:
if ans.text or len(ans):
ans.set('data-docx-vert', 'sup' if style.vert_align == 'superscript' else 'sub')
if style.lang is not inherit:
lang = html_lang(style.lang)
if lang is not None and lang != self.doc_lang:
ans.set('lang', lang)
if style.rtl is True:
ans.set('dir', 'rtl')
if is_symbol_font(style.font_family):
for elem in text:
if elem.text:
elem.text = map_symbol_text(elem.text, style.font_family)
if elem.tail:
elem.tail = map_symbol_text(elem.tail, style.font_family)
style.font_family = 'sans-serif'
return ans
def add_frame(self, html_obj, style):
last_run = self.framed[-1]
if style is inherit:
if last_run:
self.framed.append([])
return
if last_run:
if last_run[-1][1] == style:
last_run.append((html_obj, style))
else:
self.framed[-1].append((html_obj, style))
else:
last_run.append((html_obj, style))
def apply_frames(self):
for run in filter(None, self.framed):
style = run[0][1]
paras = tuple(x[0] for x in run)
parent = paras[0].getparent()
idx = parent.index(paras[0])
frame = DIV(*paras)
parent.insert(idx, frame)
self.framed_map[frame] = css = style.css(self.page_map[self.object_map[paras[0]]])
self.styles.register(css, 'frame')
if not self.block_runs:
return
rmap = {v:k for k, v in iteritems(self.object_map)}
for border_style, blocks in self.block_runs:
paras = tuple(rmap[p] for p in blocks)
for p in paras:
if p.tag == 'li':
has_li = True
break
else:
has_li = False
parent = paras[0].getparent()
if parent.tag in ('ul', 'ol'):
ul = parent
parent = ul.getparent()
idx = parent.index(ul)
frame = DIV(ul)
elif has_li:
def top_level_tag(x):
while True:
q = x.getparent()
if q is parent or q is None:
break
x = q
return x
paras = tuple(map(top_level_tag, paras))
idx = parent.index(paras[0])
frame = DIV(*paras)
else:
idx = parent.index(paras[0])
frame = DIV(*paras)
parent.insert(idx, frame)
self.framed_map[frame] = css = border_style.css
self.styles.register(css, 'frame')
def mark_block_runs(self, paras):
def process_run(run):
max_left = max_right = 0
has_visible_border = None
for p in run:
style = self.styles.resolve_paragraph(p)
if has_visible_border is None:
has_visible_border = style.has_visible_border()
if isinstance(style.margin_left, numbers.Number):
max_left = max(style.margin_left, max_left)
if isinstance(style.margin_right, numbers.Number):
max_right = max(style.margin_right, max_right)
if has_visible_border:
style.margin_left = style.margin_right = inherit
if p is not run[0]:
style.padding_top = 0
else:
border_style = style.clone_border_styles()
if has_visible_border:
border_style.margin_top, style.margin_top = style.margin_top, inherit
if p is not run[-1]:
style.padding_bottom = 0
else:
if has_visible_border:
border_style.margin_bottom, style.margin_bottom = style.margin_bottom, inherit
style.clear_borders()
if p is not run[-1]:
style.apply_between_border()
if has_visible_border:
border_style.margin_left, border_style.margin_right = max_left,max_right
self.block_runs.append((border_style, run))
run = []
for p in paras:
if run and self.frame_map.get(p) == self.frame_map.get(run[-1]):
style = self.styles.resolve_paragraph(p)
last_style = self.styles.resolve_paragraph(run[-1])
if style.has_identical_borders(last_style):
run.append(p)
continue
if len(run) > 1:
process_run(run)
run = [p]
if len(run) > 1:
process_run(run)
if __name__ == '__main__':
import shutil
from calibre.utils.logging import default_log
default_log.filter_level = default_log.DEBUG
dest_dir = os.path.join(getcwd(), 'docx_input')
if os.path.exists(dest_dir):
shutil.rmtree(dest_dir)
os.mkdir(dest_dir)
Convert(sys.argv[-1], dest_dir=dest_dir, log=default_log)()

View File

@@ -0,0 +1,143 @@
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
from collections import namedtuple
from itertools import count
from lxml.etree import tostring
from calibre.ebooks.metadata.toc import TOC
from calibre.ebooks.oeb.polish.toc import elem_to_toc_text
from polyglot.builtins import iteritems, range
def from_headings(body, log, namespace, num_levels=3):
' Create a TOC from headings in the document '
tocroot = TOC()
all_heading_nodes = body.xpath('//*[@data-heading-level]')
level_prev = {i+1:None for i in range(num_levels)}
level_prev[0] = tocroot
level_item_map = {i:frozenset(
x for x in all_heading_nodes if int(x.get('data-heading-level')) == i)
for i in range(1, num_levels+1)}
item_level_map = {e:i for i, elems in iteritems(level_item_map) for e in elems}
idcount = count()
def ensure_id(elem):
ans = elem.get('id', None)
if not ans:
ans = 'toc_id_%d' % (next(idcount) + 1)
elem.set('id', ans)
return ans
for item in all_heading_nodes:
lvl = plvl = item_level_map.get(item, None)
if lvl is None:
continue
parent = None
while parent is None:
plvl -= 1
parent = level_prev[plvl]
lvl = plvl + 1
elem_id = ensure_id(item)
text = elem_to_toc_text(item)
toc = parent.add_item('index.html', elem_id, text)
level_prev[lvl] = toc
for i in range(lvl+1, num_levels+1):
level_prev[i] = None
if len(tuple(tocroot.flat())) > 1:
log('Generating Table of Contents from headings')
return tocroot
def structure_toc(entries):
indent_vals = sorted({x.indent for x in entries})
last_found = [None for i in indent_vals]
newtoc = TOC()
if len(indent_vals) > 6:
for x in entries:
newtoc.add_item('index.html', x.anchor, x.text)
return newtoc
def find_parent(level):
candidates = last_found[:level]
for x in reversed(candidates):
if x is not None:
return x
return newtoc
for item in entries:
level = indent_vals.index(item.indent)
parent = find_parent(level)
last_found[level] = parent.add_item('index.html', item.anchor,
item.text)
for i in range(level+1, len(last_found)):
last_found[i] = None
return newtoc
def link_to_txt(a, styles, object_map):
if len(a) > 1:
for child in a:
run = object_map.get(child, None)
if run is not None:
rs = styles.resolve(run)
if rs.css.get('display', None) == 'none':
a.remove(child)
return tostring(a, method='text', with_tail=False, encoding='unicode').strip()
def from_toc(docx, link_map, styles, object_map, log, namespace):
XPath, get, ancestor = namespace.XPath, namespace.get, namespace.ancestor
toc_level = None
level = 0
TI = namedtuple('TI', 'text anchor indent')
toc = []
for tag in XPath('//*[(@w:fldCharType and name()="w:fldChar") or name()="w:hyperlink" or name()="w:instrText"]')(docx):
n = tag.tag.rpartition('}')[-1]
if n == 'fldChar':
t = get(tag, 'w:fldCharType')
if t == 'begin':
level += 1
elif t == 'end':
level -= 1
if toc_level is not None and level < toc_level:
break
elif n == 'instrText':
if level > 0 and tag.text and tag.text.strip().startswith('TOC '):
toc_level = level
elif n == 'hyperlink':
if toc_level is not None and level >= toc_level and tag in link_map:
a = link_map[tag]
href = a.get('href', None)
txt = link_to_txt(a, styles, object_map)
p = ancestor(tag, 'w:p')
if txt and href and p is not None:
ps = styles.resolve_paragraph(p)
try:
ml = int(ps.margin_left[:-2])
except (TypeError, ValueError, AttributeError):
ml = 0
if ps.text_align in {'center', 'right'}:
ml = 0
toc.append(TI(txt, href[1:], ml))
if toc:
log('Found Word Table of Contents, using it to generate the Table of Contents')
return structure_toc(toc)
def create_toc(docx, body, link_map, styles, object_map, log, namespace):
ans = from_toc(docx, link_map, styles, object_map, log, namespace) or from_headings(body, log, namespace)
# Remove heading level attributes
for h in body.xpath('//*[@data-heading-level]'):
del h.attrib['data-heading-level']
return ans

View File

@@ -0,0 +1,7 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

View File

@@ -0,0 +1,258 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
'''
Input plugin for HTML or OPF ebooks.
'''
import os, re, sys, errno as gerrno
from calibre.ebooks.oeb.base import urlunquote
from calibre.ebooks.chardet import detect_xml_encoding
from calibre.constants import iswindows
from calibre import unicode_path, as_unicode, replace_entities
from polyglot.builtins import is_py3, unicode_type
from polyglot.urllib import urlparse, urlunparse
class Link(object):
'''
Represents a link in a HTML file.
'''
@classmethod
def url_to_local_path(cls, url, base):
path = url.path
isabs = False
if iswindows and path.startswith('/'):
path = path[1:]
isabs = True
path = urlunparse(('', '', path, url.params, url.query, ''))
path = urlunquote(path)
if isabs or os.path.isabs(path):
return path
return os.path.abspath(os.path.join(base, path))
def __init__(self, url, base):
'''
:param url: The url this link points to. Must be an unquoted unicode string.
:param base: The base directory that relative URLs are with respect to.
Must be a unicode string.
'''
assert isinstance(url, unicode_type) and isinstance(base, unicode_type)
self.url = url
self.parsed_url = urlparse(self.url)
self.is_local = self.parsed_url.scheme in ('', 'file')
self.is_internal = self.is_local and not bool(self.parsed_url.path)
self.path = None
self.fragment = urlunquote(self.parsed_url.fragment)
if self.is_local and not self.is_internal:
self.path = self.url_to_local_path(self.parsed_url, base)
def __hash__(self):
if self.path is None:
return hash(self.url)
return hash(self.path)
def __eq__(self, other):
return self.path == getattr(other, 'path', other)
def __str__(self):
return 'Link: %s --> %s'%(self.url, self.path)
if not is_py3:
__unicode__ = __str__
class IgnoreFile(Exception):
def __init__(self, msg, errno):
Exception.__init__(self, msg)
self.doesnt_exist = errno == gerrno.ENOENT
self.errno = errno
class HTMLFile(object):
'''
Contains basic information about an HTML file. This
includes a list of links to other files as well as
the encoding of each file. Also tries to detect if the file is not a HTML
file in which case :member:`is_binary` is set to True.
The encoding of the file is available as :member:`encoding`.
'''
HTML_PAT = re.compile(r'<\s*html', re.IGNORECASE)
TITLE_PAT = re.compile('<title>([^<>]+)</title>', re.IGNORECASE)
LINK_PAT = re.compile(
r'<\s*a\s+.*?href\s*=\s*(?:(?:"(?P<url1>[^"]+)")|(?:\'(?P<url2>[^\']+)\')|(?P<url3>[^\s>]+))',
re.DOTALL|re.IGNORECASE)
def __init__(self, path_to_html_file, level, encoding, verbose, referrer=None):
'''
:param level: The level of this file. Should be 0 for the root file.
:param encoding: Use `encoding` to decode HTML.
:param referrer: The :class:`HTMLFile` that first refers to this file.
'''
self.path = unicode_path(path_to_html_file, abs=True)
self.title = os.path.splitext(os.path.basename(self.path))[0]
self.base = os.path.dirname(self.path)
self.level = level
self.referrer = referrer
self.links = []
try:
with open(self.path, 'rb') as f:
src = header = f.read(4096)
encoding = detect_xml_encoding(src)[1]
if encoding:
try:
header = header.decode(encoding)
except ValueError:
pass
self.is_binary = level > 0 and not bool(self.HTML_PAT.search(header))
if not self.is_binary:
src += f.read()
except IOError as err:
msg = 'Could not read from file: %s with error: %s'%(self.path, as_unicode(err))
if level == 0:
raise IOError(msg)
raise IgnoreFile(msg, err.errno)
if not src:
if level == 0:
raise ValueError('The file %s is empty'%self.path)
self.is_binary = True
if not self.is_binary:
if not encoding:
encoding = detect_xml_encoding(src[:4096], verbose=verbose)[1]
self.encoding = encoding
else:
self.encoding = encoding
src = src.decode(encoding, 'replace')
match = self.TITLE_PAT.search(src)
self.title = match.group(1) if match is not None else self.title
self.find_links(src)
def __eq__(self, other):
return self.path == getattr(other, 'path', other)
def __hash__(self):
return hash(self.path)
def __str__(self):
return 'HTMLFile:%d:%s:%s'%(self.level, 'b' if self.is_binary else 'a', self.path)
def __repr__(self):
return unicode_type(self)
def find_links(self, src):
for match in self.LINK_PAT.finditer(src):
url = None
for i in ('url1', 'url2', 'url3'):
url = match.group(i)
if url:
break
url = replace_entities(url)
try:
link = self.resolve(url)
except ValueError:
# Unparseable URL, ignore
continue
if link not in self.links:
self.links.append(link)
def resolve(self, url):
return Link(url, self.base)
def depth_first(root, flat, visited=None):
yield root
if visited is None:
visited = set()
visited.add(root)
for link in root.links:
if link.path is not None and link not in visited:
try:
index = flat.index(link)
except ValueError: # Can happen if max_levels is used
continue
hf = flat[index]
if hf not in visited:
yield hf
visited.add(hf)
for hf in depth_first(hf, flat, visited):
if hf not in visited:
yield hf
visited.add(hf)
def traverse(path_to_html_file, max_levels=sys.maxsize, verbose=0, encoding=None):
'''
Recursively traverse all links in the HTML file.
:param max_levels: Maximum levels of recursion. Must be non-negative. 0
implies that no links in the root HTML file are followed.
:param encoding: Specify character encoding of HTML files. If `None` it is
auto-detected.
:return: A pair of lists (breadth_first, depth_first). Each list contains
:class:`HTMLFile` objects.
'''
assert max_levels >= 0
level = 0
flat = [HTMLFile(path_to_html_file, level, encoding, verbose)]
next_level = list(flat)
while level < max_levels and len(next_level) > 0:
level += 1
nl = []
for hf in next_level:
rejects = []
for link in hf.links:
if link.path is None or link.path in flat:
continue
try:
nf = HTMLFile(link.path, level, encoding, verbose, referrer=hf)
if nf.is_binary:
raise IgnoreFile('%s is a binary file'%nf.path, -1)
nl.append(nf)
flat.append(nf)
except IgnoreFile as err:
rejects.append(link)
if not err.doesnt_exist or verbose > 1:
print(repr(err))
for link in rejects:
hf.links.remove(link)
next_level = list(nl)
orec = sys.getrecursionlimit()
sys.setrecursionlimit(500000)
try:
return flat, list(depth_first(flat[0], flat))
finally:
sys.setrecursionlimit(orec)
def get_filelist(htmlfile, dir, opts, log):
'''
Build list of files referenced by html file or try to detect and use an
OPF file instead.
'''
log.info('Building file list...')
filelist = traverse(htmlfile, max_levels=int(opts.max_levels),
verbose=opts.verbose,
encoding=opts.input_encoding)[0 if opts.breadth_first else 1]
if opts.verbose:
log.debug('\tFound files...')
for f in filelist:
log.debug('\t\t', f)
return filelist

View File

@@ -0,0 +1,122 @@
#!/usr/bin/env python2
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2011, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'
import textwrap, os, glob
from calibre.customize import FileTypePlugin
from calibre.constants import numeric_version
from polyglot.builtins import unicode_type
class HTML2ZIP(FileTypePlugin):
name = 'HTML to ZIP'
author = 'Kovid Goyal'
description = textwrap.dedent(_('''\
Follow all local links in an HTML file and create a ZIP \
file containing all linked files. This plugin is run \
every time you add an HTML file to the library.\
'''))
version = numeric_version
file_types = {'html', 'htm', 'xhtml', 'xhtm', 'shtm', 'shtml'}
supported_platforms = ['windows', 'osx', 'linux']
on_import = True
def run(self, htmlfile):
import codecs
from calibre import prints
from calibre.ptempfile import TemporaryDirectory
from calibre.gui2.convert.gui_conversion import gui_convert
from calibre.customize.conversion import OptionRecommendation
from calibre.ebooks.epub import initialize_container
with TemporaryDirectory('_plugin_html2zip') as tdir:
recs =[('debug_pipeline', tdir, OptionRecommendation.HIGH)]
recs.append(['keep_ligatures', True, OptionRecommendation.HIGH])
if self.site_customization and self.site_customization.strip():
sc = self.site_customization.strip()
enc, _, bf = sc.partition('|')
if enc:
try:
codecs.lookup(enc)
except Exception:
prints('Ignoring invalid input encoding for HTML:', enc)
else:
recs.append(['input_encoding', enc, OptionRecommendation.HIGH])
if bf == 'bf':
recs.append(['breadth_first', True,
OptionRecommendation.HIGH])
gui_convert(htmlfile, tdir, recs, abort_after_input_dump=True)
of = self.temporary_file('_plugin_html2zip.zip')
tdir = os.path.join(tdir, 'input')
opf = glob.glob(os.path.join(tdir, '*.opf'))[0]
ncx = glob.glob(os.path.join(tdir, '*.ncx'))
if ncx:
os.remove(ncx[0])
epub = initialize_container(of.name, os.path.basename(opf))
epub.add_dir(tdir)
epub.close()
return of.name
def customization_help(self, gui=False):
return _('Character encoding for the input HTML files. Common choices '
'include: cp1252, cp1251, latin1 and utf-8.')
def do_user_config(self, parent=None):
'''
This method shows a configuration dialog for this plugin. It returns
True if the user clicks OK, False otherwise. The changes are
automatically applied.
'''
from PyQt5.Qt import (QDialog, QDialogButtonBox, QVBoxLayout,
QLabel, Qt, QLineEdit, QCheckBox)
config_dialog = QDialog(parent)
button_box = QDialogButtonBox(QDialogButtonBox.Ok | QDialogButtonBox.Cancel)
v = QVBoxLayout(config_dialog)
def size_dialog():
config_dialog.resize(config_dialog.sizeHint())
button_box.accepted.connect(config_dialog.accept)
button_box.rejected.connect(config_dialog.reject)
config_dialog.setWindowTitle(_('Customize') + ' ' + self.name)
from calibre.customize.ui import (plugin_customization,
customize_plugin)
help_text = self.customization_help(gui=True)
help_text = QLabel(help_text, config_dialog)
help_text.setWordWrap(True)
help_text.setTextInteractionFlags(Qt.LinksAccessibleByMouse | Qt.LinksAccessibleByKeyboard)
help_text.setOpenExternalLinks(True)
v.addWidget(help_text)
bf = QCheckBox(_('Add linked files in breadth first order'))
bf.setToolTip(_('Normally, when following links in HTML files'
' calibre does it depth first, i.e. if file A links to B and '
' C, but B links to D, the files are added in the order A, B, D, C. '
' With this option, they will instead be added as A, B, C, D'))
sc = plugin_customization(self)
if not sc:
sc = ''
sc = sc.strip()
enc = sc.partition('|')[0]
bfs = sc.partition('|')[-1]
bf.setChecked(bfs == 'bf')
sc = QLineEdit(enc, config_dialog)
v.addWidget(sc)
v.addWidget(bf)
v.addWidget(button_box)
size_dialog()
config_dialog.exec_()
if config_dialog.result() == QDialog.Accepted:
sc = unicode_type(sc.text()).strip()
if bf.isChecked():
sc += '|bf'
customize_plugin(self, sc)
return config_dialog.result()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,115 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
"""
This package contains logic to read and write LRF files.
The LRF file format is documented at U{http://www.sven.de/librie/Librie/LrfFormat}.
"""
from calibre.ebooks.lrf.pylrs.pylrs import Book as _Book
from calibre.ebooks.lrf.pylrs.pylrs import TextBlock, Header, \
TextStyle, BlockStyle
from calibre.ebooks.lrf.fonts import FONT_FILE_MAP
from calibre.ebooks import ConversionError
__docformat__ = "epytext"
class LRFParseError(Exception):
pass
class PRS500_PROFILE(object):
screen_width = 600
screen_height = 775
dpi = 166
# Number of pixels to subtract from screen_height when calculating height of text area
fudge = 0
font_size = 10 #: Default (in pt)
parindent = 10 #: Default (in pt)
line_space = 1.2 # : Default (in pt)
header_font_size = 6 #: In pt
header_height = 30 # : In px
default_fonts = {'sans': "Swis721 BT Roman", 'mono': "Courier10 BT Roman",
'serif': "Dutch801 Rm BT Roman"}
name = 'prs500'
def find_custom_fonts(options, logger):
from calibre.utils.fonts.scanner import font_scanner
fonts = {'serif' : None, 'sans' : None, 'mono' : None}
def family(cmd):
return cmd.split(',')[-1].strip()
if options.serif_family:
f = family(options.serif_family)
fonts['serif'] = font_scanner.legacy_fonts_for_family(f)
if not fonts['serif']:
logger.warn('Unable to find serif family %s'%f)
if options.sans_family:
f = family(options.sans_family)
fonts['sans'] = font_scanner.legacy_fonts_for_family(f)
if not fonts['sans']:
logger.warn('Unable to find sans family %s'%f)
if options.mono_family:
f = family(options.mono_family)
fonts['mono'] = font_scanner.legacy_fonts_for_family(f)
if not fonts['mono']:
logger.warn('Unable to find mono family %s'%f)
return fonts
def Book(options, logger, font_delta=0, header=None,
profile=PRS500_PROFILE, **settings):
from uuid import uuid4
ps = {}
ps['topmargin'] = options.top_margin
ps['evensidemargin'] = options.left_margin
ps['oddsidemargin'] = options.left_margin
ps['textwidth'] = profile.screen_width - (options.left_margin + options.right_margin)
ps['textheight'] = profile.screen_height - (options.top_margin + options.bottom_margin) \
- profile.fudge
if header:
hdr = Header()
hb = TextBlock(textStyle=TextStyle(align='foot',
fontsize=int(profile.header_font_size*10)),
blockStyle=BlockStyle(blockwidth=ps['textwidth']))
hb.append(header)
hdr.PutObj(hb)
ps['headheight'] = profile.header_height
ps['headsep'] = options.header_separation
ps['header'] = hdr
ps['topmargin'] = 0
ps['textheight'] = profile.screen_height - (options.bottom_margin + ps['topmargin']) \
- ps['headheight'] - ps['headsep'] - profile.fudge
fontsize = int(10*profile.font_size+font_delta*20)
baselineskip = fontsize + 20
fonts = find_custom_fonts(options, logger)
tsd = dict(fontsize=fontsize,
parindent=int(10*profile.parindent),
linespace=int(10*profile.line_space),
baselineskip=baselineskip,
wordspace=10*options.wordspace)
if fonts['serif'] and 'normal' in fonts['serif']:
tsd['fontfacename'] = fonts['serif']['normal'][1]
book = _Book(textstyledefault=tsd,
pagestyledefault=ps,
blockstyledefault=dict(blockwidth=ps['textwidth']),
bookid=uuid4().hex,
**settings)
for family in fonts.keys():
if fonts[family]:
for font in fonts[family].values():
book.embed_font(*font)
FONT_FILE_MAP[font[1]] = font[0]
for family in ['serif', 'sans', 'mono']:
if not fonts[family]:
fonts[family] = {'normal' : (None, profile.default_fonts[family])}
elif 'normal' not in fonts[family]:
raise ConversionError('Could not find the normal version of the ' + family + ' font')
return book, fonts

View File

@@ -0,0 +1,33 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
from PIL import ImageFont
'''
Default fonts used in the PRS500
'''
LIBERATION_FONT_MAP = {
'Swis721 BT Roman' : 'LiberationSans-Regular',
'Dutch801 Rm BT Roman' : 'LiberationSerif-Regular',
'Courier10 BT Roman' : 'LiberationMono-Regular',
}
FONT_FILE_MAP = {}
def get_font(name, size, encoding='unic'):
'''
Get an ImageFont object by name.
@param size: Font height in pixels. To convert from pts:
sz in pixels = (dpi/72) * size in pts
@param encoding: Font encoding to use. E.g. 'unic', 'symbol', 'ADOB', 'ADBE', 'aprm'
@param manager: A dict that will store the PersistentTemporary
'''
if name in LIBERATION_FONT_MAP:
return ImageFont.truetype(P('fonts/liberation/%s.ttf' % LIBERATION_FONT_MAP[name]), size, encoding=encoding)
elif name in FONT_FILE_MAP:
return ImageFont.truetype(FONT_FILE_MAP[name], size, encoding=encoding)

View File

@@ -0,0 +1,10 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
"""
This package contains code to convert HTML ebooks to LRF ebooks.
"""
__docformat__ = "epytext"
__author__ = "Kovid Goyal <kovid@kovidgoyal.net>"

View File

@@ -0,0 +1,115 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import re
NAME_MAP = {
'aliceblue': '#F0F8FF',
'antiquewhite': '#FAEBD7',
'aqua': '#00FFFF',
'aquamarine': '#7FFFD4',
'azure': '#F0FFFF',
'beige': '#F5F5DC',
'bisque': '#FFE4C4',
'black': '#000000',
'blanchedalmond': '#FFEBCD',
'blue': '#0000FF',
'brown': '#A52A2A',
'burlywood': '#DEB887',
'cadetblue': '#5F9EA0',
'chartreuse': '#7FFF00',
'chocolate': '#D2691E',
'coral': '#FF7F50',
'crimson': '#DC143C',
'cyan': '#00FFFF',
'darkblue': '#00008B',
'darkgoldenrod': '#B8860B',
'darkgreen': '#006400',
'darkkhaki': '#BDB76B',
'darkmagenta': '#8B008B',
'darkolivegreen': '#556B2F',
'darkorange': '#FF8C00',
'darkorchid': '#9932CC',
'darkred': '#8B0000',
'darksalmon': '#E9967A',
'darkslateblue': '#483D8B',
'darkslategrey': '#2F4F4F',
'darkviolet': '#9400D3',
'deeppink': '#FF1493',
'dodgerblue': '#1E90FF',
'firebrick': '#B22222',
'floralwhite': '#FFFAF0',
'forestgreen': '#228B22',
'fuchsia': '#FF00FF',
'gainsboro': '#DCDCDC',
'ghostwhite': '#F8F8FF',
'gold': '#FFD700',
'goldenrod': '#DAA520',
'indianred ': '#CD5C5C',
'indigo ': '#4B0082',
'khaki': '#F0E68C',
'lavenderblush': '#FFF0F5',
'lawngreen': '#7CFC00',
'lightblue': '#ADD8E6',
'lightcoral': '#F08080',
'lightgoldenrodyellow': '#FAFAD2',
'lightgray': '#D3D3D3',
'lightgrey': '#D3D3D3',
'lightskyblue': '#87CEFA',
'lightslategrey': '#778899',
'lightsteelblue': '#B0C4DE',
'lime': '#87CEFA',
'linen': '#FAF0E6',
'magenta': '#FF00FF',
'maroon': '#800000',
'mediumaquamarine': '#66CDAA',
'mediumblue': '#0000CD',
'mediumorchid': '#BA55D3',
'mediumpurple': '#9370D8',
'mediumseagreen': '#3CB371',
'mediumslateblue': '#7B68EE',
'midnightblue': '#191970',
'moccasin': '#FFE4B5',
'navajowhite': '#FFDEAD',
'navy': '#000080',
'oldlace': '#FDF5E6',
'olive': '#808000',
'orange': '#FFA500',
'orangered': '#FF4500',
'orchid': '#DA70D6',
'paleturquoise': '#AFEEEE',
'papayawhip': '#FFEFD5',
'peachpuff': '#FFDAB9',
'powderblue': '#B0E0E6',
'rosybrown': '#BC8F8F',
'royalblue': '#4169E1',
'saddlebrown': '#8B4513',
'sandybrown': '#8B4513',
'seashell': '#FFF5EE',
'sienna': '#A0522D',
'silver': '#C0C0C0',
'skyblue': '#87CEEB',
'slategrey': '#708090',
'snow': '#FFFAFA',
'springgreen': '#00FF7F',
'violet': '#EE82EE',
'yellowgreen': '#9ACD32'
}
hex_pat = re.compile(r'#(\d{2})(\d{2})(\d{2})')
rgb_pat = re.compile(r'rgb\(\s*(\d+)\s*,\s*(\d+)\s*,\s*(\d+)\s*\)', re.IGNORECASE)
def lrs_color(html_color):
hcol = html_color.lower()
match = hex_pat.search(hcol)
if match:
return '0x00'+match.group(1)+match.group(2)+match.group(3)
match = rgb_pat.search(hcol)
if match:
return '0x00'+hex(int(match.group(1)))[2:]+hex(int(match.group(2)))[2:]+hex(int(match.group(3)))[2:]
if hcol in NAME_MAP:
return NAME_MAP[hcol].replace('#', '0x00')
return '0x00000000'

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,386 @@
from __future__ import absolute_import, division, print_function, unicode_literals
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import math, sys, re, numbers
from calibre.ebooks.lrf.fonts import get_font
from calibre.ebooks.lrf.pylrs.pylrs import TextBlock, Text, CR, Span, \
CharButton, Plot, Paragraph, \
LrsTextTag
from polyglot.builtins import string_or_bytes, range, native_string_type
def ceil(num):
return int(math.ceil(num))
def print_xml(elem):
from calibre.ebooks.lrf.pylrs.pylrs import ElementWriter
elem = elem.toElement(native_string_type('utf8'))
ew = ElementWriter(elem, sourceEncoding=native_string_type('utf8'))
ew.write(sys.stdout)
print()
def cattrs(base, extra):
new = base.copy()
new.update(extra)
return new
def tokens(tb):
'''
Return the next token. A token is :
1. A string
a block of text that has the same style
'''
def process_element(x, attrs):
if isinstance(x, CR):
yield 2, None
elif isinstance(x, Text):
yield x.text, cattrs(attrs, {})
elif isinstance(x, string_or_bytes):
yield x, cattrs(attrs, {})
elif isinstance(x, (CharButton, LrsTextTag)):
if x.contents:
if hasattr(x.contents[0], 'text'):
yield x.contents[0].text, cattrs(attrs, {})
elif hasattr(x.contents[0], 'attrs'):
for z in process_element(x.contents[0], x.contents[0].attrs):
yield z
elif isinstance(x, Plot):
yield x, None
elif isinstance(x, Span):
attrs = cattrs(attrs, x.attrs)
for y in x.contents:
for z in process_element(y, attrs):
yield z
for i in tb.contents:
if isinstance(i, CR):
yield 1, None
elif isinstance(i, Paragraph):
for j in i.contents:
attrs = {}
if hasattr(j, 'attrs'):
attrs = j.attrs
for k in process_element(j, attrs):
yield k
class Cell(object):
def __init__(self, conv, tag, css):
self.conv = conv
self.tag = tag
self.css = css
self.text_blocks = []
self.pwidth = -1.
if tag.has_attr('width') and '%' in tag['width']:
try:
self.pwidth = float(tag['width'].replace('%', ''))
except ValueError:
pass
if 'width' in css and '%' in css['width']:
try:
self.pwidth = float(css['width'].replace('%', ''))
except ValueError:
pass
if self.pwidth > 100:
self.pwidth = -1
self.rowspan = self.colspan = 1
try:
self.colspan = int(tag['colspan']) if tag.has_attr('colspan') else 1
self.rowspan = int(tag['rowspan']) if tag.has_attr('rowspan') else 1
except:
pass
pp = conv.current_page
conv.book.allow_new_page = False
conv.current_page = conv.book.create_page()
conv.parse_tag(tag, css)
conv.end_current_block()
for item in conv.current_page.contents:
if isinstance(item, TextBlock):
self.text_blocks.append(item)
conv.current_page = pp
conv.book.allow_new_page = True
if not self.text_blocks:
tb = conv.book.create_text_block()
tb.Paragraph(' ')
self.text_blocks.append(tb)
for tb in self.text_blocks:
tb.parent = None
tb.objId = 0
# Needed as we have to eventually change this BlockStyle's width and
# height attributes. This blockstyle may be shared with other
# elements, so doing that causes havoc.
tb.blockStyle = conv.book.create_block_style()
ts = conv.book.create_text_style(**tb.textStyle.attrs)
ts.attrs['parindent'] = 0
tb.textStyle = ts
if ts.attrs['align'] == 'foot':
if isinstance(tb.contents[-1], Paragraph):
tb.contents[-1].append(' ')
def pts_to_pixels(self, pts):
pts = int(pts)
return ceil((float(self.conv.profile.dpi)/72)*(pts/10))
def minimum_width(self):
return max([self.minimum_tb_width(tb) for tb in self.text_blocks])
def minimum_tb_width(self, tb):
ts = tb.textStyle.attrs
default_font = get_font(ts['fontfacename'], self.pts_to_pixels(ts['fontsize']))
parindent = self.pts_to_pixels(ts['parindent'])
mwidth = 0
for token, attrs in tokens(tb):
font = default_font
if isinstance(token, numbers.Integral): # Handle para and line breaks
continue
if isinstance(token, Plot):
return self.pts_to_pixels(token.xsize)
ff = attrs.get('fontfacename', ts['fontfacename'])
fs = attrs.get('fontsize', ts['fontsize'])
if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
font = get_font(ff, self.pts_to_pixels(fs))
if not token.strip():
continue
word = token.split()
word = word[0] if word else ""
width = font.getsize(word)[0]
if width > mwidth:
mwidth = width
return parindent + mwidth + 2
def text_block_size(self, tb, maxwidth=sys.maxsize, debug=False):
ts = tb.textStyle.attrs
default_font = get_font(ts['fontfacename'], self.pts_to_pixels(ts['fontsize']))
parindent = self.pts_to_pixels(ts['parindent'])
top, bottom, left, right = 0, 0, parindent, parindent
def add_word(width, height, left, right, top, bottom, ls, ws):
if left + width > maxwidth:
left = width + ws
top += ls
bottom = top+ls if top+ls > bottom else bottom
else:
left += (width + ws)
right = left if left > right else right
bottom = top+ls if top+ls > bottom else bottom
return left, right, top, bottom
for token, attrs in tokens(tb):
if attrs is None:
attrs = {}
font = default_font
ls = self.pts_to_pixels(attrs.get('baselineskip', ts['baselineskip']))+\
self.pts_to_pixels(attrs.get('linespace', ts['linespace']))
ws = self.pts_to_pixels(attrs.get('wordspace', ts['wordspace']))
if isinstance(token, numbers.Integral): # Handle para and line breaks
if top != bottom: # Previous element not a line break
top = bottom
else:
top += ls
bottom += ls
left = parindent if int == 1 else 0
continue
if isinstance(token, Plot):
width, height = self.pts_to_pixels(token.xsize), self.pts_to_pixels(token.ysize)
left, right, top, bottom = add_word(width, height, left, right, top, bottom, height, ws)
continue
ff = attrs.get('fontfacename', ts['fontfacename'])
fs = attrs.get('fontsize', ts['fontsize'])
if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
font = get_font(ff, self.pts_to_pixels(fs))
for word in token.split():
width, height = font.getsize(word)
left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
return right+3+max(parindent, 10), bottom
def text_block_preferred_width(self, tb, debug=False):
return self.text_block_size(tb, sys.maxsize, debug=debug)[0]
def preferred_width(self, debug=False):
return ceil(max([self.text_block_preferred_width(i, debug=debug) for i in self.text_blocks]))
def height(self, width):
return sum([self.text_block_size(i, width)[1] for i in self.text_blocks])
class Row(object):
def __init__(self, conv, row, css, colpad):
self.cells = []
self.colpad = colpad
cells = row.findAll(re.compile('td|th', re.IGNORECASE))
self.targets = []
for cell in cells:
ccss = conv.tag_css(cell, css)[0]
self.cells.append(Cell(conv, cell, ccss))
for a in row.findAll(id=True) + row.findAll(name=True):
name = a['name'] if a.has_attr('name') else a['id'] if a.has_attr('id') else None
if name is not None:
self.targets.append(name.replace('#', ''))
def number_of_cells(self):
'''Number of cells in this row. Respects colspan'''
ans = 0
for cell in self.cells:
ans += cell.colspan
return ans
def height(self, widths):
i, heights = 0, []
for cell in self.cells:
width = sum(widths[i:i+cell.colspan])
heights.append(cell.height(width))
i += cell.colspan
if not heights:
return 0
return max(heights)
def cell_from_index(self, col):
i = -1
cell = None
for cell in self.cells:
for k in range(0, cell.colspan):
if i == col:
break
i += 1
if i == col:
break
return cell
def minimum_width(self, col):
cell = self.cell_from_index(col)
if not cell:
return 0
return cell.minimum_width()
def preferred_width(self, col):
cell = self.cell_from_index(col)
if not cell:
return 0
return 0 if cell.colspan > 1 else cell.preferred_width()
def width_percent(self, col):
cell = self.cell_from_index(col)
if not cell:
return -1
return -1 if cell.colspan > 1 else cell.pwidth
def cell_iterator(self):
for c in self.cells:
yield c
class Table(object):
def __init__(self, conv, table, css, rowpad=10, colpad=10):
self.rows = []
self.conv = conv
self.rowpad = rowpad
self.colpad = colpad
rows = table.findAll('tr')
conv.in_table = True
for row in rows:
rcss = conv.tag_css(row, css)[0]
self.rows.append(Row(conv, row, rcss, colpad))
conv.in_table = False
def number_of_columns(self):
max = 0
for row in self.rows:
max = row.number_of_cells() if row.number_of_cells() > max else max
return max
def number_or_rows(self):
return len(self.rows)
def height(self, maxwidth):
''' Return row heights + self.rowpad'''
widths = self.get_widths(maxwidth)
return sum([row.height(widths) + self.rowpad for row in self.rows]) - self.rowpad
def minimum_width(self, col):
return max([row.minimum_width(col) for row in self.rows])
def width_percent(self, col):
return max([row.width_percent(col) for row in self.rows])
def get_widths(self, maxwidth):
'''
Return widths of columns + self.colpad
'''
rows, cols = self.number_or_rows(), self.number_of_columns()
widths = list(range(cols))
for c in range(cols):
cellwidths = [0 for i in range(rows)]
for r in range(rows):
try:
cellwidths[r] = self.rows[r].preferred_width(c)
except IndexError:
continue
widths[c] = max(cellwidths)
min_widths = [self.minimum_width(i)+10 for i in range(cols)]
for i in range(len(widths)):
wp = self.width_percent(i)
if wp >= 0:
widths[i] = max(min_widths[i], ceil((wp/100) * (maxwidth - (cols-1)*self.colpad)))
itercount = 0
while sum(widths) > maxwidth-((len(widths)-1)*self.colpad) and itercount < 100:
for i in range(cols):
widths[i] = ceil((95/100)*widths[i]) if \
ceil((95/100)*widths[i]) >= min_widths[i] else widths[i]
itercount += 1
return [i+self.colpad for i in widths]
def blocks(self, maxwidth, maxheight):
rows, cols = self.number_or_rows(), self.number_of_columns()
cellmatrix = [[None for c in range(cols)] for r in range(rows)]
rowpos = [0 for i in range(rows)]
for r in range(rows):
nc = self.rows[r].cell_iterator()
try:
while True:
cell = next(nc)
cellmatrix[r][rowpos[r]] = cell
rowpos[r] += cell.colspan
for k in range(1, cell.rowspan):
try:
rowpos[r+k] += 1
except IndexError:
break
except StopIteration: # No more cells in this row
continue
widths = self.get_widths(maxwidth)
heights = [row.height(widths) for row in self.rows]
xpos = [sum(widths[:i]) for i in range(cols)]
delta = maxwidth - sum(widths)
if delta < 0:
delta = 0
for r in range(len(cellmatrix)):
yield None, 0, heights[r], 0, self.rows[r].targets
for c in range(len(cellmatrix[r])):
cell = cellmatrix[r][c]
if not cell:
continue
width = sum(widths[c:c+cell.colspan])-self.colpad*cell.colspan
sypos = 0
for tb in cell.text_blocks:
tb.blockStyle = self.conv.book.create_block_style(
blockwidth=width,
blockheight=cell.text_block_size(tb, width)[1],
blockrule='horz-fixed')
yield tb, xpos[c], sypos, delta, None
sypos += tb.blockStyle.attrs['blockheight']

View File

@@ -0,0 +1,7 @@
from __future__ import absolute_import, division, print_function, unicode_literals
"""
This package contains code to generate ebooks in the SONY LRS/F format. It was
originally developed by Mike Higgins and has been extended and modified by Kovid
Goyal.
"""

View File

@@ -0,0 +1,78 @@
from __future__ import absolute_import, division, print_function, unicode_literals
""" elements.py -- replacements and helpers for ElementTree """
from polyglot.builtins import unicode_type, string_or_bytes
class ElementWriter(object):
def __init__(self, e, header=False, sourceEncoding="ascii",
spaceBeforeClose=True, outputEncodingName="UTF-16"):
self.header = header
self.e = e
self.sourceEncoding=sourceEncoding
self.spaceBeforeClose = spaceBeforeClose
self.outputEncodingName = outputEncodingName
def _encodeCdata(self, rawText):
if isinstance(rawText, bytes):
rawText = rawText.decode(self.sourceEncoding)
text = rawText.replace("&", "&amp;")
text = text.replace("<", "&lt;")
text = text.replace(">", "&gt;")
return text
def _writeAttribute(self, f, name, value):
f.write(' %s="' % unicode_type(name))
if not isinstance(value, string_or_bytes):
value = unicode_type(value)
value = self._encodeCdata(value)
value = value.replace('"', '&quot;')
f.write(value)
f.write('"')
def _writeText(self, f, rawText):
text = self._encodeCdata(rawText)
f.write(text)
def _write(self, f, e):
f.write('<' + unicode_type(e.tag))
attributes = e.items()
attributes.sort()
for name, value in attributes:
self._writeAttribute(f, name, value)
if e.text is not None or len(e) > 0:
f.write('>')
if e.text:
self._writeText(f, e.text)
for e2 in e:
self._write(f, e2)
f.write('</%s>' % e.tag)
else:
if self.spaceBeforeClose:
f.write(' ')
f.write('/>')
if e.tail is not None:
self._writeText(f, e.tail)
def toString(self):
class x:
pass
buffer = []
x.write = buffer.append
self.write(x)
return ''.join(buffer)
def write(self, f):
if self.header:
f.write('<?xml version="1.0" encoding="%s"?>\n' % self.outputEncodingName)
self._write(f, self.e)

Some files were not shown because too many files have changed in this diff Show More